Distributed deep learning with Horovod and PowerAI DDL

阿新 • • 發佈：2019-01-16

Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL communication library with Horovod. DDL uses the hierarchical topology of the network to minimize the communication cost.

Minimum requirements:

IBM PowerAI 1.5.2 (1.5.3 for using Horovod and Python 3)

Horovod v0.13.11

Setting up Horovod and DDL

The following setup steps need to be executed on all the machines that the distributed run will use.

Download PowerAI using the PowerAI docker image or following the Ordering information.
You can skip next 2 steps if you use the docker container.
Install the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow.

Install DDL and its header files
RHEL: sudo yum install ddl ddl-dev
Run the deep learning framework(s) and DDL activation scripts
source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
Install Horovod with DDL backend
HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir

Note: Horovod needs to be reinstalled to use a different backend

Training a model with Horovod+DDL

We will use the Tensorflow framework with the High-Performance Models as an example.

First, copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
/opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
Run the deep learning framework(s) and DDL activation scripts
source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
Use to execute the distributed run

ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod

Note: HOROVOD_FUSION_THRESHOLD=16777216 is recommended to increase performance by better overlapping communication with computation.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------

For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod

Distributed deep learning with Horovod and PowerAI DDL

Minimum requirements:

Setting up Horovod and DDL

Training a model with Horovod+DDL

Distributed deep learning with Horovod and PowerAI DDL

Distributed Deep Learning with IBM DDL and TensorFlow NMT

11 Deep Learning With Python Libraries and Frameworks

Distributed Deep Learning on Kubernetes with Polyaxon

(轉) Learning Deep Learning with Keras

paper reading----Xception: Deep Learning with Depthwise Separable Convolutions

[Deep-Learning-with-Python]神經網絡的數學基礎

[Deep-Learning-with-Python]GAN圖片生成

Repo:Deep Learning with Differential Privacy

《2017-Xception Deep Learning with Depthwise Separable Convolutions》

Neural Network Programming - Deep Learning with PyTorch with deeplizard.

Python深度學習(Deep Learning with Python) 中文版+英文版+原始碼

【文藝學生】Learning with exploration, and go ahead with learning. Let's progress together! :)

Deep learning with Theano 官方中文教程（翻譯）（四）—— 卷積神經網路（CNN）

Deep learning with Theano 官方中文教程（翻譯）（三）——多層感知機（MLP）

Apache Spark sets out to standardize distributed machine learning training, execution, and deployment

Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE

Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ

Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ - Markdown版本

Hands on Machine Learning with Sklearn and TensorFlow學習筆記——機器學習概覽

Distributed deep learning with Horovod and PowerAI DDL

Minimum requirements:

Setting up Horovod and DDL

Training a model with Horovod+DDL

相關推薦