Distributed deep learning with Horovod and PowerAI DDL
Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL communication library with Horovod. DDL uses the hierarchical topology of the network to minimize the communication cost.
Minimum requirements:
- IBM PowerAI 1.5.2 (1.5.3 for using Horovod and Python 3)
- Horovod v0.13.11
Setting up Horovod and DDL
The following setup steps need to be executed on all the machines that the distributed run will use.
- Download PowerAI using the PowerAI docker image or following the Ordering information.
You can skip next 2 steps if you use the docker container. - Install the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow.
- Install DDL and its header files
RHEL:sudo yum install ddl ddl-dev
- Run the deep learning framework(s) and DDL activation scripts
source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
- Install Horovod with DDL backend
HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
Note: Horovod needs to be reinstalled to use a different backend
Training a model with Horovod+DDL
We will use the Tensorflow framework with the High-Performance Models as an example.
- First, copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
/opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
- Run the deep learning framework(s) and DDL activation scripts
source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
- Use to execute the distributed run
ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod
Note: HOROVOD_FUSION_THRESHOLD=16777216
is recommended to increase performance by better overlapping communication with computation.
The run output should display the IBM Corp. DDL
banner and for this model, the total images/sec
.
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------
For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod
相關推薦
Distributed deep learning with Horovod and PowerAI DDL
Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL com
Distributed Deep Learning with IBM DDL and TensorFlow NMT
by Seetharami Seelam, Geert Janssen, and Luis Lastras Introduction Sequence-to-sequence models are used extensively in tasks such as machine translation
11 Deep Learning With Python Libraries and Frameworks
TensorFlow is an open-source library for numerical computation in which it uses data flow graphs. The Google Brain Team researchers developed this with the
Distributed Deep Learning on Kubernetes with Polyaxon
Distributed Deep Learning on Kubernetes with PolyaxonIn this short tutorial, we will be going over a new feature in Polyaxon, distributed training.Polyaxon
(轉) Learning Deep Learning with Keras
trees create pda sse caffe latex .py encode you Learning Deep Learning with Keras Piotr Migda? - blog Projects Articles Publications Res
paper reading----Xception: Deep Learning with Depthwise Separable Convolutions
module 之間 pap AD lin reg arch dual pooling 背景以及問題描述: Inception-style models的基本單元是Inception module。Inception model是Inception mod
[Deep-Learning-with-Python]神經網絡的數學基礎
val 描述 優化算法 初始化 訓練數據 eight data 一個數 NPU 理解深度學習需要熟悉一些簡單的數學概念:Tensors(張量)、Tensor operations 張量操作、differentiation微分、gradient descent 梯度下降等等。
[Deep-Learning-with-Python]GAN圖片生成
gen 優秀 img 人工 process trick inpu generator type GAN 由Goodfellow等人於2014年引入的生成對抗網絡(GAN)是用於學習圖像潛在空間的VAE的替代方案。它們通過強制生成的圖像在統計上幾乎與真實圖像幾乎無法區分,從而
Repo:Deep Learning with Differential Privacy
翻譯參考:https://blog.csdn.net/qq_42803125/article/details/81232037 >>>Introduction: 當前的神經網路存在的問題:資料集是眾包(crowdsourced)的,並且可能含有敏感資訊 (眾包:一個廣泛
《2017-Xception Deep Learning with Depthwise Separable Convolutions》
本論文追求的不是準確率的提高,而是不降低準確率的前提下,減少引數數量,尋找更有的結構; 這篇論文是不錯的實驗模仿物件,以後做實驗可以按照本論文的思路探索; 動機 要解決什麼問題? 探尋Inception的基本思路,並將這種思
Neural Network Programming - Deep Learning with PyTorch with deeplizard.
PyTorch Prerequisites - Syllabus for Neural Network Programming Series PyTorch先決條件 - 神經網路程式設計系列教學大綱 每個人都在發生什麼事?歡迎來到PyTorch神經網路程式設計系列。 在這篇文章中,我們將看看做好最佳準備
Python深度學習(Deep Learning with Python) 中文版+英文版+原始碼
Keras作者、谷歌大腦François Chollet最新撰寫的深度學習Python教程實戰書籍(2017年12月出版)介紹深入學習使用Python語言和強大Keras庫,詳實新穎。PDF高清中文版+英文版+原始碼,這本書讓你通過直觀的解釋和例項學習深度學習,不得不看。 下載地址:https://www.
【文藝學生】Learning with exploration, and go ahead with learning. Let's progress together! :)
文藝學生 Learning with exploration, and go ahead with learning. Let's progress together! :)
Deep learning with Theano 官方中文教程(翻譯)(四)—— 卷積神經網路(CNN)
供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 文章中的程式碼截圖不是很清晰,可以去上面的原文網址去檢視。 1、動機 卷積神經網路(CNN
Deep learning with Theano 官方中文教程(翻譯)(三)——多層感知機(MLP)
供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 下面。http://deeplearning.net/tutorial/mlp.html#mlp 的中
Apache Spark sets out to standardize distributed machine learning training, execution, and deployment
We called it Machine Learning October Fest. Last week saw the nearly synchronized breakout of a number of news centered around machine learning (ML): The r
Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE
Machine learning promises to address many of the challenges faced by network security analysts; however, there are still many obstacles that prevent widesp
Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ
官方網頁:https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html#deep-learning-with-pytorch-a-60-minute-blitz 一、安裝torchvision conda
Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ - Markdown版本
Deep Learning with PyTorch: A 60 Minute Blitz 0.基礎準備 1.安裝torchvision 2.更新了一堆,以下是torchvision文件 1.What is PyTorch?
Hands on Machine Learning with Sklearn and TensorFlow學習筆記——機器學習概覽
一、什麼是機器學習? 計算機程式利用經驗E(訓練資料)學習任務T(要做什麼,即目標),效能是P(效能指標),如果針對任務T的效能P隨著經驗E不斷增長,成為機器學習。【這是湯姆米切爾在1997年定義】 大白話:類比於學生學習考試,你先練習一套有一套的模擬卷 (這就相當於訓練資料),在這幾