1. 程式人生 > >Distributed deep learning with Horovod and PowerAI DDL

Distributed deep learning with Horovod and PowerAI DDL

Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL communication library with Horovod. DDL uses the hierarchical topology of the network to minimize the communication cost.

Minimum requirements:

  • IBM PowerAI 1.5.2 (1.5.3 for using Horovod and Python 3)
  • Horovod v0.13.11

Setting up Horovod and DDL

The following setup steps need to be executed on all the machines that the distributed run will use.

  1. Download PowerAI using the PowerAI docker image or following the Ordering information.
    You can skip next 2 steps if you use the docker container.
  2. Install the deep learning framework(s) you want to use (Tensorflow, pytorch). In this tutorial, we will focus on Tensorflow.
  3. Install DDL and its header files
    RHEL: sudo yum install ddl ddl-dev
  4. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  5. Install Horovod with DDL backend
    HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir

    Note: Horovod needs to be reinstalled to use a different backend

Training a model with Horovod+DDL

We will use the Tensorflow framework with the High-Performance Models as an example.

  1. First, copy the model scripts to your current directory (repeat on each machine if the filesystem is not distributed)
    /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
  2. Run the deep learning framework(s) and DDL activation scripts
    source /opt/DL/tensorflow/bin/tensorflow-activate; source /opt/DL/ddl/bin/ddl-activate
  3. Use to execute the distributed run
ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=horovod

Note: HOROVOD_FUSION_THRESHOLD=16777216 is recommended to increase performance by better overlapping communication with computation.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------

For more information on how to integrate your model with Horovod, see their github repository: https://github.com/uber/horovod

相關推薦

Distributed deep learning with Horovod and PowerAI DDL

Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL com

Distributed Deep Learning with IBM DDL and TensorFlow NMT

by Seetharami Seelam, Geert Janssen, and Luis Lastras Introduction Sequence-to-sequence models are used extensively in tasks such as machine translation

11 Deep Learning With Python Libraries and Frameworks

TensorFlow is an open-source library for numerical computation in which it uses data flow graphs. The Google Brain Team researchers developed this with the

Distributed Deep Learning on Kubernetes with Polyaxon

Distributed Deep Learning on Kubernetes with PolyaxonIn this short tutorial, we will be going over a new feature in Polyaxon, distributed training.Polyaxon

(轉) Learning Deep Learning with Keras

trees create pda sse caffe latex .py encode you Learning Deep Learning with Keras Piotr Migda? - blog Projects Articles Publications Res

paper reading----Xception: Deep Learning with Depthwise Separable Convolutions

module 之間 pap AD lin reg arch dual pooling 背景以及問題描述: Inception-style models的基本單元是Inception module。Inception model是Inception mod

[Deep-Learning-with-Python]神經網絡的數學基礎

val 描述 優化算法 初始化 訓練數據 eight data 一個數 NPU 理解深度學習需要熟悉一些簡單的數學概念:Tensors(張量)、Tensor operations 張量操作、differentiation微分、gradient descent 梯度下降等等。

[Deep-Learning-with-Python]GAN圖片生成

gen 優秀 img 人工 process trick inpu generator type GAN 由Goodfellow等人於2014年引入的生成對抗網絡(GAN)是用於學習圖像潛在空間的VAE的替代方案。它們通過強制生成的圖像在統計上幾乎與真實圖像幾乎無法區分,從而

Repo:Deep Learning with Differential Privacy

翻譯參考:https://blog.csdn.net/qq_42803125/article/details/81232037 >>>Introduction: 當前的神經網路存在的問題:資料集是眾包(crowdsourced)的,並且可能含有敏感資訊 (眾包:一個廣泛

《2017-Xception Deep Learning with Depthwise Separable Convolutions》

本論文追求的不是準確率的提高,而是不降低準確率的前提下,減少引數數量,尋找更有的結構; 這篇論文是不錯的實驗模仿物件,以後做實驗可以按照本論文的思路探索; 動機 要解決什麼問題? 探尋Inception的基本思路,並將這種思

Neural Network Programming - Deep Learning with PyTorch with deeplizard.

PyTorch Prerequisites - Syllabus for Neural Network Programming Series PyTorch先決條件 - 神經網路程式設計系列教學大綱 每個人都在發生什麼事?歡迎來到PyTorch神經網路程式設計系列。 在這篇文章中,我們將看看做好最佳準備

Python深度學習(Deep Learning with Python) 中文版+英文版+原始碼

Keras作者、谷歌大腦François Chollet最新撰寫的深度學習Python教程實戰書籍(2017年12月出版)介紹深入學習使用Python語言和強大Keras庫,詳實新穎。PDF高清中文版+英文版+原始碼,這本書讓你通過直觀的解釋和例項學習深度學習,不得不看。 下載地址:https://www.

【文藝學生】Learning with exploration, and go ahead with learning. Let's progress together! :)

文藝學生 Learning with exploration, and go ahead with learning. Let's progress together! :)

Deep learning with Theano 官方中文教程(翻譯)(四)—— 卷積神經網路(CNN)

供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 文章中的程式碼截圖不是很清晰,可以去上面的原文網址去檢視。 1、動機    卷積神經網路(CNN

Deep learning with Theano 官方中文教程(翻譯)(三)——多層感知機(MLP)

供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 下面。http://deeplearning.net/tutorial/mlp.html#mlp  的中

Apache Spark sets out to standardize distributed machine learning training, execution, and deployment

We called it Machine Learning October Fest. Last week saw the nearly synchronized breakout of a number of news centered around machine learning (ML): The r

Removing Obstacles to Production Machine Learning with OpnIDS and Dragonfly MLE

Machine learning promises to address many of the challenges faced by network security analysts; however, there are still many obstacles that prevent widesp

Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ

官方網頁:https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html#deep-learning-with-pytorch-a-60-minute-blitz 一、安裝torchvision conda

Pytorch Tutorial (1) -- DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ - Markdown版本

Deep Learning with PyTorch: A 60 Minute Blitz 0.基礎準備 1.安裝torchvision 2.更新了一堆,以下是torchvision文件 1.What is PyTorch?

Hands on Machine Learning with Sklearn and TensorFlow學習筆記——機器學習概覽

 一、什麼是機器學習?   計算機程式利用經驗E(訓練資料)學習任務T(要做什麼,即目標),效能是P(效能指標),如果針對任務T的效能P隨著經驗E不斷增長,成為機器學習。【這是湯姆米切爾在1997年定義】   大白話:類比於學生學習考試,你先練習一套有一套的模擬卷 (這就相當於訓練資料),在這幾