1. 程式人生 > >Distributed Deep Learning with IBM DDL and TensorFlow NMT

Distributed Deep Learning with IBM DDL and TensorFlow NMT

by Seetharami Seelam, Geert Janssen, and Luis Lastras

Introduction

Sequence-to-sequence models are used extensively in tasks such as machine translation, speech recognition, caption generation, and text summarization. The training times can be on the order of weeks for these models to achieve reasonable accuracy even on a GPU so it can takes weeks to months to explore different hyperparameters such as batch sizes, drop out rates, or the number of layers to improve the accuracy. IBM DDL (

distributed deep learning library) can reduce these long training times by applying multiple (from 2 to 100s) GPUs to each training job and by optimizing the communication between the GPUs. These capabilities allow one to more quickly and easily explore the hyperparameter space.

In this article, we discuss how to use the IBM DDL technology to reduce the training times of sequence-to-sequence models with multi-GPU training.

IBM DDL provides a compact set of APIs and simple steps to modify training code that is written to run on a single GPU to run on multiple GPUs. We demonstrate this DDL concept with the TensorFlow implementation of sequence to sequence models from the TensorFlow NMT tutorials: https://github.com/tensorflow/nmt
.

NMT Sequence-to-Sequence models

The Neural Machine Translation example implements a state of the art model for translation of sentences between (different) languages. It uses an encoder-decoder architecture consisting of LSTMs. We have observed evidence of efforts to implement model parallelism by mapping different LSTMs to different GPUs. Our experiments show that model parallelization does not scale well with these workloads. Since the Python script does not allow any means of data parallelismwe will discuss how to enable data parallelism with IBM DDL.

IBM Distributed Deep Learning (DDL) library

IBM Distributed Deep Learning (DDL) is a communication library that provides a set of collective functions much like MPI. These functions are exposed as TensorFlow operators. While constructing a TensorFlow model graph, DDL operators are injected to facilitate certain synchronization and communication actions. Usually the DDL operator insertion action is done automatically. A user Python script merely imports the ddl.py wrapper module which overrides certain TensorFlow graph construction functions. In other situations, the integration effort might be a bit more involved and require modifications to comply with the recommendations described below.

The objective of using DDL is to enable distributed learning by spreading the computational work over many processors, typically GPUs. The type of parallelism employed by DDL is data parallelism, i.e., the data is partitioned and distributed to several identical workers. DDL is based on a very simple symmetric distribution model: all worker processes run the same Python script and there will be one worker per GPU. The worker processes are started and governed by MPI. This makes it easy to deploy a script across
multiple host machines.

Whether done automatically, or under precise user guidance, the steps necessary to use DDL in a TensorFlow Python training script are as follows:

  1. Since there is one worker per GPU, the model parameters, a.k.a. the weights and biases (i.e., all so-called trainable TensorFlow variables) will reside in GPU memory which must be kept synchronized across the workers. This synchronization entails two actions: initialization and gradient updates. All trainable parameters must have identical initial values at the onset of computation. If these values are mere constants this is achieved by default. Often the initial values are randomly chosen from some probability distribution. In the random values case one could first initialize a single worker and then broadcast its values to all other workers. Typically, the root, or rank 0 worker, is elected as master and all workers perform a broadcast operation that also functions as a barrier.
  2. A gradient update to each worker‘s model parameters must be synchronized across all workers. Each worker will compute its own (different) gradients based on the batch of input samples that it processes during one iteration. The gradients must then be combined (typically just summed or averaged) and the result must be distibutedto all workers. A standard (MPI) operation to achieve this result is called all-reduce. DDL provides a highly tuned variant of all-reduce (tuned specifically for distributed deep learning) that ensures the best possible communication across the available channels of varying bandwidth, speed, and latency.
  3. Within this symmetric distribution scheme, workers need to be provided with a unique chunk of the overall dataset. Statistically it would be inefficient for workers to operate on the same input data at the same time as nothing is gained by doing the same learning across multiple workers. Therefore one must partition the dataset and hand out a different chunk to each worker. Quite often this is achieved by keeping data in a large buffer and assigning each worker a particular offset location to start reading in this buffer. Alternatively, the partitioning could be done upfront and each worker simply reads from its own dedicated file.
  4. The use of multiple workers changes the calculation and interpretation of some of the hyper parameters, especially the learning rate schedule. Since with N workers each processing a batch size B of samples per iteration, the total amount of samples processed per iteration step is increased up by a factor N. The learning rate schedule may need to be adjusted to accommodate this increase in samples per iteration. A schedule which is unaware of the parallel execution would act too slowly.
  5. Quite often, a training script incorporates progress evaluations at regular intervals. This typically entails the recording of checkpoints. Also, there will be log messages to inform the user of specific events. In debugged code it makes little sense to replicate these actions in all workers; it suffices that only a single dedicated worker takes care of them.
    Summarizing, to be successful at distributed deep learning, one must pay attention to:
    1. Uniform weight (and bias) initialization,
    2. Synchronized shared gradient updates,
    3. Partitioned data sets,
    4. Hyper parameter control adjustment, and
    5. Single worker checkpointing, evaluation, and logging.

Enabling distribution in NMT Sequence-to-Sequence models

Typically most scripts create a single training graph and DDL can be integrated into that training graph following the approach described below. However, in NMT, the script creates 3 graphs: the train, inference, and eval graph. The latter 2 will have a slightly different top layer structure than the train graph, and will operate by loading the variables from checkpoints produced by the train graph. The implications for running multiple parallel scripts are:

  1. The train graph needs variable initialization; the inference and eval graphs need no variable initialization.
  2. In principal only one worker needs to create the inference and eval graphs.
  3. One worker will have to do the checkpointing.

The use of multiple graphs also means that DDL ( ddl.py ) must treat the graphs differently: it will inject broadcast initializers for the train graph, but must not do so for the inference and eval graphs. DDL provides new API functions to enable and disable the insertion of broadcast nodes to solve this problem. Note that for simplicity the script still constructs the 3 graphs for each worker but makes sure that only the root or master worker deploys the eval and inference graphs; the other workers simply ignore those graphs. See here for an example: https://github.com/seelam/nmt/blob/master/nmt/train.py#L301

Apart from importing ddl.py in the top-level nmp.py script (see here: https://github.com/seelam/nmt/blob/master/nmt/train.py#L18), most other changes are the insertions of if-statements to condition some code blocks to only be executed by the rank 0 worker. More specifically:

In train.py:

  • creation of the train model is passed the total number of workers and the
    jobid as arguments to correctly do the data partitioning when the data input
    iterator is constructed.
  • if-statements are inserted at several places to enforce single worker logging,
    checkpointing and progress evaluation as described above.

The final version of DDL enabled NMT code is available here:
https://github.com/seelam/nmt

Experimental Results & Discussion

The distributed NMT code from above is tested on a cluster of AC922 machines with V100 GPUs. These machines have two Power9 processors and 4 V100 GPUs per machine and the cluster had 4 machines for a total of 16 GPUs. These machines are connected with a 100Gbps Ethernet network and additional details about the system and the benchmark configuration are provided below:

System Model: IBM POWER9 AC922
GPUs: 4x V100
OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Linux kernel: 4.14.0-49.el7a.ppc64le
CUDA/cuDNN: 9.2 / 7.1.4
MPI Version:

Run command:
DATA_DIR=/tmp/
mpirun -x LD_LIBRARY_PATH -x PATH -x DDL_OPTIONS=”-mode p -dbg_level 2″ -n -host \
python -m nmt.nmt \
–src=en –tgt=en \
–vocab_prefix=$DATA_DIR/vocab.bpe.32000 \
–train_prefix=$DATA_DIR/train.tok.clean.bpe.32000.shuf \
–dev_prefix=$DATA_DIR/newstest2013.tok.bpe.32000 \
–test_prefix=$DATA_DIR/newstest2015.tok.bpe.32000 \
–out_dir=$DATA_DIR/output2 \
–batch_size=64 \
–num_train_steps=10000 \
–steps_per_stats=100 \
–num_layers=4 \
–num_units=1024 \
–dropout=0.2 \
–metrics=bleu \
–beam_width=15 \
–encoder_type=bi \
–decay_scheme=luong10 \
–subword_option=bpe \
–num_gpus=1 \
–random_seed=1234567

Training details:

The training data above is the same WMT German English data used in the Tensorflow NMT website with a few changes: The training data (train.tok.clean.bpe.32000.shuf) is just a random shuffling of the actual BPE training data (train.tok.clean.bpe.32000). This was done because the original ordering in the data set has similar sequences grouped together which causes an odd periodic training behavior as the epoch progresses and this shuffling removes that problem.

Results

The graph below shows the words per second (wps) as a function of the number of GPUs employed for running the NMT application enhanced with the distribution capability on the AC922 systems. As noted above, each of these systems have 4 GPUs so the data points for 1,2,4 GPUs are on a single system. A single node of the AC922 system achieves a throughput of 51K wps (4 GPUs total). With 16 GPUs across 4 nodes the AC922 systems achieve 160K wps and an effective scaling of 8x compared to a single GPU performance.

To understand the performance scaling better, the table below shows the step-time. The step time for 1 GPU is determined by computation time and the rate of data exchange between the GPU and the CPU. Step times for 2 or more GPUs involve, in addition to computation and data exchange between the GPU and the CPU, the communication time to transfer gradients between the GPUs.

System/NUM GPUS 1 2 4 8 12 16
Time per step (ms) 360 450 560 680 690 700

The gradients for this particular NMT model are about 717MB in size and they are transferred among GPUs once in each step. In the AC922 system pairs of 2-GPUs are connected with 150GB/s using NVLINK2 technology, so the communication time is about 90ms (450 ms – 360 ms). When the communication occurs between the systems (as in the case of 8,12, 16 GPUs), the data flows over a 100Gbps network resulting in sub-linear growth in communication times. From these data, we can conclude that DDL enabled NMT on AC922 with NVLINK that connects the GPUs on the node and the 100Gbps connecting the nodes in the systems substantially improves the scalability of this application and reduces the training time by a factor of 8 over 16 GPUs. Therefore DDL allows one to use additional GPUs to cut the executing time of NMT from a week to a day and enables faster exploration of the hyperparmeter space.

Try it yourself… 

In this blog, we discussed the steps to extend Tensorflow NMT sequence-to-sequence models code with the IBM DDL. Our results demonstrate that with minimal changes to the user code, DDL allows one to efficiently distribute work across GPUs on single node and across nodes. The efficient use of multiple GPUs can reduce the execution time by an order of magnitude and enables easy exploration of hyper parameter space.

References:

See the references below for additional information on IBM DDL, POWER AI, NMT, etc

is a Research Staff Member at IBM Research. His research interests are at the intersection of cloud, systems, and cognitive computing. He is working to deliver systems and cognitive services as services in the cloud with ease of use and differentiated performance.

is a Research Staff Member at IBM Research. his research interests are in developing parallel algorithms for distributed deep learning.

is a Research Staff Member at IBM Research. His interests are in the area of information retrieval, semantic search, knowledge graphs, linked data, social networks, natural language processing and very large scale systems.

相關推薦

Distributed Deep Learning with IBM DDL and TensorFlow NMT

by Seetharami Seelam, Geert Janssen, and Luis Lastras Introduction Sequence-to-sequence models are used extensively in tasks such as machine translation

Distributed deep learning with Horovod and PowerAI DDL

Horovod is a popular distributed training framework for TensorFlow, Keras, and PyTorch. This blog post explains how to use the efficient PowerAI DDL com

OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow學習筆記彙總

其中用到的知識點我都記錄在部落格中了:https://blog.csdn.net/dss_dssssd 第一章知識點總結: supervised learning k-Nearest Neighbors Linear Regression

Hands-on Machine Learning with Scikit-Learn and TensorFlow(中文版)和深度學習原理與TensorFlow實踐-學習筆記

監督學習:新增標籤。學習的目標是求出輸入與輸出之間的關係函式y=f(x)。樸素貝葉斯、邏輯迴歸和神經網路等都屬於監督學習的方法。 監督學習主要解決兩類核心問題,即迴歸和分類。 迴歸和分類的區別在於強調一個是連續的,一個是離散的。 非監督學習:不新增標籤。學習目標是為了探索樣本資料之間是否

11 Deep Learning With Python Libraries and Frameworks

TensorFlow is an open-source library for numerical computation in which it uses data flow graphs. The Google Brain Team researchers developed this with the

二、《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一個完整的機器學習專案

  本章中,你會假裝作為被一家地產公司剛剛僱傭的資料科學家,完整地學習一個案例專案。 下面是主要步驟: 1. 專案概述。 2. 獲取資料。 3. 發現並可視化資料,發現規律。 4. 為機器學習演算法準備資料。 5. 選擇模型,進行訓練。 6. 微調模型。 7. 給出解決方案。 8. 部

Distributed Deep Learning on IBM Cloud Private

IBM PowerAI Distributed Deep Learning (DDL) can be deployed directly into your enterprise private cloud with IBM Cloud Private (ICP). This blog post exp

Distributed Deep Learning on Kubernetes with Polyaxon

Distributed Deep Learning on Kubernetes with PolyaxonIn this short tutorial, we will be going over a new feature in Polyaxon, distributed training.Polyaxon

Deep learning with Apache SystemML, a discussion with AI engineer Prithviraj Sen from IBM Research

Romeo Kienzler works as a Chief Data Scientist in the IBM Watson IoT worldwide team helping clients to apply advanced machine learning at scale on their Io

(轉) Learning Deep Learning with Keras

trees create pda sse caffe latex .py encode you Learning Deep Learning with Keras Piotr Migda? - blog Projects Articles Publications Res

paper reading----Xception: Deep Learning with Depthwise Separable Convolutions

module 之間 pap AD lin reg arch dual pooling 背景以及問題描述: Inception-style models的基本單元是Inception module。Inception model是Inception mod

[Deep-Learning-with-Python]神經網絡的數學基礎

val 描述 優化算法 初始化 訓練數據 eight data 一個數 NPU 理解深度學習需要熟悉一些簡單的數學概念:Tensors(張量)、Tensor operations 張量操作、differentiation微分、gradient descent 梯度下降等等。

[Deep-Learning-with-Python]GAN圖片生成

gen 優秀 img 人工 process trick inpu generator type GAN 由Goodfellow等人於2014年引入的生成對抗網絡(GAN)是用於學習圖像潛在空間的VAE的替代方案。它們通過強制生成的圖像在統計上幾乎與真實圖像幾乎無法區分,從而

Repo:Deep Learning with Differential Privacy

翻譯參考:https://blog.csdn.net/qq_42803125/article/details/81232037 >>>Introduction: 當前的神經網路存在的問題:資料集是眾包(crowdsourced)的,並且可能含有敏感資訊 (眾包:一個廣泛

《2017-Xception Deep Learning with Depthwise Separable Convolutions》

本論文追求的不是準確率的提高,而是不降低準確率的前提下,減少引數數量,尋找更有的結構; 這篇論文是不錯的實驗模仿物件,以後做實驗可以按照本論文的思路探索; 動機 要解決什麼問題? 探尋Inception的基本思路,並將這種思

詳細解讀Completely Heterogeneous Transfer Learning with Attention - What And What Not To Transfer

             本人現在是電子科技大學2018級準研究生,大四來老闆實驗室打了幾個月的工,最近一直在跑程式碼,一跑幾天,閒來無事寫幾篇部落格總結下學到的東西。因為老闆希望我做遷移所以一直都在學遷移的東西,尤其是針

Neural Network Programming - Deep Learning with PyTorch with deeplizard.

PyTorch Prerequisites - Syllabus for Neural Network Programming Series PyTorch先決條件 - 神經網路程式設計系列教學大綱 每個人都在發生什麼事?歡迎來到PyTorch神經網路程式設計系列。 在這篇文章中,我們將看看做好最佳準備

Python深度學習(Deep Learning with Python) 中文版+英文版+原始碼

Keras作者、谷歌大腦François Chollet最新撰寫的深度學習Python教程實戰書籍(2017年12月出版)介紹深入學習使用Python語言和強大Keras庫,詳實新穎。PDF高清中文版+英文版+原始碼,這本書讓你通過直觀的解釋和例項學習深度學習,不得不看。 下載地址:https://www.

Deep learning with Theano 官方中文教程(翻譯)(四)—— 卷積神經網路(CNN)

供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 文章中的程式碼截圖不是很清晰,可以去上面的原文網址去檢視。 1、動機    卷積神經網路(CNN

Deep learning with Theano 官方中文教程(翻譯)(三)——多層感知機(MLP)

供大家相互交流和學習,本人水平有限,若有各種大小錯誤,還請巨牛大牛小牛微牛們立馬拍磚,這樣才能共同進步!若引用譯文請註明出處http://www.cnblogs.com/charleshuang/。 下面。http://deeplearning.net/tutorial/mlp.html#mlp  的中