16 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 1609.04836v1

阿新 • • 發佈：2019-01-13

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter TangNorthwestern University & Intel* SGD及其變種在batch size增大的時候會有泛化能力的明顯下降 generalization drop/degradation,迄今為止原因還不是很清楚。
* 這篇文章用充足的數值證據支援一個觀點：batch size越大，越有可能收斂到比較尖銳的區域性極小值。batch size越小，越有可能收斂到比較平穩的區域性極小值，實驗證明這是因為梯度估計的內在的噪聲引起的。

* 討論了一些經驗上的方法幫助large-batch methods消除the generalization gap，總結了一些future research ideas and open questions.

train和test的分佈有一定的偏差，區域性最小值附近越尖銳，推到test的時候效能下降的越多。
* non-convex optimization(training the network):

f_i：loss function，x: weights### SGD and its variantsiteratively taking steps of the form:

相當於帶noise的梯度下降。一般來說

* pros：（a）對於強的凸函式能收斂到極小值，對於非凸函式能收斂到駐點（stationary points）;（b）能夠避免走到鞍點（saddle-point）.* cons：由於方法是序列求解的，batch size一般比較小，在並行上面會有限制。之前有一些工作在並行SGD上面，但是受小的batch size的限制。* 一個很自然的並行方案就是增大batch size。但是會存在generalization gap/drop, 實際中即使對於小的網路，也能有高達5%的精度下降。* 如果增大batch size，但是不犧牲效能，將有可能增大並行度，很大程度上減少訓練時間。### large-batch(LB) methods的缺點generalization gap，儘管訓練的時候LB和SB(small-batch)能達到差不多的精度。

可能的原因：(1) LB相對於SB更加over-fitting(2) LB方法相對於SB方法缺少explorative（探險，類似強化學習中的概念）的性質，更加傾向於收斂到與初始點接近的區域性最小(3) LB和SB收斂到泛化能力不一樣的區域性最小(4) 深度網路需要一定的迭代次數才能使目標函式收斂到泛化能力較好的區域性最優，LB方法一般沒有SB方法的迭代次數多(5) LB方法收斂到了saddle points（鞍點）這篇文章的資料支援了第(2),(3)個推測（這些推測是作者與Yann LeCun的私人交流得到的）

* sharpness of minimizer can be characterized by the magnitude of the eigenvalues of

* 本文提出一個（即使對很大的網路）計算可行的metric。隱含的idea是在當前解的小鄰域內搜尋最壞情形的損失函式值### 數值實驗

LB: 10% of training dataSB: 256ADAM optimizer每個實驗5次，不同均勻初始化在所有實驗中，兩種方法training acc都很高，但test上面的泛化效能有明顯差距。所有網路訓練到loss飽和。* 作者認為generalization gap不是由於over-fitting（一個表達能力特別強的model在僅有的訓練資料上過度訓練，從而在某個迭代點達到test acc的峰值，然後又由於這個model在特定的訓練集上的學習特性使test acc下降。這在實驗中沒有發現，所以early-stopping啟發式地防止過擬合不能減小generalization gap）

### Parametric plots描述函式在一維子空間的值的情況，但是並不能提供在整個空間函式值在區域性最小值附近的情況（sharpness）。線性

(figure 3)曲線

(figure 4)

### sharpness找到的解的鄰域

A是隨機生成的，A^+是A的偽逆

如果A沒有確定，就假設是I_nxn

* 表3是A=In, 表4是隨機矩陣A_nx100（表示在隨機子空間中的）eps = 1e-3, 5e-4* 用L-BFGS-B解(3)中的max問題，10次迭代。* 兩個表中的結果都說明LB的靈敏度更大。SB和LB相差1~2個量級。### 為什麼SB不收斂到sharp minimizers？因為小的batch size的梯度噪聲更大，在尖銳的minimizers的底部，只要有一點噪聲，就不是區域性最優了，因此會促使其收斂到比較平緩的區域性最優（噪聲不會使其遠離底部）。

batch size越大，test acc會降低，sharpness會增大。* noise對於遠離sharp minimizer並不充分* 首先用0.25% batch size訓練100 epochs，每個epochs存下來，以每個epoch作為初始點，用LB的方法訓練100 epochs（相當於看不同的初始點對LB的影響，用SB進行預訓練）。然後畫acc和sharpness（LB和SB的）

在一定的SB迭代次數之後（exploitation階段已經差不多了，已經找到了flat minimizer），LB能夠達到和SB可比擬的acc這證明了LB方法確實是會傾向於到達與初始點鄰近的sharp minimizers，而SB會遠離這種。

橫座標從1到0.1,...減小。在大的loss（和初始點鄰近，靠近左邊）處，LB和SB的sharpness相似。當loss減小的時候（往右邊去），sharpness差別越大，LB比SB的sharpness更大。對於SB，sharpness相對恆定，然後才下降。（說明有一個exploitation的階段，然後在收斂到一個平緩的區域性最優）LB的失敗就是因為缺少exploitation的過程，收斂（zooming-in）到與初始點接近的區域性最優。但是不知道為什麼這種zooming-in是有害的。### 減輕這種問題的嘗試LB：10%，SB：0.25%ADAM* data augmentation能不能修改loss function的幾何結構來使其對LB更友好。loss function取決於目標函式的形式，訓練集的大小和性質。因此用data augmentation的方法。這是domain specific的，但是通常對訓練集的操作是可控的。相當於對網路做了regularization。

對於影象分類，data augmentation：horizontal reflections, random rotations up to 10◦ and random translation of up to 0:2 times the size of the imageLB達到了和SB（也做了data augmentation）可比擬的精度，但是sharpness仍然存在。說明sensitivity to images contained in neither training nor testing set. * Conservative Training（保守訓練）

Mu Li et al.[ACM SIGKDD'14] argue that the convergence rate of SGD for the large-batch setting can be improvedby obtaining iterates through the following proximal sub-problem.

motivation: to bettere utilize a batch before moving onto the next one.用GD，co-ordinate descent or L-BFGS不精確的迭代3-5次求解，他們證明這不僅提高了SGD的收斂速度，而且改善了凸機器學習問題的經驗效能。更充分利用一個batch的基本思想不只是針對凸的問題，對於DL也一樣，但是沒有理論保證。本文用ADAM進行3次迭代解上述問題。\lambda=1e-3,這種方法也能夠提升test acc，但是靈敏度的問題還是沒有解決。* Robust training

其中eps>0求解最壞情況下的cost。

直接使用這種技巧是不可行的，因為包含一個大規模的SOCP（二階錐規劃）問題，計算代價太高了在DL中，魯棒性有兩個相互依賴的概念：對資料魯棒和對解魯棒。前者說明f本質上是一個統計模型，後者將f看做是黑箱。[36]中說對（相對於資料）解的魯棒性和adversarial training[15]是等價的.由於data augmentation一定程度上是成功的，很自然想到adversarial training是否有效。[15]中說，adversarial training也是想要人為地增廣訓練集，但是不像隨機增廣，是用的模型的靈敏度來構造新的樣本。儘管這很直觀，但是在實驗中沒有發現這能夠改善泛化能力下降的問題。同時，實驗也發現，用[44]中stability training的方法也沒能提升泛化能力。
adversarial training，或者其他任何形時的robust training能不能用來提升large-batch training的效能，仍然待驗證。### 總結與討論* 通過數值實驗證明了large-batch training確實存在更容易收斂到尖銳的區域性最優的情況，從而使其泛化能力下降。文章中用了data augmentation或者保守訓練的嘗試來減輕這個問題，能夠在一定程度上緩解泛化能力的下降，但是不能解決收斂到尖銳的區域性最優的問題。* 可能有希望的方法包括dynamic sampling或者switching-based strategy，在初始的一些epochs利用small batches，然後逐漸或者突然轉變為large-batch方法。例如[7]中的方法。用SB方法初始化，然後結合其他方法，是future research的方向。* 之前很多工作說明了DL中，loss surface上存在很多區域性最優，大部分深度相似。本文說明在相似深度上的區域性最優的sharpness會不一樣。the relative sharpness and frequency of minimizers on the loss surface remains to be explored theoretically.* open questions:(a) 能否證明large-batch方法收斂到sharp minimizers？(b) 兩種不同的區域性最小點的相對密度如何？(c) 能否設計網路時考慮解決LB方法的靈敏度的問題？(d) 網路能否被某種方法初始化，使得LB方法能成功？(e) 能否用演算法或者可調控的方法引導LB方法遠離sharp minimizers？LB方法如果能夠解決，很明顯的一個意義就是能夠使得在多核CPU、甚至多節點的CPU機器上能夠很大程度地實現DL的並行加速。

16 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 1609.04836v1

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter TangNorthwestern University & Intel* SGD及其變種在batch size增大

《On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima》-ICLR2017文章閱讀

這是一篇發表在ICLR2017上面的文章。這篇文章探究了深度學習中一個普遍存在的問題——使用大的batchsize訓練網路會導致網路的泛化效能下降（文中稱之為Generalization Gap）。文中給出了Generalization Gap現象的解釋：大

MLHPC 2018 | Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

這篇文章主要介紹了一個名為Aluminum通訊庫，在這個庫中主要針對Allreduce做了一些關於計算通訊重疊以及針對延遲的優化，以加速分散式深度學習訓練過程。 ### 分散式訓練的通訊需求 #### 通訊何時發生一般來說，神經網路的訓練過程分為三步：前向傳播、反向傳播以及引數優化。在使用資料並行進行分散

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 NotebookFigure 1. PlaidML Logo.PlaidML is a deep learning software platform which enab

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

The choice of optimization algorithm for your deep learning model can mean the difference between good results in minutes, hours, and days. The Adam optim

【論文閱讀】韓鬆《Efficient Methods And Hardware For Deep Learning》節選《Learning both Weights and Connections 》

Pruning Deep Neural Networks 本節內容主要來自NIPS 2015論文《Learning both Weights and Connections for Efﬁcient Neural Networks》。這部分主要介紹如何剪枝網路

Linear algebra cheat sheet for deep learning

Linear algebra cheat sheet for deep learningBeginner’s guide to commonly used operationsDuring Jeremy Howard’s excellent deep learning course I realized I

Google and Uber’s Best Practices for Deep Learning

Google and Uber’s Best Practices for Deep LearningThere is more to building a sustainable Deep Learning solution than what is provided by Deep Learning fra

Esri Incorporates BuildingFootprintUSA Data for Deep Learning

WIRE)--Oct 4, 2018--Esri, the global leader in location intelligence, has announced that it is partnering with BuildingFootprintUSA to provide unprecedente

Why Use Framework for Deep Learning?

You can implement your own deep learning algorithms from scratch using Python or any other programming language. When you start implementing more complex m

Basic Linear Algebra for Deep Learning

Linear Algebra is a continuous form of mathematics and is applied throughout science and engineering because it allows you to model natural phenomena and t

Ask HN: Whats the best way to learn C++ for Deep learning?

What is your reason for learning "C++ for deep learning"?This will kind of define how to go about doing it.I can think of a few different reasons you might

Opinionated openness: Facebook AI research strategy, ecosystem, and target audience for Deep Learning, and the nuances of using

Chintala's take is that some people would have to be assigned on something like this anyway. If PyTorch had not been created, the other option would be to

16 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 1609.04836v1

16 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 1609.04836v1

《On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima》-ICLR2017文章閱讀

MLHPC 2018 | Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

【論文閱讀】韓鬆《Efficient Methods And Hardware For Deep Learning》節選《Learning both Weights and Connections 》

Linear algebra cheat sheet for deep learning

Google and Uber’s Best Practices for Deep Learning

Esri Incorporates BuildingFootprintUSA Data for Deep Learning

Why Use Framework for Deep Learning?

Basic Linear Algebra for Deep Learning

Ask HN: Whats the best way to learn C++ for Deep learning?

Opinionated openness: Facebook AI research strategy, ecosystem, and target audience for Deep Learning, and the nuances of using

Data for Deep Learning

Bridging the Deployment Gap for Deep Learning (Part 2)

How to Use the Keras Functional API for Deep Learning

DarwinAI Emerges from Stealth with Design, Optimization and Explainability Platform for Deep Learning

Infrastructure for Deep Learning

Google Colab 免費的谷歌GPU for deep learning

A Gentle Introduction to Transfer Learning for Deep Learning

16 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 1609.04836v1

相關推薦