用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

阿新 • • 發佈：2019-02-12

The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.
Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is the loss for the last minibatch/instance only

, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.
So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a

standard textbook on numerical optimization, and peruse the latest papers on optimization in deep learning.
Minimize what? The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see

Radford Neal's fbm), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.
KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer:

用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

開始深度學習之旅——caffe安裝

機器學習：用6行Python程式碼開始寫第一個機器學習程式

Coursera 深度學習吳恩達 deep learning.ai 筆記整理（3-2）——機器學習策略

小宋深度學習之旅（小白入門教程）0

機器學習與深度學習系列連載：第二部分深度學習（七）深度學習技巧4（Deep learning tips- Dropout）

神經⽹絡與深度學習 Neural Networks and Deep Learning

AWS 深度學習之旅

一起開始linux學習之旅

機器學習與深度學習系列連載：第二部分深度學習（六）深度學習技巧3（Deep learning tips- Early stopping and Regularization）

用50行Python程式碼從零開始實現一個AI平衡小遊戲！

[深度學習]七行程式碼體驗深度學習的神奇

Matlab影象識別/檢索系列(7)-10行程式碼完成深度學習網路之取中間層資料作為特徵（轉載）

圖像識別VPU——易用的嵌入式AI支持深度學習平臺介紹

用120行Java程式碼寫一個自己的區塊鏈

用20行JS程式碼實現貼上板功能

百度爬蟲工程師教你只用500行Python程式碼構建一個輕量級爬蟲框架

入門十天，我就用50行Python程式碼爬到了整個網站

深度學習之目標檢測object_detection程式碼實現

深度學習之卷積神經網路CNN及tensorflow程式碼實現示例詳細介紹

用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

相關推薦