1. 程式人生 > >為什麼深度神經網路難以訓練Why are deep neural networks hard to train?

為什麼深度神經網路難以訓練Why are deep neural networks hard to train?

Imagine you're an engineer who has been asked to design a computer from scratch. One day you're working away in your office, designing logical circuits, setting out AND gates, OR gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:

You're dumbfounded, and tell your boss: "The customer is crazy!"

Your boss replies: "I think they're crazy, too. But what the customer wants, they get."

In fact, there's a limited sense in which the customer isn't crazy. Suppose you're allowed to use a special logical gate which lets youAND

 together as many inputs as you want. And you're also allowed a many-input NAND gate, that is, a gate which can AND multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that's just two layers deep.

But just because something is possible doesn't make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions. In other words, we build up to a solution through multiple layers of abstraction.

For instance, suppose we're designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:

That is, our final circuit contains at least three layers of circuit elements. In fact, it'll probably contain more than three layers, as we break the sub-tasks down into smaller units than I've described. But you get the general idea.

So deep circuits make the process of design easier. But they're not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than do deep circuits. For instance, a famous series of papers in the early 1980s**The history is somewhat complex, so I won't give detailed references. See Johan Håstad's 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references.showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it's easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.

Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we've worked with have just a single hidden layer of neurons (plus the input and output layers):

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we'd expect networks with many more hidden layers to be more powerful:

Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we're doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks**For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014). See also the more informal discussion in section 2 ofLearning deep architectures for AI, by Yoshua Bengio (2009)..

How can we train such deep networks? In this chapter, we'll try training deep networks using our workhorse learning algorithm -stochastic gradient descent by backpropagation. But we'll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.

That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we'll dig down and try to understand what's making our deep networks hard to train. When we look closely, we'll discover that the different layers in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn't simply due to bad luck. Rather, we'll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.

As we delve into the problem more deeply, we'll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we'll find that there's an intrinsic instability associated to learning by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.

This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what's required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we'll use deep learning to attack image recognition problems.

So, what goes wrong when we try to train a deep network?

To answer that question, let's first revisit the case of a network with just a single hidden layer. As per usual, we'll use the MNIST digit classification problem as our playground for learning and experimentation**I introduced the MNIST problem and data hereand here..

If you wish, you can follow along by training networks on your computer. It is also, of course, fine to just read along. If you do wish to follow live, then you'll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant repository from the command line:

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
If you don't use git then you can download the data and code here. You'll need to change into the src subdirectory.

Then, from a Python shell we load the MNIST data:

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

We set up our network:

>>> import network2
>>> net = network2.Network([784, 30, 10])

This network has 784 neurons in the input layer, corresponding to the 28×28=78428×28=784 pixels in the input image. We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits ('0', '1', '2', , '9').

Let's try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate η=0.1η=0.1, and regularization parameter λ=5.0λ=5.0. As we train we'll monitor the classification accuracy on the validation_data**Note that the networks is likely to take some minutes to train, depending on the speed of your machine. So if you're running the code you may wish to continue reading and return later, not wait for the code to finish executing.:

>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

We get a classification accuracy of 96.48 percent (or thereabouts - it'll vary a bit from run to run), comparable to our earlier results with a similar configuration.

Now, let's add another hidden layer, also with 30 neurons in it, and try training with the same hyper-parameters:

>>> net = network2.Network([784, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

This gives an improved classification accuracy, 96.90 percent. That's encouraging: a little more depth is helping. Let's add another 30-neuron hidden layer:

>>> net = network2.Network([784, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

That doesn't help at all. In fact, the result drops back down to 96.57 percent, close to our original shallow network. And suppose we insert one further hidden layer:

>>> net = network2.Network([784, 30, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

The classification accuracy drops again, to 96.53 percent. That's probably not a statistically significant drop, but it's not encouraging, either.

This behaviour seems strange. Intuitively, extra hidden layers ought to make the network able to learn more complex classification functions, and thus do a better job classifying. Certainly, things shouldn't get worse, since the extra layers can, in the worst case, simply do nothing**See this later problem to understand how to build a hidden layer that does nothing.. But that's not what's going on.

So what is going on? Let's assume that the extra hidden layers really could help in principle, and the problem is that our learning algorithm isn't finding the right weights and biases. We'd like to figure out what's going wrong in our learning algorithm, and how to do better.

To get some insight into what's going wrong, let's visualize how the network learns. Below, I've plotted part of a [784,30,30,10][784,30,30,10]network, i.e., a network with two hidden layers, each containing 3030hidden neurons. Each neuron in the diagram has a little bar on it, representing how quickly that neuron is changing as the network learns. A big bar means the neuron's weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly. More precisely, the bars denote the gradient C/b∂C/∂b for each neuron, i.e., the rate of change of the cost with respect to the neuron's bias. Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too. Don't worry if you don't recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron's weights and bias are changing as the network learns.

To keep the diagram simple, I've shown just the top six neurons in the two hidden layers. I've omitted the input neurons, since they've got no weights or biases to learn. I've also omitted the output neurons, since we're doing layer-wise comparisons, and it makes most sense to compare layers with the same number of neurons. The results are plotted at the very beginning of training, i.e., immediately after the network is initialized. Here they are**The data plotted is generated using the program generate_gradient.py. The same program is also used to generate the results quoted later in this section.:

The network was initialized randomly, and so it's not surprising that there's a lot of variation in how rapidly the neurons learn. Still, one thing that jumps out is that the bars in the second hidden layer are mostly much larger than the bars in the first hidden layer. As a result, the neurons in the second hidden layer will learn quite a bit faster than the neurons in the first hidden layer. Is this merely a coincidence, or are the neurons in the second hidden layer likely to learn faster than neurons in the first hidden layer in general?

To determine whether this is the case, it helps to have a global way of comparing the speed of learning in the first and second hidden layers. To do this, let's denote the gradient as δlj=C/bljδjl=∂C/∂bjl, i.e., the gradient for the jjth neuron in the llth layer**Back in Chapter 2 we referred to this as the error, but here we'll adopt the informal term "gradient". I say "informal" because of course this doesn't explicitly include the partial derivatives of the cost with respect to the weights, C/w∂C/∂w.. We can think of the gradient δ1δ1 as a vector whose entries determine how quickly the first hidden layer learns, and δ2δ2 as a vector whose entries determine how quickly the second hidden layer learns. We'll then use the lengths of these vectors as (rough!) global measures of the speed at which the layers are learning. So, for instance, the length δ1∥δ1∥measures the speed at which the first hidden layer is learning, while the length δ2∥δ2∥ measures the speed at which the second hidden layer is learning.

With these definitions, and in the same configuration as was plotted above, we find δ1=0.07∥δ1∥=0.07… and 

相關推薦

為什麼深度神經網路難以訓練Why are deep neural networks hard to train?

Imagine you're an engineer who has been asked to design a computer from scratch. One day you're working away in your office, designing

【火爐煉AI】深度學習003-構建並訓練深度神經網路模型

【火爐煉AI】深度學習003-構建並訓練深度神經網路模型 (本文所使用的Python庫和版本號: Python 3.6, Numpy 1.14, scikit-learn 0.19, matplotlib 2.2 ) 前面我們講解過單層神經網路模型,發現它結構簡單,難以解決一些實際的比較複雜的問題,故而現

#####好好好好####Keras深度神經網路訓練分類模型的四種方法

Github程式碼: Keras樣例解析 歡迎光臨我的部落格:https://gaussic.github.io/2017/03/03/imdb-sentiment-classification/ (轉載請註明出處:https://gaussic.github.io) Keras的官方E

深度神經網路的多工學習概覽(An Overview of Multi-task Learning in Deep Neural Networks)

譯自:http://sebastianruder.com/multi-task/ 1. 前言 在機器學習中,我們通常關心優化某一特定指標,不管這個指標是一個標準值,還是企業KPI。為了達到這個目標,我們訓練單一模型或多個模型集合來完成指定得任務。然後,我們通過精細調參,來改進模型直至效能不再

【讀書1】【2017】MATLAB與深度學習——單層神經網路訓練:增量規則(3)

例如,epoch = 10意味著神經網路對相同的資料集經過10次重複的訓練過程。 For instance, epoch = 10 means that theneural network goes through 10 repeated training pr

深度神經網路壓縮 Deep Compression (ICLR2016 Best Paper)

【論文閱讀】Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman coding 如何壓縮? Prunes the network:

【姿態估計】DeepPose: 基於深度神經網路的人體姿態估計 Human Pose Estimation via Deep Neural Networks

Alexander Toshev Christian Szegedy Google 1600 Amphitheatre Pkwy Mountain View, CA 94043 toshev,[e

使用IRIS資料集訓練第一個深度神經網路

本文主要以Iris資料集為例子講解如何訓練一個簡單的Deep Neural Network。 環境配置 python 3.5.4 TensorFlow 1.4 完整原始碼 import os import urllib import nump

基於深度學習的CT影象肺結節自動檢測技術四—資料增強—定義神經網路訓練

開發環境 jupyter notebook # -- coding: utf-8 -- #訓練影象分割網路(u-net)模型 import csv import glob import random import cv2 import numpy import

深度神經網路為何很難訓練(包含梯度消失和梯度爆炸等)

我選取了原文的部分內容進行轉載。之前我搜索”梯度消失和梯度爆炸“的相關部落格,發現很多都解釋的不是很好,然後看到了 極客學院 的這篇介紹,感覺介紹的挺詳細,轉載一下,大家一起分享一下~ 到現在為止,本書講神經網路看作是瘋狂的客戶。幾乎我們遇到的所有的網路

深度神經網路訓練的技巧

這裡主要介紹8中實現細節的技巧或tricks:資料增廣、影象預處理、網路初始化、訓練過程中的技巧、啟用函式的選擇、不同正則化方法、來自於資料的洞察、整合多個深度網路的方法。 1.       資料增廣 在不改變影象類別的情況下,增加資料量,能提高模型的泛化能力 自然

Coursera deeplearning.ai 深度學習筆記1-4-Deep Neural Networks-深度神經網路原理推導與程式碼實現

在掌握了淺層神經網路演算法後,對深度神經網路進行學習。 1. 原理推導 1.1 深度神經網路表示 定義:L表示神經網路總層數,上標[l]代表第l層網路,n[l]代表第l層的節點數,a[l]

AlphaGo論文的譯文,用深度神經網路和樹搜尋征服圍棋:Mastering the game of Go with deep neural networks and tree search

前言: 圍棋的英文是 the game of Go,標題翻譯為:《用深度神經網路和樹搜尋征服圍棋》。譯者簡介:大三,211,電腦科學與技術專業,平均分92分,專業第一。為了更好地翻譯此文,譯者查看了很多資料。譯者翻譯此論文已盡全力,不足之處希望讀者指出

【資料極客】Week3_訓練深度神經網路的技巧

Tips for Training DNN 訓練深度神經網路技巧 【李巨集毅2017秋天 課程】 1 Vanishing Gradient Problem 梯度消失問題 在輸入層部分,即便有很大的變化,通過 Sigmoid 啟

如何用C++在TensorFlow中訓練深度神經網路

目前流行的深度學習框架 TensorFlow 是以 C++為底層構建的,但絕大多數人都在 Python 上使用 TensorFlow 來開發自己的模型。隨著 C++ API 的完善,直接使用 C++來搭建神經網路已經成為可能,本文將向你介紹一種簡單的實現方法。 很多人都

一起做實驗 | 多GPU平行計算訓練深度神經網路

科技你好關注我們·成為科技潮人2018年2月25日,平昌東奧會閉幕式上,備受矚目的“北京八分鐘”

深度神經網路Deep Neural Network)

dZ[l]=dA[l]∗g[l]′(Z[l])dW[l]=1mdZ[l]⋅A[l−1]db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)dA[l−1]=W[l]T⋅dZ[l]

Deep Learning(深度學習)之(六)【深度神經網路壓縮】Deep Compression (ICLR2016 Best Paper)

      Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman coding 這篇論文是Stanford的Song Han的

用spark訓練深度神經網路

SparkNet: Training Deep Network in Spark 這篇論文是 Berkeley 大學 Michael I. Jordan 組的 ICLR2016(under review) 的最新論文,有興趣可以看看原文和原始碼:paper,github

估算深度神經網路的最優學習率

學習率如何影響訓練? 深度學習模型通常由隨機梯度下降演算法進行訓練。隨機梯度下降演算法有許多變形:例如 Adam、RMSProp、Adagrad 等等。這些演算法都需要你設定學習率。學習率決定了在一個小批量(mini-batch)中權重在梯度方向要移動多遠。 如果學習率很低,訓練會變得更