1. 程式人生 > >Training Very Deep Networks論文筆記

Training Very Deep Networks論文筆記

Abstract
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.

摘要
理論和實證證據表明,神經網路的深度對其效能至關重要。 然而,隨著深度的增加,訓練變得更加困難,這對於深度網路的訓練來說仍然是一個懸而未決的問題。 在這裡,我們介紹一種旨在克服這一點的新架構。 我們稱其為高速公路網路,此網路允許資訊在高速公路網路的多層中暢通無阻。 這個網路是受到LSTM的啟發,並使用自適應門控單元來調節資訊流。 即使有數百層,也可以通過簡單的梯度下降直接訓練高速公路網路。 這使得研究極其深入和高效的架構成為可能。

2 Highway Networks
Notation We use boldface letters for vectors and matrices, and italicized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an identity matrix. The function σ(x) is defined as σ

( x ) = 1 1 + e
x
\sigma \left ( x \right )=\frac{1}{1+e^{-x}} ; x ϵ R x\epsilon R . The dot operator (·) is used to denote element-wise multiplication.
A plain feedforward neural network typically consists of L layers where the l t h l^{th} layer ( l ϵ { 1 , 2 , . . . , L } l\epsilon \left \{ 1,2,...,L \right \} ) applies a non-linear transformation H (parameterized by W H , l W_{H,l} ) on its input x l x_{l} to produce its output y l y_{l} . Thus, x 1 x_{1} is the input to the network and y L y_{L} is the network’s output. Omitting the layer index and biases for clarity,
y = H ( x , W H ) y=H\left ( x,W_{H} \right ) (1)
H is usually an affine transform followed by a non-linear activation function, but in general it may take other forms, possibly convolutional or recurrent. For a highway network, we additionally define two non-linear transforms T ( x , W T ) T\left ( x,W_{T} \right ) and C ( x , W C ) C\left ( x,W_{C} \right ) such that
y = H ( x , W H ) T ( x , W T ) + x C ( x , W C ) y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot C\left ( x,W_{C} \right ) (2)
We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 − T, giving
y = H ( x , W H ) T ( x , W T ) + x ( 1 T ( x , W T ) ) y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot (1-T\left ( x,W_{T} \right )) (3)
The dimensionality of x; y; H ( x , W H ) H\left ( x,W_{H} \right ) and T ( x , W T ) T\left ( x,W_{T} \right ) must be the same for Equation 3 to be valid.
Note that this layer transformation is much more flexible than Equation 1. In particular, observe that for particular values of T,
y = { x , i f T ( x , W T ) = 0 H ( x , W H ) , i f T ( x , W T ) = 1 y=\left\{\begin{matrix} x,& ifT\left ( x,W_{T} \right )= 0\\ H\left ( x,W_{H} \right),& if T\left ( x,W_{T} \right )=1 \end{matrix}\right. (4)

Similarly, for the Jacobian of the layer transform,
d y d x = { I i f T ( x , W T ) = 0 H ( x , W H ) i f T ( x , W T ) = 1 \frac{dy}{dx}=\left\{\begin{matrix} I& if T\left ( x,W_{T} \right )=0\\ H^{'}\left ( x,W_{H} \right ) & if T\left ( x,W_{T} \right )=1 \end{matrix}\right. (5)
Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of H and that of a layer which simply passes its inputs through. Just as a plain layer consists of multiple computing units such that the i t h i^{th} unit computes y i = H i ( x ) y_{i}=H_{i}(x) , a highway network consists of multiple blocks such that the i t h i^{th}