1. 程式人生 > >ResNet論文翻譯——中英文對照

ResNet論文翻譯——中英文對照

文章作者:Tyan
部落格:noahsnail.com  |  CSDN  |  簡書

Deep Learning

宣告:作者翻譯論文僅為學習,如有侵權請聯絡作者刪除博文,謝謝!

Deep Residual Learning for Image Recognition

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers——8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

摘要

更深的神經網路更難訓練。我們提出了一種殘差學習框架來減輕網路訓練,這些網路比以前使用的網路更深。我們明確地將層變為學習關於層輸入的殘差函式,而不是學習未參考的函式。我們提供了全面的經驗證據說明這些殘差網路很容易優化,並可以顯著增加深度來提高準確性。在ImageNet資料集上我們評估了深度高達152層的殘差網路——比VGG[40]深8倍但仍具有較低的複雜度。這些殘差網路的集合在ImageNet測試集上取得了3.57%的錯誤率。這個結果在ILSVRC 2015分類任務上贏得了第一名。我們也在CIFAR-10上分析了100層和1000層的殘差網路。

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

對於許多視覺識別任務而言,表示的深度是至關重要的。僅由於我們非常深度的表示,我們便在COCO目標檢測資料集上得到了28%的相對提高。深度殘差網路是我們向ILSVRC和COCO 2015競賽提交的基礎,我們也贏得了ImageNet檢測任務,ImageNet定位任務,COCO檢測和COCO分割任務的第一名。

1. Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39]. Deep networks naturally integrate low/mid/high-level features [49] and classifiers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]. Many other non-trivial visual recognition tasks [7, 11, 6, 32, 27] have also greatly benefited from very deep models.

1. 引言

深度卷積神經網路[22, 21]導致了影象分類[21, 49, 39]的一系列突破。深度網路自然地將低/中/高階特徵[49]和分類器以端到端多層方式進行整合,特徵的“級別”可以通過堆疊層的數量(深度)來豐富。最近的證據[40, 43]顯示網路深度至關重要,在具有挑戰性的ImageNet資料集上領先的結果都採用了“非常深”[40]的模型,深度從16 [40]到30 [16]之間。許多其它重要的視覺識別任務[7, 11, 6, 32, 27]也從非常深的模型中得到了極大受益。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [14, 1, 8], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 8, 36, 12] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

在深度重要性的推動下,出現了一個問題:學些更好的網路是否像堆疊更多的層一樣容易?回答這個問題的一個障礙是梯度消失/爆炸[14, 1, 8]這個眾所周知的問題,它從一開始就阻礙了收斂。然而,這個問題通過標準初始化[23, 8, 36, 12]和中間標準化層[16]在很大程度上已經解決,這使得數十層的網路能通過具有反向傳播的隨機梯度下降(SGD)開始收斂。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [10, 41] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

Figure 1

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

當更深的網路能夠開始收斂時,暴露了一個退化問題:隨著網路深度的增加,準確率達到飽和(這可能並不奇怪)然後迅速下降。意外的是,這種下降不是由過擬合引起的,並且在適當的深度模型上新增更多的層會導致更高的訓練誤差,正如[10, 41]中報告的那樣,並且由我們的實驗完全證實。圖1顯示了一個典型的例子。

Figure 1

圖1 20層和56層的“簡單”網路在CIFAR-10上的訓練誤差(左)和測試誤差(右)。更深的網路有更高的訓練誤差和測試誤差。ImageNet上的類似現象如圖4所示。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

退化(訓練準確率)表明不是所有的系統都很容易優化。讓我們考慮一個較淺的架構及其更深層次的物件,為其新增更多的層。存在通過構建得到更深層模型的解決方案:新增的層是恆等對映,其他層是從學習到的較淺模型的拷貝。 這種構造解決方案的存在表明,較深的模型不應該產生比其對應的較淺模型更高的訓練誤差。但是實驗表明,我們目前現有的解決方案無法找到與構建的解決方案相比相對不錯或更好的解決方案(或在合理的時間內無法實現)。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x):=H(x)x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

在本文中,我們通過引入深度殘差學習框架解決了退化問題。我們明確地讓這些層擬合殘差對映,而不是希望每幾個堆疊的層直接擬合期望的基礎對映。形式上,將期望的基礎對映表示為H(x),我們將堆疊的非線性層擬合另一個對映F(x):=H(x)x。原始的對映重寫為F(x)+x。我們假設殘差對映比原始的、未參考的對映更容易優化。在極端情況下,如果一個恆等對映是最優的,那麼將殘差置為零比通過一堆非線性層來擬合恆等對映更容易。

The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 33, 48] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

Figure 2

Figure 2. Residual learning: a building block.

公式F(x)+x可以通過帶有“快捷連線”的前向神經網路(圖2)來實現。快捷連線[2, 33, 48]是那些跳過一層或更多層的連線。在我們的案例中,快捷連線簡單地執行恆等對映,並將其輸出新增到堆疊層的輸出(圖2)。恆等快捷連線既不增加額外的引數也不增加計算複雜度。整個網路仍然可以由帶有反向傳播的SGD進行端到端的訓練,並且可以使用公共庫(例如,Caffe [19])輕鬆實現,而無需修改求解器。

Figure 2

圖2. 殘差學習:構建塊

We present comprehensive experiments on ImageNet [35] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

我們在ImageNet[35]上進行了綜合實驗來顯示退化問題並評估我們的方法。我們發現:1)我們極深的殘差網路易於優化,但當深度增加時,對應的“簡單”網路(簡單堆疊層)表現出更高的訓練誤差;2)我們的深度殘差網路可以從大大增加的深度中輕鬆獲得準確性收益,生成的結果實質上比以前的網路更好。

Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

CIFAR-10資料集上[20]也顯示出類似的現象,這表明了優化的困難以及我們的方法的影響不僅僅是針對一個特定的資料集。我們在這個資料集上展示了成功訓練的超過100層的模型,並探索了超過1000層的模型。

On the ImageNet classification dataset [35], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [40]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

在ImageNet分類資料集[35]中,我們通過非常深的殘差網路獲得了很好的結果。我們的152層殘差網路是ImageNet上最深的網路,同時還具有比VGG網路[40]更低的複雜性。我們的模型集合在ImageNet測試集上有3.57% top-5的錯誤率,並在ILSVRC 2015分類比賽中獲得了第一名。極深的表示在其它識別任務中也有極好的泛化效能,並帶領我們在進一步贏得了第一名:包括ILSVRC & COCO 2015競賽中的ImageNet檢測,ImageNet定位,COCO檢測和COCO分割。堅實的證據表明殘差學習準則是通用的,並且我們期望它適用於其它的視覺和非視覺問題。

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

2. 相關工作

殘差表示。在影象識別中,VLAD[18]是一種通過關於字典的殘差向量進行編碼的表示形式,Fisher向量[30]可以表示為VLAD的概率版本[18]。它們都是影象檢索和影象分類[4,47]中強大的淺層表示。對於向量量化,編碼殘差向量[17]被證明比編碼原始向量更有效。

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [44, 45], which relies on variables that represent residual vectors between two scales. It has been shown [3, 44, 45] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

在低階視覺和計算機圖形學中,為了求解偏微分方程(PDE),廣泛使用的Multigrid方法[3]將系統重構為在多個尺度上的子問題,其中每個子問題負責較粗尺度和較細尺度的殘差解。Multigrid的替代方法是層次化基礎預處理[44,45],它依賴於表示兩個尺度之間殘差向量的變數。已經被證明[3,44,45]這些求解器比不知道解的殘差性質的標準求解器收斂得更快。這些方法表明,良好的重構或預處理可以簡化優化。

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 33, 48] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [33, 48]. In [43, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [43], an “inception” layer is composed of a shortcut branch and a few deeper branches.

快捷連線。導致快捷連線[2,33,48]的實踐和理論已經被研究了很長時間。訓練多層感知機(MLP)的早期實踐是新增一個線性層來連線網路的輸入和輸出[33,48]。在[43,24]中,一些中間層直接連線到輔助分類器,用於解決梯度消失/爆炸。論文[38,37,31,46]提出了通過快捷連線實現層間響應,梯度和傳播誤差的方法。在[43]中,一個“inception”層由一個快捷分支和一些更深的分支組成。

Concurrent with our work, “highway networks” [41, 42] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

和我們同時進行的工作,“highway networks” [41, 42]提出了門功能[15]的快捷連線。這些門是資料相關且有引數的,與我們不具有引數的恆等快捷連線相反。當門控快捷連線“關閉”(接近零)時,高速網路中的層表示非殘差函式。相反,我們的公式總是學習殘差函式;我們的恆等快捷連線永遠不會關閉,所有的資訊總是通過,還有額外的殘差函式要學習。此外,高速網路還沒有證實極度增加的深度(例如,超過100個層)帶來的準確性收益。

3. Deep Residual Learning

3.1. Residual Learning

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x)x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x):=H(x)x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

3. 深度殘差學習

3.1. 殘差學習

我們考慮H(x)作為幾個堆疊層(不必是整個網路)要擬合的基礎對映,x表示這些層中第一層的輸入。假設多個非線性層可以漸近地近似複雜函式,它等價於假設它們可以漸近地近似殘差函式,即H(x)x(假設輸入輸出是相同維度)。因此,我們明確讓這些層近似引數函式 F(x):=H(x)x,而不是期望堆疊層近似H(x)。因此原始函式變為F(x)+x。儘管兩種形式應該都能漸近地近似要求的函式(如假設),但學習的難易程度可能是不同的。

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

關於退化問題的反直覺現象激發了這種重構(圖1左)。正如我們在引言中討論的那樣,如果新增的層可以被構建為恆等對映,更深模型的訓練誤差應該不大於它對應的更淺版本。退化問題表明求解器通過多個非線性層來近似恆等對映可能有困難。通過殘差學習的重構,如果恆等對映是最優的,求解器可能簡單地將多個非線性連線的權重推向零來接近恆等對映。

In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

Figure 7

Figure 7. Standard deviations (std) of layer responses on CIFAR-10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.

在實際情況下,恆等對映不太可能是最優的,但是我們的重構可能有助於對問題進行預處理。如果最優函式比零對映更接近於恆等對映,則求解器應該更容易找到關於恆等對映的抖動,而不是將該函式作為新函式來學習。我們通過實驗(圖7)顯示學習的殘差函式通常有更小的響應,表明恆等對映提供了合理的預處理。

Figure 7

圖7。層響應在CIFAR-10上的標準差(std)。這些響應是每個3×3層的輸出,在BN之後非線性之前。上面:以原始順序顯示層。下面:響應按降序排列。

3.2. Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:

y=F(x,Wi)+x (1)

Here x and y are the input and output vectors of the layers considered. The function F(x,Wi) represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, F