1. 程式人生 > >ImageNet Classification with Deep Convolutional Neural Networks論文翻譯——中英文對照

ImageNet Classification with Deep Convolutional Neural Networks論文翻譯——中英文對照

Deep Learning
文章作者:Tyan
部落格:noahsnail.com  |  CSDN  |  簡書

ImageNet Classification with Deep Convolutional Neural Networks

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

摘要

我們訓練了一個大型深度卷積神經網路來將ImageNet LSVRC-2010競賽的120萬高解析度的影象分到1000不同的類別中。在測試資料上,我們得到了top-1 37.5%, top-5 17.0%的錯誤率,這個結果比目前的最好結果好很多。這個神經網路有6000萬引數和650000個神經元,包含5個卷積層(某些卷積層後面帶有池化層)和3個全連線層,最後是一個1000維的softmax。為了訓練的更快,我們使用了非飽和神經元並對卷積操作進行了非常有效的GPU實現。為了減少全連線層的過擬合,我們採用了一個最近開發的名為dropout的正則化方法,結果證明是非常有效的。我們也使用這個模型的一個變種參加了ILSVRC-2012

競賽,贏得了冠軍並且與第二名 top-5 26.2%的錯誤率相比,我們取得了top-5 15.3%的錯誤率。

1 Introduction

Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small – on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the current best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.

1 引言

當前的目標識別方法基本上都使用了機器學習方法。為了提高目標識別的效能,我們可以收集更大的資料集,學習更強大的模型,使用更好的技術來防止過擬合。直到最近,標註影象的資料集都相對較小–在幾萬張影象的數量級上(例如,NORB[16],Caltech-101/256 [8, 9]和CIFAR-10/100 [12])。簡單的識別任務在這樣大小的資料集上可以被解決的相當好,尤其是如果通過標籤保留變換進行資料增強的情況下。例如,目前在MNIST數字識別任務上(<0.3%)的最好準確率已經接近了人類水平[4]。但真實環境中的物件表現出了相當大的可變性,因此為了學習識別它們,有必要使用更大的訓練資料集。實際上,小影象資料集的缺點已經被廣泛認識到(例如,Pinto et al. [21]),但收集上百萬影象的標註資料僅在最近才變得的可能。新的更大的資料集包括LabelMe [23],它包含了數十萬張完全分割的影象,ImageNet[6],它包含了22000個類別上的超過1500萬張標註的高解析度的影象。

To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.

為了從數百萬張影象中學習幾千個物件,我們需要一個有很強學習能力的模型。然而物件識別任務的巨大複雜性意味著這個問題不能被指定,即使通過像ImageNet這樣的大資料集,因此我們的模型應該也有許多先驗知識來補償我們所沒有的資料。卷積神經網路(CNNs)構成了一個這樣的模型[16, 11, 13, 18, 15, 22, 26]。它們的能力可以通過改變它們的廣度和深度來控制,它們也可以對影象的本質進行強大且通常正確的假設(也就是說,統計的穩定性和畫素依賴的區域性性)。因此,與具有層次大小相似的標準前饋神經網路,CNNs有更少的連線和引數,因此它們更容易訓練,而它們理論上的最佳效能可能僅比標準前饋神經網路差一點。

Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.

儘管CNN具有引人注目的質量,儘管它們的區域性架構相當有效,但將它們大規模的應用到到高解析度影象中仍然是極其昂貴的。幸運的是,目前的GPU,搭配了高度優化的2D卷積實現,強大到足夠促進有趣地大量CNN的訓練,最近的資料集例如ImageNet包含足夠的標註樣本來訓練這樣的模型而沒有嚴重的過擬合。

The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly. Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4. Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.

本文具體的貢獻如下:我們在ILSVRC-2010和ILSVRC-2012[2]的ImageNet子集上訓練了到目前為止最大的神經網路之一,並取得了迄今為止在這些資料集上報道過的最好結果。我們編寫了高度優化的2D卷積GPU實現以及訓練卷積神經網路內部的所有其它操作,我們把它公開了。我們的網路包含許多新的不尋常的特性,這些特性提高了神經網路的效能並減少了訓練時間,詳見第三節。即使使用了120萬標註的訓練樣本,我們的網路尺寸仍然使過擬合成為一個明顯的問題,因此我們使用了一些有效的技術來防止過擬合,詳見第四節。我們最終的網路包含5個卷積層和3個全連線層,深度似乎是非常重要的:我們發現移除任何卷積層(每個卷積層包含的引數不超過模型引數的1%)都會導致更差的效能。

In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

最後,網路尺寸主要受限於目前GPU的記憶體容量和我們能忍受的訓練時間。我們的網路在兩個GTX 580 3GB GPU上訓練五六天。我們的所有實驗表明我們的結果可以簡單地通過等待更快的GPU和更大的可用資料集來提高。

2 The Dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

2 資料集

ImageNet資料集有超過1500萬的標註高解析度影象,這些影象屬於大約22000個類別。這些影象是從網上收集的,使用了Amazon’s Mechanical Turk的眾包工具通過人工標註的。從2010年起,作為Pascal視覺物件挑戰賽的一部分,每年都會舉辦ImageNet大規模視覺識別挑戰賽(ILSVRC)。ILSVRC使用ImageNet的一個子集,1000個類別每個類別大約1000張影象。總計,大約120萬訓練影象,50000張驗證影象和15萬測試影象。

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.

ILSVRC-2010是ILSVRC競賽中唯一可以獲得測試集標籤的版本,因此我們大多數實驗都是在這個版本上執行的。由於我們也使用我們的模型參加了ILSVRC-2012競賽,因此在第六節我們也報告了模型在這個版本的資料集上的結果,這個版本的測試標籤是不可獲得的。在ImageNet上,按照慣例報告兩個錯誤率:top-1top-5top-5錯誤率是指測試影象的正確標籤不在模型認為的五個最可能的便籤之中。

ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.

ImageNet包含各種解析度的影象,而我們的系統要求不變的輸入維度。因此,我們將影象進行下采樣到固定的256×256解析度。給定一個矩形影象,我們首先縮放影象短邊長度為256,然後從結果影象中裁剪中心的256×256大小的影象塊。除了在訓練集上對畫素減去平均活躍度外,我們不對影象做任何其它的預處理。因此我們在原始的RGB畫素值(中心的)上訓練我們的網路。

3 The Architecture

The architecture of our network is summarized in Figure 2. It contains eight learned layers — five convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important first.

3 架構

我們的網路架構概括為圖2。它包含八個學習層–5個卷積層和3個全連線層。下面,我們將描述我們網路結構中的一些新奇的不尋常的特性。3.1-3.4小節按照我們對它們評估的重要性進行排序,最重要的最有先。

3.1 ReLU Nonlinearity

The standard way to model a neuron’s output f as a function of its input x is with f(x) = tanh(x) or f(x) = (1 + e−x)−1. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0,x). Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models.

3.1 ReLU非線性

將神經元輸出f建模為輸入x的函式的標準方式是用f(x) = tanh(x)f(x) = (1 + e−x)−1。考慮到梯度下降的訓練時間,這些飽和的非線性比非飽和非線性f(x) = max(0,x)更慢。根據Nair和Hinton[20]的說法,我們將這種非線性神經元稱為修正線性單元(ReLU)。採用ReLU的深度卷積神經網路訓練時間比等價的tanh單元要快幾倍。在圖1中,對於一個特定的四層卷積網路,在CIFAR-10資料集上達到25%的訓練誤差所需要的迭代次數可以證實這一點。這幅圖表明,如果我們採用傳統的飽和神經元模型,我們將不能在如此大的神經網路上實驗該工作。

Figure 1

Figure 1: A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.

圖1:使用ReLU的四層卷積神經網路在CIFAR-10資料集上達到25%的訓練誤差比使用tanh神經元的等價網路(虛線)快六倍。為了使訓練儘可能快,每個網路的學習率是單獨選擇的。沒有采用任何型別的正則化。影響的大小隨著網路結構的變化而變化,這一點已得到證實,但使用ReLU的網路都比等價的飽和神經元快幾倍。

We are not the first to consider alternatives to traditional neuron models in CNNs. For example, Jarrett et al. [11] claim that the nonlinearity f(x) = |tanh(x)| works particularly well with their type of contrast normalization followed by local average pooling on the Caltech-101 dataset. However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.

我們不是第一個考慮替代CNN中傳統神經元模型的人。例如,Jarrett等人[11]聲稱非線性函式f(x) = |tanh(x)|與其對比度歸一化一起,然後是區域性均值池化,在Caltech-101資料集上工作的非常好。然而,在這個資料集上主要的關注點是防止過擬合,因此他們觀測到的影響不同於我們使用ReLU擬合數據集時的加速能力。更快的學習對大型資料集上大型模型的效能有很大的影響。

3.2 Training on Multiple GPUs

A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation.

3.2 多GPU訓練

單個GTX580 GPU只有3G記憶體,這限制了可以在GTX580上進行訓練的網路最大尺寸。事實證明120萬影象用來進行網路訓練是足夠的,但網路太大因此不能在單個GPU上進行訓練。因此我們將網路分佈在兩個GPU上。目前的GPU非常適合跨GPU並行,因為它們可以直接互相讀寫記憶體,而不需要通過主機記憶體。我們採用的並行方案基本上每個GPU放置一半的核(或神經元),還有一個額外的技巧:只在某些特定的層上進行GPU通訊。這意味著,例如,第3層的核會將第2層的所有核對映作為輸入。然而,第4層的核只將位於相同GPU上的第3層的核對映作為輸入。連線模式的選擇是一個交叉驗證問題,但這可以讓我們準確地調整通訊數量,直到它的計算量在可接受的範圍內。

The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Ciresan et al. [5], except that our columns are not independent (see Figure 2). This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as
many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net.

除了我們的列不是獨立的之外(看圖2),最終的架構有點類似於Ciresan等人[5]採用的“columnar” CNN。與每個卷積層一半的核在單GPU上訓練的網路相比,這個方案降分別低了我們的top-1 1.7%top-5 1.2%的錯誤率。雙GPU網路比單GPU網路稍微減少了訓練時間。

Figure 2

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

圖 2:我們CNN架構圖解,明確描述了兩個GPU之間的責任。在圖的頂部,一個GPU執行在部分層上,而在圖的底部,另一個GPU執行在部分層上。GPU只在特定的層進行通訊。網路的輸入是150,528維,網路剩下層的神經元數目分別是253,440–186,624–64,896–64,896–43,264–4096–4096–1000(8層)。

3.3 Local Response Normalization

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by ax,yi the activity of a neuron computed by applying kernel i at position (x,y) and then applying the ReLU nonlinearity, the response-normalized activity bx,yi is given by the expression

bx,yi=ax,yi/(k+αj=max(0,in/2)min(N1,i+n/2)(ax,yi)2)β

where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 0.0001, and β = 0.75. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5).

3.3 區域性響應歸一化

ReLU具有讓人滿意的特性,它不需要通過輸入歸一化來防止飽和。如果至少一些訓練樣本對ReLU產生了正輸入,那麼那個神經元上將發生學習。然而,我們仍然發現接下來的區域性響應歸一化有助於泛化。ax,yi表示神經元啟用,通過在(x,y)位置應用核i,然後應用ReLU非線性來計算,響應歸一化啟用bx,yi通過下式給定:

bx,yi=ax,yi/(k+αj=max(0,in/2)min(N1,i+n/2)(ax,yi)2)β

求和運算在n個“毗鄰的”核對映的同一位置上執行,N是本層的卷積核數目。核對映的順序當然是任意的,在訓練開始前確定。響應歸一化的順序實現了一種側抑制形式,靈感來自於真實神經元中發現的型別,為使用不同核進行神經元輸出計算的較大活動創造了競爭。常量k,n,α,β是超引數,它們的值通過驗證集確定;我們設k=2,n=5,α=0.0001,β=0.75。我們在特定的層使用的ReLU非線性之後應用了這種歸一化(請看3.5小節)。

This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization.

這個方案與Jarrett等人[11]的區域性對比度歸一化方案有一定的相似性,但我們更恰當的稱其為“亮度歸一化”,因此我們沒有減去均值。響應歸一化分別減少了top-1 1.4%top-5 1.2%的錯誤率。我們也在CIFAR-10資料集上驗證了這個方案的有效性:一個乜嘢歸一化的四層CNN取得了13%的錯誤率,而使用歸一化取得了11%的錯誤率。

3.4 Overlapping Pooling

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Tr