1. 程式人生 > >Going Deeper with Convolutions閱讀摘要

Going Deeper with Convolutions閱讀摘要

    論文連結:Going deeper with convolutions

  程式碼下載:

  • Abstract
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14). The main hallmark of this
architecture is the improved utilization of the computing resources inside the network. #我們提出一種稱為Inception的深度卷積神經網路框架,並且在ImageNet大規模視覺識別挑戰賽2014
(ILSVRC14)分類和檢測任務中實現了新的state-of-the-art.這個結構的主要特點是增加對網路內部計算資源的利用.
By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in
our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection
#通過精心地手動設計,我們在確保不增加計算量的情況下擴充套件網路的深度和寬度.為了優化效能,網路設計基於Hebbian準則以及多尺度處理機制.我們在ILSVRC14上提交的作品中最典型的被稱為GoogLeNet,一個22層的深度神經網路.它的效能通過文字檢測和目標檢測進行評估.
  • Introduction
In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10].One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas,algorithms and improved network architectures. 
#在過去的三年裡,由於深度學習和神經網路[10]的進步,我們在目標分類和檢測上取得了很大的進步。一個令人鼓舞的訊息是大多數進步都不只是更加強大的硬體,更廣泛的資料以及更大的模型堆砌的結果,而是新思路,演算法和改進的網路架構的綜合結果.
No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 times fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate.
#例如,截止到ILSVRC14中的頂級作品,除了比賽中用於檢測的分類資料集,沒有使用新的資料.事實上我們在ILSVRC14挑戰賽中提交的作品中所使用的模型引數少於兩年前Krizhevsky et al[9]使用的獲勝模型的1/12,卻在準確率上有了顯著的提升.
On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].
#在目標檢測前沿,最大的進步並非來源於簡單使用越來越大的網路,而是來自深度模型和經典計算機視覺的融合,例如
Girshick et al[6]提出的R-CNN演算法.
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. 
#另一個顯著的因素是隨著計算機裝置不斷嵌入式化,我們演算法的高效性————尤其是運算能力和記憶體的使用上————變得重要.值得注意的是本文提出的深度框架設計思路包含這個因素,而不是偏執於準確率的數值.
For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
#對於大多數實驗,模型在設計時都為了保證推理階段的計算量在15億次乘加指令以內,因此它們不會單純以學術興趣收尾,而是可以應用到現實世界,即使在大型資料集上的表現也是可以接收的.
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all,in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth.
#在這篇文章中,我們將側重於構建一個適用於計算機視覺的高效深度神經網路框架,稱為Inception,取自Lin et al[12]的Network in Network盧文,與著名的網際網路基因"we need to go deeper"[1]相呼應.在我們的例子中,"deep"包含了兩層含義:首先在於我們以"Inception module"的形式引入了一個新組織,更直接地在於增加網路的深度.
In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, where it significantly outperforms the current state of the art.
#通常,人們談論起Arora et al理論作品中的inspiration and guidance,便會把Inception模組當作是邏輯的頂峰.框架的優點在ILSVRC 2014分類和檢測挑戰賽中得到了驗證,並且超越了當時的state-of-the-art.
  • Related Work
Starting with LeNet-5[10], convolutional neural networks (CNN) have typically had a standard structure stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21].
#自LeNet-5[10]以來,卷積神經網路(CNN)用了一個典型的由卷積層堆積(後面選擇性的跟隨了對比度歸一化和最大池化)的基本結構,並且後面跟隨了一層或更多的全連線層.這個基本設計的變種在影象分類作品中開始變得流形,並且超越了MINST,CIFAR上至今為止最好的表現以及ImageNet分類挑戰賽[9,21]中最顯耀的作品.
For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
#對於類似ImageNet的大型資料集,現在的趨勢是增加網路的數量[12]和尺寸[21,14],同時使用dropout[7]來處理過擬合的問題.
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].
#除了關心最大池化層可能導致空間精確資訊的丟失,相同的卷積網路例如[9]已經成功應用在目標定位[9,14],目標檢測[6,14,18,5]和人體姿態估算[19].
Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here.
#受靈長類動物視覺拼成的神經學模型啟發,Serre et al[15]使用了一系列不同固定尺寸的Gavor濾波器來處理多尺度.這裡我們使用了一種近似的策略.
However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.
#不過,相比於[15]中固定兩層深度模型,Inception結構中所有的濾波器都是學習得來的.進一步說,Inception層重複了許多次,最終形成一個擁有22層深度的GoogLeNet模型.
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth.We use this approach heavily in our architecture.
#Network-in-Network是Lin et al.[12]為了增加神經網路的表達能力提出的一種方法.在他們的模型中,在網路上額外增加一層1*1的卷積層,可以增加它的深度。我們的框架中大量應用這種方式.
However,in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.
#但是,我們的設定裡,1*1卷積層主要有兩層目的:最重要的是它們主要充當降維模組以避免計算瓶頸,否則這將限制網路的尺寸.這樣不僅能增加網路的深度,也可以在不影響網路的效能情況下增加網路的寬度.
Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6].R-CNN decomposes the over all detection problem into two subproblems: utilizing low  level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations.
#最終,當前目標檢測的state-of-the-art是Girshick et al.[6]提出的基於區域的卷積神經網路(R-CNN).R-CNN將檢測問題總體上分為兩個子問題:利用顏色和紋理等低層次特徵,以跨類別的方式產生目標候選區域,隨後使用CNN分類器來區分這些位置上的物體類別.
Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.
#這樣一個分為兩個階段的方法利用了低層次特徵分割邊框的準確性,並利用了state-of-the-art CNNs強大的分類能力.我們在檢測作品中引入了類似的框架,但是在這個階段都作了拓展和改善,例如用於高目標邊框recall的multi-box[5]預測以及融合更加準確的邊框候選分類方法.
  • Motivation and High Level Considerations
The most straightforward way of improving the performance of deep neural networks is by increasing their size.This includes both increasing the depth – the number of network levels – as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.
#改善深度神經網路效能最直接的方法是增加網路的尺寸,這包括增加深度————工作的網路層數————以及它的寬度:每個層的黨員數量.這是訓練高質量模型的一個簡單而安全的方法,尤其是在提供了大量的帶有標籤的訓練資料.但是,這個簡單的解決方案有兩個主要的缺點.

Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classification challenge. Domain knowledge is required to distinguish between these classes.
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet(even in the 1000-class ILSVRC subset) as shown in Figure 1.
#更大的尺寸通常意味著大量的引數,這使得擴大後的網路更容易過擬合,尤其是訓練過程中帶標籤樣本是有限的.這是一個主要的瓶頸,因為帶有強標註資料集的獲取是費力並且昂貴的,經常需要專業人員來評分割槽分那些例如圖1所示具有不同細密紋理的ImageNet視覺目錄.
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. 
#網路尺寸增加的另外一個缺點是計算資源消耗的大幅增長.舉個例子,在一個深度視覺網路中,如果兩個卷積層是相連的.它們中間任何濾波器數量的增長都會引起計算量以該數量的二次方增長.
If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.
#如果增加上的網路容量沒有被有效利用(例如,如果大多數權值都是接近零),那麼大多數計算都是浪費的.既然計算資源通常是有限的,計算資源的合理有效分佈是需要優先考慮的,以適應尺寸的任意增長,即使最主要的目標是提高網路的效能.
A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2].
#解決上述問題的一個基礎方案是引入稀疏,使用稀疏層替換全連線層,即使在卷積層內部.除了模仿生物系統,這個方案還有更加堅實的理論基礎.由於Arora et al[2]的開創性工作.
Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network,then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle–neurons that fire together, wire together–suggests that the underlying idea is applicable even under less strict conditions, in practice.
#他們主要的結果說明如果資料集的概率分佈可以有一個大的非常稀疏的神經網路表示,那麼最優的網路拓撲可以通過前一層啟用與輸出高度相關的神經元聚類的相關性.雖然嚴格的數學證明需要非常眼科的條件,事實上這個論述與著名的Hebbian準則(一起放電的神經元串聯在一起)有著異曲同工之妙,意味著這個想法在實踐中即使條件沒有那麼嚴苛也是適用的.
Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off.
#遺憾的是,今天的計算機架構在處理非均勻分佈稀疏資料結構的數值計算時是非常低效的,即使數值運算量艦隊到1/100,查詢開銷和快取丟失會佔據主導地位:切換到稀疏矩陣可能沒有回報.
The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure.
#即使引入溫補改進和高度優化的數值計算庫,可以極為快速地完成稠密矩陣乘法,並且利用底層CPU和GPU[16,9]硬體上的微小細節,差距仍然在增大.同樣的,不均勻稀疏模型需要更加複雜的工程和計算基礎設施.
Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer.
#大多數當前的定位於視覺的機器學習系統僅僅通過引入卷積實現空間上的稀疏性.但是,早前的卷積是以一系列層間的緊密連線實現的.
ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [9] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.
#自從[11]以來,卷積神經網路傳統上在特徵維度上使用隨機稀疏連線表以破壞對稱性並改善學習,然而又趨向於使用[9]中的全連線層以實現平行計算的進一步優化.當前使用於計算機視覺的state-of-the-art結構擁有均勻的結構,大量的濾波器和更大的batch size實現稠密運算的高效利用.
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.
#這引發一個問題:是否存在一種中間步驟:一個使用濾波器級別稀疏性的結構,如理論建議一樣,但是通過利用稠密矩陣上的計算來拓展我們當前的計算機硬體.大量關於稀疏矩陣計算的作品(e.g.[3])建議將稀疏矩陣聚類成相對稠密的子矩陣趨向於為稀疏矩陣乘法帶來具有競爭性的表現.這看起來並沒有不著邊際,不久的將來相似的方法將會被用於不均勻深度學習框架的自動化構建.
The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12].
#Inception結構開始是為了驗證複雜網路拓撲構造演算法輸出假設的一個案例學習,這個演算法嘗試接近[2]隱含用於計算機視覺網路的稀疏結構幷包含稠密的現有模組假設輸出.儘管這是一個高度投機的事業,早期在對比基於[12]的參考網路有一定的收穫。
With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally.
#隨著差異在一定程度上的調整變大,Inception被證明是尤其作為語義定位[6]和目標定位[5]基礎網路是非常有效的.有趣的是,絕大多數原始框架選擇被懷疑並且完全分離測試,它們表現出來接近區域性最優。
One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.
#但是仍然需要保持謹慎:儘管Inception框架在計算機視覺獲得成功,仍然需要懷疑這是否可以被歸納於構建網路的指導性原則。確定這一點需要更多完整的分析和驗證。
  • Architectural Details
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks.
#Inception框架的主要實現是考慮一個區域性優化稀疏結構的卷積視覺網路如何利用現有稠密元件來近似和覆蓋。注意假設平移不變性意味著我們的網路由卷積積木構成。
All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation.
#我們所要做的便是尋找到區域性最優的構建方式並在空間上重複。Arora et al.[2]提出一種分析前層的統計學相關資訊,並將它們聚類成高度相關的逐層構建方式。
These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).
#這些聚類形成下一層的單元並與上一層相連。我們假設前一層中每個單元與輸入影象的某個區域相關,這些單元被分到濾波器組。低層級卷積層(接近輸入層)相互關聯的單元將會聚集在區域性區域。因此,最終我們將會形成一個很多聚類的區域性區域,同時他們可以被下一層的11個卷積所包含,如[12]所述,但是人們也會期待.
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.
#由於這些"Inception"模組在彼此的頂部堆疊,它們的輸出相關統計必然會有變化:由於較高層將捕捉較高層的抽象特徵,其空間聚集度預計會減少。這表明隨著轉移到較高層,3X3和5X5卷積的比例將會增加。
        (a) Inception module, naı̈ve version
One big problem with the above modules, at least in this naı̈ve form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage.
#上述模組的一個大問題在於,至少在原始形式上,即使適量的5X5卷積堆積在具有大量濾波器卷積層上面也是極為昂貴的。一旦混合中加入池化單元,這個問題甚至變得特別突出:輸出濾波器數目等於上一階段濾波器數量。
The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
#池化層輸出與卷積層輸出的融合將會不可避免地導致這一階段到下一階段輸出數量的增長。雖然這個框架可以覆蓋最優的稀疏結構,過程可能非常低效,若干階段後可能會造成計算爆炸。
This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise.This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse.
#這促使Inception框架的第二個想法:明智地較少維度,否則可能會造成計算需求的極大增長。這是基於嵌入的成功:即使低維度的嵌入可能包含相對較大影象集的大量資訊。但是,嵌入以稠密和壓縮形式表達資訊,而壓縮資訊更難處理。這種表達應該在大多地方保持稀疏(如條件2所要求的)並且在需要整體聚集的時候才進行訊號的壓縮。
That is,1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure 2(b).
#就是在3X3和5X5卷積層之前使用1X1卷積來減少計算量。除了用作減少運算量,他們也包含rectified linear啟用,使他具有雙重作用。最終的結果呈現在表2(b)。
In general, an Inception network is a network consisting of modules of the above type stacked upon each other,with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
#通常來說,一個Inception網路是一個包含上述彼此堆疊的模組,以及偶爾使用步長為2的最大池化層來減半網路解析度。由於技術原因(訓練過程過程的記憶體效率),看似來只在較高層使用Inception模組而保持較低層為傳統卷積結構是有效的。這不是嚴格必要的,只是反應我們當前實現中一些基礎設施中的抵消部分。
    (b) Inception module with dimensionality reduction
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes. 
#這個結構中有用的地方在於它允許增加每個階段顯著地增加單元數而不會造成較後的卷積層中計算量失控地爆炸式增長。這是通過普遍存在的昂貴的較大尺寸卷積層前降維的使用。
Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
#進一步說,這個設計遵循一個實踐指南,視覺資訊需要被處理成不同尺度,然後聚集在一起,因而下一個階段可以立即抽取不同尺度的特徵
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it.
#計算資源的改善利用允許增加各個階段的網路寬度以及在不造成計算瓶頸的情況下增加階段數量。人們可以使用Inception模組來建立稍微低階的,但是計算相對便宜的計算機視覺版本。
We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 310× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.
#我們發現有助於計算資源均衡的可用旋鈕和槓槓使用網路速度相對於未使用Inception模組的網路有了3到10倍的提升,但是這一點上需要仔細的手動設計。
  • GoogLeNet
By the “GoogLeNet” name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC 2014 competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally.
#通過GoogLeNet這個名字我們指的是ILSVRC 2014挑戰賽上提交作品中的Inception框架的特定體現。我們也用了另一個更深更寬效能稍微改善的Inception網路,但是把它加進融合中似乎只改善邊緣的結果
We omit the details of that network,as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table 1 illustrates the most common instance of Inception used in the competition. This network (trained with different image patch sampling methods) was used for 6 out of the 7 models in our ensemble.
#我們省略了網路的細節,由於經驗告訴我們準確架構引數的影響是相對較小的。表1闡述了比賽中使用多數常用的Inception例項。我們的融合中使用了7個這些網路(使用不同影象分塊取樣方式訓練)的6個
                    Table 1: GoogLeNet incarnation of the Inception architecture
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.
#所有卷積,包括那些位於Inception內部的,使用ReLU啟用函式。我們的網路中RGB零均值顏色空間的感受野尺寸是224X224。 “#3×3 reduce” 和 “#5×5 reduce”代表3×3以及5×5卷積層前降維層的1X1濾波器數量。可以在pool proj這列看到最大池化層後的預測層中1X1濾波器數量。所有這些降維層/預測層也都使用ReLU啟用函式。
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint.The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The exact number depends on how layers are counted by the machine learning infrastructure.
#網路設計之初就考慮了計算效率和實踐性,所以介面可以在包括有限計算資源尤其是小記憶體的個人裝置上執行。統計時只考慮引數層,網路具有22層(或者把池化層也計算在內則為27層)。構建網路總共使用了100層(獨立的構建模組)。額外的數目取決於如何計算機器學習基礎框架
The use of average pooling before the classifier is based on [12], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.
#分類器前的平均池化層是基於文獻[12],儘管我們的實現中有一層額外的線性層。線性層使我們的網路模型更容易適用於其他含標籤資料集,但多數情況是處於便利性,並沒有期待它會起到主要的作用。我們發現將全連線層替換為平均池化層將top-1準確率提升了0.6個百分點,但即使移除了全連線層dropout仍然是必要的
Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected.
#給定了相對大的深度網路,有效將梯度通過所有網路層進行傳播的能力值得擔憂。在這個任務上更淺的網路強大的效能說明網路中間層產生的特徵應該是非常有判別力的。期待通過增加額外的分類器連線到這些中間層使較低階段的分類器具有判別力
This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3).
#這被認為是提供正則化來抵抗梯度彌散問題。這些分類器通過在Inception(4a)和Inception(4d)輸出的頂層放置較小卷積網路的形式。在訓練階段,它們的損失按照一定的比例權重加到網路的總體損失上(輔助分類器的權重為0.3。
At inference time, these auxiliary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.
#在推理階段,這些輔助網路被拋棄。後續的受控實驗證明了輔助網路的作用是相對較小的(約0.5%),並且只需要使用其中的一種即可實現相同的效果。
The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
# 邊上包含輔助分類器的額外網路準確結構如下所示:
1. An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a),and 4×4×528 for the (4d) stage.
# 1.使用卷積尺寸為5*5,stride為3的平均池化層,產生了(4a)階段中的4x4x512輸出以及(4d)階段中的4x4x528輸出 2. A
1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
# 2.一個擁有128個濾波器用於降維的1x1卷積層和RLU啟用層 3. A fully connected layer with
1024 units and rectified linear activation.
# 3.一個擁有1024個單元的全連線層和RLU啟用層 4. A dropout layer with
70% ratio of dropped outputs.
# 4.一個丟棄率為70%的dropout層 5. A linear layer with softmax loss
as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time). # 5.一個使用softmax損失函式的線性層作為分類器(作為主分類器預測相同的1000種類物體,但是在推理階段移除)
A schematic view of the resulting network
is depicted in Figure 3
#最終網路的圖示展現在表3中
  • Training Methodology
GoogLeNet networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage.
#GoogLeNet使用DistBelief[4]分散式機器學習系統以及數量模型和並行資料進行訓練。儘管我們只是用CPU實現,粗略估計如果使用少量高效能GPUs,模型可以在一週內收斂,主要的限制是記憶體的使用量。
Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.
#我們的訓練使用非同步隨機梯度下降法,其中momentum[17]為0.9,固定的學習率機制(每8個epochs將學習率降低4%)。Polyak平均[13]用於建立推理階段最終使用的模型。
Image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such as dropout and the learning rate. Therefore, it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8].
#影象取樣方法在競賽這個月潛意識中被改變了,已經覆蓋使用其他方式訓練的模型,有時候與可變超引數一起,例如dropout以及學習率。因此,對於大多數網路訓練很難給出一個確切唯一的高效指導。對於後續更復雜的事情,一些模型主要在集中在更小的關聯資料訓練,有一些模型集中在更大資料集上訓練。
Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area with aspect ratio constrained to the interval [ 34 , 3 4 ]. Also, we found that the photometric distortionsof Andrew Howard [8] were useful to combat overfitting to the imaging conditions of training data.
#競賽結束後仍然驗證了一個方法的有效性,包括取樣不同尺寸的影象塊,甚至影象面積分布在8%到100%並保持寬高比在3/4到4/3之間。我們同時發現Andrew Howard[8]的亮度失真在處理訓練過程中的過擬合非常有效。
  • ILSVRC 2014 Classification Challenge Setup and Results
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. 
#ILSVRC 2014分類挑戰賽包含將影象分成ImageNet分級中1000個之類中的其中一類。訓練中使用了120萬張影象,驗證中使用了5萬張影象,測試中使用了10萬張影象。每張影象都與一個真是類別相關,效能都是基於最高得分分類器預測結果進行衡量的。
Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class,and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed Figure correctly classified if the ground truth is among the top-5,regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes,
#報告中通常會體現兩個數值:將第一預測序列與實際結果進行對比的top-1準確率,以及將前五個預測序列與實際結果進行對比的top-5準確率:如果實際結果在top-5之中這判定影象被正確分類,而不管它們中的排名.比賽使用top-5錯誤率進行排名.
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.
# 我們在比賽中沒有使用多餘的資料進行訓練。除了本文中提到的訓練技巧外,我們在測試階段引入了一系列技巧來獲取更高的效能,將在下文中闡述。
1.We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them.These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.
# 1.我們獨立訓練了7個版本相同的GoogLeNet模型(包括一個相對更寬的版本),並將它們的預測結果進行融合。這些模型使用相同的初始化進行訓練(甚至是相同的初始化權值,由於一個疏忽)以及學習率策略。它們只在取樣方法和輸入影象隨機順序上有差異。
2.During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares).For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. 
# 2.在測試階段,我們引入了一個比Krizhevsky et al.[9]更加激進的裁剪方法.特別是,我們將影象分別放大了4個比例,其中具有代表性的短邊(高度或寬度)分別為256,288,320以及352,取縮放後圖像的左邊,中間和右邊(在人像例子中,我們取了上邊,中間和底邊)。對於每條邊,我們隨後取4個角以及中間224*224影象塊以及縮放到224*224的方塊影象和它的映象。
This leads to 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on)
# 這使得每張圖片中有4x3x6x2 = 144 個圖片塊。Andrew Howard[8]在去年挑戰賽中使用了類似的方法,後來被證明稍微遜於我們所提出的機制。我們注意到這種激進的分塊操作在實際應用中並非必要,在達到合理數量的影象塊後(我們稍後會說明),使用更多分塊所帶來的邊際效應微乎其微。
3.The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers,but they lead to inferior performance than the simple averaging。
# 3.softmax概率是通過在各種影象塊以及所有獨立分類器上取平均以獲取最終的預測結果。在我們的實驗中,我們分析了驗證集上的可選方法,例如在分塊上的最大池化操作以及分類器上取平均,但是它們帶來的效果並沒有單一平均來得好。
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.

Our final submission to the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012,and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. Table 2 shows the statistics of some of the top-performing approaches over the past 3 years.

          Table 2: Classification performance.

      Table 3: GoogLeNet classification performance break down.

We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in Table 3.When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.

  • ILSVRC 2014 Detection Challenge Setup and Results
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes.Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP).
#ILSVRC檢測任務要求在影象內的200種可能的物體周邊產生邊界框.被測物體如果與真實類別匹配且他們的邊界框與實際位置相交超過50%(使用Jaccard index)則認定為正確.無關的檢測則被認定為錯誤的並且給予懲罰.相對於分類任務,每個影象可能包含很多物體或者不包含,而且他們的尺寸可能變化比較大.結構使用平均準確率(mAP)進行彙報.
The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the selective search [20] approach with multibox [5] predictions for higher object bounding box recall.In order to reduce the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. 
#GoogLeNet在檢測任務中使用的方法與[6]所述R-CNN類似,但使用Inception模組作為區域分類器進行擴充套件.額外地,區域檢測步驟混合使用selective search [20]方法和multibox [5]預測方法進行改善,為了達到更高的邊界框recall.為了減少錯誤分類數量,超解析度尺寸增大為原來的2倍.這將selective search演算法中的建議減半.
We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 GoogLeNets when classifying each region. This leads to an increase in accuracy from 40% to 43.9%. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.
#我們將multi-box[5]演算法結果中200個區域建議加回去,總體上,[6]使用了60%的邊框建議,這將準確率從92%提高到93%.對於這例單一模型,總體上減少邊框建議帶來的準確率增長是1%.最後,我們在對每個區域進行分類的時候使用了6個GoogLeNets的融合.這將準確率從40%提高到43.9%.注意相對於R-CNN,由於時間關係我們沒有使用邊框迴歸.
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled.The top performing teams all use convolutional networks.We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. 
#我們首先報告了頂尖的檢測結果並展示了第一版檢測任務後的進步.相比與2013的結果,準確率幾乎翻倍了.成績最好的隊伍都是用卷積網路.我們在表4中彙總了各團隊的官方成績以及他們使用的策略:使用額外的資料,模型融合或語境模型.
      Table 4: Comparison of detection performances. Unreported values are noted with question marks

The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.
#額外的資料是典型用於模型預訓練的ILSVRC12分類資料集,這個模型隨後在檢測資料集上進行提煉.有一些隊伍也提到定位資料的使用.由於一個好的定位任務邊界框部分並沒有包含在檢測資料集上,很難使用該資料集進行通用邊界框迴歸器的預訓練.GoogLeNet作品中並沒有使用定位資料進行預訓練.
In Table 5, we compare results using a single model only.The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.
#在表5中,我們只將結果與一個模型進行比較.表現最好的模型是Deep Insight實現的,令人驚訝的是它使用3個模型的融合僅提高了0.3個百分點,而GoogLeNet使用融合後取得明顯更強的結果.

      Table 5: Single model performance for detection.
  • Conclusions
Our results yield a solid evidence that approximating th