TensorFlow實戰：Chapter-5（CNN-3-經典卷積神經網路（GoogleNet）)

阿新 • • 發佈：2019-02-12

GoogleNet

GoogleNet 簡介

本節講的是GoogleNet，這裡面的Google自然代表的就是科技界的老大哥Google公司。

Googe Inception Net首次出現在ILSVRC2014的比賽中(和VGGNet同年)，以較大的優勢獲得冠軍。那一屆的GoogleNet通常被稱為Inception V1，Inception V1的特點是控制了計算量的引數量的同時，獲得了非常好的效能-top5錯誤率6.67%, 這主要歸功於GoogleNet中引入一個新的網路結構Inception模組，所以GoogleNet又被稱為Inception V1(後面還有改進版V2、V3、V4)架構中有22層深，V1比VGGNet和AlexNet都深，但是它只有500萬的引數量，計算量也只有15億次浮點運算，在引數量和計算量下降的同時保證了準確率，可以說是非常優秀並且實用的模型。

GoogleNet大家族

Google Inception Net是一個大家族,包括:

2014年9月的《Going deeper with convolutions》提出的Inception V1.
2015年2月的《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》提出的Inception V2
2015年12月的《Rethinking the Inception Architecture for Computer Vision》提出的Inception V3

2016年2月的《Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning》提出的Inception V4

GoogleNet的發展

Inception V1：

Inception V1中精心設計的Inception Module提高了引數的利用率；nception V1去除了模型最後的全連線層，用全域性平均池化層(將圖片尺寸變為1x1)，在先前的網路中，全連線層佔據了網路的大部分引數，很容易產生過擬合現象；(詳細見下面論文分析)

Inception V2：

Inception V2學習了VGGNet，用兩個3*3的卷積代替5*5的大卷積核(降低引數量的同時減輕了過擬合)，同時還提出了註明的Batch Normalization(簡稱BN)方法。BN是一個非常有效的正則化方法，可以讓大型卷積網路的訓練速度加快很多倍，同時收斂後的分類準確率可以的到大幅度提高。

BN在用於神經網路某層時，會對每一個mini-batch資料的內部進行標準化處理，使輸出規範化到(0,1)的正態分佈，減少了Internal Covariate Shift(內部神經元分佈的改變)。BN論文指出，傳統的深度神經網路在訓練時，每一層的輸入的分佈都在變化，導致訓練變得困難，我們只能使用一個很小的學習速率解決這個問題。而對每一層使用BN之後，我們可以有效的解決這個問題，學習速率可以增大很多倍，達到之間的準確率需要的迭代次數有需要1/14，訓練時間大大縮短，並且在達到之間準確率後，可以繼續訓練。以為BN某種意義上還起到了正則化的作用，所有可以減少或取消Dropout，簡化網路結構。

當然，在使用BN時，需要一些調整：

增大學習率並加快學習衰減速度以適應BN規範化後的資料
去除Dropout並減輕L2正則(BN已起到正則化的作用)
去除LRN
更徹底地對訓練樣本進行shuffle
減少資料增強過程中對資料的光學畸變(BN訓練更快，每個樣本被訓練的次數更少，因此真實的樣本對訓練更有幫助)

Inception V3：

Inception V3主要在兩個方面改造:

引入了Factorization into small convolutions的思想，將一個較大的二維卷積拆成兩個較小的一位卷積，比如將7*7卷積拆成1*7卷積和7*1卷積（下圖是3*3拆分為1*3和3*1的示意圖）。一方面節約了大量引數，加速運算並減去過擬合，同時增加了一層非線性擴充套件模型表達能力。論文中指出，這樣非對稱的卷積結構拆分，結果比對稱地拆分為幾個相同的小卷積核效果更明顯，可以處理更多、更豐富的空間特徵、增加特徵多樣性。

3*3卷積核拆分為1*3卷積和3*1卷積示意圖：
這裡寫圖片描述

另一方面，Inception V3優化了Inception Module的結構，現在Inception Module有35*35、17*17和8*8三種不同的結構，如下圖。這些Inception Module只在網路的後部出現，前部還是普通的卷積層。並且還在Inception Module的分支中還使用了分支。

Inception V3中三種結構的Inception Module：

Inception V4：

Inception V4相比V3主要是結合了微軟的ResNet，有興趣的可以檢視《Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning》論文。

GoogleNet論文分析

這裡分析的是2014年9月的《Going deeper with convolutions》提出的Inception V1.

引言

這裡寫圖片描述

原文	description
The main hallmark of this architecture is the improved utilization of the computing resources inside the network. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing	論文引入了新的網路結構Inception Modules，提高了網路內部計算資源的利用率網路決策是基於Hebbian原理和multi-scale處理意圖的

詳解

將Hebbian原理應用在神經網路上，如果資料集的概率分佈可以被一個很大很稀疏的神經網路表達，那麼構築這個網路的最佳方法是逐層構築網路：將上一層高度相關的節點聚類，並將聚類出來的每一個小簇連線到一起。

什麼是Hebbian原理？
神經反射活動的持續與重複會導致神經元連續穩定性持久提升，當兩個神經元細胞A和B距離很近，並且A參與了對B重複、持續的興奮，那麼某些代謝會導致A將作為使B興奮的細胞。總結一下:“一起發射的神經元會連在一起”,學習過程中的刺激會使神經元間突觸強度增加。
這裡我們先討論一下為什麼需要稀疏的神經網路是什麼概念?
人腦神經元的連線是稀疏的，研究者認為大型神經網路的合理的連線方式應該也是稀疏的，稀疏結構是非常適合神經網路的一種結構，尤其是對非常大型、非常深的神經網路，可以減輕過擬合併降低計算量，例如CNN就是稀疏連線。
為什麼CNN就是稀疏連線？
在符合Hebbian原理的基礎上，我們應該把相關性高的一簇神經元節點連線在一起。在普通的資料集中，這可能需要對神經元節點做聚類，但是在圖片資料中，天然的就是臨近區域的資料相關性高，因此相鄰的畫素點被卷積操作連線在一起（符合Hebbian原理），而卷積操作就是在做稀疏連線。
怎樣構建滿足Hebbian原理的網路？
在CNN模型中，我們可能有多個卷積核，在同一空間位置但在不同通道的卷積核的輸出結果相關性極高。我們可以使用1*1的卷積很自然的把這些相關性很高的、在同一空間位置但是不同通道的特徵連線在一起。

1.介紹

這裡寫圖片描述

原文	description
The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision especially their power and memory use the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth	物體檢測的最大收益並不是來自使用深層網路或更大的模型，而是來自深層架構和經典計算機視覺的協同作用我們搭建的模型時，模型計算所需要的功耗和記憶體問題值得我們關注在NIN論文中提到的deeper在本論文中有兩個含義: 1.引入模組”Inception module” 2.網路的深度更深了

原文

description

The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision

especially their power and memory use

the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth

物體檢測的最大收益並不是來自使用深層網路或更大的模型，而是來自深層架構和經典計算機視覺的協同作用

我們搭建的模型時，模型計算所需要的功耗和記憶體問題值得我們關注

在NIN論文中提到的deeper在本論文中有兩個含義:
1.引入模組”Inception module”
2.網路的深度更深了

2.相關工作

這裡寫圖片描述

原文	description
We use this approach heavily in our architecture. However, in our setting, 1 * 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty	我們在網路中大量使用11卷積核，其目的是: 使用11卷積核主要用來減少維度從而消除計算瓶頸，否則會限制網路的大小。使用1*1卷積核，不僅可以增加網路深度，而且還可以增加網路寬度且不會有太大的網路效能損失(功耗和記憶體使用不會增長過多)

3.動機和高層次考慮

這裡寫圖片描述

原文	description
improving the performance of deep neural networks is by increasing their size: 1. increasing the depth 2. its width However this simple solution comes with two major drawbacks. 1.which makes the enlarged network more prone to overfitting, 2.Another drawback of uniformly increased network size is the dramatically increased use of computational resources The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures	一般來說，提升網路效能最直接的方式就是增加網路的大小: 1.增加網路的深度 2.增加網路的寬度這樣簡單的解決辦法有兩個主要的缺點: 1.網路引數的增多，網路容易陷入過擬閤中，這需要大量的訓練資料，而在解決高粒度分類的問題上，高質量的訓練資料成本太高; 2.簡單的增加網路的大小，會讓網路計算量增大，而增大計算量得不到充分的利用，從而造成計算資源的浪費解決上面的兩個缺點的思路：將全連線的結構轉換為稀疏結構(即使是內部卷積) 如果資料集的概率分佈可以可以有大型的稀疏的深度神經網路表示，則優化網路的方法可以是逐層的分析層輸出的相關性，對相關的輸出做聚類操作. 當對非均勻稀疏資料結構計算時，計算效率非常低，這需要在底層的計算庫做優化;不均勻稀疏模型需要複雜的計算工程實現和裝置,大多數面向視覺的機器學習系統只是利用了卷積在空間域中稀疏特性

原文

description

improving the performance of deep neural networks is by increasing their size:
1. increasing the depth
2. its width

However this simple solution comes with two major drawbacks.
1.which makes the enlarged network more prone to overfitting,
2.Another drawback of uniformly increased network size is the dramatically increased use of computational resources

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions.
Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs

todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures

一般來說，提升網路效能最直接的方式就是增加網路的大小:
1.增加網路的深度
2.增加網路的寬度

這樣簡單的解決辦法有兩個主要的缺點:
1.網路引數的增多，網路容易陷入過擬閤中，這需要大量的訓練資料，而在解決高粒度分類的問題上，高質量的訓練資料成本太高;
2.簡單的增加網路的大小，會讓網路計算量增大，而增大計算量得不到充分的利用，從而造成計算資源的浪費

解決上面的兩個缺點的思路：
將全連線的結構轉換為稀疏結構(即使是內部卷積)
如果資料集的概率分佈可以可以有大型的稀疏的深度神經網路表示，則優化網路的方法可以是逐層的分析層輸出的相關性，對相關的輸出做聚類操作.

當對非均勻稀疏資料結構計算時，計算效率非常低，這需要在底層的計算庫做優化;不均勻稀疏模型需要複雜的計算工程實現和裝置,大多數面向視覺的機器學習系統只是利用了卷積在空間域中稀疏特性

這裡寫圖片描述

原文	description
The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction	稀疏矩陣乘法有一個良好的實踐辦法是將稀疏矩陣聚類成相對密集的子矩陣 inception演算法試圖逼近隱含在視覺網路中的稀疏結構，並利用密集、易實現的元件來實現這樣的假設(隱含的稀疏結構) 儘管提出inception這樣的架構在計算機視覺上成功應用了，但是依舊存在問題：這樣網路結構的效能的成功是否可以歸結於網路結構的構成，這還需要探討和驗證

原文

description

The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication

tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components

although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction

稀疏矩陣乘法有一個良好的實踐辦法是將稀疏矩陣聚類成相對密集的子矩陣

inception演算法試圖逼近隱含在視覺網路中的稀疏結構，並利用密集、易實現的元件來實現這樣的假設(隱含的稀疏結構)

儘管提出inception這樣的架構在計算機視覺上成功應用了，但是依舊存在問題：這樣網路結構的效能的成功是否可以歸結於網路結構的構成，這還需要探討和驗證

4.動機和高層次考慮

這裡寫圖片描述

原文	description
The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 11, 33 and 5*5, however this decision was based more on convenience rather than necessity	inception架構的主要思想是建立在找到可以逼近的卷積視覺網路內的最優區域性稀疏結構，並可以通過易實現的模組實現這種結構; 使用大的卷積核在空間上會擴散更多的區域，而對應的聚類就會變少，聚類的數目隨著卷積核增大而減少，為了避免這個問題，inception架構當前只使用11,33,5*5的濾波器大小，這個決策更多的是為了方便而不是必須的。

原文

description

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components

one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1*1, 3*3 and 5*5, however this decision was based more on convenience rather than necessity

inception架構的主要思想是建立在找到可以逼近的卷積視覺網路內的最優區域性稀疏結構，並可以通過易實現的模組實現這種結構;

使用大的卷積核在空間上會擴散更多的區域，而對應的聚類就會變少，聚類的數目隨著卷積核增大而減少，為了避免這個問題，inception架構當前只使用1*1,3*3,5*5的濾波器大小，這個決策更多的是為了方便而不是必須的。

這裡寫圖片描述

原文	description
This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equals to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise 11 convolutions are used to compute reductions before the expensive 33 and 5*5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose	將池化單元組合到一起就會面臨更加明顯的問題:他們輸出濾波器數量等於上一階段濾波器數量，每個階段的跨越這就不可避免的會增加輸出數量在計算力需求急速增長的部分我們明智的應用降維和projections技術 11卷積做compute reductions, 而不是計算33或55的卷積;除此之外，11卷積還可以用來矯正線性啟用

原文

description

This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equals to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage

judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise

1*1 convolutions are used to compute reductions before the expensive 3*3 and 5*5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose

將池化單元組合到一起就會面臨更加明顯的問題:他們輸出濾波器數量等於上一階段濾波器數量，每個階段的跨越這就不可避免的會增加輸出數量

在計算力需求急速增長的部分我們明智的應用降維和projections技術

1*1卷積做compute reductions, 而不是計算3*3或5*5的卷積;除此之外，1*1卷積還可以用來矯正線性啟用

詳解

我們重點關注一下Inception Module的基本結構。

Inception Module的目標即是找出易實現的，能夠逼近最佳區域性稀疏結構的模組，而這樣的稀疏結構，是需要將層輸出中相關性高的聚類到一起，這些聚類構成一下單位，與上一個單元連線。

回到卷積神經網路中，假設前面層的每個單元對於於輸出影象的某些區域，而卷積操作就是很好的聚類，在接近輸入層的低層中，相關單元集中在某些區域性區域:如下圖灰色部分

這裡寫圖片描述

使用1*1的卷積核
這裡寫圖片描述

使用更大的卷積核(3*3,5*5)，使用更大的卷積核在空間上的輸出會減少(維度)

這裡寫圖片描述

為了避免不同patch帶來的校準問題，現在的濾波器大小限制在1*1，3*3和5*5，主要是為了方便。將該區域的不同卷積輸出連線到一起，這樣的Inception模組如下：

這裡寫圖片描述

另外，新增一個額外的並行pooling路徑用於提高效率

這裡寫圖片描述

這就是Inception Module的初態了。

採用上面的結構有一個問題：
卷積層頂端由於濾波器太多，並且當pooling單元加入之後這個問題更加明顯：輸出濾波器的數量等於上一步濾波器的數量。pooling層的輸出和卷積層的輸出融合會導致輸出數量逐步增長。即使這個架構可能包含了最優的稀疏結構，還是會非常沒有效率，導致計算沒經過幾步就崩潰。

因此有了改進版的Inception Module架構：增加很多1*1的卷積操作用於降維同時也能提高網路表達能力。即在3*3和5*5的卷積前用一個1*1的卷積降維，這不僅能夠減少計算，還可以修正線性啟用。如下圖所示。

這裡寫圖片描述

上圖有4個分支: 第一個分支對輸入進行1*1的卷積，1*1的卷積是一個非常優秀的結構，它可以跨通道組織資訊，提高網路的表達能力，同時可以對輸出通道升維或者降維；可以看到Inception Module的4個分支都用到了1*1卷積，進行低成本(計算量比3*3小很多)的跨通道的特徵變換。

Inception Module的4個分支在最後通過一個聚合操作合併(在輸出通道數這個維度上聚合)，構建出了很高效的符合Hebbian原理的稀疏結構。Inception Module中包含了三種不同尺寸的卷積和1個最大池化，增加了網路對不同尺度的適應性，這一部分與Multi-Scale的思想類似。總的來說Inception Module可以讓網路的深度和寬度高效率地擴充，提示準確率且不致於過擬合。

在Inception Module中，通常1*1卷積的比例(輸出通道數佔比)最高，3*3卷積和5*5卷積稍低。整個模型中，會有多個堆疊的Inception Module，我們希望靠後的Inception Module可以捕捉更高階的抽象特徵，因此靠後的Inception Module的卷積的空間集中度應該逐漸降低，這樣可以捕獲更大面積的特徵。因此，越靠後的Inception Module中，3*3和5*5這兩個大面積的卷積核的佔比(輸出通道數)應該更多。

5.GoogLeNet

這裡寫圖片描述

原文	description
It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers. that the features produced by the layers in the middle of the network should be very discriminative By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.	使用平均池化代替FC層可以提高top-1精度0.6%,需要注意的是即使是移除掉了FC層，仍需要使用dropout技術有一個很好的觀點：網路的中間層應該是很有判別力的(考慮到網路深度有22層，且有一些淺層的模型表現的很好) 我們期望分類器可以在較低階段就可以區分，故在網路的中間層添加了輔助分類器，這不僅可以在BP中增加傳播的梯度訊號，而且可以提供額外的正則化在訓練過程中，輔助分類器的loss也計算到總的loss中，loss以不同比例的權重計算(佔比為0.3)，在inference階段，輔助分類器不使用

原文

description

It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

that the features produced by the layers in the middle of the network should be very discriminative

By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization

During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

使用平均池化代替FC層可以提高top-1精度0.6%,需要注意的是即使是移除掉了FC層，仍需要使用dropout技術

有一個很好的觀點：網路的中間層應該是很有判別力的(考慮到網路深度有22層，且有一些淺層的模型表現的很好)

我們期望分類器可以在較低階段就可以區分，故在網路的中間層添加了輔助分類器，這不僅可以在BP中增加傳播的梯度訊號，而且可以提供額外的正則化

在訓練過程中，輔助分類器的loss也計算到總的loss中，loss以不同比例的權重計算(佔比為0.3)，在inference階段，輔助分類器不使用

詳解

GoogleNet整個網路架構如下圖所示:
這裡寫圖片描述

對於輔助分類器：

Inception Net 有22層深，除了最後一層的輸出，其中間節點的分類效果也很好，因此在Inception Net中，還使用了輔助分類節點(auxiliary classifiers)，即將中間某一層的輸出用作分類，並按一個較小的權重(0.3)加到最終分類結果中，這樣相當於做了模型融合，同時給網路增加了BP的梯度訊號，也提供了額外的正則化。

輔助分類器結構：

均值pooling層濾波器大小為5*5，步長為3，(4a)的輸出為4*4*512，(4d)的輸出為4*4*528；
有128個1*1的卷積用於降維和修正線性啟用；
全連線層有1024個單元和修正線性啟用；
dropout層比率為70%；
線性層將softmax損失作為分類器（和主分類器一樣預測1000個類，但在inference時移除）。

6.訓練方法

這裡寫圖片描述

原文	description
Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing he learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time	訓練使用帶0.9動量的非同步隨機梯度下降法，學習率是固定的變化的(每8個epochs下降4%)，在inference 時候我們使用Polyak averaging 來建立最終模型

7.ILSVRC 2014 Classification Challenge Setup and Results

這裡寫圖片描述

原文	description
We independently trained 7 versions of the same GoogLeNet model and they only differ in sampling methodologies and the random order in which they see input images. 2. we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224224 crop as well as the square resized to 224224, and their mirrored versions. This results in 436*2 = 144 crops per image The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction.	分別訓練了7個模型,每個模型的初試操作相同(相同的初試權值),不同的是每個模型的取樣方法和隨機輸入圖片在測試時，將圖片的短邊resize到4個代表性的大小(256,288,320,352),在分別在取出resized那邊的上中下(左中右)三個正方形區域，每個正方形區域取出五個區域(左上、右上、左下、右下、中央)，再將正方形區域resize到224大小，一共是6個圖片，再水平翻轉一次，一共是436*2 = 144張測試圖片最終輸出在softmax層做平均，獲得最終的預測

原文

description

We independently trained 7 versions of the same GoogLeNet model
and they only differ in sampling methodologies and the random order in which they see input images.

2. we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224*224 crop as well as the square resized to 224*224, and their mirrored versions. This results in 4*3*6*2 = 144 crops per image

The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction.

分別訓練了7個模型,每個模型的初試操作相同(相同的初試權值),不同的是每個模型的取樣方法和隨機輸入圖片

在測試時，將圖片的短邊resize到4個代表性的大小(256,288,320,352),在分別在取出resized那邊的上中下(左中右)三個正方形區域，每個正方形區域取出五個區域(左上、右上、左下、右下、中央)，再將正方形區域resize到224大小，一共是6個圖片，再水平翻轉一次，一共是4*3*6*2 = 144張測試圖片

最終輸出在softmax層做平均，獲得最終的預測

8.ILSVRC 2014 Detection Challenge Setup and Results

（略）

9.總結

（略）….

GoogleNet在TensorFlow的實現

由於Google Inception Net相對比較複雜，我們模仿著在TensorFlow的GitHub程式碼庫上開源的Inception-V3模型的原始碼，V3模型中使用的Inception模型相比V1和V2有著更加複雜多樣的網路結構；

V3模型共有46層，有三個Inception模型組(三層、五層、三層)，共有96的卷積層，可以想象V3的程式碼會很長，這裡使用tf.contrib.slim工具輔助設計網路，簡化程式碼。

V3的網路架構圖如下：

這裡寫圖片描述

網路總覽:

型別	kernel尺寸/步長(或註釋)	輸出尺寸
conv0	3 * 3 / 2	149 * 149 * 32
conv1	3 * 3 / 1	147 * 147 * 32
conv2	3 * 3 / 1	147 * 147 * 64
pool1	3 * 3 / 2	73 * 73 * 64
conv3	3 * 3 / 1	73 * 73 * 80
conv4	3 * 3 / 1	71 * 71 * 192
pool2	3 * 3 / 2	35 * 35 * 192
Inception模組	mixed_b mixed_c mixed_d	35 * 35 * 256 35 * 35 * 288 35 * 35 * 288
Inception模組	mixed_a mixed_b mixed_c mixed_d mixed_e	17 * 17 * 768 17 * 17 * 768 17 * 17 * 768 17 * 17 * 768 17 * 17 * 768
Inception模組	mixed_a mixed_b mixed_c	8 * 8 * 1280 8 * 8 * 2048 8 * 8 * 2048
池化	8 * 8	1 * 1 * 2048
logits	logits	1 * 1 * 1000
Softmax	分類輸出	1 * 1 * 1000
輔助分類器
avg_pool	5 * 5 / 3	5 * 5 * 768
conv0	1 * 1 / 1	5 * 5 * 128
conv1	5 * 5 / 1	1 * 1 * 768
logits	logits	1 * 1 * 1000

程式碼實現如下（所以設計和尺寸參見注釋）:

# coding:UTF-8
import tensorflow as tf
from datetime import datetime
import math
import time

slim = tf.contrib.slim
trunc_normal = lambda stddev: tf.truncated_normal_initializer(0.0, stddev)

# 用來生成網路中經常用到的函式的預設引數
# 預設引數：卷積的啟用函式、權重初始化方式、標準化器等
definception_v3_arg_scope(weight_decay=0.00004,    # L2正則的weight_decay
                           stddev=0.1,  # 標準差0.1
                           batch_norm_var_collection='moving_vars'):

  batch_norm_params = {  # 定義batch normalization引數字典
      'decay': 0.9997,  #衰減係數
      'epsilon': 0.001,
      'updates_collections': tf.GraphKeys.UPDATE_OPS,
      'variables_collections': {
          'beta': None,
          'gamma': None,
          'moving_mean': [batch_norm_var_collection],
          'moving_variance': [batch_norm_var_collection],
      }
  }

  # silm.arg_scope可以給函式自動賦予某些預設值
  # 會對[slim.conv2d, slim.fully_connected]這兩個函式的引數自動賦值,
  # 使用slim.arg_scope後就不需要每次都重複設定引數了，只需要在有修改時設定
  with slim.arg_scope([slim.conv2d, slim.fully_connected],
                      weights_regularizer=slim.l2_regularizer(weight_decay)): # 對[slim.conv2d, slim.fully_connected]自動賦值

      # 巢狀一個slim.arg_scope對卷積層生成函式slim.conv2d的幾個引數賦予預設值
    with slim.arg_scope(
        [slim.conv2d],
        weights_initializer=trunc_normal(stddev), # 權重初始化器
        activation_fn=tf.nn.relu, # 啟用函式
        normalizer_fn=slim.batch_norm, # 標準化器
        normalizer_params=batch_norm_params) as sc: # 標準化器的引數設定為前面定義的batch_norm_params
      return sc # 最後返回定義好的scope




# 生成V3網路的卷積部分
definception_v3_base(inputs, scope=None):
  '''
  Args:
  inputs：輸入的tensor
  scope：包含了函式預設引數的環境
  '''
  end_points = {} # 定義一個字典表儲存某些關鍵節點供之後使用

  with tf.variable_scope(scope, 'InceptionV3', [inputs]):
    with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d], # 對三個引數設定預設值
                        stride=1, padding='VALID'):

      #  因為使用了slim以及slim.arg_scope，我們一行程式碼就可以定義好一個卷積層
      #  相比AlexNet使用好幾行程式碼定義一個卷積層，或是VGGNet中專門寫一個函式定義卷積層，都更加方便
      #
      # 正式定義Inception V3的網路結構。首先是前面的非Inception Module的卷積層
      # slim.conv2d函式第一個引數為輸入的tensor，第二個是輸出的通道數，卷積核尺寸，步長stride，padding模式

      #一共有5個卷積層，2個池化層，實現了對圖片資料的尺寸壓縮，並對圖片特徵進行了抽象
      # 299 x 299 x 3
      net = slim.conv2d(inputs, 32, [3, 3],
                        stride=2, scope='Conv2d_1a_3x3')    # 149 x 149 x 32

      net = slim.conv2d(net, 32, [3, 3],
                        scope='Conv2d_2a_3x3')      # 147 x 147 x 32

      net = slim.conv2d(net, 64, [3, 3], padding='SAME',
                        scope='Conv2d_2b_3x3')  # 147 x 147 x 64

      net = slim.max_pool2d(net, [3, 3], stride=2,
                            scope='MaxPool_3a_3x3')   # 73 x 73 x 64

      net = slim.conv2d(net, 80, [1, 1],
                        scope='Conv2d_3b_1x1')  # 73 x 73 x 80

      net = slim.conv2d(net, 192, [3, 3],
                        scope='Conv2d_4a_3x3')  # 71 x 71 x 192

      net = slim.max_pool2d(net, [3, 3], stride=2,
                            scope='MaxPool_5a_3x3') # 35 x 35 x 192


    '''
    三個連續的Inception模組組，三個Inception模組組中各自分別有多個Inception Module，這部分是Inception Module V3
    的精華所在。每個Inception模組組內部的幾個Inception Mdoule結構非常相似，但是存在一些細節的不同
    '''
    # Inception blocks
    with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d], # 設定所有模組組的預設引數
                        stride=1, padding='SAME'): # 將所有卷積層、最大池化、平均池化層步長都設定為1
      # 第一個模組組包含了三個結構類似的Inception Module

''' -------------------------------------------------------- 第一個Inception組一共三個Inception模組 ''' with tf.variable_scope('Mixed_5b'): # 第一個Inception Module名稱。Inception Module有四個分支 # 第一個分支64通道的1*1卷積 with tf.variable_scope('Branch_0'): branch_0 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') # 35x35x64 # 第二個分支48通道1*1卷積，連結一個64通道的5*5卷積 with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 48, [1, 1], scope='Conv2d_0a_1x1') # 35x35x48 branch_1 = slim.conv2d(branch_1, 64, [5, 5], scope='Conv2d_0b_5x5') #35x35x64 # 第三個分支64通道1*1卷積,96的3*3,再接一個3*3 with tf.variable_scope('Branch_2'): branch_2 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0b_3x3') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0c_3x3')#35x35x96 # 第四個分支64通道3*3平均池化,32的1*1 with tf.variable_scope('Branch_3'): branch_3 = slim.avg_pool2d(net, [3, 3], scope='AvgPool_0a_3x3') branch_3 = slim.conv2d(branch_3, 32, [1, 1], scope='Conv2d_0b_1x1') #35*35*32 net = tf.concat([branch_0, branch_1, branch_2, branch_3], 3) # 將四個分支的輸出合併在一起（第三個維度合併，即輸出通道上合併） # 64+64+96+32 = 256 # mixed_1: 35 x 35 x 256. ''' 因為這裡所有層步長均為1，並且padding模式為SAME，所以圖片尺寸不會縮小，但是通道數增加了。四個分支通道數之和 64+64+96+32=256，最終輸出的tensor的圖片尺寸為35*35*256 ''' with tf.variable_scope('Mixed_5c'): with tf.variable_scope('Branch_0'): branch_0 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 48, [1, 1], scope='Conv2d_0b_1x1') branch_1 = slim.conv2d(branch_1, 64, [5, 5], scope='Conv_1_0c_5x5') with tf.variable_scope('Branch_2'): branch_2 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0b_3x3') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0c_3x3') with tf.variable_scope('Branch_3'): branch_3 = slim.avg_pool2d(net, [3, 3], scope='AvgPool_0a_3x3') branch_3 = slim.conv2d(branch_3, 64, [1, 1], scope='Conv2d_0b_1x1') net = tf.concat([branch_0, branch_1, branch_2, branch_3], 3) # 64+64+96+64 = 288 # mixed_2: 35 x 35 x 288. with tf.variable_scope('Mixed_5d'): with tf.variable_scope('Branch_0'): branch_0 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 48, [1, 1], scope='Conv2d_0a_1x1') branch_1 = slim.conv2d(branch_1, 64, [5, 5], scope='Conv2d_0b_5x5') with tf.variable_scope('Branch_2'): branch_2 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0b_3x3') branch_2 = slim.conv2d(branch_2, 96, [3, 3], scope='Conv2d_0c_3x3') with tf.variable_scope('Branch_3'): branch_3 = slim.avg_pool2d(net, [3, 3], scope='AvgPool_0a_3x3') branch_3 = slim.conv2d(branch_3, 64, [1, 1], scope='Conv2d_0b_1x1') net = tf.concat([branch_0, branch_1, branch_2, branch_3], 3) # 64+64+96+64 = 288 # mixed_1: 35 x 35 x 288 ''' 第一個Inception組結束一共三個Inception模組輸出為:35*35*288 ---------------------------------------------------------------------- 第二個Inception組共5個Inception模組 ''' with tf.variable_scope('Mixed_6a'): with tf.variable_scope('Branch_0'): branch_0 = slim.conv2d(net, 384, [3, 3], stride=2, padding='VALID', scope='Conv2d_1a_1x1') #17*17*384 with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 64, [1, 1], scope='Conv2d_0a_1x1') #35*35*64 branch_1 = slim.conv2d(branch_1, 96, [3, 3], scope='Conv2d_0b_3x3')#35*35*96 branch_1 = slim.conv2d(branch_1, 96, [3, 3], stride=2, padding='VALID', scope='Conv2d_1a_1x1') #17*17*96 with tf.variable_scope('Branch_2'): branch_2 = slim.max_pool2d(net, [3, 3], stride=2, padding='VALID', scope='MaxPool_1a_3x3') #17*17*288 net = tf.concat([branch_0, branch_1, branch_2], 3) # 輸出尺寸定格在17 x 17 x 768 # 384+96+288 = 768 # mixed_3: 17 x 17 x 768. with tf.variable_scope('Mixed_6b'): with tf.variable_scope('Branch_0'): branch_0 = slim.conv2d(net, 192, [1, 1], scope='Conv2d_0a_1x1') with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 128, [1, 1], scope='Conv2d_0a_1x1') branch_1 = slim.conv2d(branch_1, 128, [1, 7], scope='Conv2d_0b_1x7') # 串聯1*7卷積和7*1卷積合成7*7卷積，減少了引數，減輕了過擬合 branch_1 = slim.conv2d(branch_1, 192, [7, 1], scope='Conv2d_0c_7x1') with tf.variable_scope('Branch_2'): branch_2 = slim.conv2d(net, 128, [1, 1], scope='Conv2d_0a_1x1') # 反覆將7*7卷積拆分 branch_2 = slim.conv2d(branch_2, 128, [7, 1], scope='Conv2d_0b_7x1') branch_2 = slim.conv2d(branch_2, 128, [1, 7], scope='Conv2d_0c_1x7') branch_2 = slim.conv2d(branch_2, 128, [7, 1], scope='Conv2d_0d_7x1') branch_2 = slim.conv2d(branch_2, 192, [1, 7], scope='Conv2d_0e_1x7') with tf.variable_scope('Branch_3'): branch_3 = slim.avg_pool2d(net, [3, 3], scope='AvgPool_0a_3x3') branch_3 = slim.conv2d(branch_3, 192, [1, 1], scope='Conv2d_0b_1x1') net = tf.concat([branch_0, branch_1, branch_2, branch_3], 3) # 192+192+192+192 = 768 # mixed4: 17 x 17 x 768. with tf.variable_scope('Mixed_6c'): with tf.variable_scope('Branch_0'): ''' 我們的網路每經過一個inception module，即使輸出尺寸不變，但是特徵都相當於被重新精煉了一遍，其中豐富的卷積和非線性化對提升網路效能幫助很大。 ''' branch_0 = slim.conv2d(net, 192, [1, 1], scope='Conv2d_0a_1x1') with tf.variable_scope('Branch_1'): branch_1 = slim.conv2d(net, 160, [1, 1], scope='Conv2d_0a_1x1') branch_1 = slim.conv2d(branch_1, 160, [1, 7], scope='Conv2d_0b_1x7')

TensorFlow實戰：Chapter-5（CNN-3-經典卷積神經網路（GoogleNet）)

GoogleNet

GoogleNet 簡介

GoogleNet大家族

GoogleNet的發展

Inception V1：

Inception V2：

Inception V3：

Inception V4：

GoogleNet論文分析

引言

詳解

1.介紹

2.相關工作

3.動機和高層次考慮

4.動機和高層次考慮

詳解

5.GoogLeNet

詳解

6.訓練方法

7.ILSVRC 2014 Classification Challenge Setup and Results

8.ILSVRC 2014 Detection Challenge Setup and Results

9.總結

GoogleNet在TensorFlow的實現

程式碼實現如下（所以設計和尺寸參見注釋）:

相關推薦