TensorFlow實戰：Chapter-4（CNN-2-經典卷積神經網路（AlexNet、VGGNet））

阿新 • • 發佈：2018-11-29

引言
AlexNet
VGGNet
參考資料

引言

上一章節，我們學習了CNN模型的發展歷史和應用領域，同時細緻的剖析了模型的結構和原理，也簡單的學習了在TensorFlow內有關CNN的api。最後在MNIST和CIFAR10資料集上驗證了CNN模型強大的特徵提取能力。

本章節將學習經典的CNN模型，這幾個CNN模型都曾在ImageNet大賽上大放光彩，也代表著這幾年CNN模型的發展路線。

本節重點放在AlexNet和VGGNet上，下面表格是兩個網路的簡單比較:

特點	AlexNet	VGGNet
論文貢獻	介紹完整CNN架構模型(近些年的許多CNN模型都是依據此模型變種來的)和多種訓練技巧 CNN模型復興的開山之作使用GPU加速訓練，讓CNN模型訓練得以實現	討論了在小卷積核下，模型效能隨著堆疊層數加深的變化同時探討multi-crop和dense evaluation對模型效能的影響
網路結構	1.AlexNet的網路架構成為了大型CNN模型的經典架構 2.成功的使用了ReLU啟用函式 3.成功的應用了（重疊）最大池化層	1.多個小卷積核層堆疊(模型的遷移效能較好) 2.使用1*1卷積核增強模型的判別能力
訓練Tricks	1.資料增強，包括對訓練資料做取RGB均值，隨機擷取固定尺寸，做水平翻轉擴大訓練集; 2.使用dropout技術增強模型的魯棒性和泛化能力 3.測試時，對測試圖片做資料增強，同時在softmax層去均值再輸出 4.帶動量的權值衰減 5.學習率在驗證集上的錯誤率不提高時下降10倍	訓練簡單模型A，使用模型A的權值初始化複雜模型B(遷移學習) 資料增強，對訓練資料去RGB均值，做水平翻轉擴大訓練集; 使用dropout技術測試時，對測試圖片做資料增強，在softmax層帶動量的權值衰減學習率在驗證集上的錯誤率不提高時下降10倍
其他觀點	LRN層有助於增強模型泛化能力	LRN效果甚微，反倒的計算量增加不少

AlexNet

AlexNet 簡介

2012年，Hinton的學生Alex提出了CNN模型AlexNet，AlexNet可以算是LeNet的一種更深更寬的版本。同年AlexNet以顯著優勢獲得了IamgeNet的冠軍,top-5錯誤率降低到了16.4%，相比於第二名26.2%的錯誤路有了巨大的提升，而AlexNet模型的引數量還不到第二名模型的二分之一。AlexNet可以說是神經網路在低谷期後第一次發威,確立了深度學習在計算機視覺領域的統治地位,同時也推動了深度學習在語音識別、自然語言處理、強化學習等領域的拓展。

AlexNet的特點

針對網路架構:
- 成功的使用ReLU作為啟用函式,並驗證其效果在較深的網路要優於Sigmoid.
- 使用LRN層，對區域性神經元的活動建立競爭機制，使得其中響應比較大的值變的相對更大，並抑制其他反饋較小的神經元，增強了模型的泛化能力。
- 使用重疊的最大池化,論文中提出讓步長比池化核的尺寸小，這樣池化層的輸出之間會有重疊和覆蓋，提升了特徵的豐富性。
針對過擬合現象:
- 資料增強，對原始影象隨機的擷取輸入圖片尺寸大小(以及對影象作水平翻轉操作)，使用資料增強後大大減輕過擬合，提升模型的泛化能力。同時，論文中會對原始資料圖片的RGB做PCA分析，並對主成分做一個標準差為0.1的高斯擾動。
- 使用Dropout隨機忽略一部分神經元，避免模型過擬合。
針對訓練速度:
- 使用GPU計算，加快計算速度

AlexNet論文分析

引言

這裡寫圖片描述

原文	description
The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers,and three fully-connected layers with a final 1000-way softmax.	介紹了AlexNet網路結構: 五個卷積層(若干卷積層包含最大池化層)三個全連線層(最後一層為1000分類的softmax) 超過六千萬引數和六十五萬神經元
To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.	減少全連線層的過擬合情況: 使用dropout方法防治overfitting (後續有詳解)

1. 介紹

這裡寫圖片描述

原文	description
datasets of labeled images were relatively small… especially if they are augmented with label-preserving transformations… But objects in realistic settings exhibit considerable variability…	指出傳統資料集和模型存在的問題: 資料集較小時，通過資料增強等技術，模型可在簡單的識別任務上表現很好但是在現實複雜環境識別任務上表現的不盡人意進而引出了ImageNet資料集
so our model should also have lots of prior knowledge Their capacity can be controlled by varying their depth and breadth and they also make strong and mostly correct assumptions about the nature of images CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.	點出為什麼用CNN模型，CNN模型的優點? 我們的模型需要足夠的先驗知識 CNN的效能可以通過擴充套件寬度和深度來控制 CNN的引數相對來說較少，從而可訓練 CNN的表現可觀

這裡寫圖片描述

原文	description
current GPUs The size of our network made overfitting a significant problem we found that removing any convolutional layer resulted in inferior performance. our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.	CNN模型優缺點: 因為GPU，現在可以實現CNN模型的訓練這樣多引數的模型，過擬合現象依舊是一個大問題模型的深度看起來看重要，移除任何一個卷積層，模型的效能都會大大降低提升模型效能簡單的方法是使用更好的GPU和更大的資料集

2. 資料集

這裡寫圖片描述

原文	description
we down-sampled the images to a fixed resolution of 256 x 256. except for subtracting the mean activity over the training set from each pixel.	資料集的處理方式: 對圖片做放縮,下采樣得到標準矩形大小為256x256影象輸入影象做去均值操作(後續有詳解)

3. 網路架構

這裡寫圖片描述

原文	description
In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x). but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons. on this dataset the primary concern is preventing overfitting	ReLU啟用函式的應用: 傳統的啟用函式,例如sigmoid或tanh會存在梯度彌散/消失問題，導致訓練速度降低以前的paper使用ReLU配置均值池化來防止過擬合，本文使用ReLU啟用函式可以大幅度提高訓練速度

註解

如圖可以看到sigmoid、tanh函式存在飽和區，即在輸入取較大或者較小值時，輸出值飽和。這樣的啟用函式存在一個問題:在飽和區域，gradient較小，尤其是在深度網路中，會有gradient vanishing的問題，導致訓練收斂速度降低。

這裡寫圖片描述

ReLU函式的特性決定了在大部分情況下，其gradient為常數，有利於BP的誤差傳遞。

同時還有一個特性(個人見解)：ReLU函式當輸入小於等於零，輸出也為零，這和dropout技術(下面有解釋)有點相似，可以提高模型的魯棒性。

當然，這裡不是說ReLU就一定比sigmoid或tanh好，ReLU的不是全程可導(優勢：ReLU的導數計算比sigmoid或tanh簡單的多);ReLU的輸出範圍區間不可控。

這裡寫圖片描述

現在GPU單卡的計算力和視訊記憶體可以放下整個AlexNet了，這裡就不過分討論論文裡關於使用多GPU加速計算的內容了

這裡寫圖片描述

原文	description
we still find that the following local normalization scheme aids generalization. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.	論文有關於區域性響應歸一化的看法: 在ReLU後，做LRN處理可以提高模型的泛化能力 LRN處理類似於生物神經元的橫向抑制機制，可以理解為將區域性響應最大的再放大，並抑制其他響應較小的(我的理解這是放大區域性顯著特徵,作用還是提高魯棒性) 在AlexNet中加LRN層可以降低錯誤率(在下一篇VGGNet論文中實驗指出LRN並不能降低錯誤率,反倒加大了計算量，LRN的應用場合還需要再考量)

原文

description

we still find that the following local normalization scheme aids generalization.

This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

論文有關於區域性響應歸一化的看法:

在ReLU後，做LRN處理可以提高模型的泛化能力

LRN處理類似於生物神經元的橫向抑制機制，可以理解為將區域性響應最大的再放大，並抑制其他響應較小的(我的理解這是放大區域性顯著特徵,作用還是提高魯棒性)

在AlexNet中加LRN層可以降低錯誤率(在下一篇VGGNet論文中實驗指出LRN並不能降低錯誤率,反倒加大了計算量，LRN的應用場合還需要再考量)

這裡寫圖片描述

原文	description
each summarizing a neighborhood of size z * z centered at the location of the pooling unit. If we set s < z, we obtain overlapping pooling We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.	重疊最大池化層的使用: 池化核的大小為z * z,如果移動的步長s < z,就得到了重疊的池化結果使用重疊的池化層可以增強模型提取特徵的能力，有利於防止過擬合

這裡寫圖片描述

原文	description
Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers.	詳解介紹的一下網路架構: 網路最大化了多項logistic迴歸目標，等同於在預測分佈下最大化了訓練樣本中正確標籤的對數概率的平均值(softmax定義) FC層與上一層是全連線的，包括第五層卷積層輸出與FC一層也是全連線的;第一層卷積和第二層卷積後有LRN層，再跟最大池化層每一個卷積層後都做ReLU處理，第一第二卷積層後的LRN層也是處理ReLU的輸出第三層/第四層和第五層卷積層只單單做卷積，沒有池化或LRN層

原文

description

Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers

follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers.

詳解介紹的一下網路架構:
網路最大化了多項logistic迴歸目標，等同於在預測分佈下最大化了訓練樣本中正確標籤的對數概率的平均值(softmax定義)

FC層與上一層是全連線的，包括第五層卷積層輸出與FC一層也是全連線的;第一層卷積和第二層卷積後有LRN層，再跟最大池化層

每一個卷積層後都做ReLU處理，第一第二卷積層後的LRN層也是處理ReLU的輸出

第三層/第四層和第五層卷積層只單單做卷積，沒有池化或LRN層

註解

(下面使用的結構與AlexNet類似，不全相同(AlexNet分GPU訓練，連線方式不同))

AlexNet的網路計算流程如下圖：

這裡寫圖片描述

連線層	計算流程
第一卷積層	輸入–>卷積–>ReLUs–>LRN–>max-pooling
第二卷積層	卷積–>ReLUs–>LRN–>max-pooling
第三卷積層	卷積–>ReLUs
第四卷積層	卷積–>ReLUs
第五卷積層	卷積–>ReLUs
第一全連線層	矩陣乘法–>ReLUs–>dropout(後面會介紹)
第二全連線層	矩陣乘法–>ReLUs–>dropout
第三全連線層(softmax層)	矩陣乘法–>ReLUs–>softmax

整體的計算架構(簡化了ReLU、LRN、Dropout等操作)如下:

操作層	結果	操作步驟
INPUT	[227x227x3]	注:論文標準是224(實際計算應該是227)
CONV1	[55x55x96]	96 11x11 filters at stride 4, pad 0 (227-11)/4 + 1 = 55
MAX POOL1	[27x27x96]	3x3 filters at stride 2 (55-3)/2 + 1 = 27
CONV2	[27x27x256]	256 5x5 filters at stride 1, pad 2 (27-5 + 2*2)/1 + 1 = 27
MAX POOL2	[13x13x256]	3x3 filters at stride 2 (27 -3)/2 + 1 = 13
CONV3	[13x13x384]	384 3x3 filters at stride 1, pad 1 (13-3 + 2*1)/1 + 1 = 13
CONV4	[13x13x384]	384 3x3 filters at stride 1, pad 1 (13-3 + 2*1)/1 + 1 = 13
CONV5	[13x13x256]	256 3x3 filters at stride 1, pad 1 (13-3 + 2*1)/1 + 1 = 13
MAX POOL3	[6x6x256]	3x3 filters at stride 2 (13-3)/2 + 1 = 6
FC1	[4096]	4096 neurons
FC2	[4096]	4096 neurons
SOFTMAX	[1000]	1000 neurons

模型使用的資源如下：

這裡寫圖片描述

連線層	引數量	佔用視訊記憶體量(GPU訓練一張圖片)
第一卷積層	34K	500K
第二卷積層	600K	220K
第三卷積層	860K	60K
第四卷積層	1300K	60K
第五卷積層	870K	49K
第一全連線層	36870K	4K
第二全連線層	16380K	4K
第三全連線層(softmax層)	4000K	1K
合計	*一張圖片:60M4bytes = 240M**	900K

4. 減少過擬合

這裡寫圖片描述

原文	description
The first form of data augmentation consists of generating image translations and horizontal reflections. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. At test time, the network makes a prediction by extracting five 224 * 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches. The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination.	有關於減少過擬合現象的操作: 第一種方式:資料增強通過隨機擷取輸入中固定大小的圖片，並做水平翻轉，擴大資料集,減少過擬合情況資料集可擴大到原來的2048倍 (256-224)^2 * 2 = 2048 在做預測時,取出測試圖片的十張patch(四角+中間,再翻轉)作為輸入，十張patch在softmax層輸出相加去平均值作為最後的輸出結果第二種方式:改變訓練圖片的RGB通道值對原訓練集做PCA(主成分分析),對每一張訓練圖片，加上多倍的主成分 (個人理解：這樣操作類似於訊號處理中的去均值,消除直流分量的影響) 經過做PCA分析的處理，減少了模型的過擬合現象。可以得到一個自然影象的性質，改變影象的光照的顏色和強度，目標的特性是不變的，並且這樣的處理有利於減少過擬合。

原文

description

The first form of data augmentation consists of generating image translations and horizontal reflections.
This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.

At test time, the network makes a prediction by extracting five 224 * 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set.

To each training image, we add multiples of the found principal components,with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.

This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination.

有關於減少過擬合現象的操作:
第一種方式:資料增強
通過隨機擷取輸入中固定大小的圖片，並做水平翻轉，擴大資料集,減少過擬合情況
資料集可擴大到原來的2048倍
(256-224)^2 * 2 = 2048

在做預測時,取出測試圖片的十張patch(四角+中間,再翻轉)作為輸入，十張patch在softmax層輸出相加去平均值作為最後的輸出結果

第二種方式:改變訓練圖片的RGB通道值
對原訓練集做PCA(主成分分析),對每一張訓練圖片，加上多倍的主成分
(個人理解：這樣操作類似於訊號處理中的去均值,消除直流分量的影響)

經過做PCA分析的處理，減少了模型的過擬合現象。可以得到一個自然影象的性質，改變影象的光照的顏色和強度，目標的特性是不變的，並且這樣的處理有利於減少過擬合。

這裡寫圖片描述

原文	description
a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5. every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.	使用dropout減少過擬合現象: dropout技術即讓hidden層的神經元以50%的概率參與前向傳播和反向傳播. dropout技術減少了神經元之間的耦合，在每一次傳播的過程中，hidden層的參與傳播的神經元不同，整個模型的網路就不相同了，這樣就會強迫網路學習更robust的特徵，從而提高了模型的魯棒性對採用dropout技術的層，在輸出時候要乘以0.5，這是一個合理的預測分佈下的幾何平均數的近似

原文

description

a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5.

every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.

使用dropout減少過擬合現象:
dropout技術即讓hidden層的神經元以50%的概率參與前向傳播和反向傳播.

dropout技術減少了神經元之間的耦合，在每一次傳播的過程中，hidden層的參與傳播的神經元不同，整個模型的網路就不相同了，這樣就會強迫網路學習更robust的特徵，從而提高了模型的魯棒性

對採用dropout技術的層，在輸出時候要乘以0.5，這是一個合理的預測分佈下的幾何平均數的近似

5. 訓練細節

這裡寫圖片描述

原文	description
We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate	在訓練除錯上的細節描述: 使用小幅度的權值衰減，很利於模型的訓練學習對於第二、第四、第五卷積層還有全連線層，我們將bias初始化為1，有利於給ReLU提供一個正輸入，加快訓練速度 ;剩下的層bias初始化為0 學習率我們初始化為0.01，當模型在驗證集錯誤率不提升時，我們將學習率降低10倍，整個訓練過程降低了三次

原文

description

We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error.

We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.

The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate

在訓練除錯上的細節描述:
使用小幅度的權值衰減，很利於模型的訓練學習

對於第二、第四、第五卷積層還有全連線層，我們將bias初始化為1，有利於給ReLU提供一個正輸入，加快訓練速度 ;剩下的層bias初始化為0

學習率我們初始化為0.01，當模型在驗證集錯誤率不提升時，我們將學習率降低10倍，整個訓練過程降低了三次

6. 結果

這裡寫圖片描述

原文	description
The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image. If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar.	模型結果的分析: 圖片右邊展示了和測試圖片在輸出層上向量歐式距離很小的幾張圖片，可以看到這幾張圖片非常相似我們可以得出結論，在特徵啟用向量的歐式距離相差較小的話，可以認為兩張圖片非常相似

7. 討論

這裡寫圖片描述

參考文獻（略…）

AlexNet在TensorFlow裡面實現

TensorFlow官方給出的AlexNet實現

首先給出的是TensorFlow官方的AlexNet實現，因為ImageNet資料集太大，所以官方是程式碼實現計算了前向和反向的訓練時間，並沒有真正的訓練模型。其後，我在MNIST和CIFAR10上測試了AlexNet的效能。

實現程式碼

原始碼在

    models.tutorials.image.alexnet.alexnet_benchmark.py
    
     1

實際程式碼如下:

# coding:utf8
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================



"""Timing benchmark for AlexNet inference.

To run, use:
  bazel run -c opt --config=cuda \
      models/tutorials/image/alexnet:alexnet_benchmark

Across 100 steps on batch size = 128.

Forward pass:
Run on Tesla K40c: 145 +/- 1.5 ms / batch
Run on Titan X:     70 +/- 0.1 ms / batch

Forward-backward pass:
Run on Tesla K40c: 480 +/- 48 ms / batch
Run on Titan X:    244 +/- 30 ms / batch
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
from datetime import datetime
import math
import sys
import time

from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf

FLAGS = None


def print_activations(t):
  '''
    展示一個Tensor的name和shape
  :param t:
  :return:
  '''
  print(t.op.name, ' ', t.get_shape().as_list())



def inference(images):
  '''
   定義了AlexNet的前五個卷積層(FC計算速度較快,這裡不做考慮)
  :param images: 輸入影象Tensor
  :return:  返回最後一層pool5和parameters
  '''
  parameters = []

  # conv1
  # 使用name_scope可以將scope內建立的Variable命名conv1/xxx,便於區分不同卷積層的引數
  # 64個卷積核為11*11*3,步長為4,初始化權值為截斷的正態分佈(標註差為0.1)
  with tf.name_scope('conv1') as scope:
    kernel = tf.Variable(tf.truncated_normal([11, 11, 3, 64], dtype=tf.float32,
                                             stddev=1e-1), name='weights')
    conv = tf.nn.conv2d(images, kernel, [1, 4, 4, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.0, shape=[64], dtype=tf.float32),
                         trainable=True, name='biases')
    bias = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(bias, name=scope)
    print_activations(conv1)
    parameters += [kernel, biases]

  # lrn1
  #
  with tf.name_scope('lrn1') as scope:
    lrn1 = tf.nn.lrn(conv1, alpha=1e-4, beta=0.75, depth_radius=2, bias=2.0)

  # pool1
  # 池化核3*3 步長為2*2
  pool1 = tf.nn.max_pool(lrn1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='VALID')
  print_activations(pool1)

  # conv2
  # 192個卷積核為5*5*64,步長為1,初始化權值為截斷的正態分佈(標註差為0.1)
  with tf.name_scope('conv2') as scope:
    kernel = tf.Variable(tf.truncated_normal([5, 5, 64, 192], dtype=tf.float32,
                                             stddev=1e-1), name='weights')
    conv = tf.nn.conv2d(pool1, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.0, shape=[192], dtype=tf.float32),
                         trainable=True, name='biases')
    bias = tf.nn.bias_add(conv, biases)
    conv2 = tf.nn.relu(bias, name=scope)
    parameters += [kernel, biases]
  print_activations(conv2)

  # lrn2
  with tf.name_scope('lrn2') as scope:
    lrn2 = tf.nn.lrn(conv2, alpha=1e-4, beta=0.75, depth_radius=2, bias=2.0)

  # pool2
  pool2 = tf.nn.max_pool(lrn2, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='VALID')
  print_activations(pool2)

  # conv3
  # 384個卷積核為3*3*192,步長為1,初始化權值為截斷的正態分佈(標註差為0.1)
  with tf.name_scope('conv3') as scope:
    kernel = tf.Variable(tf.truncated_normal([3, 3, 192, 384],
                                             dtype=tf.float32,
                                             stddev=1e-1), name='weights')
    conv = tf.nn.conv2d(pool2, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.0, shape=[384], dtype=tf.float32),
                         trainable=True, name='biases')
    bias = tf.nn.bias_add(conv, biases)
    conv3 = tf.nn.relu(bias, name=scope)
    parameters += [kernel, biases]
    print_activations(conv3)

  # conv4
  # 256個卷積核為3*3*384,步長為1,初始化權值為截斷的正態分佈(標註差為0.1)
  with tf.name_scope('conv4') as scope:
    kernel = tf.Variable(tf.truncated_normal([3, 3, 384, 256],
                                             dtype=tf.float32,
                                             stddev=1e-1), name='weights')
    conv = tf.nn.conv2d(conv3, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.0, shape=[256], dtype=tf.float32),
                         trainable=True, name='biases')
    bias = tf.nn.bias_add(conv, biases)
    conv4 = tf.nn.relu(bias, name=scope)
    parameters += [kernel, biases]
    print_activations(conv4)

  # conv5
  # 256個卷積核為3*3*256,步長為1,初始化權值為截斷的正態分佈(標註差為0.1)
  with tf.name_scope('conv5') as scope:
    kernel = tf.Variable(tf.truncated_normal([3, 3, 256, 256],
                                             dtype=tf.float32,
                                             stddev=1e-1), name='weights')
    conv = tf.nn.conv2d(conv4, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.0, shape=[256], dtype=tf.float32),
                         trainable=True, name='biases')
    bias = tf.nn.bias_add(conv, biases)
    conv5 = tf.nn.relu(bias, name=scope)
    parameters += [kernel, biases]
    print_activations(conv5)

  # pool5
  pool5 = tf.nn.max_pool(conv5, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='VALID')
  print_activations(pool5)

  return pool5, parameters


def time_tensorflow_run(session, target, info_string):
  '''
    用於評估AlexNet計算時間
  :param session:
  :param target:
  :param info_string:
  :return:
  '''
  num_steps_burn_in = 10  # 裝置熱身,存在視訊記憶體載入/cache命中等問題
  total_duration = 0.0      # 總時間
  total_duration_squared = 0.0  # 用於計算方差

  for i in xrange(FLAGS.num_batches + num_steps_burn_in):
    start_time = time.time()
    _ = session.run(target)
    duration = time.time() - start_time
    if i >= num_steps_burn_in:
      if not i % 10:
        print ('%s: step %d, duration = %.3f' %
               (datetime.now(), i - num_steps_burn_in, duration))
      total_duration += duration
      total_duration_squared += duration * duration

  mn = total_duration / FLAGS.num_batches
  vr = total_duration_squared / FLAGS.num_batches - mn * mn
  sd = math.sqrt(vr)
  print ('%s: %s across %d steps, %.3f +/- %.3f sec / batch' %
         (datetime.now(), info_string, FLAGS.num_batches, mn, sd))



def run_benchmark():
  '''
    隨機生成一張圖片,
  :return:
  '''
  with tf.Graph().as_default():
    # Generate some dummy images.
    image_size = 224
    # Note that our padding definition is slightly different the cuda-convnet.
    # In order to force the model to start with the same activations sizes,
    # we add 3 to the image_size and employ VALID padding above.
    images = tf.Variable(tf.random_normal([FLAGS.batch_size,
                                           image_size,
                                           image_size, 3],
                                          dtype=tf.float32,
                                          stddev=1e-1))

    # Build a Graph that computes the logits predictions from the
    # inference model.
    pool5, parameters = inference(images)

    # Build an initialization operation.
    init = tf.global_variables_initializer()

    # Start running operations on the Graph.
    config = tf.ConfigProto()
    config.gpu_options.allocator_type = 'BFC'
    sess = tf.Session(config=config)
    sess.run(init)

    # Run the forward benchmark.
    time_tensorflow_run(sess, pool5, "Forward")

    # Add a simple objective so we can calculate the backward pass.
    objective = tf.nn.l2_loss(pool5)
    # Compute the gradient with respect to all the parameters.
    grad = tf.gradients(objective, parameters)  #計算梯度(objective與parameters有相關)
    # Run the backward benchmark.
    time_tensorflow_run(sess, grad, "Forward-backward")


def main(_):
  run_benchmark()


if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--batch_size',
      type=int,
      default=128,
      help='Batch size.'
  )
  parser.add_argument(
      '--num_batches',
      type=int,
      default=100,
      help='Number of batches to run.'
  )
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

    
     1
     2
     3
     4
     5
     6
     7
     8
     9
     10

TensorFlow實戰：Chapter-4（CNN-2-經典卷積神經網路（AlexNet、VGGNet））

引言

AlexNet

AlexNet 簡介

AlexNet的特點

AlexNet論文分析