1. 程式人生 > >MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

We present a class of efficient models called MobileNets for mobile and embedded vision applications.
我們為移動和嵌入式視覺應用提出了一種名為MobileNets的高效模型。

MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks.
MobileNets基於流線型架構,使用深度可分離的卷積來構建輕量級深度神經網路。

We introduce two simple global hyperparameters that efficiently trade off between latency and accuracy. These hyperparameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effective- ness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

我們介紹兩個簡單的全域性超引數,可以在延遲和準確性之間高效地進行折衷。 這些超引數允許模型構建者根據問題的約束為其應用程式選擇合適的大小模型。 我們在資源和精度折衷方面進行了廣泛的實驗,並且與ImageNet分類上的其他流行模型相比,顯示了強大的效能。 然後,我們將展示MobileNets廣泛的應用和用例(包括物件檢測,細粒度分類,人臉屬性和大規模地理定位)方面的有效性。

1. Introduction
Convolutional neural networks have become ubiquitous in computer vision ever since AlexNet [19
] popularized deep convolutional neural networks by winning the Ima- geNet Challenge: ILSVRC 2012 [24]. The general trend has been to make deeper and more complicated networks in order to achieve higher accuracy [27, 31, 29, 8]. How- ever, these advances to improve accuracy are not necessar- ily making networks more efficient with respect to size and speed. In many real world applications such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computation- ally limited platform. This paper describes an efficient network architecture and a set of two hyper-parameters in order to build very small, low latency models that can be easily matched to the design requirements for mobile and embedded vision applications. Section 2 reviews prior work in building small models. Section 3 describes the MobileNet architecture and two hyper-parameters width multiplier and resolution mul- tiplier to define smaller and more efficient MobileNets. Sec- tion 4 describes experiments on ImageNet as well a variety of different applications and use cases. Section 5 closes with a summary and conclusion. 1.介紹 卷積神經網路自從AlexNet[19]通過贏得ImaNet挑戰:ILSVRC 2012 [24]推廣深度卷積神經網路以來,已經成為計算機視覺領域的無處不在。總體趨勢是為了達到更高的準確性而做出更深更復雜的網路[27,31,29,8]。然而,這些提高精確度的進步並不一定使網路在尺寸和速度方面更有效率。在許多真實世界的應用中,例如機器人,自駕車和增強現實,識別任務需要在計算有限的平臺上及時進行。 本文描述了一個高效的網路體系結構和包含兩個超引數的一個集合,以建立非常小的低延遲模型,可以很容易地匹配移動和嵌入式視覺應用的設計要求。第二部分回顧以前的工作,建設小型楷模。第3節描述了MobileNet架構和兩個超引數寬度乘法器和解析度乘法器,以定義更小和更高效的MobileNets。第4節介紹ImageNet上的實驗以及各種不同的應用和用例。第五部分總結總結。
2. Prior Work
There has been rising interest in building small and effi- cient neural networks in the recent literature, e.g. [16, 34, 12, 36, 22]. Many different approaches can be generally categorized into either compressing pretrained networks or training small networks directly. This paper proposes a class of network architectures that allows a model devel- oper to specifically choose a small network that matches the resource restrictions (latency, size) for their application. MobileNets primarily focus on optimizing for latency but also yield small networks. Many papers on small networks focus only on size but do not consider speed.
MobileNets are built primarily from depthwise separable convolutions initially introduced in [26] and subsequently used in Inception models [13] to reduce the computation in the first few layers. Flattened networks [16] build a network out of fully factorized convolutions and showed the poten- tial of extremely factorized networks. Independent of this current paper, Factorized Networks[34] introduces a similar factorized convolution as well as the use of topological con- nections. Subsequently, the Xception network [3] demon- strated how to scale up depthwise separable filters to out perform Inception V3 networks. Another small network is Squeezenet [12] which uses a bottleneck approach to design a very small network. Other reduced computation networks include structured transform networks [28] and deep fried convnets [37].

2.以前的工作
在近來的文獻中,人們越來越關注構建小而有效的神經網路, [16,34,12,36,22]。通常可以將許多不同的方法分為壓縮預訓練網路或直接訓練小網路。本文提出了一種網路體系結構,允許模型開發人員專門選擇與其應用程式的資源限制(延遲,大小)相匹配的小型網路。 MobileNets主要專注於優化延遲,但也產生小型網路。許多關於小型網路的論文只關注尺寸,但不考慮速度。
MobileNets主要是由最初在[26]中引入的深度可分卷積構建的,隨後被用於Inception模型[13]以減少前幾層中的計算。扁平網路[16]利用完全因式分解的卷積建立一個網路,並顯示出極端因式分解網路的潛力。獨立於當前的論文,因式分解網路[34]引入了類似的分解卷積以及拓撲連線的使用。隨後,Xception網路[3]演示瞭如何擴充套件深度分離濾波器來執行Inception V3網路。另一個小型網路是Squeezenet [12],它使用瓶頸方法來設計一個非常小的網路。其他簡化的計算網路包括結構化變換網路[28]和深炸魚網[37]。
A different approach for obtaining small networks is shrinking, factorizing or compressing pretrained networks. Compression based on product quantization [36], hashing [2], and pruning, vector quantization and Huffman coding [5] have been proposed in the literature. Additionally var- ious factorizations have been proposed to speed up pre- trained networks [14, 20]. Another method for training small networks is distillation [9] which uses a larger net- work to teach a smaller network. It is complementary to our approach and is covered in some of our use cases in section 4. Another emerging approach is low bit networks [4, 22, 11].


獲得小型網路的另一種方法是縮小,分解或壓縮預訓練網路。 在文獻中已經提出了基於產品量化的壓縮[36],雜湊[2]以及修剪,向量量化和霍夫曼編碼[5]。 此外,還提出了各種因子分解來加速訓練前的網路[14,20]。 另一種訓練小型網路的方法是蒸餾[9],它使用更大的網路來教小型網路。 這是對我們的方法的補充,並在第4節的一些用例中進行了介紹。另一個新興的方法是低位網路[4,22,11]。
3. MobileNet Architecture
In this section we first describe the core layers that Mo- bileNet is built on which are depthwise separable filters. We then describe the MobileNet network structure and con- clude with descriptions of the two model shrinking hyper- parameters width multiplier and resolution multiplier.
3.1. Depthwise Separable Convolution
The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise con- volution. For MobileNets the depthwise convolution ap- plies a single filter to each input channel. The pointwise convolution then applies a 1 × 1 convolution to combine the outputs the depthwise convolution. A standard convolution both filters and combines inputs into a new set of outputs in one step. The depthwise separable convolution splits this into two layers, a separate layer for filtering and a separate layer for combining. This factorization has the effect of drastically reducing computation and model size. Figure 2 shows how a standard convolution 2(a) is factorized into a depthwise convolution 2(b) and a 1 × 1 pointwise convolu- tion 2(c).
A standard convolutional layer takes as input a DF ×
DF ×M featuremapFandproducesaDF ×DF ×N feature map G where DF is the spatial width and height of a square input feature map1, M is the number of input channels (input depth), DG is the spatial width and height of a square output feature map and N is the number of output channel (output depth).


3. MobileNet架構
在本節中,我們首先描述MoBileNet構建的核心層,它們是深度可分離的過濾器。然後我們描述MobileNet網路結構,幷包含兩個縮小模型超引數寬度乘法器和解析度乘法器的描述。
3.1。深度可分卷積
MobileNet模型基於深度可分卷積,這是一種分解卷積的形式,它將標準卷積分解成深度卷積和1×1卷積(稱為逐點卷積)。對於MobileNets,深度卷積將單個濾波器應用於每個輸入通道。逐點卷積然後應用1×1卷積來組合輸出的深度卷積。標準卷積既能過濾又能將輸入組合成一組新的輸出。深度可分離的卷積將其分成兩層,一層用於過濾,另一層用於組合。這種分解具有大幅度減少計算和模型大小的效果。圖2顯示瞭如何將標準卷積2(a)分解為深度卷積2(b)和1×1點群卷積2(c)。
標準的卷積層將DF×作為輸入
DF×M featuremapFandproducts a DF×DF×N特徵圖G其中DF是方形輸入特徵圖1的空間寬度和高度,M是輸入通道的數量(輸入深度),DG是方形輸出特徵的空間寬度和高度地圖和N是輸出通道的數量(輸出深度)。
3.2. Network Structure and Training
The MobileNet structure is built on depthwise separable convolutions as mentioned in the previous section except for the first layer which is a full convolution. By defining the network in such simple terms we are able to easily explore network topologies to find a good network. The MobileNet architecture is defined in Table 1. All layers are followed by a batchnorm [13] and ReLU nonlinearity with the exception of the final fully connected layer which has no nonlinearity and feeds into a softmax layer for classification. 

Figure 3 contrasts a layer with regular convolutions, batchnorm and ReLU nonlinearity to the factorized layer with depthwise convolution, 1 × 1 pointwise convolution as well as batchnorm and ReLU after each convolutional layer.  Down sampling is handled with strided convolution in the depthwise convolutions as well as in the first layer. 

A final average pooling reduces the spatial resolution to 1 before the fully connected layer. 

Counting depthwise and pointwise convolutions as separate layers, MobileNet has 28 layers.

It is not enough to simply define networks in terms of a small number of Mult-Adds. 

It is also important to make sure these operations can be efficiently implementable. 


3.2。網路結構和培訓
MobileNet結構是建立在上一節提到的深度可分離的卷積上的,除了第一層是完全卷積的。通過簡單的定義網路,我們可以輕鬆地探索網路拓撲,找到一個好的網路。在表1中定義了MobileNet架構。除了沒有非線性的最終完全連線層以外,所有層之後都是batchnorm [13]和ReLU非線性度,並饋入softmax層用於分類。

圖3將具有常規卷積,蝙蝠科和ReLU非線性的層與具有深度卷積的分解層,1×1點卷積以及每個卷積層之後的蝙蝠科和ReLU進行對比。下采樣在深度卷積中以及在第一層中通過逐步卷積來處理。

在完全連線的層之前,最終的平均匯聚將空間解析度降低到1。

MobileNet擁有28層,將深度和逐點卷積作為單獨的層數進行計算。

簡單地用少量的多重新增來定義網路是不夠的。

確保這些操作能夠有效地實施也是很重要的。
instance unstructured sparse matrix operations are not typ- ically faster than dense matrix operations until a very high level of sparsity. Our model structure puts nearly all of the computation into dense 1 × 1 convolutions. This can be im- plemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. For instance, this approach is used in the popular Caffe package [15]. 1 × 1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms. MobileNet spends 95% of it’s computation time in 1 × 1 convolutions which also has 75% of the parameters as can be seen in Table 2. Nearly all of the additional parameters are in the fully connected layer.
MobileNet models were trained in TensorFlow [1] us- ing RMSprop [33] with asynchronous gradient descent sim- ilar to Inception V3 [31]. However, contrary to training large models we use less regularization and data augmen- tation techniques because small models have less trouble with overfitting. When training MobileNets we do not use side heads or label smoothing and additionally reduce the amount image of distortions by limiting the size of small crops that are used in large Inception training [31]. Additionally, we found that it was important to put very little or no weight decay (l2 regularization) on the depthwise filters since their are so few parameters in them. For the ImageNet benchmarks in the next section all models were trained with same training parameters regardless of the size of the model.

例項非結構化稀疏矩陣運算通常不會比密集矩陣運算更快,直到非常高的稀疏度。我們的模型結構幾乎把所有的計算都放到了密集為1×1卷積中。這可以通過高度優化的通用矩陣乘法(GEMM)函式來實現。經常由GEMM實現卷積,但是為了將其對映到GEMM,需要在記憶體中初始重新排序,稱為im2col。例如,這種方法在流行的Caffe軟體包[15]中使用。 1×1卷積不需要在儲存器中進行這種重新排序,可以直接用GEMM來實現,GEMM是最優化的數值線性代數演算法之一。 MobileNet將其計算時間的95%用於1×1卷積,其中也有75%的引數,如表2所示。幾乎所有的附加引數都在完全連線的層中。
MobileNet模型在TensorFlow [1]中使用RMSprop [33]進行訓練,非同步梯度下降類似於Inception V3 [31]。然而,與訓練大型模型相反,我們使用較少的正則化和資料增強技術,因為小型模型在過擬合方面較少麻煩。在訓練MobileNets時,我們不使用側頭或標籤平滑,並通過限制在大型初始訓練[31]中使用的小作物的尺寸來額外減少扭曲的數量影象。另外,我們發現在深度濾波器上放置非常小的權重衰減或者沒有權重衰減(l2正則化)是很重要的,因為它們中只有很少的引數。對於下一節中的ImageNet基準,所有模型均使用相同的訓練引數進行訓練,而與模型的大小無關。
3.3. Width Multiplier: Thinner Models
Although the base MobileNet architecture is already small and low latency, many times a specific use case or application may require the model to be smaller and faster. In order to construct these smaller and less computationally expensive models we introduce a very simple parameter α called width multiplier. The role of the width multiplier α is to thin a network uniformly at each layer. For a given layer and width multiplier α, the number of input channels M be- comes αM and the number of output channels N becomes αN.
The computational cost of a depthwise separable convo- lution with width multiplier α is:
DK ·DK ·αM ·DF ·DF +αM ·αN ·DF ·DF (6)
where α ∈ (0, 1] with typical settings of 1, 0.75, 0.5 and 0.25. α = 1 is the baseline MobileNet and α < 1 are reduced MobileNets. Width multiplier has the effect of re- ducing computational cost and the number of parameters quadratically by roughly α2. Width multiplier can be ap- plied to any model structure to define a new smaller model with a reasonable accuracy, latency and size trade off. It is used to define a new reduced structure that needs to be trained from scratch.

3.3。寬度乘數:稀疏模型
雖然MobileNet基礎架構已經很小,延遲也很低,但是很多時候,特定的用例或應用程式可能會要求模型更小更快。為了構造這些更小,計算量更小的模型,我們引入一個非常簡單的引數α,稱為寬度乘數。寬度乘數α的作用是在每層均勻地減薄網路。對於一個給定的層和寬度乘數α,輸入通道的數量M為αM,輸出通道的數量為αN。
具有寬度乘法器α的深度可分離控制的計算成本為:
DK·DK·αM·DF·DF +αM·αN·DF·DF(6)

其中α∈(0,1),典型設定為10.750.50.25,α= 1是移動網的基線,α<1是移動網的縮減,寬度乘數具有降低計算成本的作用,引數的二次方約為α2,寬度乘數可以應用於任何模型結構,以合理的精度定義一個新的較小的模型,等待時間和大小權衡,用來定義一個新的需要從零開始訓練的簡化結構。

3.4. Resolution Multiplier: Reduced Representation
The second hyper-parameter to reduce the computational cost of a neural network is a resolution multiplier ρ. We apply this to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set ρ by setting the input resolution.
We can now express the computational cost for the core layers of our network as depthwise separable convolutions with width multiplier α and resolution multiplier ρ:
DK ·DK ·αM ·ρDF ·ρDF +αM ·αN ·ρDF ·ρDF (7) where ρ ∈ (0, 1] which is typically set implicitly so that
the input resolution of the network is 224, 192, 160 or 128. ρ = 1 is the baseline MobileNet and ρ < 1 are reduced computation MobileNets. Resolution multiplier has the ef- fect of reducing computational cost by ρ2.
As an example we can look at a typical layer in Mo- bileNet and see how depthwise separable convolutions, width multiplier and resolution multiplier reduce the cost and parameters. Table 3 shows the computation and number of parameters for a layer as architecture shrinking methods are sequentially applied to the layer. The first row shows the Mult-Adds and parameters for a full convolutional layer with an input feature map of size 14 × 14 × 512 with a ker- nelK ofsize3×3×512×512. Wewilllookindetail in the next section at the trade offs between resources and accuracy.

3.4 解析度乘數:簡化表示
第二個降低神經網路計算成本的超引數是解析度乘數ρ。我們將其應用於輸入影象,每個圖層的內部表示隨後由相同的乘數減少。在實踐中,我們通過設定輸入解析度隱式設定ρ。
我們現在可以將網路核心層的計算成本表示為深度可分卷積,其中寬度乘法器α和解析度乘法器ρ:
DK·DK·αM·ρDF·ρDF+αM·αN·ρDF·ρDF(7)其中ρ∈(0,1),通常隱含地設定為
網路的輸入解析度是224,192,160或128.ρ= 1是基線MobileNet,ρ<1是計算MobileNets的減少。解析度乘數有降低計算成本的效果ρ2。
作為一個例子,我們可以看一下MoBileNet中的典型圖層,看看深度可分卷積,寬度乘法器和解析度乘法器如何降低成本和引數。表3顯示了隨著體系結構縮小方法被順序地應用到層上,層的引數的計算和數目。第一行顯示了具有尺寸為14×14×512,尺寸為3×3×512×512的核心的輸入特徵對映的全卷積層的多重增加和引數。 我們將在下一節的資源和準確性之間的權衡得更仔細。

  1. Experiments
    In this section we first investigate the effects of depthwise convolutions as well as the choice of shrinking by reducing the width of the network rather than the number of layers.
    4.實驗
    在本節中,我們首先研究深度卷積的影響以及縮小網路寬度而不是層數的縮小選擇。

We then show the trade offs of reducing the net- work based on the two hyper-parameters: width multiplier and resolution multiplier and compare results to a number of popular models.
然後,我們基於兩個超引數(寬度乘法器和解析度乘法器)展示減少網路的權衡,並將結果與​​一些流行模型進行比較。

We then investigate MobileNets applied to a number of different applications.
然後我們調查應用於許多不同應用的MobileNets。

4.1. Model Choices
First we show results for MobileNet with depthwise sep- arable convolutions compared to a model built with full con- volutions.
4.1。模型選擇
首先,我們展示MobileNet的深度分離卷積結果,與完全回收的模型進行比較。

In Table 4 we see that using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet was saving tremendously on mult-adds and parameters.
在表4中我們看到,使用深度可分離的卷積與完全卷積相比,ImageNet上的精度僅降低了1%,可以大大節省多次疊加和引數。

We next show results comparing thinner models with width multiplier to shallower models using less layers.
我們接下來會展示使用較少層數的較薄模型與寬度倍數比較的結果。

To make MobileNet shallower, the 5 layers of separable filters with feature size 14 × 14 × 512 in Table 1 are removed.
為了使MobileNet更淺,表1中特徵尺寸為14×14×512的5層可分離濾波器被去除。

Table 5 shows that at similar computation and number of parameters, that making MobileNets thinner is 3% better than making them shallower.
表5顯示,在類似的計算和引數數量下,使得MobileNets更薄的3%要比使它們更淺。

4.2. Model Shrinking Hyperparameters
4.2. 模型縮小的超引數

Table 6 shows the accuracy, computation and size trade offs of shrinking the MobileNet architecture with the width multiplier α.
表6顯示了使用寬度乘數α縮小MobileNet架構的精度,計算和尺寸的權衡。

Accuracy drops off smoothly until the architecture is made too small at α = 0.25.
精確度平穩地下降,直到結構在α= 0.25時變得太小。

Table 7 shows the accuracy, computation and size trade offs for different resolution multipliers by training MobileNets with reduced input resolutions. Accuracy drops off smoothly across resolution.
表7顯示了通過減少輸入解析度對MoBileNets進行訓練的不同解析度乘法器的精度,計算和尺寸的權衡。精度在整個解析度下平穩下降。

Figure 4 shows the trade off between ImageNet Accu- racy and computation for the 16 models made from the cross product of width multiplier α ∈ {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}.
圖4顯示了在16個由寬度乘法器α∈{1,0.75,0.5,0.25}和解析度{224,192,160,128}的叉積所構成的模型中,ImageNet精度和計算之間的折衷。

Results are log linear with a jump when models get very small at α = 0.25.
當模型在α= 0.25時非常小時,結果是對數線性的跳躍。

Figure 5 shows the trade off between ImageNet Accuracy and number of parameters for the 16 models made from the cross product of width multiplier α ∈ {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}.
圖5顯示了16個模型的寬度乘法器α∈{1,0.75,0.5,0.25}和解析度{224,192,160,128}的交叉乘積,ImageNet精度和引數個數之間的折衷。

Table 8 compares full MobileNet to the original GoogleNet [30] and VGG16 [27]. MobileNet is nearly as accurate as VGG16 while being 32 times smaller and 27 times less compute intensive.
表 8 將完整MobileNet與原始GoogleNet [30]和VGG16 [27]進行了比較。 MobileNet幾乎與VGG16一樣精確,同時體積縮小32倍,計算密度減少27倍。

It is more accurate than GoogleNet while being smaller and more than 2.5 times less computation.
它比GoogleNet更精確,體積更小,計算量減少2.5倍以上。

Table 9 compares a reduced MobileNet with width multiplier α = 0.5 and reduced resolution 160 × 160. Reduced MobileNet is 4% better than AlexNet [19] while being 45× smaller and 9.4× less compute than AlexNet.
表9比較了減少MobileNet與寬度成倍 α= 0.5 和減少解析度 160×160. 減少MobileNet 比 AlexNet [4]好4%,而比 AlexNet 小 45 倍和 9.4 倍計算。

It is also 4% better than Squeezenet [12] at about the same size and 22× less computation.
它大約相當於Squeezenet [12]的 4%,大小相同,計算量少22倍。

4.3. Fine Grained Recognition
4.3。 細粒度的識別

We train MobileNet for fine grained recognition on the Stanford Dogs dataset [17].
我們訓練MobileNet在Stanford Dogs資料集上的細粒度識別[17]。

We extend the approach of [18] and collect an even larger but noisy training set than [18] from the web.
我們擴充套件[18]的方法,收集比網路[18]更大,但噪音更大的訓練集。

We use the noisy web data to pretrain a fine grained dog recognition model and then fine tune the model on the Stanford Dogs training set.
我們使用嘈雜的網路資料來預訓細粒度的狗識別模型,然後在斯坦福犬訓練集上微調模型。

Results on Stanford Dogs test set are in Table 10.
斯坦福大學的測試結果如表10所示。

MobileNet can almost achieve the state of the art results from [18] at greatly reduced computation and size.
MobileNet幾乎可以在大大減少計算和大小的情況下達到最先進的結果[18]。

4.4. Large Scale Geolocalizaton
4.4. 大規模地質定位

PlaNet [35] casts the task of determining where on earth a photo was taken as a classification problem.
PlaNet [35]負責確定將照片拍攝到哪裡作為分類問題。

The approach divides the earth into a grid of geographic cells that serve as the target classes and trains a convolutional neural network on millions of geo-tagged photos.
該方法將地球劃分為一個地理單元格網格,作為目標類別,並在數以百萬計的地理標記照片上訓練卷積神經網路。

PlaNet has been shown to successfully localize a large variety of photos and to out- perform Im2GPS [6, 7] that addresses the same task.
PlaNet已被證明能夠成功地定位各種各樣的照片,並且能夠勝任解決相同任務的Im2GPS [6,7]。

We re-train PlaNet using the MobileNet architecture on the same data.
我們使用MobileNet架構在相同的資料上重新訓練PlaNet。

While the full PlaNet model based on the Inception V3 architecture [31] has 52 million parameters and 5.74 billion mult-adds.
而基於感知V3架構[31]的完整PlaNet模型有5200萬個引數和574億個多重加法。

The MobileNet model has only 13 million parameters with the usual 3 million for the body and 10 million for the final layer and 0.58 Million mult-adds.
MobileNet模型只有1300萬個引數,通常是300萬個引數,最後一個引數是1000萬個引數,0.58百萬個引數。

As shown in Tab. 11, the MobileNet version delivers only slightly decreased performance compared to PlaNet despite being much more compact. Moreover, it still outperforms Im2GPS by a large margin.
如Tab。所示。 如圖11所示,儘管MobileNet版本更緊湊,但與PlaNet相比,其效能僅略有下降。 而且,它的表現還是遠遠優於Im2GPS。

4.5. Face Attributes
4.5.人面的屬性

Another use-case for MobileNet is compressing large systems with unknown or esoteric training procedures.
MobileNet的另一個用例是壓縮具有未知或深奧訓練過程的大型系統。

In a face attribute classification task, we demonstrate a synergistic relationship between MobileNet and distillation [9], a knowledge transfer technique for deep networks.
在面部屬性分類任務中,我們展示了MobileNet和精餾之間的協同關係[9],這是一種深度網路的知識轉移技術。

We seek to reduce a large face attribute classifier with 75 million parameters and 1600 million Mult-Adds.
我們試圖用7500萬個引數和1600萬個多重新增來減少一個大的人臉屬性分類器。

The classifier is trained on a multi-attribute dataset similar to YFCC100M [32].
分類器是在類似YFCC100M [32]的多屬性資料集上進行訓練的。

We distill a face attribute classifier using the MobileNet architecture.
我們使用MobileNet架構來提取一個人臉屬性分類器。

Distillation [9] works by training the classifier to emulate the outputs of a larger model2 instead of the ground-truth labels, hence enabling training from large (and potentially infinite) unlabeled datasets.
蒸餾[9]通過訓練分類器來模擬較大模型2的輸出而不是地面真值標籤,從而能夠從大的(可能是無限的)未標記的資料集進行訓練。

Marrying the scalability of distillation training and the parsimonious parameterization of MobileNet, the end system not only requires no regularization (e.g. weight-decay and early-stopping), but also demonstrates enhanced performances.
結合蒸餾培訓的可擴充套件性和MobileNet的簡潔性,終端系統不僅不需要正規化(例如體重衰減和提早停止),而且還表現出提高的效能。

It is evident from Tab. 12 that the MobileNet-based classifier is resilient to aggressive model shrinking: it achieves a similar mean average precision across attributes (mean AP) as the in-house while consuming only 1% the Multi-Adds.
這是從Tab。在圖12中,基於MobileNet的分類器反映了積極的模型收縮:它在整個屬性(平均AP)上實現了類似的平均平均精度,而僅消耗1%的多重新增。

4.6. Object Detection
4.6. 物件檢測

MobileNet can also be deployed as an effective base network in modern object detection systems.
MobileNet也可以作為現代物體探測系統的有效基礎網路進行部署。

We report results for MobileNet trained for object detection on COCO data based on the recent work that won the 2016 COCO challenge [10].
我們根據最近獲得2016年COCO挑戰的工作報告了在COCO資料上進行物件檢測培訓的MobileNet結果[10]。

In table 13, MobileNet is compared to VGG and Inception V2 [13] under both Faster-RCNN [23] and SSD [21] framework.
在表13中,MobileNet與Faster-RCNN [23]和SSD [21]框架下的VGG和Inception V2 [13]相比較。

In our experiments, SSD is evaluated with 300 input resolution (SSD 300) and Faster-RCNN is compared with both 300 and 600 input resolution (Faster- RCNN 300, Faster-RCNN 600).
在我們的實驗中,SSD的輸入解析度為300(SSD 300),Faster-RCNN的輸入解析度為300和600(Faster-RCNN 300,Faster-RCNN 600)。

The Faster-RCNN model evaluates 300 RPN proposal boxes per image.
Faster-RCNN模型評估每個影象300個RPN投標箱。

The models are trained on COCO train+val excluding 8k minival images and evaluated on minival.
模型在COCO 訓練+驗證集上訓練,不包括8k minival影象,並在minival上進行評估。

For both frameworks, MobileNet achieves comparable results to other networks with only a fraction of computational complexity and model size.
對於這兩種框架來說,MobileNet只能在計算複雜度和模型大小方面達到與其他網路類似的結果。

4.7. Face Embeddings
4.7. 面部嵌入

The FaceNet model is a state of the art face recognition model [25].
FaceNet模型是最先進的人臉識別模型[25]。

It builds face embeddings based on the triplet loss.
它基於三重損失構建臉部嵌入。

To build a mobile FaceNet model we use distillation to train by minimizing the squared differences of the output of FaceNet and MobileNet on the training data.
要建立一個移動的FaceNet模型,我們使用蒸餾來訓練,通過最小化FaceNet和MobileNet的輸出在訓練資料上的平方差。

Results for very small MobileNet models can be found in table 14.
非常小的MobileNet模型的結果可以在表14中找到。

  1. Conclusion
    5.結論

We proposed a new model architecture called MobileNets based on depthwise separable convolutions.
我們提出了一種基於深度可分卷積的新模型體系結構MoBileNets。

We investigated some of the important design decisions leading to an efficient model.
我們調查了一些導致高效模型的重要設計決策。

We then demonstrated how to build smaller and faster MobileNets using width multiplier and resolution multiplier by trading off a reasonable amount of accuracy to reduce size and latency.
然後,我們演示瞭如何使用寬度乘法器和解析度乘法器來構建更小更快的行動網路,通過交換合理的精度來減小大小和延遲。

We then compared different MobileNets to popular models demonstrating superior size, speed and accuracy characteristics.
然後,我們將不同的MobileNets與流行的模型進行了比較,這些模型展示了優越的尺寸,速度和準確度特徵。

We concluded by demonstrating MobileNet’s effectiveness when applied to a wide variety of tasks.
我們通過展示MobileNet在應用於各種任務時的有效性得出結論。

As a next step to help adoption and exploration of MobileNets, we plan on releasing models in Tensor Flow.
作為幫助採用和探索MobileNets的下一步,我們計劃在Tensor Flow中釋出模型。