1. 程式人生 > >Squeeze-and-Excitation Networks論文翻譯——中英文對照

Squeeze-and-Excitation Networks論文翻譯——中英文對照

文章作者:Tyan 部落格:noahsnail.com  |  CSDN  |  簡書

宣告:作者翻譯論文僅為學習,如有侵權請聯絡作者刪除博文,謝謝!

Squeeze-and-Excitation Networks

Abstract

Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, much existing work has shown the benefits of enhancing spatial encoding. In this work, we focus on channels and propose a novel architectural unit, which we term the “Squeeze-and-Excitation”(SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at slight computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%2.251%, achieving a ∼25%∼25% relative improvement over the winning entry of 2016.

摘要

卷積神經網路建立在卷積運算的基礎上,通過融合區域性感受野內的空間資訊和通道資訊來提取資訊特徵。為了提高網路的表示能力,許多現有的工作已經顯示出增強空間編碼的好處。在這項工作中,我們專注於通道,並提出了一種新穎的架構單元,我們稱之為“Squeeze-and-Excitation”(SE)塊,通過顯式地建模通道之間的相互依賴關係,自適應地重新校準通道式的特徵響應。通過將這些塊堆疊在一起,我們證明了我們可以構建SENet架構,在具有挑戰性的資料集中可以進行泛化地非常好。關鍵的是,我們發現SE塊以微小的計算成本為現有的最先進的深層架構產生了顯著的效能改進。SENets是我們ILSVRC 2017分類提交的基礎,它贏得了第一名,並將top-5

錯誤率顯著減少到2.251%2.251%,相對於2016年的獲勝成績取得了∼25%∼25%的相對改進。

1. Introduction

Convolutional neural networks (CNNs) have proven to be effective models for tackling a variety of visual tasks [19, 23, 29, 41]. For each convolutional layer, a set of filters are learned to express local spatial connectivity patterns along input channels. In other words, convolutional filters are expected to be informative combinations by fusing spatial and channel-wise information together, while restricted in local receptive fields. By stacking a series of convolutional layers interleaved with non-linearities and downsampling, CNNs are capable of capturing hierarchical patterns with global receptive fields as powerful image descriptions. Recent work has demonstrated the performance of networks can be improved by explicitly embedding learning mechanisms that help capture spatial correlations without requiring additional supervision. One such approach was popularised by the Inception architectures [14, 39], which showed that the network can achieve competitive accuracy by embedding multi-scale processes in its modules. More recent work has sought to better model spatial dependence [1, 27] and incorporate spatial attention [17].

1. 引言

卷積神經網路(CNNs)已被證明是解決各種視覺任務的有效模型[19,23,29,41]。對於每個卷積層,沿著輸入通道學習一組濾波器來表達區域性空間連線模式。換句話說,期望卷積濾波器通過融合空間資訊和通道資訊進行資訊組合,而受限於區域性感受野。通過疊加一系列非線性和下采樣交織的卷積層,CNN能夠捕獲具有全域性感受野的分層模式作為強大的影象描述。最近的工作已經證明,網路的效能可以通過顯式地嵌入學習機制來改善,這種學習機制有助於捕捉空間相關性而不需要額外的監督。Inception架構推廣了一種這樣的方法[14,39],這表明網路可以通過在其模組中嵌入多尺度處理來取得有競爭力的準確度。最近的工作在尋找更好地模型空間依賴[1,27],結合空間注意力[17]。

In contrast to these methods, we investigate a different aspect of architectural design —— the channel relationship, by introducing a new architectural unit, which we term the “Squeeze-and-Excitation” (SE) block. Our goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.

與這些方法相反,通過引入新的架構單元,我們稱之為“Squeeze-and-Excitation” (SE)塊,我們研究了架構設計的一個不同方向——通道關係。我們的目標是通過顯式地建模卷積特徵通道之間的相互依賴性來提高網路的表示能力。為了達到這個目的,我們提出了一種機制,使網路能夠執行特徵重新校準,通過這種機制可以學習使用全域性資訊來選擇性地強調資訊特徵並抑制不太有用的特徵。

The basic structure of the SE building block is illustrated in Fig.1. For any given transformation Ftr:X→UFtr:X→U, X∈RW′×H′×C′,U∈RW×H×CX∈RW′×H′×C′,U∈RW×H×C, (e.g. a convolution or a set of convolutions), we can construct a corresponding SE block to perform feature recalibration as follows. The features UU are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions W×HW×H to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an excitation operation, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps UU are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.

Figure 1

Figure 1. A Squeeze-and-Excitation block.

SE構建塊的基本結構如圖1所示。對於任何給定的變換Ftr:X→UFtr:X→U, X∈RW′×H′×C′,U∈RW×H×CX∈RW′×H′×C′,U∈RW×H×C,(例如卷積或一組卷積),我們可以構造一個相應的SE塊來執行特徵重新校準,如下所示。特徵UU首先通過squeeze操作,該操作跨越空間維度W×HW×H聚合特徵對映來產生通道描述符。這個描述符嵌入了通道特徵響應的全域性分佈,使來自網路全域性感受野的資訊能夠被其較低層利用。這之後是一個excitation操作,其中通過基於通道依賴性的自門機制為每個通道學習特定取樣的啟用,控制每個通道的激勵。然後特徵對映UU被重新加權以生成SE塊的輸出,然後可以將其直接輸入到隨後的層中。

Figure 1

圖1. Squeeze-and-Excitation塊

An SE network can be generated by simply stacking a collection of SE building blocks. SE blocks can also be used as a drop-in replacement for the original block at any depth in the architecture. However, while the template for the building block is generic, as we show in Sec. 6.3, the role it performs at different depths adapts to the needs of the network. In the early layers, it learns to excite informative features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the SE block becomes increasingly specialised, and responds to different inputs in a highly class-specific manner. Consequently, the benefits of feature recalibration conducted by SE blocks can be accumulated through the entire network.

SE網路可以通過簡單地堆疊SE構建塊的集合來生成。SE塊也可以用作架構中任意深度的原始塊的直接替換。然而,雖然構建塊的模板是通用的,正如我們6.3節中展示的那樣,但它在不同深度的作用適應於網路的需求。在前面的層中,它學習以類不可知的方式激發資訊特徵,增強共享的較低層表示的質量。在後面的層中,SE塊越來越專業化,並以高度類特定的方式響應不同的輸入。因此,SE塊進行特徵重新校準的好處可以通過整個網路進行累積。

The development of new CNN architectures is a challenging engineering task, typically involving the selection of many new hyperparameters and layer configurations. By contrast, the design of the SE block outlined above is simple, and can be used directly with existing state-of-the-art architectures whose convolutional layers can be strengthened by direct replacement with their SE counterparts. Moreover, as shown in Sec. 4, SE blocks are computationally lightweight and impose only a slight increase in model complexity and computational burden. To support these claims, we develop several SENets, namely SE-ResNet, SE-Inception, SE-ResNeXt and SE-Inception-ResNet and provide an extensive evaluation of SENets on the ImageNet 2012 dataset [30]. Further, to demonstrate the general applicability of SE blocks, we also present results beyond ImageNet, indicating that the proposed approach is not restricted to a specific dataset or a task.

新CNN架構的開發是一項具有挑戰性的工程任務,通常涉及許多新的超引數和層配置的選擇。相比之下,上面概述的SE塊的設計是簡單的,並且可以直接與現有的最新架構一起使用,其卷積層可以通過直接用對應的SE層來替換從而進行加強。另外,如第四節所示,SE塊在計算上是輕量級的,並且在模型複雜性和計算負擔方面僅稍微增加。為了支援這些宣告,我們開發了一些SENets,即SE-ResNet,SE-Inception,SE-ResNeXt和SE-Inception-ResNet,並在ImageNet 2012資料集[30]上對SENets進行了廣泛的評估。此外,為了證明SE塊的一般適用性,我們還呈現了ImageNet之外的結果,表明所提出的方法不受限於特定的資料集或任務。

Using SENets, we won the first place in the ILSVRC 2017 classification competition. Our top performing model ensemble achieves a 2.251%2.251% top-5 error on the test set. This represents a ∼25%∼25% relative improvement in comparison to the winner entry of the previous year (with a top-55 error of 2.991%2.991%). Our models and related materials have been made available to the research community.

使用SENets,我們贏得了ILSVRC 2017分類競賽的第一名。我們的表現最好的模型集合在測試集上達到了2.251%2.251%的top-5錯誤率。與前一年的獲獎者(2.991%2.991%的top-5錯誤率)相比,這表示∼25%∼25%的相對改進。我們的模型和相關材料已經提供給研究界。

Deep architectures. A wide range of work has shown that restructuring the architecture of a convolutional neural network in a manner that eases the learning of deep features can yield substantial improvements in performance. VGGNets [35] and Inception models [39] demonstrated the benefits that could be attained with an increased depth, significantly outperforming previous approaches on ILSVRC 2014. Batch normalization (BN) [14] improved gradient propagation through deep networks by inserting units to regulate layer inputs stabilising the learning process, which enables further experimentation with a greater depth. He et al. [9, 10] showed that it was effective to train deeper networks by restructuring the architecture to learn residual functions through the use of identity-based skip connections which ease the flow of information across units. More recently, reformulations of the connections between network layers [5, 12] have been shown to further improve the learning and representational properties of deep networks.

2. 近期工作

深層架構。大量的工作已經表明,以易於學習深度特徵的方式重構卷積神經網路的架構可以大大提高效能。VGGNets[35]和Inception模型[39]證明了深度增加可以獲得的好處,明顯超過了ILSVRC 2014之前的方法。批標準化(BN)[14]通過插入單元來調節層輸入穩定學習過程,改善了通過深度網路的梯度傳播,這使得可以用更深的深度進行進一步的實驗。He等人[9,10]表明,通過重構架構來訓練更深層次的網路是有效的,通過使用基於恆等對映的跳躍連線來學習殘差函式,從而減少跨單元的資訊流動。最近,網路層間連線的重新表示[5,12]已被證明可以進一步改善深度網路的學習和表徵屬性。

An alternative line of research has explored ways to tune the functional form of the modular components of a network. Grouped convolutions can be used to increase cardinality (the size of the set of transformations) [13, 43] to learn richer representations. Multi-branch convolutions can be interpreted as a generalisation of this concept, enabling more flexible compositions of convolutional operators [14, 38, 39, 40]. Cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure [6, 18] or jointly by using standard convolutional filters [22] with 1×11×1 convolutions, while much of this work has concentrated on the objective of reducing model and computational complexity. This approach reflects an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the network with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.

另一種研究方法探索了調整網路模組化元件功能形式的方法。可以用分組卷積來增加基數(一組變換的大小)[13,43]以學習更豐富的表示。多分支卷積可以解釋為這個概念的概括,使得卷積運算元可以更靈活的組合[14,38,39,40]。跨通道相關性通常被對映為新的特徵組合,或者獨立的空間結構[6,18],或者聯合使用標準卷積濾波器[22]和1×11×1卷積,然而大部分工作的目標是集中在減少模型和計算複雜度上面。這種方法反映了一個假設,即通道關係可以被表述為具有區域性感受野的例項不可知的函式的組合。相比之下,我們聲稱為網路提供一種機制來顯式建模通道之間的動態、非線性依賴關係,使用全域性資訊可以減輕學習過程,並且顯著增強網路的表示能力。

Attention and gating mechanisms. Attention can be viewed, broadly, as a tool to bias the allocation of available processing resources towards the most informative components of an input signal. The development and understanding of such mechanisms has been a longstanding area of research in the neuroscience community [15, 16, 28] and has seen significant interest in recent years as a powerful addition to deep neural networks [20, 25]. Attention has been shown to improve performance across a range of tasks, from localisation and understanding in images [3, 17] to sequence-based models [2, 24]. It is typically implemented in combination with a gating function (e.g. a softmax or sigmoid) and sequential techniques [11, 37]. Recent work has shown its applicability to tasks such as image captioning [4, 44] and lip reading [7], in which it is exploited to efficiently aggregate multi-modal data. In these applications, it is typically used on top of one or more layers representing higher-level abstractions for adaptation between modalities. Highway networks [36] employ a gating mechanism to regulate the shortcut connection, enabling the learning of very deep architectures. Wang et al. [42] introduce a powerful trunk-and-mask attention mechanism using an hourglass module [27], inspired by its success in semantic segmentation. This high capacity unit is inserted into deep residual networks between intermediate stages. In contrast, our proposed SE-block is a lightweight gating mechanism, specialised to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.

注意力和門機制。從廣義上講,可以將注意力視為一種工具,將可用處理資源的分配偏向於輸入訊號的資訊最豐富的組成部分。這種機制的發展和理解一直是神經科學社群的一個長期研究領域[15,16,28],並且近年來作為一個強大補充,已經引起了深度神經網路的極大興趣[20,25]。注意力已經被證明可以改善一系列任務的效能,從影象的定位和理解[3,17]到基於序列的模型[2,24]。它通常結合門功能(例如softmax或sigmoid)和序列技術來實現[11,37]。最近的研究表明,它適用於像影象標題[4,44]和口頭閱讀[7]等任務,其中利用它來有效地彙集多模態資料。在這些應用中,它通常用在表示較高級別抽象的一個或多個層的頂部,以用於模態之間的適應。高速網路[36]採用門機制來調節快捷連線,使得可以學習非常深的架構。王等人[42]受到語義分割成功的啟發,引入了一個使用沙漏模組[27]的強大的trunk-and-mask注意力機制。這個高容量的單元被插入到中間階段之間的深度殘差網路中。相比之下,我們提出的SE塊是一個輕量級的門機制,專門用於以計算有效的方式對通道關係進行建模,並設計用於增強整個網路中模組的表示能力。

3. Squeeze-and-Excitation Blocks

The Squeeze-and-Excitation block is a computational unit which can be constructed for any given transformation Ftr:X→U,X∈RW′×H′×C′,U∈RW×H×CFtr:X→U,X∈RW′×H′×C′,U∈RW×H×C. For simplicity of exposition, in the notation that follows we take FtrFtr to be a standard convolutional operator. Let V=[v1,v2,…,vC]V=[v1,v2,…,vC] denote the learned set of filter kernels, where vcvc refers to the parameters of the cc-th filter. We can then write the outputs of FtrFtr as U=[u1,u2,…,uC]U=[u1,u2,…,uC] where

uc=vc∗X=∑s=1C′vsc∗xs.uc=vc∗X=∑s=1C′vcs∗xs.

Here ∗∗ denotes convolution, vc=[v1c,v2c,…,vC′c]vc=[vc1,vc2,…,vcC′] and X=[x1,x2,…,xC′]X=[x1,x2,…,xC′] (to simplify the notation, bias terms are omitted). Here vscvcs is a 22D spatial kernel, and therefore represents a single channel of vcvc which acts on the corresponding channel of XX. Since the output is produced by a summation through all channels, the channel dependencies are implicitly embedded in vcvc, but these dependencies are entangled with the spatial correlation captured by the filters. Our goal is to ensure that the network is able to increase its sensitivity to informative features so that they can be exploited by subsequent transformations, and to suppress less useful ones. We propose to achieve this by explicitly modelling channel interdependencies to recalibrate filter responses in two steps, squeeze and excitation, before they are fed into next transformation. A diagram of an SE building block is shown in Fig.1.

3. Squeeze-and-Excitation塊

Squeeze-and-Excitation塊是一個計算單元,可以為任何給定的變換構建:Ftr:X→U,X∈RW′×H′×C′,U∈RW×H×CFtr:X→U,X∈RW′×H′×C′,U∈RW×H×C。為了簡化說明,在接下來的表示中,我們將FtrFtr看作一個標準的卷積運算元。V=[v1,v2,…,vC]V=[v1,v2,…,vC]表示學習到的一組濾波器核,vcvc指的是第cc個濾波器的引數。然後我們可以將FtrFtr的輸出寫作U=[u1,u2,…,uC]U=[u1,u2,…,uC],其中

uc=vc∗X=∑s=1C′vsc∗xs.uc=vc∗X=∑s=1C′vcs∗xs.

這裡∗∗表示卷積,vc=[v1c,v2c,…,vC′c]vc=[vc1,vc2,…,vcC′],X=[x1,x2,…,xC′]X=[x1,x2,…,xC′](為了簡潔表示,忽略偏置項)。這裡vscvcs是22D空間核,因此表示vcvc的一個單通道,作用於對應的通道XX。由於輸出是通過所有通道的和來產生的,所以通道依賴性被隱式地嵌入到vcvc中,但是這些依賴性與濾波器捕獲的空間相關性糾纏在一起。我們的目標是確保能夠提高網路對資訊特徵的敏感度,以便後續轉換可以利用這些功能,並抑制不太有用的功能。我們建議通過顯式建模通道依賴性來實現這一點,以便在進入下一個轉換之前通過兩步重新校準濾波器響應,兩步為:squeezeexcitation。SE構建塊的圖如圖1所示。

3.1. Squeeze: Global Information Embedding

In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operate with a local receptive field and consequently each unit of the transformation output UU is unable to exploit contextual information outside of this region. This is an issue that becomes more severe in the lower layers of the network whose receptive field sizes are small.

3.1. Squeeze:全域性資訊嵌入

為了解決利用通道依賴性的問題,我們首先考慮輸出特徵中每個通道的訊號。每個學習到的濾波器都對區域性感受野進行操作,因此變換輸出UU的每個單元都無法利用該區域之外的上下文資訊。在網路較低的層次上其感受野尺寸很小,這個問題變得更嚴重。

To mitigate this problem, we propose to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally, a statistic z∈RCz∈RC is generated by shrinking UU through spatial dimensions W×HW×H, where the cc-th element of zz is calculated by:

zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).

為了減輕這個問題,我們提出將全域性空間資訊壓縮成一個通道描述符。這是通過使用全域性平均池化生成通道統計實現的。形式上,統計z∈RCz∈RC是通過在空間維度W×HW×H上收縮UU生成的,其中zz的第cc個元素通過下式計算:

zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).

Discussion. The transformation output UU can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in feature engineering work [31, 34, 45]. We opt for the simplest, global average pooling, while more sophisticated aggregation strategies could be employed here as well.

討論。轉換輸出UU可以被解釋為區域性描述子的集合,這些描述子的統計資訊對於整個影象來說是有表現力的。特徵工程工作中[31,34,45]普遍使用這些資訊。我們選擇最簡單的全域性平均池化,同時也可以採用更復雜的匯聚策略。

3.2. Excitation: Adaptive Recalibration

To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship as multiple channels are allowed to be emphasised opposed to one-hot activation. To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation:

s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))

where δδ refers to the ReLU[26] function, W1∈RCr×CW1∈RCr×C and W2∈RC×CrW2∈RC×Cr. To limit model complexity and aid generalisation, we parameterise the gating mechanism by forming a bottleneck with two fully-connected (FC) layers around the non-linearity, i.e. a dimensionality-reduction layer with parameters W1W1 with reduction ratio rr (we set it to be 16, and this parameter choice is discussed in Sec.6.3), a ReLU and then a dimensionality-increasing layer with parameters W2W2. The final output of the block is obtained by rescaling the transformation output UU with the activations:

x˜c=Fscale(uc,sc)=sc⋅ucx~c=Fscale(uc,sc)=sc⋅uc

where X˜=[x˜1,x˜2,…,x˜C]X~=[x~1,x~2,…,x~C] and Fscale(uc,sc)Fscale(uc,sc) refers to channel-wise multiplication between the feature map uc∈RW×Huc∈RW×H and the scalar scsc.

3.2. Excitation:自適應重新校正

為了利用壓縮操作中匯聚的資訊,我們接下來通過第二個操作來全面捕獲通道依賴性。為了實現這個目標,這個功能必須符合兩個標準:第一,它必須是靈活的(特別是它必須能夠學習通道之間的非線性互動);第二,它必須學習一個非互斥的關係,因為獨熱啟用相反,這裡允許強調多個通道。為了滿足這些標準,我們選擇採用一個簡單的門機制,並使用sigmoid啟用:

s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))

,其中δδ是指ReLU[26]函式,W1∈RCr×CW1∈RCr×C和W2∈RC×CrW2∈RC×Cr。為了限制模型複雜度和輔助泛化,我們通過在非線性周圍形成兩個全連線(FC)層的瓶頸來引數化門機制,即降維層引數為W1W1,降維比例為rr(我們把它設定為16,這個引數選擇在6.3節中討論),一個ReLU,然後是一個引數為W2W2的升維層。塊的最終輸出通過重新調節帶有啟用的變換輸出UU得到:

x˜c=Fscale(uc,sc)=sc⋅ucx~c=Fscale(uc,sc)=sc⋅uc

其中X˜=[x˜1,x˜2,…,x˜C]X~=[x~1,x~2,…,x~C]和Fscale(uc,sc)Fscale(uc,sc)指的是特徵對映uc∈RW×Huc∈RW×H和標量scsc之間的對應通道乘積。

Discussion. The activations act as channel weights adapted to the input-specific descriptor zz. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, helping to boost feature discriminability.

討論。啟用作為適應特定輸入描述符zz的通道權重。在這方面,SE塊本質上引入了以輸入為條件的動態特性,有助於提高特徵辨別力。

3.3. Exemplars: SE-Inception and SE-ResNet

The flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by integrating SE blocks into two popular network families of architectures, Inception and ResNet. SE blocks are constructed for the Inception network by taking the transformation FtrFtr to be an entire Inception module (see Fig.2). By making this change for each such module in the architecture, we construct an SE-Inception network.

Figure 2

Figure 2. The schema of the original Inception module (left) and the SE-Inception module (right).

3.3. 模型:SE-Inception和SE-ResNet

SE塊的靈活性意味著它可以直接應用於標準卷積之外的變換。為了說明這一點,我們通過將SE塊整合到兩個流行的網路架構系列Inception和ResNet中來開發SENets。通過將變換FtrFtr看作一個整體的Inception模組(參見圖2),為Inception網路構建SE塊。通過對架構中的每個模組進行更改,我們構建了一個SE-Inception網路。

Figure 2

圖2。最初的Inception模組架構(左)和SE-Inception模組架構(右)。

Residual networks and their variants have shown to be highly effective at learning deep representations. We develop a series of SE blocks that integrate with ResNet [9], ResNeXt [43] and Inception-ResNet [38] respectively. Fig.3 depicts the schema of an SE-ResNet module. Here, the SE block transformation FtrFtr is taken to be the non-identity branch of a residual module. Squeeze and excitation both act before summation with the identity branch.

Figure 3

Figure 3. The schema of the original Residual module (left) and the SE-ResNet module (right).

殘留網路及其變種已經證明在學習深度表示方面非常有效。我們開發了一系列的SE塊,分別與ResNet[9],ResNeXt[43]和Inception-ResNet[38]整合。圖3描述了SE-ResNet模組的架構。在這裡,SE塊變換FtrFtr被認為是殘差模組的非恆等分支。壓縮激勵都在恆等分支相加之前起作用。

Figure 3

圖3。 最初的Residual模組架構(左)和SE-ResNet模組架構(右)。

4. Model and Computational Complexity

An SENet is constructed by stacking a set of SE blocks. In practice, it is generated by replacing each original block (i.e. residual block) with its corresponding SE counterpart (i.e. SE-residual block). We describe the architecture of SE-ResNet-50 and SE-ResNeXt-50 in Table 1.

Table 1

Table 1. (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a 32×4d32×4d template. The shapes and operations with specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in a SE-module.

4. 模型和計算複雜度

SENet通過堆疊一組SE塊來構建。實際上,它是通過用原始塊的SE對應部分(即SE殘差塊)替換每個原始塊(即殘差塊)而產生的。我們在表1中描述了SE-ResNet-50和SE-ResNeXt-50的架構。

Table 1

表1。(左)ResNet-50,(中)SE-ResNet-50,(右)具有32×4d32×4d模板的SE-ResNeXt-50。在括號內列出了殘差構建塊特定引數設定的形狀和操作,並且在外部呈現了一個階段中堆疊塊的數量。fc後面的內括號表示SE模組中兩個全連線層的輸出維度。

For the proposed SE block to be viable in practice, it must provide an acceptable model complexity and computational overhead which is important for scalability. To illustrate the cost of the module, we take the comparison between ResNet-50 and SE-ResNet-50 as an example, where the accuracy of SE-ResNet-50 is obviously superior to ResNet-50 and approaching a deeper ResNet-101 network (shown in Table 2). ResNet-50 requires ∼∼3.86 GFLOPs in a single forward pass for a 224×224224×224 pixel input image. Each SE block makes use of a global average pooling operation in the squeeze phase and two small fully connected layers in the excitation phase, followed by an inexpensive channel-wise scaling operation. In aggregate, SE-ResNet-50 requires ∼∼3.87 GFLOPs, corresponding to only a 0.26%0.26% relative increase over the original ResNet-50.

Table 2

Table 2. Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation column. The SENet column refers the corresponding architectures in which SE blocks have been added. The numbers in brackets denote the performance improvement over the re-implemented baselines. † indicates that the model has been evaluated on the non-blacklisted subset of the validation set (this is discussed in more detail in [38]), which may slightly improve results.

在實踐中提出的SE塊是可行的,它必須提供可接受的模型複雜度和計算開銷,這對於可伸縮性是重要的。為了說明模組的成本,作為例子我們比較了ResNet-50和SE-ResNet-50,其中SE-ResNet-50的精確度明顯優於ResNet-50,接近更深的ResNet-101網路(如表2所示)。對於224×224224×224畫素的輸入影象,ResNet-50單次前向傳播需要∼∼ 3.86 GFLOP。每個SE塊利用壓縮階段的全域性平均池化操作和激勵階段中的兩個小的全連線層,接下來是廉價的通道縮放操作。總的來說,SE-ResNet-50需要∼∼ 3.87 GFLOP,相對於原始的ResNet-50只相對增加了0.26%0.26%。

Table 2

表2。ImageNet驗證集上的單裁剪影象錯誤率(%)和複雜度比較。original列是指原始論文中報告的結果。為了進行公平比較,我們重新訓練了基準模型,並在re-implementation列中報告分數。SENet列是指已新增SE塊後對應的架構。括號內的數字表示與重新實現的基準資料相比的效能改善。†表示該模型已經在驗證集的非黑名單子集上進行了評估(在[38]中有更詳細的討論),這可能稍微改善結果。

In practice, with a training mini-batch of 256256 images, a single pass forwards and backwards through ResNet-50 takes 190190ms, compared to 209209ms for SE-ResNet-50 (both timings are performed on a server with 88 NVIDIA Titan X GPUs). We argue that it is a reasonable overhead as global pooling and small inner-product operations are less optimised in existing GPU libraries. Moreover, due to its importance for embedded device applications, we also benchmark CPU inference time for each model: for a 224×224224×224 pixel input image, ResNet-50 takes 164164ms, compared to for SE-ResNet-5050. The small additional computational overhead required by the SE block is justified by its contribution to model performance (discussed in detail in Sec. 6).

在實踐中,訓練的批資料大小為256張影象,ResNet-50的一次前向傳播和反向傳播花費190190 ms,而SE-ResNet-50則花費209209 ms(兩個時間都在具有88個NVIDIA Titan X GPU的伺服器上執行)。我們認為這是一個合理的開銷,因為在現有的GPU庫中,全域性池化和小型內積操作的優化程度較低。此外,由於其對嵌入式裝置應用的重要性,我們還對每個模型的CPU推斷時間進行了基準測試:對於224×224224×224畫素的輸入影象,ResNet-50花費了164164ms,相比之下,SE-ResNet-5050花費了167167ms。SE塊所需的小的額外計算開銷對於其對模型效能的貢獻來說是合理的(在第6節中詳細討論)。

Next, we consider the additional parameters introduced by the proposed block. All additional parameters are contained in the two fully connected layers of the gating mechanism, which constitute a small fraction of the total network capacity. More precisely, the number of additional parameters introduced is given by:

2r∑s=1SNs⋅Cs22r∑s=1SNs⋅Cs2

where rr denotes the reduction ratio (we set rr to 1616 in all our experiments), SS refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), CsCs denotes the dimension of the output channels for stage ss and NsNs refers to the repeated block number. In total, SE-ResNet-50 introduces ∼∼2.5 million additional parameters beyond the ∼∼25 million parameters required by ResNet-50, corresponding to a ∼10%∼10% increase in the total number of parameters. The majority of these additional parameters come from the last stage of the network, where excitation is performed across the greatest channel dimensions. However, we found that the comparatively expensive final stage of SE blocks could be removed at a marginal cost in performance (<0.1%<0.1% top-1 error on ImageNet dataset) to reduce the relative parameter increase to ∼4%∼4%, which may prove useful in cases where parameter usage is a key consideration.

接下來,我們考慮所提出的塊引入的附加引數。所有附加引數都包含在門機制的兩個全連線層中,構成網路總容量的一小部分。更確切地說,引入的附加引數的數量由下式給出:

2r∑s=1SNs⋅Cs22r∑s=1SNs⋅Cs2

其中rr表示減少比率(我們在所有的實驗中將rr設定為1616),SS指的是階段數量(每個階段是指在共同的空間維度的特徵對映上執行的塊的集合),CsCs表示階段ss的輸出通道的維度,NsNs表示重複的塊編號。總的來說,SE-ResNet-50在ResNet-50所要求的∼∼2500萬引數之外引入了∼∼250萬附加引數,相對增加了∼10%∼10%的引數總數量。這些附加引數中的大部分來自於網路的最後階段,其中激勵在最大的通道維度上執行。然而,我們發現SE塊相對昂貴的最終階段可以在效能的邊際成本(ImageNet資料集上<0.1%<0.1%的top-1錯誤率)上被移除,將相對引數增加減少到∼4%∼4%,這在引數使用是關鍵考慮的情況下可能證明是有用的。

5. Implementation

During training, we follow standard practice and perform data augmentation with random-size cropping [39] to 224×224224×224 pixels (299×299299×299 for Inception-ResNet-v2 [38] and SE-Inception-ResNet-v2) and random horizontal flipping. Input images are normalised through mean channel subtraction. In addition, we adopt the data balancing strategy described in [32] for mini-batch sampling to compensate for the uneven distribution of classes. The networks are trained on our distributed learning system “ROCS” which is capable of handing efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.9 and a mini-batch size of 1024 (split into sub-batches of 32 images per GPU across 4 servers, each containing 8 GPUs). The initial learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs. All models are trained for 100 epochs from scratch, using the weight initialisation strategy described in [8].

5. 實現

在訓練過程中,我們遵循標準的做法,使用隨機大小裁剪[39]到224×224224×224畫素(299×299299×299用於Inception-ResNet-v2[38]和SE-Inception-ResNet-v2)和隨機的水平翻轉進行資料增強。輸入影象通過通道減去均值進行歸一化。另外,我們採用[32]中描述的資料均衡策略進行小批量取樣,以補償類別的不均勻分佈。網路在我們的分散式學習系統“ROCS”上進行訓練,能夠處理大型網路的高效並行訓練。使用同步SGD進行優化,動量為0.9,小批量資料的大小為1024(在4個伺服器的每個GPU上分成32張影象的子批次,每個伺服器包含8個GPU)。初始學習率設為0.6,每30個迭代週期減少10倍。使用[8]中描述的權重初始化策略,所有模型都從零開始訓練100個迭代週期。

6. Experiments

In this section we conduct extensive experiments on the ImageNet 2012 dataset [30] for the purposes: first, to explore the impact of the proposed SE block for the basic networks with different depths and second, to investigate its capacity of integrating with current state-of-the-art network architectures, which aim to a fair comparison between SENets and non-SENets rather than pushing the performance. Next, we present the results and details of the models for ILSVRC 2017 classification task. Furthermore, we perform experiments on the Places365-Challenge scene classification dataset [48] to investigate how well SENets are able to generalise to other datasets. Finally, we investigate the role of excitation and give some analysis based on experimental phenomena.

6. 實驗

在這一部分,我們在ImageNet 2012資料集上進行了大量的實驗[30],其目的是:首先探索提出的SE塊對不同深度基礎網路的影響;其次,調查它與最先進的網路架構整合後的能力,旨在公平比較SENets和非SENets,而不是推動效能。接下來,我們將介紹ILSVRC 2017分類任務模型的結果和詳細資訊。此外,我們在Places365-Challenge場景分類資料集[48]上進行了實驗,以研究SENets是否能夠很好地泛化到其它資料集。最後,我們研究激勵的作用,並根據實驗現象給出了一些分析。

6.1. ImageNet Classification

The ImageNet 2012 dataset is comprised of 1.28 million training images and 50K validation images from 1000 classes. We train networks on the training set and report the top-1 and the top-5 errors using centre crop evaluations on the validation set, where 224×224224×224 pixels are cropped from each image whose shorter edge is first resized to 256 (299×299299×299 from each image whose shorter edge is first resized to 352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

6.1. ImageNet分類

ImageNet 2012資料集包含來自1000個類別的128萬張訓練影象和5萬張驗證影象。我們在訓練集上訓練網路,並在驗證集上使用中心裁剪影象評估來報告top-1top-5錯誤率,其中每張影象短邊首先歸一化為256,然後從每張影象中裁剪出224×224224×224個畫素,(對於Inception-ResNet-v2和SE-Inception-ResNet-v2,每幅影象的短邊首先歸一化到352,然後裁剪出299×299299×299個畫素)。

Network depth. We first compare the SE-ResNet against a collection of standard ResNet architectures. Each ResNet and its corresponding SE-ResNet are trained with identical optimisation schemes. The performance of the different networks on the validation set is shown in Table 2, which shows that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity.

網路深度。我們首先將SE-ResNet與一系列標準ResNet架構進行比較。每個ResNet及其相應的SE-ResNet都使用相同的優化方案進行訓練。驗證集上不同網路的效能如表2所示,表明SE塊在不同深度上的網路上計算複雜度極小增加,始終提高效能。

Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%6.62%, exceeding ResNet-50 (7.48%7.48%) by 0.86%0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52%6.52% top-5 error) with only half of the computational overhead (3.873.87 GFLOPs vs. 7.587.58 GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 (6.07%6.07% top-55 error) not only matches, but outperforms the deeper ResNet-152 network (6.34%6.34% top-5 error) by 0.27%0.27%. Fig.4 depicts the training and validation curves of SE-ResNets and ResNets, respectively. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with adding more depth to the base architecture.

Figure 4

Figure 4. Training curves on ImageNet. (Left): ResNet-50 and SE-ResNet-50; (Right): ResNet-152 and SE-ResNet-152.

值得注意的是,SE-ResNet-50實現了單裁剪影象6.62%6.62%的top-5驗證錯誤率,超過了ResNet-50(7.48%7.48%)0.86%0.86%,接近更深的ResNet-101網路(6.52%6.52%的top-5錯誤率),且只有ResNet-101一半的計算開銷(3.873.87 GFLOPs vs. 7.587.58 GFLOPs)。這種模式在更大的深度上重複,SE-ResNet-101(6.07%6.07%的top-5錯誤率)不僅可以匹配,而且超過了更深的ResNet-152網路(6.34%6.34%的top-5錯誤率)。圖4分別描繪了SE-ResNets和ResNets的訓練和驗證曲線。雖然應該注意SE塊本身增加了深度,但是它們的計算效率極高,即使在擴充套件的基礎架構的深度達到收益遞減的點上也能產生良好的回報。而且,我們看到通過對各種不同深度的訓練,效能改進是一致的,這表明SE塊引起的改進可以與增加基礎架構更多深度結合使用。

Figure 4

圖4。ImageNet上的訓練曲線。(左):ResNet-50和SE-ResNet-50;(右):ResNet-152和SE-ResNet-152。

Integration with modern architectures. We next investigate the effect of combining SE blocks with another two state-of-the-art architectures, Inception-ResNet-v2 [38] and ResNeXt [43]. The Inception architecture constructs modules of convolutions as multibranch combinations of factorised filters, reflecting the Inception hypothesis [6] that spatial correlations and cross-channel correlations can be mapped independently. In contrast, the ResNeXt architecture asserts that richer representations can be obtained by aggregating combinations of sparsely connected (in the channel dimension) convolutional features. Both approaches introduce prior-structured correlations in modules. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 (32×4d32×4d) is given in Table 1). Like previous experiments, the same optimisation scheme is used for both the original networks and their SENet counterparts.

與現代架構整合。接下來我們將研究SE塊與另外兩種最先進的架構Inception-ResNet-v2[38]和ResNeXt[43]的結合效果。Inception架構將卷積模組構造為分解濾波器的多分支組合,反映了Inception假設[6],可以獨立對映空間相關性和跨通道相關性。相比之下,ResNeXt體架構斷言,可以通過聚合稀疏連線(在通道維度中)卷積特徵的組合來獲得更豐富的表示。兩種方法都在模組中引入了先前結構化的相關性。我們構造了這些網路的SENet等價物,SE-Inception-ResNet-v2和SE-ResNeXt(表1給出了SE-ResNeXt-50(32×4d32×4d)的配置)。像前面的實驗一樣,原始網路和它們對應的SENet網路都使用相同的優化方案。

The results given in Table 2 illustrate the significant performance improvement induced by SE blocks when introduced into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of 5.49%5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90%5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57%5.57% top-5 error), a model which has almost double the number of parameters and computational overhead. As for the experiments of Inception-ResNet-v2, we conjecture the difference of cropping strategy might lead to the gap between their reported result and our re-implemented one, as their original image size has not been clarified in [38] while we crop the 299×299299×299 region from a relative larger image (where the shorter edge is resized to 352). SE-Inception-ResNet-v2 (4.79%4.79% top-5 error) outperforms our reimplemented Inception-ResNet-v2 (5.21%5.21% top-5 error) by 0.42%0.42% (a relative improvement of 8.1%8.1%) as well as the reported result in [38]. The optimisation curves for each network are depicted in Fig. 5, illustrating the consistency of the improvement yielded by SE blocks throughout the training process.

Figure 5

Figure 5. Training curves on ImageNet. (Left): ResNeXt-50 and SE-ResNeXt-50; (Right): Inception-ResNet-v2 and SE-Inception-ResNet-v2.

表2中給出的結果說明在將SE塊引入到兩種架構中會引起顯著的效能改善。尤其是SE-ResNeXt-50的top-5錯誤率是5.49%5.49%,優於於它直接對應的ResNeXt-50(5.90%5.90%的top-5錯誤率)以及更深的ResNeXt-101(5.57%5.57%的top-5錯誤率),這個模型幾乎有兩倍的引數和計算開銷。對於Inception-ResNet-v2的實驗,我們猜測可能是裁剪策略的差異導致了其報告結果與我們重新實現的結果之間的差距,因為它們的原始影象大小尚未在[38]中澄清,而我們從相對較大的影象(其中較短邊被歸一化為352)中裁剪出299×299299×299大小的區域。SE-Inception-ResNet-v2(4.79%4.79%的top-5錯誤率)比我們重新實現的Inception-ResNet-v2(5.21%5.21%的top-5錯誤率)要低0.42%0.42%(相對改進了8.1%8.1%)也優於[38]中報告的結果。每個網路的優化曲線如圖5所示,說明了在整個訓練過程中SE塊產生了一致的改進。

Figure 5

圖5。ImageNet的訓練曲線。(左): ResNeXt-50和SE-ResNeXt-50;(右):Inception-ResNet-v2和SE-Inception-ResNet-v2。

Finally, we assess the effect of SE blocks when operating on a non-residual network by conducting experiments with the BN-Inception architecture [14] which provides good performance at a lower model complexity. The results of the comparison are shown in Table 2 and the training curves are shown in Fig. 6, exhibiting the same phenomena that emerged in the residual architectures. In particular, SE-BN-Inception achieves a lower top-5 error of 7.14%7.14% in comparison to BN-Inception whose error rate is 7.89%7.89%. These experiments demonstrate that improvements induced by SE blocks can be used in combination with a wide range of architectures. Moreover, this result holds for both residual and non-residual foundations.

Figure 6

Figure 6. Training curves of BN-Inception and SE-BN-Inception on ImageNet.

最後,我們通過對BN-Inception架構[14]進行實驗來評估SE塊在非殘差網路上的效果,該架構在較低的模型複雜度下提供了良好的效能。比較結果如表2所示,訓練曲線如圖6所示,表現出的現象與殘差架構中出現的現象一樣。尤其是與BN-Inception 7.89%7.89%的錯誤率相比,SE-BN-Inception獲得了更低7.14%7.14%的top-5錯誤。這些實驗表明SE塊引起的改進可以與多種架構結合使用。而且,這個結果適用於殘差和非殘差基礎。

Figure 6

圖6。BN-Inception和SE-BN-Inception在ImageNet上的訓練曲線。

Results on ILSVRC 2017 Classification Competition. ILSVRC [30] is an annual computer vision competition which has proved to be a fertile ground for model developments in image classification. The training and validation data of the ILSVRC 2017 classification task are drawn from the ImageNet 2012 dataset, while the test set consists of an additional unlabelled 100K images. For the purposes of the competition, the top-5 error metric is used to rank entries.

**ILSVRC 2017分類競賽的結果。**ILSVRC[30]是一個年度計算機視覺競賽,被證明是影象分類模型發展的沃土。ILSVRC 2017分類任務的訓練和驗證資料來自ImageNet 2012資料集,而測試集包含額外的未標記的10萬張影象。為了競爭的目的,使用top-5錯誤率度量來對輸入條目進行排序。

SENets formed the foundation of our submission to the challenge where we won first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a 2.251%2.251% top-5 error on the test set. This result represents a ∼25%∼25% relative improvement on the winning entry of 2016 (2.99%2.99% top-5 error). One of our high-performing networks is constructed by integrating SE blocks with a modified ResNeXt [43] (details of the modifications are provided in Appendix A). We compare the proposed architecture with the state-of-the-art models on the ImageNet validation set in Table 3. Our model achieves a top-1 error of 18.68%18.68% and a top-5 error of 4.47%4.47% using a 224×224224×224 centre crop evaluation on each image (where the shorter edge is first resized to 256). To enable a fair comparison with previous models, we also provide a 320×320320×320 centre crop evaluation, obtaining the lowest error rate under both the top-1 (17.28%17.28%) and the top-5 (3.79%3.79%) error metrics.

Table 3

Table 3. Single-crop error rates of state-of-the-art CNNs on ImageNet validation set. The size of test crop is 224×224224×224 and 320×320320×320/299×299299×299 as in [10]. Our proposed model, SENet, shows a significant performance improvement on prior work.

SENets是我們在挑戰中贏得第一名的基礎。我們的獲勝輸入由一小群SENets組成,它們採用標準的多尺度和多裁剪影象融合策略,在測試集上獲得了2.251%2.251%的top-5錯誤率。這個結果表示在2016年獲勝輸入(2.99%2.99%的top-5錯誤率)的基礎上相對改進了∼25%∼25%。我們的高效能網路之一是將SE塊與修改後的ResNeXt[43]整合在一起構建的(附錄A提供了這些修改的細節)。在表3中我們將提出的架構與最新的模型在ImageNet驗證集上進行了比較。我們的模型在每一張影象使用224×224224×224中間裁剪評估(短邊首先歸一化到256)取得了18.68%18.68%的top-1錯誤率和4.47%4.47%的top-5錯誤率。為了與以前的模型進行公平的比較,我們也提供了320×320320×320的中心裁剪影象評估,在top-1(17.28%17.28%)和top-5(3.79%3.79%)的錯誤率度量中獲得了最低的錯誤率。

Table 3

表3。最新的CNNs在ImageNet驗證集上單裁剪影象的錯誤率。測試的裁剪影象大小是224×224224×224和[10]中的320×320320×320/299×299299×299。與前面的工作相比,我們提出的模型SENet表現出了顯著的改進。

6.2. Scene Classification

Large portions of the ImageNet dataset consist of images dominated by single objects. To evaluate our proposed model in more diverse scenarios, we also evaluate it on the Places365-Challenge dataset [48] for scene classification. This dataset comprises 8 million training images and 36, 500 validation images across 365 categories. Relative to classification, the task of scene understanding can provide a better assessment of the ability of a model to generalise well and handle abstraction, since it requires the capture of more complex data associations and robustness to a greater level of appearance variation.

6.2. 場景分類

ImageNet資料集的大部分由單個物件支配的影象組成。為了在更多不同的場景下評估我們提出的模型,我們還在Places365-Challenge資料集[48]上對場景分類進行評估。該資料集包含800萬張訓練影象和365個類別的36500張驗證影象。相對於分類,場景理解的任務可以更好地評估模型泛化和處理抽象的能力,因為它需要捕獲更復雜的資料關聯以及對更大程度外觀變化的魯棒性。

We use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the evaluation protocol in [33]. Table 4 shows the results of training a ResNet-152 model and a SE-ResNet-152 for the given task. Specifically, SE-ResNet-152 (11.01%11.01% top-5 error) achieves a lower validation error than ResNet-152 (11.61%11.61% top-5 error), providing evidence that SE blocks can perform well on different datasets. This SENet surpasses the previous state-of-the-art model Places-365-CNN [33] which has a top-5 error of 11.48%11.48% on this task.

Table 4

Table 4. Single-crop error rates (%) on the Places365 validation set.

我們使用ResNet-152作為強大的基線來評估SE塊的有效性,並遵循[33]中的評估協議。表4顯示了針對給定任務訓練ResNet-152模型和SE-ResNet-152的結果。具體而言,SE-ResNet-152(11.01%11.01%的top-5錯誤率)取得了比ResNet-152(11.61%11.61%的top-5錯誤率)更低的驗證錯誤率,證明了SE塊可以在不同的資料集上表現良好。這個SENet超過了先前的最先進的模型Places-365-CNN [33],它在這個任務上有11.48%11.48%的top-5錯誤率。

Table 4

表4。Places365驗證集上的單裁剪影象錯誤率(%)。

6.3. Analysis and Discussion

Reduction ratio. The reduction ratio rr introduced in Eqn. (5) is an important hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the model. To investigate this relationship, we conduct experiments based on the SE-ResNet-50 architecture for a range of different rr values. The comparison in Table 5 reveals that performance does not improve monotonically with increased capacity. This is likely to be a result of enabling the SE block to overfit the channel interdependencies of the training set. In particular, we found that setting r=16r=16 achieved a good tradeoff between accuracy and complexity and consequently, we used this value for all experiments.

Table 5

Table 5. Single-crop error rates (%) on the ImageNet validation set and corresponding model sizes for the SE-ResNet-50 architecture at different reduction ratios rr. Here original refers to ResNet-50.

6.3. 分析和討論

減少比率。公式(5)中引入的減少比率rr是一個重要的超引