1. 程式人生 > >Deformable Convolutional Networks論文翻譯——中英文對照

Deformable Convolutional Networks論文翻譯——中英文對照

文章作者:Tyan
部落格:noahsnail.com  |  CSDN  |  簡書

宣告:作者翻譯論文僅為學習,如有侵權請聯絡作者刪除博文,謝謝!

Deformable Convolutional Networks

Abstract

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks

. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets
.

摘要

卷積神經網路(CNN)由於其構建模組固定的幾何結構天然地侷限於建模幾何變換。在這項工作中,我們引入了兩個新的模組來提高CNN的轉換建模能力,即可變形卷積和可變形RoI池化。兩者都基於這樣的想法:增加模組中的空間取樣位置以及額外的偏移量,並且從目標任務中學習偏移量,而不需要額外的監督。新模組可以很容易地替換現有CNN中的普通模組,並且可以通過標準的反向傳播便易地進行端對端訓練,從而產生可變形卷積網路。大量的實驗驗證了我們方法的效能。我們首次證明了在深度CNN中學習密集空間變換對於複雜的視覺任務(如目標檢測和語義分割)是有效的。程式碼釋出在https://github.com/msracver/Deformable-ConvNets

1. Introduction

A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation. In general, there are two ways. The first is to build the training datasets with sufficient desired variations. This is usually realized by augmenting the existing data samples, e.g., by affine transformation. Robust representations can be learned from the data, but usually at the cost of expensive training and complex model parameters. The second is to use transformation-invariant features and algorithms. This category subsumes many well known techniques, such as SIFT (scale invariant feature transform) [42] and sliding window based object detection paradigm.

1. 引言

視覺識別中的一個關鍵挑戰是如何在目標尺度,姿態,視點和部件變形中適應幾何變化或建模幾何變換。一般來說,有兩種方法。首先是建立具有足夠期望變化的訓練資料集。這通常通過增加現有的資料樣本來實現,例如通過仿射變換。魯棒的表示可以從資料中學習,但是通常以昂貴的訓練和複雜的模型引數為代價。其次是使用變換不變的特徵和演算法。這一類包含了許多眾所周知的技術,如SIFT(尺度不變特徵變換)[42]和基於滑動視窗的目標檢測範例。

There are two drawbacks in above ways. First, the geometric transformations are assumed fixed and known. Such prior knowledge is used to augment the data, and design the features and algorithms. This assumption prevents generalization to new tasks possessing unknown geometric transformations, which are not properly modeled. Second, hand-crafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.

上述方法有兩個缺點。首先,幾何變換被假定是固定並且已知的。這樣的先驗知識被用來擴充資料,並設計特徵和演算法。這個假設阻止了對具有未知幾何變換的新任務的泛化能力,這些新任務沒有被正確地建模。其次,手工設計的不變特徵和演算法對於過於複雜的變換可能是困難的或不可行的,即使在已知複雜變化的情況下。

Recently, convolutional neural networks (CNNs) [35] have achieved significant success for visual recognition tasks, such as image classification [31], semantic segmentation [41], and object detection [16]. Nevertheless, they still share the above two drawbacks. Their capability of modeling geometric transformations mostly comes from the extensive data augmentation, the large model capacity, and some simple hand-crafted modules (e.g., max-pooling [1] for small translation-invariance).

最近,卷積神經網路(CNNs)[35]在影象分類[31],語義分割[41]和目標檢測[16]等視覺識別任務中取得了顯著的成功。不過,他們仍然有上述兩個缺點。它們對幾何變換建模的能力主要來自大量的資料增強,大的模型容量以及一些簡單的手工設計模組(例如,對小的平移具有不變性的最大池化[1])。

In short, CNNs are inherently limited to model large, unknown transformations. The limitation originates from the fixed geometric structures of CNN modules: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a fixed ratio; a RoI (region-of-interest) pooling layer separates a RoI into fixed spatial bins, etc. There lacks internal mechanisms to handle the geometric transformations. This causes noticeable problems. For one example, the receptive field sizes of all activation units in the same CNN layer are the same. This is undesirable for high level CNN layers that encode the semantics over spatial locations. Because different locations may correspond to objects with different scales or deformation, adaptive determination of scales or receptive field sizes is desirable for visual recognition with fine localization, e.g., semantic segmentation using fully convolutional networks [41]. For another example, while object detection has seen significant and rapid progress [16, 52, 15, 47, 46, 40, 7] recently, all approaches still rely on the primitive bounding box based feature extraction. This is clearly sub-optimal, especially for non-rigid objects.

簡而言之,CNN本質上侷限於建模大型,未知的轉換。該限制源於CNN模組的固定幾何結構:卷積單元在固定位置對輸入特徵圖進行取樣;池化層以一個固定的比例降低空間解析度;一個RoI(感興趣區域)池化層把RoI分成固定的空間組塊等等。缺乏處理幾何變換的內部機制。這會導致明顯的問題。舉一個例子,同一CNN層中所有啟用單元的感受野大小是相同的。對於在空間位置上編碼語義的高階CNN層來說,這是不可取的。由於不同的位置可能對應不同尺度或形變的目標,所以對於具有精細定位的視覺識別來說,例如使用全卷積網路的語義分割[41],尺度或感受野大小的自適應確定是理想的情況。又如,儘管最近目標檢測已經取得了顯著而迅速的進展[16,52,15,47,46,40,7],但所有方法仍然依賴於基於特徵提取的粗糙邊界框。這顯然是次優的,特別是對於非剛性目標。

In this work, we introduce two new modules that greatly enhance CNNs’ capability of modeling geometric transformations. The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. It is illustrated in Figure 1. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.

Figure 1

Figure 1: Illustration of the sampling locations in 3 × 3 standard and deformable convolutions. (a) regular sampling grid (green points) of standard convolution. (b) deformed sampling locations (dark blue points) with augmented offsets (light blue arrows) in deformable convolution. (c)(d) are special cases of (b), showing that the deformable convolution generalizes various transformations for scale, (anisotropic) aspect ratio and rotation.

在這項工作中,我們引入了兩個新的模組,大大提高了CNN建模幾何變換的能力。首先是可變形卷積。它將2D偏移新增到標準卷積中的常規網格取樣位置上。它可以使取樣網格自由形變。如圖1所示。偏移量通過附加的卷積層從前面的特徵圖中學習。因此,變形以區域性的,密集的和自適應的方式受到輸入特徵的限制。

Figure 1

圖1:3×3標準卷積和可變形卷積中取樣位置的示意圖。(a)標準卷積的定期取樣網格(綠點)。(b)變形的取樣位置(深藍色點)和可變形卷積中增大的偏移量(淺藍色箭頭)。(c)(d)是(b)的特例,表明可變形卷積泛化到了各種尺度(各向異性)、長寬比和旋轉的變換。

The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

第二個是可變形的RoI池化。它為前面的RoI池化的常規bin分割槽中的每個bin位置新增一個偏移量[15,7]。類似地,從前面的特徵對映和RoI中學習偏移量,使得具有不同形狀的目標能夠自適應的進行部件定位。

Both modules are light weight. They add small amount of parameters and computation for the offset learning. They can readily replace their plain counterparts in deep CNNs and can be easily trained end-to-end with standard back-propagation. The resulting CNNs are called deformable convolutional networks, or deformable ConvNets.

兩個模組都輕量的。它們為偏移學習增加了少量的引數和計算。他們可以很容易地取代深層CNN中簡單的對應部分,並且可以很容易地通過標準的反向傳播進行端對端的訓練。所得到的CNN被稱為可變形卷積網路,或可變形ConvNets

Our approach shares similar high level spirit with spatial transform networks [26] and deformable part models [11]. They all have internal transformation parameters and learn such parameters purely from data. A key difference in deformable ConvNets is that they deal with dense spatial transformations in a simple, efficient, deep and end-to-end manner. In Section 3.1, we discuss in details the relation of our work to previous works and analyze the superiority of deformable ConvNets.

我們的方法與空間變換網路[26]和可變形部件模型[11]具有類似的高層精神。它們都有內部的轉換引數,純粹從資料中學習這些引數。可變形ConvNets的一個關鍵區別在於它們以簡單,高效,深入和端到端的方式處理密集的空間變換。在3.1節中,我們詳細討論了我們的工作與以前的工作的關係,並分析了可變形ConvNets的優越性。

2. Deformable Convolutional Networks

The feature maps and convolution in CNNs are 3D. Both deformable convolution and RoI pooling modules operate on the 2D spatial domain. The operation remains the same across the channel dimension. Without loss of generality, the modules are described in 2D here for notation clarity. Extension to 3D is straightforward.

2. 可變形卷積網路

CNN中的特徵對映和卷積是3D的。可變形卷積和RoI池化模組都在2D空間域上執行。在整個通道維度上的操作保持不變。在不喪失普遍性的情況下,為了符號清晰,這些模組在2D中描述。擴充套件到3D很簡單。

2.1. Deformable Convolution

The 2D convolution consists of two steps: 1) sampling using a regular grid R over the input feature map x; 2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,

R={(1,1),(1,0),,(0,1),(1,1)} defines a 3×3 kernel with dilation 1.

2.1. 可變形卷積

2D卷積包含兩步:1)用規則的網格R在輸入特徵對映x上取樣;2)對w加權的取樣值求和。網格R定義了感受野的大小和擴張。例如,

R={(1,1),(1,0),,(0,1),(1,1)}定義了一個擴張大小為13×3卷積核。

For each location p0 on the output feature map y, we have

(1)y(p0)=pnRw(pn)x(p0+pn) where pn enumerates the locations in R.

對於輸出特徵對映y上的每個位置p0,我們有