We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (Rnet) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.

我們引入了一個通用框架, 它降低了物體檢測的計算成本, 同時保留了不同大小的物體在高解析度影象中出現的情況的準確性。檢測過程中以coarse-to-fine的方式進行, 首先對影象的down-sampled版本, 然後再對被識別為可能提高檢測精度的更高解析度區域排序。在強化學習的基礎上, 我們的方法包括一個模型 (R-net), 使用粗檢測結果來預測在更高解析度下分析一個區域的潛在精度增益, 另一個模型 (Q-net), 繼續選擇區域zoom-in.在 Caltech Pedestrians的行人資料集的實驗表明, 我們的方法減少了processed pixels的數量超過50% ,檢測精度沒有下降。我們的方法的優點在從 YFCC100M 資料集收集的高解析度測試集上變得更加重要, 我們的方法保持了較高的檢測效能, 同時將處理的畫素的數量減少了 70%, 並且檢測時間超過了50%。

Most recent convolutional neural network (CNN) detectors are applied to images with relatively low resolution, e.g., VOC2007/2012 (about 500×400) [12, 13] and MS COCO (about 600×400) [26]. At such low resolutions, the computational cost of convolution is low. However, the resolution of everyday devices has quickly outpaced standard computer vision datasets. The camera of a 4K smartphone, for instance, has a resolution of 2,160×3,840 pixels and a DSLR camera can reach 6,000×4,000 pixels. Applying state-of-the-art CNN detectors directly to those high resolution images requires a large amount of processing time. Additionally, the convolution output maps are too large for the memory of current GPUs.

最近的卷積神經網路(CNN)檢測器應用於解析度相對較低的影象,例如VOC2007 / 2012(約500×400)[12,13]和MS COCO(約600×400)[26]。 在如此低的解析度下,卷積的計算成本很低。 然而,日常裝置的解析度已經快速超過了標準的計算機視覺資料集。 例如,4K智慧手機的相機解析度為2,160×3,840畫素,單反相機可以達到6000×4000畫素。 將state-of-the-art的CNN檢測器直接應用於這些高解析度影象需要大量的處理時間。 此外,卷積輸出對映對於當前GPU的記憶體來說太大。

Prior works address some of these issues by simplifying the network architecture [14, 41, 9, 23, 38] to speed up detection and reduce GPU memory consumption. However, these models are tailored to particular network structures and may not generalize well to new architectures. A more general direction is treating the detector as a black box that is judiciously applied to optimize accuracy and efficiency. For example, one could partition an image into sub-images that satisfy memory constraints and apply the CNN to each sub-image. However, this solution is still computationally burdensome. One could also speed up detection process and reduce memory requirements by running existing detectors on down-sampled images. However, the smallest objects may become too small to detect in the down-sampled images. Object proposal methods are the basis for most CNN detectors, restricting expensive analysis to regions that are likely to contain objects of interest [11, 35, 44, 43]. However, the number of object proposals needed to achieve good recall for small objects in large images is prohibitively high which leads to huge computational cost.

之前的工作通過簡化網路架構來解決其中的一些問題[14,41,9,23,38],以加速檢測並降低GPU記憶體消耗。但是,這些模型是針對特定的網路結構量身定製的,可能不能很好地推廣到新架構。更普遍的方向是將檢測器視為一個黑匣子,明智地應用該檢測器來優化精度和效率。例如,可以將影象劃分為滿足儲存器限制的子影象並將CNN應用於每個子影象。但是,這個解決方案仍然是計算繁瑣的。還可以通過在down-sampled影象上執行現有的檢測器來加速檢測過程並減少儲存器需求。但是,最小的物體可能變得太小而無法在降取樣的影象中檢測到。Object proposal方法是大多數CNN檢測器的基礎,將昂貴的分析限制在可能包含感興趣物體的區域[11,35,44,43]。然而,為了在大影象中實現對小物體的良好召回而需要的Object proposal的數量非常高,這導致巨大的計算成本。

Our approach is illustrated in Fig. 1. We speed up object detection by first performing coarse detection on a downsampled version of the image and then sequentially selecting promising regions to be analyzed at a higher resolution. We employ reinforcement learning to model long-term reward in terms of detection accuracy and computational cost and dynamically select a sequence of regions to analyze at higher resolution. Our approach consists of two networks: a zoom-in accuracy gain regression network (R-net) learns correlations between coarse and fine detections and predicts the accuracy gain for zooming in on a region; a zoom-in Q function network (Q-net) learns to sequentially select the optimal zoom locations and scales by analyzing the output of the R-net and the history of previously analyzed regions.

我們的方法如圖1所示。我們通過首先對影象的down-sampled版本執行粗略檢測,然後依次選擇有希望的區域以更高的解析度進行分析,從而加速物體檢測。我們採用強化學習在檢測精度和計算成本方面對long-term reward進行建模,並動態選擇一系列區域以更高解析度進行分析。我們的方法由兩個網路組成:zoom-in accuracy gain regression network(R-net)學習粗略和精細檢測之間的相關性,並預測zoom-in區域的精度增益; zoom-in Q function network(Q-net)學習通過分析R-net的輸出和先前分析的區域的歷史來依次選擇最佳縮放定位和縮放比例。

Experiments demonstrate that, with a negligible drop in detection accuracy, our method reduces processed pixels by over 50% and average detection time by 25% on the Caltech Pedestrian Detection dataset [10], and reduces processed pixels by about 70% and average detection time by over 50% on a high resolution dataset collected from YFCC100M [21] that has pedestrians of varied sizes. We also compare our method to recent single-shot detectors [32, 27] to show our advantage when handling large images.

實驗證明,在檢測精度可以忽略不計的情況下,我們的方法在Caltech行人檢測資料集[10]上減少了50%以上的processed pixels和average detection時間25%,並且減少了約70%的processed pixels和average detection時間在從YFCC100M [21]收集的高解析度資料集中有超過50%的行人具有不同的尺寸。我們還將我們的方法與最近的single-shot detectors進行了比較[32,27],以顯示我們在處理大影象時的優勢。

CNN detectors. One way to analyze high resolution images efficiently is to improve the underlying detector. Girshick [16] speeded up the region proposal based CNN [17] by sharing convolutional features between proposals. Ren et al. proposed Faster R-CNN [33], a fully end-to-end pipeline that shares features between proposal generation and object detection, improving both accuracy and computational efficiency. Recently, single-shot detectors [27, 31, 32] have received much attention for real-time performance. These methods remove the proposal generation stage and formulate detection as a regression problem. Although these detectors performed well on PASCAL VOC [12, 13] and MS COCO [26] datasets, which generally contain large objects in images with relatively low resolution, they do not generalize as well on large images with objects of variable sizes. Also, their processing cost increases dramatically with image size due to the large number of convolution operations.

CNN檢測器。有效分析高解析度影象的一種方法是改進底層檢測器。 Girshick [16]通過分享proposal之間的卷積特徵來加速基於CNN的區域proposal[17]。 Ren等人提出了 Faster R-CNN [33],這是一種完全端到端的pipeline,共享proposal生成和物體檢測之間的特徵,提高了準確性和計算效率。最近,single-shot detectors[27,31,32]在實時效能方面受到了很多關注。這些方法刪除proposal生成階段並將檢測制定為迴歸問題。雖然這些檢測器在PASCAL VOC [12,13]和MS COCO [26]資料集上表現良好,這些資料集通常在解析度相對較低的影象中包含較大的物體,但它們不能在具有可變大小物體的大影象上進行概括。另外,由於大量的卷積操作,其處理成本隨著影象大小而急劇增加。

Sequential search. Another strategy to handle large image sizes is to avoid processing the entire image and instead investigate small regions sequentially. However, most existing works focus on mining informative regions to improve detection accuracy without considering computational cost. Lu et al. [28] improve localization by adaptively focusing on subregions likely to contain objects. Alexe et al. [1] sequentially investigated locations based on what has been seen to improve detection accuracy. However, the proposed approach introduces a large overhead leading to long detection time (about 5s per object class per image). Zhang et al. [42] improved the detection accuracy by penalizing the inaccurate location of the initial object proposals, which introduced more than 15% overhead to detection time.

Sequential search。處理大影象尺寸的另一種策略是避免處理整個影象,而是依次調查小區域。然而,大多數現有的工作集中在挖掘資訊區域以提高檢測精度而不考慮計算成本。 Lu等人[28]通過適應性地關注可能包含物體的子區域來改善定位。Alexe等人[1]根據已經看到的提高檢測準確性的順序調查定位。然而,所提出的方法引入了大的開銷,導致檢測時間很長(每個影象的每個物體類別大約5s)。 Zhang等人[42]通過懲罰初始物體提議的不準確定位來提高檢測的準確性,這會導致檢測時間的開銷超過15%。

A sequential search process can also make use of contextual cues from sources, such as scene segmentation. Existing approaches have explored this idea for various object localization tasks [8, 37, 30]. Such cues can also be incorporated within our framework (e.g., as input to predicting the zoom in reward). However, we focus on using only coarse detections as a guide for sequential search and leave additional contextual information for future work. Other previous work [25] utilizes a coarse-to-fine strategy to speed up detection, but this work does not select promising regions sequentially.

Sequential search過程還可以利用來自源的上下文線索,例如場景分割。現有的方法已經為各種物體定位任務探索了這個想法[8,37,30]。這樣的線索也可以被納入我們的框架內(例如,作為預測zoom-inreward的輸入)。但是,我們專注於僅使用粗略檢測作為Sequential search的指導,併為將來的工作留下額外的上下文資訊。其他以前的工作[25]採用了coarse-to-fine的策略來加速檢測,但這項工作並沒有順序選擇有前途的區域。

Reinforcement learning (RL). RL a is popular mechanism for learning sequential search policies, as it allows models to consider the effect of a sequence of actions rather than individual ones. Ba et al. use RL to train a attention based model in [3] to sequentially select most relevant regions for object recognition and Jie et al. [20] select regions for localization in a top-down search fashion. However, these methods require a large number of selection steps and may lead to long running time. Caicedo et al. [7] designed an active detection model for object localization, which utilizes Deep Q Networks (DQN) [29] to learn a long-term reward function to transform an initial bounding box sequentially until it converges to an object. However, as reported in [7], the box transformation takes about 1.5s detection time on a typical Pascal VOC image which is much slower than recent detectors [33, 27, 32]. In addition, [7] does not explicitly consider selection cost. Although, RL implicitly forces the algorithm to take a minimum number of steps, we need to explicitly penalize cost since each step can yield a high cost. For example, if we do not penalize cost, the algorithm will tend to zoom in on the whole image. Existing works have proposed methods to apply RL in cost sensitive settings [18, 22]. We follow the approach of [18] and treat the reward function as a linear combination of accuracy and cost.

Reinforcement learning (RL). RL a 是學習Sequential search策略的流行機制,因為它允許模型考慮一系列操作而不是單個操作的效果。 Ba等人使用RL在[3]中訓練基於attention的模型以依次選擇最相關的區域用於物體識別,並且Jie等人[20]選擇區域以top-down的搜尋方式進行定位。但是,這些方法需要大量的選擇步驟,並可能導致執行時間過長。 Caicedo等人 [7]為物體定位設計了一個主動檢測模型,該模型利用Deep Q Networks(DQN)[29]學習一個long-term reward函式,以便順序轉換初始邊界框直到它收斂到一個物體。然而,正如文獻[7]報道的那樣,在典型的Pascal VOC image上,box transformation需要大約1.5s的檢測時間,這比最近的檢測器慢得多[33,27,32]。另外,[7]沒有明確考慮選擇成本。儘管RL隱含地迫使演算法採取最小步數,但我們需要明確懲罰成本,因為每一步都會產生高成本。例如,如果我們不懲罰成本,演算法將傾向於zoom-in整個影象。現有工作已經提出了將RL應用於RL in cost sensitive settings的方法[18,22]。我們遵循[18]的方法,將reward函式看作精度和成本的線性組合。

Dynamic zoom-in network
Our work employs a coarse-to-fine strategy, applying a coarse detector at low resolution and using the outputs of this detector to guide an in-depth search for objects at high resolution. The intuition is that, while the coarse detector will not be as accurate as the fine detector, it will identify image regions that need to be further analyzed, incurring the cost of high resolution detection only in promising regions. We make use of two major components: 1) a mechanism for learning the statistical relationship between the coarse and fine detectors, so that we can predict which regions need to be zoomed in given the coarse detector output; and 2) a mechanism for selecting a sequence of regions to analyze at high resolution, given the coarse detector output and the regions that have already been analyzed by the fine detector. Our pipeline is illustrated in Fig. 2. We learn a strategy that models the long-term goal of maximizing the overall detection accuracy with limited cost.

Dynamic zoom-in network
我們的工作採用了coarse-to-fine的策略,在低解析度下應用粗檢測器,並使用該檢測器的輸出來指導深度搜索高解析度物體。直覺是,儘管粗檢測器不如精檢測器那麼精確,但它將識別需要進一步分析的影象區域,僅在有前途的區域中產生高解析度檢測的成本。我們利用兩個主要部分:1)學習粗檢測器和精檢測器之間的統計關係的機制,以便我們可以預測在給定粗檢測器輸出的情況下哪些區域需要被zoom-in; 2) 給定粗檢測器輸出和已經由精細檢測器分析的區域,用於以高解析度選擇要分析的區域序列的機制。我們的流程如圖2所示。我們學習一種策略,模擬以有限的成本最大化整體檢測精度的長期目標。

Baseline methods
We compare to the following baseline algorithms:
Fine-detection-all. This baseline directly applies the fine detector to the high resolution version of image. This method leads to high detection accuracy with high computational cost. All of the other approaches seek to maintain this detection accuracy with less computation.

Baseline methods
我們比較以下baseline演算法:
Fine-detection-all。該baseline直接將精細檢測器應用於高解析度版本的影象。該方法導致高檢測精度和高計算成本。所有其他方法都試圖用較少的計算來保持這種檢測精度。

Coarse-detection-all. This baseline applies the coarse detector on down-sampled images with no zooming.

Coarse-detection-all。該baseline在down-sampled影象上應用粗檢測器,沒有縮放。

GS+Rnet. Given the initial state representation generated by the R-net, we use a greedy search strategy (GS) to densely search for the best window every time based on the current state without considering the long-term reward.

GS+Rnet。給定由R-net生成的初始狀態表示,我們使用貪婪搜尋策略(GS)根據當前狀態每次密集搜尋最佳視窗,而不考慮long-term reward。

ER+Qnet. The entropy of the detector output (object vs no object) is another way to measure the quality of a coarse detection. [2] used entropy to measure the quality of a region for a classification task. Higher entropy implies lower quality of a coarse detection. So, if we ignore the correlation between fine and coarse detections, the accuracy gain of a region can also be computed as

pillog(pil)(1pil)log(1pil)

where pl indicates the score of the coarse detection. For fair comparison, we fix all parameters of the pipeline except replacing the R-net output of a proposal with its entropy.

ER+Qnet。檢測器輸出(有物體vs無物體)的熵是另一種測量粗略檢測質量的方法。 [2]用熵來衡量分類任務的區域質量。較高的熵意味著較低的粗檢測質量。因此,如果我們忽略精細和粗略檢測之間的相關性,則區域的精度增益也可以計算為

pillog(pil)(1pil)log(1pil)

其中pl表示粗略檢測的分數。為了公平比較,我們修復了pipeline的所有引數,但是用熵替換了proposal的R-net輸出。

SSD and YOLOv2. We also compare our method with off-the-shelf SSD [27] and YOLOv2 [32] trained on CPD, to show the advantage of our method on large images.

SSD和YOLOv2。我們還將我們的方法與CPD訓練的off-the-shelf的SSD [27]和YOLOv2 [32]進行了比較,以展示我們的方法在大影象上的優勢。

Variants of our framework
We use Qnet-CNN to represent the Q-net developed using a fully convolutional network (see Fig. 2). To analyze the contributions of different components to the performance gain, we evaluate three variants of our framework: Qnet*, Qnet-FC and Rnet*.

Variants of our framework
我們使用Qnet-CNN來表示使用完全卷積網路開發的Q-net(見圖2)。為了分析不同元件對效能增益的貢獻,我們評估了我們框架的三種變體:Qnet ,Qnet-FC和Rnet

Qnet*. This method uses a Q-net with refinement to locally adjust the zoom-in window selected by Q-net.

QNET *。這種方法使用Q-net進行細化以在區域性調整由Q-net選擇的zoom-in window。

Qnet-FC. Following [7], we develop this variant with two fully connected (FC) layers for Q-net. For Qnet-FC, the state representation is resized to a vector of length 1, 200 as the input. The first layer has 128 units and the second layer has 34 units (9+25). Each output unit represents a sampled window on an image. We uniformly sample 25 windows of size 320 × 240 and 9 windows of size 214 × 160 on the CPD dataset. Since the output number of Qnet-FC can not be changed, windows sizes are proportionally increased when Qnet-FC is applied to WP dataset.

QNET-FC。 [7]之後,我們為Q-net開發了兩個完全連線(FC)層的變體。對於Qnet-FC,狀態表示被調整為長度為1,200的向量作為輸入。第一層有128個單元,第二層有34個單元(9 + 25)。每個輸出單位表示影象上的取樣視窗。我們在CPD資料集上統一取樣25個尺寸為320×240的視窗和9個尺寸為214×160的視窗。由於無法更改Qnet-FC的輸出數量,因此將Qnet-FC應用於WP資料集時,視窗大小會成比例地增加。

Rnet*. This is an R-net learned using a reward function that does not explicitly encode cost (λ = 0 in Eq. 1).

Rnet*。這是使用reward函式學習的R-net,其沒有明確編碼成本(方程式1中的λ= 0)。

We propose a dynamic zoom-in network to speed up object detection in large images without manipulating the underlying detector’s structure. Images are first downsampled and processed by the R-net to predict the accuracy gain of zooming in on a region. Then, the Q-net sequentially selects regions with high zoom-in reward to conduct fine detection. The experiments show that our method is effective on both Caltech Pedestrian Detection dataset and a high resolution pedestrian dataset.

我們提出了一個Dynamic zoom-in network來加速大影象中的物體檢測,而不需要操縱底層檢測器的結構。 R-net首先對影象進行down-sampled和處理,以預測zoom-in區域的精度增益。然後,Q-net依次選擇具有高zoom-in回報的區域來進行精細檢測。實驗表明,我們的方法對加州理工行人檢測資料集和高解析度行人資料集均有效。

這裡寫圖片描述

Figure 1: Illustration of our approach. The input is a downsampled version of the image to which a coarse detector is applied. The R-net uses the initial coarse detection results to predict the utility of zooming in on a region to perform detection at higher resolution. The Q-net, then uses the computed accuracy gain map and a history of previous zooms to determine the next zoom that is most likely to improve detection with limited computational cost.

圖1:我們的方法說明。輸入是應用粗檢測器的影象的down-sampled版本。 R-net使用初始的粗略檢測結果來預測在更高解析度下zoom-in區域以執行檢測的效用。 Q-net然後使用計算的精確度增益對映和previous zooms的歷史來確定下一個zoom,其以最有限的計算成本最有可能改善檢測。

這裡寫圖片描述

Figure 2: Given a down-sampled image as input, the R-net generates an initial accuracy gain (AG) map indicating the potential zoom-in accuracy gain of different regions (initial state). The Q-net is applied iteratively on the AG map to select regions. Once a region is selected, the AG map will be updated to reflect the history of actions. For the Q-net, two parallel pipelines are used, each of which outputs an action-reward map that corresponds to selecting zoom-in regions with a specific size. The value of the map indicates the likelihood that the action will increase accuracy at low cost. Action rewards from all maps are considered to select the optimal zoom-in region at each iteration. The notation 128×15×20:(7,10) means 128 convolution kernels with size 15×20, and stride of 7/10 in height/width. Each grid cell in the output maps is given a unique color, and a bounding box of the same color is drawn on the image to denote the corresponding zoom region size and location.

圖2:給定一個down-sampled影象作為輸入,R-net生成一個initial accuracy gain(AG)對映,指示不同區域(初始狀態)的潛在zoom-in精度增益。迭代地在AG圖上應用Q-net來選擇區域。一旦選擇了一個區域,AG對映將被更新以反映action的歷史。對於Q-net,使用兩條並行pipelines,每條pipeline輸出一個動作reward對映,對應於選擇具有特定大小的zoom-in區域。對映的價值表示action以低成本提高準確性的可能性。來自所有對映的action reward被認為是在每次迭代中選擇最佳zoom-in區域。符號128×15×20:(7,10)表示128個大小為15×20的卷積核,高度/寬度為7/10的步長。輸出對映中的每個網格單元都被賦予一種獨特的顏色,並且在影象上繪製相同顏色的邊界框以表示相應的縮放區域大小和定位。