We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (Rnet) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Experiments on the Caltech Pedestrians dataset show that our approach reduces the number of processed pixels by over 50% without a drop in detection accuracy. The merits of our approach become more significant on a high resolution test set collected from YFCC100M dataset, where our approach maintains high detection performance while reducing the number of processed pixels by about 70% and the detection time by over 50%.

Most recent convolutional neural network (CNN) detectors are applied to images with relatively low resolution, e.g., VOC2007/2012 (about 500×400) [12, 13] and MS COCO (about 600×400) [26]. At such low resolutions, the computational cost of convolution is low. However, the resolution of everyday devices has quickly outpaced standard computer vision datasets. The camera of a 4K smartphone, for instance, has a resolution of 2,160×3,840 pixels and a DSLR camera can reach 6,000×4,000 pixels. Applying state-of-the-art CNN detectors directly to those high resolution images requires a large amount of processing time. Additionally, the convolution output maps are too large for the memory of current GPUs.

Prior works address some of these issues by simplifying the network architecture [14, 41, 9, 23, 38] to speed up detection and reduce GPU memory consumption. However, these models are tailored to particular network structures and may not generalize well to new architectures. A more general direction is treating the detector as a black box that is judiciously applied to optimize accuracy and efficiency. For example, one could partition an image into sub-images that satisfy memory constraints and apply the CNN to each sub-image. However, this solution is still computationally burdensome. One could also speed up detection process and reduce memory requirements by running existing detectors on down-sampled images. However, the smallest objects may become too small to detect in the down-sampled images. Object proposal methods are the basis for most CNN detectors, restricting expensive analysis to regions that are likely to contain objects of interest [11, 35, 44, 43]. However, the number of object proposals needed to achieve good recall for small objects in large images is prohibitively high which leads to huge computational cost.

Our approach is illustrated in Fig. 1. We speed up object detection by first performing coarse detection on a downsampled version of the image and then sequentially selecting promising regions to be analyzed at a higher resolution. We employ reinforcement learning to model long-term reward in terms of detection accuracy and computational cost and dynamically select a sequence of regions to analyze at higher resolution. Our approach consists of two networks: a zoom-in accuracy gain regression network (R-net) learns correlations between coarse and fine detections and predicts the accuracy gain for zooming in on a region; a zoom-in Q function network (Q-net) learns to sequentially select the optimal zoom locations and scales by analyzing the output of the R-net and the history of previously analyzed regions.

Experiments demonstrate that, with a negligible drop in detection accuracy, our method reduces processed pixels by over 50% and average detection time by 25% on the Caltech Pedestrian Detection dataset [10], and reduces processed pixels by about 70% and average detection time by over 50% on a high resolution dataset collected from YFCC100M [21] that has pedestrians of varied sizes. We also compare our method to recent single-shot detectors [32, 27] to show our advantage when handling large images.

CNN detectors. One way to analyze high resolution images efficiently is to improve the underlying detector. Girshick [16] speeded up the region proposal based CNN [17] by sharing convolutional features between proposals. Ren et al. proposed Faster R-CNN [33], a fully end-to-end pipeline that shares features between proposal generation and object detection, improving both accuracy and computational efficiency. Recently, single-shot detectors [27, 31, 32] have received much attention for real-time performance. These methods remove the proposal generation stage and formulate detection as a regression problem. Although these detectors performed well on PASCAL VOC [12, 13] and MS COCO [26] datasets, which generally contain large objects in images with relatively low resolution, they do not generalize as well on large images with objects of variable sizes. Also, their processing cost increases dramatically with image size due to the large number of convolution operations.

CNN檢測器。有效分析高解析度影象的一種方法是改進底層檢測器。 Girshick [16]通過分享proposal之間的卷積特徵來加速基於CNN的區域proposal[17]。 Ren等人提出了 Faster R-CNN [33]，這是一種完全端到端的pipeline，共享proposal生成和物體檢測之間的特徵，提高了準確性和計算效率。最近，single-shot detectors[27,31,32]在實時效能方面受到了很多關注。這些方法刪除proposal生成階段並將檢測制定為迴歸問題。雖然這些檢測器在PASCAL VOC [12,13]和MS COCO [26]資料集上表現良好，這些資料集通常在解析度相對較低的影象中包含較大的物體，但它們不能在具有可變大小物體的大影象上進行概括。另外，由於大量的卷積操作，其處理成本隨著影象大小而急劇增加。

Sequential search. Another strategy to handle large image sizes is to avoid processing the entire image and instead investigate small regions sequentially. However, most existing works focus on mining informative regions to improve detection accuracy without considering computational cost. Lu et al. [28] improve localization by adaptively focusing on subregions likely to contain objects. Alexe et al. [1] sequentially investigated locations based on what has been seen to improve detection accuracy. However, the proposed approach introduces a large overhead leading to long detection time (about 5s per object class per image). Zhang et al. [42] improved the detection accuracy by penalizing the inaccurate location of the initial object proposals, which introduced more than 15% overhead to detection time.

Sequential search。處理大影象尺寸的另一種策略是避免處理整個影象，而是依次調查小區域。然而，大多數現有的工作集中在挖掘資訊區域以提高檢測精度而不考慮計算成本。 Lu等人[28]通過適應性地關注可能包含物體的子區域來改善定位。Alexe等人[1]根據已經看到的提高檢測準確性的順序調查定位。然而，所提出的方法引入了大的開銷，導致檢測時間很長（每個影象的每個物體類別大約5s）。 Zhang等人[42]通過懲罰初始物體提議的不準確定位來提高檢測的準確性，這會導致檢測時間的開銷超過15％。

A sequential search process can also make use of contextual cues from sources, such as scene segmentation. Existing approaches have explored this idea for various object localization tasks [8, 37, 30]. Such cues can also be incorporated within our framework (e.g., as input to predicting the zoom in reward). However, we focus on using only coarse detections as a guide for sequential search and leave additional contextual information for future work. Other previous work [25] utilizes a coarse-to-fine strategy to speed up detection, but this work does not select promising regions sequentially.

Sequential search過程還可以利用來自源的上下文線索，例如場景分割。現有的方法已經為各種物體定位任務探索了這個想法[8,37,30]。這樣的線索也可以被納入我們的框架內（例如，作為預測zoom-inreward的輸入）。但是，我們專注於僅使用粗略檢測作為Sequential search的指導，併為將來的工作留下額外的上下文資訊。其他以前的工作[25]採用了coarse-to-fine的策略來加速檢測，但這項工作並沒有順序選擇有前途的區域。

Reinforcement learning (RL). RL a is popular mechanism for learning sequential search policies, as it allows models to consider the effect of a sequence of actions rather than individual ones. Ba et al. use RL to train a attention based model in [3] to sequentially select most relevant regions for object recognition and Jie et al. [20] select regions for localization in a top-down search fashion. However, these methods require a large number of selection steps and may lead to long running time. Caicedo et al. [7] designed an active detection model for object localization, which utilizes Deep Q Networks (DQN) [29] to learn a long-term reward function to transform an initial bounding box sequentially until it converges to an object. However, as reported in [7], the box transformation takes about 1.5s detection time on a typical Pascal VOC image which is much slower than recent detectors [33, 27, 32]. In addition, [7] does not explicitly consider selection cost. Although, RL implicitly forces the algorithm to take a minimum number of steps, we need to explicitly penalize cost since each step can yield a high cost. For example, if we do not penalize cost, the algorithm will tend to zoom in on the whole image. Existing works have proposed methods to apply RL in cost sensitive settings [18, 22]. We follow the approach of [18] and treat the reward function as a linear combination of accuracy and cost.

Reinforcement learning (RL). RL a 是學習Sequential search策略的流行機制，因為它允許模型考慮一系列操作而不是單個操作的效果。 Ba等人使用RL在[3]中訓練基於attention的模型以依次選擇最相關的區域用於物體識別，並且Jie等人[20]選擇區域以top-down的搜尋方式進行定位。但是，這些方法需要大量的選擇步驟，並可能導致執行時間過長。 Caicedo等人 [7]為物體定位設計了一個主動檢測模型，該模型利用Deep Q Networks（DQN）[29]學習一個long-term reward函式，以便順序轉換初始邊界框直到它收斂到一個物體。然而，正如文獻[7]報道的那樣，在典型的Pascal VOC image上，box transformation需要大約1.5s的檢測時間，這比最近的檢測器慢得多[33,27,32]。另外，[7]沒有明確考慮選擇成本。儘管RL隱含地迫使演算法採取最小步數，但我們需要明確懲罰成本，因為每一步都會產生高成本。例如，如果我們不懲罰成本，演算法將傾向於zoom-in整個影象。現有工作已經提出了將RL應用於RL in cost sensitive settings的方法[18,22]。我們遵循[18]的方法，將reward函式看作精度和成本的線性組合。

Dynamic zoom-in network
Our work employs a coarse-to-fine strategy, applying a coarse detector at low resolution and using the outputs of this detector to guide an in-depth search for objects at high resolution. The intuition is that, while the coarse detector will not be as accurate as the fine detector, it will identify image regions that need to be further analyzed, incurring the cost of high resolution detection only in promising regions. We make use of two major components: 1) a mechanism for learning the statistical relationship between the coarse and fine detectors, so that we can predict which regions need to be zoomed in given the coarse detector output; and 2) a mechanism for selecting a sequence of regions to analyze at high resolution, given the coarse detector output and the regions that have already been analyzed by the fine detector. Our pipeline is illustrated in Fig. 2. We learn a strategy that models the long-term goal of maximizing the overall detection accuracy with limited cost.

Dynamic zoom-in network

Baseline methods
We compare to the following baseline algorithms:
Fine-detection-all. This baseline directly applies the fine detector to the high resolution version of image. This method leads to high detection accuracy with high computational cost. All of the other approaches seek to maintain this detection accuracy with less computation.

Baseline methods

Fine-detection-all。該baseline直接將精細檢測器應用於高解析度版本的影象。該方法導致高檢測精度和高計算成本。所有其他方法都試圖用較少的計算來保持這種檢測精度。

Coarse-detection-all. This baseline applies the coarse detector on down-sampled images with no zooming.

Coarse-detection-all。該baseline在down-sampled影象上應用粗檢測器，沒有縮放。

GS+Rnet. Given the initial state representation generated by the R-net, we use a greedy search strategy (GS) to densely search for the best window every time based on the current state without considering the long-term reward.

GS+Rnet。給定由R-net生成的初始狀態表示，我們使用貪婪搜尋策略（GS）根據當前狀態每次密集搜尋最佳視窗，而不考慮long-term reward。

ER+Qnet. The entropy of the detector output (object vs no object) is another way to measure the quality of a coarse detection. [2] used entropy to measure the quality of a region for a classification task. Higher entropy implies lower quality of a coarse detection. So, if we ignore the correlation between fine and coarse detections, the accuracy gain of a region can also be computed as

$-{p}_{i}^{l}log\left({p}_{i}^{l}\right)-\left(1-{p}_{i}^{l}\right)log\left(1-{p}_{i}^{l}\right)$

where ${p}^{l}$ indicates the score of the coarse detection. For fair comparison, we fix all parameters of the pipeline except replacing the R-net output of a proposal with its entropy.

ER+Qnet。檢測器輸出（有物體vs無物體）的熵是另一種測量粗略檢測質量的方法。 [2]用熵來衡量分類任務的區域質量。較高的熵意味著較低的粗檢測質量。因此，如果我們忽略精細和粗略檢測之間的相關性，則區域的精度增益也可以計算為

$-{p}_{i}^{l}log\left({p}_{i}^{l}\right)-\left(1-{p}_{i}^{l}\right)log\left(1-{p}_{i}^{l}\right)$

SSD and YOLOv2. We also compare our method with off-the-shelf SSD [27] and YOLOv2 [32] trained on CPD, to show the advantage of our method on large images.

SSD和YOLOv2。我們還將我們的方法與CPD訓練的off-the-shelf的SSD [27]和YOLOv2 [32]進行了比較，以展示我們的方法在大影象上的優勢。

Variants of our framework
We use Qnet-CNN to represent the Q-net developed using a fully convolutional network (see Fig. 2). To analyze the contributions of different components to the performance gain, we evaluate three variants of our framework: Qnet*, Qnet-FC and Rnet*.

Variants of our framework

Qnet*. This method uses a Q-net with refinement to locally adjust the zoom-in window selected by Q-net.

QNET *。這種方法使用Q-net進行細化以在區域性調整由Q-net選擇的zoom-in window。

Qnet-FC. Following [7], we develop this variant with two fully connected (FC) layers for Q-net. For Qnet-FC, the state representation is resized to a vector of length 1, 200 as the input. The first layer has 128 units and the second layer has 34 units (9+25). Each output unit represents a sampled window on an image. We uniformly sample 25 windows of size 320 × 240 and 9 windows of size 214 × 160 on the CPD dataset. Since the output number of Qnet-FC can not be changed, windows sizes are proportionally increased when Qnet-FC is applied to WP dataset.

QNET-FC。 [7]之後，我們為Q-net開發了兩個完全連線（FC）層的變體。對於Qnet-FC，狀態表示被調整為長度為1,200的向量作為輸入。第一層有128個單元，第二層有34個單元（9 + 25）。每個輸出單位表示影象上的取樣視窗。我們在CPD資料集上統一取樣25個尺寸為320×240的視窗和9個尺寸為214×160的視窗。由於無法更改Qnet-FC的輸出數量，因此將Qnet-FC應用於WP資料集時，視窗大小會成比例地增加。

Rnet*. This is an R-net learned using a reward function that does not explicitly encode cost (λ = 0 in Eq. 1).

Rnet*。這是使用reward函式學習的R-net，其沒有明確編碼成本（方程式1中的λ= 0）。

We propose a dynamic zoom-in network to speed up object detection in large images without manipulating the underlying detector’s structure. Images are first downsampled and processed by the R-net to predict the accuracy gain of zooming in on a region. Then, the Q-net sequentially selects regions with high zoom-in reward to conduct fine detection. The experiments show that our method is effective on both Caltech Pedestrian Detection dataset and a high resolution pedestrian dataset.

Figure 1: Illustration of our approach. The input is a downsampled version of the image to which a coarse detector is applied. The R-net uses the initial coarse detection results to predict the utility of zooming in on a region to perform detection at higher resolution. The Q-net, then uses the computed accuracy gain map and a history of previous zooms to determine the next zoom that is most likely to improve detection with limited computational cost.

Figure 2: Given a down-sampled image as input, the R-net generates an initial accuracy gain (AG) map indicating the potential zoom-in accuracy gain of different regions (initial state). The Q-net is applied iteratively on the AG map to select regions. Once a region is selected, the AG map will be updated to reflect the history of actions. For the Q-net, two parallel pipelines are used, each of which outputs an action-reward map that corresponds to selecting zoom-in regions with a specific size. The value of the map indicates the likelihood that the action will increase accuracy at low cost. Action rewards from all maps are considered to select the optimal zoom-in region at each iteration. The notation 128×15×20:(7,10) means 128 convolution kernels with size 15×20, and stride of 7/10 in height/width. Each grid cell in the output maps is given a unique color, and a bounding box of the same color is drawn on the image to denote the corresponding zoom region size and location.