【YOLT】《You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery》

阿新 • • 發佈：2019-01-04

本文主要講解了一篇基於yolo演算法進行改進的高效衛星影象目標檢測演算法，主要針對高解析度輸入和密集小物體進行了優化。

這裡寫圖片描述

1 Motivation

Detection of small objects in large swaths of imagery is one of the primary problems in satellite imagery analytics.

Object detection in ground-based imagery has benefited from research into new deep learning approaches, transitioning such technology to overhead

imagery is nontrivial.

1.1 Challenges

sheer number of pixels per image：over 250 million pixels
geographic extent per image：> $64 k m^{2}$
objects of interest are minuscule：about 10 pixels

1.2 Impressive framework

Faster R-CNN typically ingests 1000 × 600 pixel images
SSD：300 × 300 or 512 × 512
YOLO：416 × 416

or 544 × 544

However

None can come remotely close to ingesting the ~ 16,000×16,000 input sizes typical of satellite imagery.

Due to the speed, accuracy, and flexibility of YOLO, 作者的 framework 基於YOLO設計.

Excluding implementation details, algorithms must adjust for:

Small spatial extent（目標太小）：small and densely clustered，在衛星影象中，感興趣的物體相對尺寸都很小而且常常聚攏在一起，與ImageNet資料集中大範圍的顯著物體大不相同。同時物體的解析度主要由地面取樣距離決定，它定義了每個畫素對應的物理長度。通常情況下衛星執行的高度是350km左右，最清晰的商用衛星影象可以達到30cm的GSD（每個畫素對應30cm），而普通的數字衛星影響只能達到3-4m的解析度了。所以對於車輛、船隻這樣的小物體來說可能只有10多個畫素來描述；
Complete rotation invariance（要有旋轉不變性）：衛星影象中的物體具有各個方位的朝向，而ImageNet資料集中大多是豎直方向的，需要檢測器具有旋轉不變性；
Training example frequency（訓練樣本少）：訓練資料的缺乏，對於衛星影象缺乏高質量的訓練資料，雖然SpaceNet已經進行了一系列有益的工作，但還需要進一步改進；
Ultra high resolution（圖片太大）：極高的影象解析度，與通常輸入的小圖片不同，衛星影象動輒上億畫素，簡單的將取樣方法對於衛星影象處理無法適用。

文章的 contribution 就是 addresses each of these issues separately

Notion:

Ground sample distance (GSD)
衛星圖片上一個畫素點代表真實世界的尺寸，比如 30cm GSD 就表示，圖片上的一個畫素點就為真實世界中的30cm

Commercially available imagery varies from 30 cm GSD for the sharpest Digital-Globe imagery, to 3-4 meter GSD for Planet imagery

That’s to say, cars each object will be only ~15 pixels in extent even at the highest resolution.

3 Advantage

The proposed approach can rapidly detect objects of vastly different scales with relatively little training data over multiple sensors.

4 Method

Left: Model applied to a large 4000 × 4000 pixel test image downsampled to a size of 416 × 416;（小目標沒有了）none of the 1142 cars in this image are detected.

right：Model applied to a small 416 × 416 pixel cutout; the excessive false negative rate is due to the high density of cars that cannot be differentiated by the 13 × 13 grid.

作者的方法，總結一下就是：
Data augmentation + pre- and post-processing + 改進的YOLOv1

4.1 改進YOLOv1

輸入416×416
consider the default YOLO network architecture, which downsamples by a factor of 32 and returns a 13 ×13 prediction grid;
導致如果目標畫素小於32，就無法檢測

作者的方法

縮小了 downsample 的倍數，加多的網路的層數
we implement a network architecture that uses 22 layers and downsamples by a factor of 16 Thus, a 416 × 416 pixel input image yields a 26 × 26 prediction grid.
啟用函式用的 Leaky ReLUs
加了一個 pass through layer

論文中作者總結對YOLOv1的改進如下

4.1.1 Leaky ReLUs

這裡寫圖片描述

4.1.2 passthrough layer

這個層的作用就是將上一層特徵圖的相鄰畫素都切除一部分組成了另外一個通道。例如，將26*26*512的特徵圖變為13*13*2048的特徵圖（這裡具體的實現過程需要看作者的原始碼，但是，為了解釋這個變化過程，可以做這樣的解釋，就是將一個26*26的圖的畫素放到4個13*13的圖中，水平每2個畫素取1個，垂直也是每2個畫素取一個，一共就可以得到2*2=4個，512*4=2048），使得特徵圖的數目提高了4倍，同時，相比於26*26的特徵圖，13*13的特徵圖更有利用小目標物的檢測，

網路結構如下圖

紅線處就是passthrough層

$N_{f} = N_{b o x e s} * （ N_{c l a s s} + 5 ）$

$N_{b o x e s}$ is the number of boxes per grid（default is 5）

4.2 pre-processing and post-processing

pre-processing 就是訓練的時候 split 產生許多cutouts，有15%的 overlap
post-processing 就是測試時把cutout 通過 NMS（非極大值抑制）合起來

5 dataset

汽車資料集使用了COWC資料集，基於15cm的GSD尺度。為了與目前商用衛星影象的30cm尺度一致，利用高斯核對影象進性了處理，並在30cmGSD的尺度上為每輛車標註3m的邊框，共13303個樣本；
建築平面基於SpaceNet的資料在30cmGSD尺度下標註了221336個樣本；
飛機利用八張GigitalGlobe的圖片標註了230個樣本；
船隻利用三張GigitalGlobe的圖片標註了556個樣本；
機場利用37張圖片作為訓練樣本，其中包含機場跑道，並進行4比例的降取樣。

An initial learning rate of $10^{- 3}$ , a weight decay of 0.0005, and a momentum of 0.9.
Training takes 2 ~ 3 days on a single NVIDIA Titan X GPU.

6 Experiment

作者用F1來評估模型，Car目標相比其它目標比較小，所以設定的IOU比較小，結果如下

作者發現有時候把高速公路錯認為飛機場的跑道，且飛機場比其它目標大的多，作者採用兩個模型來處理，一個檢測飛將場，一個檢測除飛機場以為的目標

不同解析度下的cutout 如下，注意 0.15m GSD 的解析度要高於0.30m GSD

作者在0.30m GSD下訓練，在各種解析度下測試，結果如下

右邊的縱軸是 $F c$ , $F c = N_{p r e d i c t e d} / N_{t r u t h}$

作者也嘗試了在不同解析度下訓練測試的結果，At each of the thirteen resolutionswe evaluate test scenes with a uniquemodel trained at that resolution.結果如下：

bottom of x 軸對應的是SGD，top of x軸對應的是car在相應SGD的大小（pixel），比如在0.15m的SGD下，一個Car 的大小大概為20個畫素

【YOLT】《You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery》