【Mask RCNN】《Mask R-CNN》
ICCV-2017
目錄
1 Motivation
object detection and semantic segmentation 發展很快, 因為有 base system Faster R-CNN 和 FCN respectively,our goal in this work is to develop a comparably enabling(適應性廣的) framework for instance segmentation
instance segmentation 和 object detection 類似,都會涉及到區別同一類的不同個體
2 Innovation
Extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. 做到一個模型,三種用途 instance segmentation, bounding-box object detection, and person keypoint detection
提出了 RoIAlign 彌補 Faster R-CNN 的 end-to-end align for instance segmentation
3 Advantages
instance segmentation, bounding-box object detection, and person keypoint detection 三合一,且效果比各自單項冠軍(2016 COCO)好
4 Methods
4.1 Head Architeture
左邊的結構不好,R-FCN這邊論文一開始就說了(this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation. 類似RCNN的感覺,提出proposal後,對每個proposal進行後續處理),作者也推薦用右邊的結構(we do not recommend using the C4 variant in practice)
Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset,作者加了第三個 branch,讓網路 output object mask. 但是 第三個 branch requiring extraction of much finer spatial layout of an object.
Mask R-CNN also outputs a binary mask for each RoI
4.2 RoI Align
做segment是pixel級別的,但是faster rcnn中roi pooling有2次量化操作導致了沒有對齊 ,兩次量化,第一次 roi 對映 feature map 時,第二次 roi pooling 時 1
量化(quantization)如下
RoI pooling、warp、align 的區別如下:
RoI Align 詳解如下圖中間部分
取非量化後 RoI 中的四個點,用雙線性差值(周圍四個畫素點)確定其畫素值,然後四個加起來求平均
4.3 Train
Multi-task loss: , is defined only on positive RoIs
mask branch has a dimensional output for each RoI,K classes and m×m resolution
- RoI 的 positive IoU at least 0.5 and negative otherwise
- 如同 fast rcnn 一樣,採用 image-centric sampling 而不是 RoI centric sampling 來訓練
- RoI-centric sampling:從所有圖片的所有RoI中均勻取樣,這樣每個SGD的mini-batch中包含了不同影象中的樣本。(SPPnet採用)
- image-centric sampling: (solution)mini-batch採用層次取樣,先對影象取樣,再對RoI取樣,同一影象的RoI共享計算和記憶體2。
- Each mini-batch has 2 images per GPU and each image has N sampled RoIs,positive:negative = 1:3,N = 64 for C4 backbone and 512 for FPN(見圖3)
- RPN anchors 5 scales and 3 aspect ratios
4.4 Inference
- Proposal = 300 for C4,and 1000 for FPN,然後丟到 box prediction branch, 接NMS
- Mask branch applied to the highest scoring 100 detection boxes,與訓練的時候不同,但是加速
- Mask branch 能預測 K 個 masks per RoI,但是隻用 k-th mask,k是 classification branch 的結果
- Mask 會 resize 到 RoI 的大小,二值化的 thresold 為0.5
5 Experiments:Instance Segmentation
evaluate using mask IoU
5.1 Main Results
outperform COCO2015、2016的 instance segmentation 冠軍
Mask RCNN VS FCIS,FCIS exhibits systematic artifacts on overlapping instances 而 Mask RCNN 沒有。
5.2 Ablation Experiments
- Backbone:benefit from depth(50 vs 101),FPN and ResNeXt(表2 a)
- Multinomial vs Independent Masks:簡單的說就是 sigmoid vs softmax,sigmod 是 class-specific 的,爭對每一類,二分類,而 softmax 是 class- agnostic,爭對每個畫素, 用softmax 然後 multinomial logistic loss(表2 b,c)
- RoIAlign:對 max還是average pooling insensitive,所以作者都採用的是average pooling,相對 RoI pooling 效果有明顯提升(表2 c,d),(c)的backbone 為 ResNet-50-C4,stride 為16,(d)中採用的是 ResNet-50-C5,stride 為 32,(d)比(c)的效果好,AP 30.9 vs 30.3,用FPN的話效果會進一步提升。
- Mask branch: FCN 比 MLP (FC)好
5.3 Bounding Box Detection Results
注意到 去掉mask 和 加上mask的區別在於,solely due to the benefits of multi-task training
table1 中, instance segmentation 的 AP 為 37.1
This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.
5.4 Timing
our design is not optimized for speed
Mask R-CNN for Human Pose Estimation 以及 Experiments on Cityscapes(instance segmentation)這篇部落格就不在討論了,有興趣的可以去看下原文。