1. 程式人生 > >立體匹配CNN篇(一) :[LW-CNN] look wider to match image patches by cnn

立體匹配CNN篇(一) :[LW-CNN] look wider to match image patches by cnn

Abstract

2016 IEEE SPL 目前在middlebury上排名第二

  • 提出一種新的CNN 模式,通過一個較大尺寸視窗來學習一個matching cost 與以往的池化層不同(with strides),the proposed per-pixel pyramid-pooling layer can cover a larger area without a loss of resolution and detail.因此cost的學習函式可以利用更大區域的資訊,避免引入fattening effect.
  • 創新點: 類似於SPP, 加入multi-scale 的pool,然後對資訊進行融合,得到的feature map 不會丟失細節資訊。
  • 相比於mc-cnn的改進之處,加入4P模型後,在弱紋理區域,包含了更大的視窗。

1 Introduction

解決window-based在視差不連續區域的不可靠性:One method to handle this trade-off is to make the window-based versatile to its input patterns [10], [11], [12]. making the shape of the matching template adaptive so that it can discard the information from the pixels that are irrelevant to the target pixel.
但是knowing the background pixels before the actual matching is difficult.
existing method are based on AlexNet 或者VGG ,這些都是為識別設計的而不是為匹配設計的。 這類CNN的困難在於增大patches的size
而patch的有效size又直接與感知野的空間區域聯絡, 並可以通過以下方式擴大:
1) include a few strided pooling /convolution layers
2)在每一層採用更大的卷積核
3)增加層數
然而,使用strided pooling 或者卷積層會讓結果降取樣,丟失一些細節資訊。Although the resolution can be recovered by applying fractional-strided convolution [17], reconstructing small or thin structures is still difficult if once they are lost after downsampling.

2 related work

關於matching cost的學習 [13,14,22]
[13] mc-cnn :11*11 window,沒有使用池化,得到的cost比較noisy,,因此後面使用了CROSS-based cost ggregation+SGM
[14] learning to compare patches.. 採用了multiple pooling layers and spatial-pyramid-pooling (SPP) [24] to process larger patches.
但結果會引入fattening effect,這是由於pooling的資訊丟失導致的。
本文的創新點:引入一個新的池化方法,可以在更大的感知野上處理而不丟失細節資訊。
類似的嘗試在語義分割中已經有所體現:[25,26,27] 這些方法都是將高層和底層的資訊進行結合,使得object-level的資訊能夠精確到pixel-level
這些方法可以在大物體上取得較好的效果,但是對那些小的物體則失效。
FlowNet [28] 可以將low-level的flow上取樣到原始尺寸。
與本文最接近的工作是【24】 (何愷明的SPP)
SPP中,去掉了卷積層之間的池化層,而先對幾個卷積層串聯輸出的結果進行pool,得到high-level和mid-level的資訊計算高度非線性的feature-map
儘管【14】也用到了SPP,但是它仍然有卷積層之間夾著的池化層,因此也是丟失了資訊的。

3 method

輸入: 兩個patches
輸出: matching cost

A. Per-pixel Pyramid Pooling (4P)

池化層的作用大家都知道,可以使map的尺寸指數縮小,缺點是在獲得更大的感知野的同時丟失了一些細節資訊。

採用大的池化視窗來替代一個帶stride的小池化視窗,以達到同樣大的感知野。
進行多個不同視窗尺寸的池化,並將輸出連線得到新的feature maps
注意,這個multi-scale pooling operation 是對每個畫素進行,而stride =1! 

這裡寫圖片描述

B. proposed model

s=[27,9,3,1]
選取的對比演算法是mc-cnn
這裡寫圖片描述

4 實驗

採用mc-cnn的框架,不同之處在於
1)patch size: 3737
2)只fine-tune 最後面三個11卷積層,這比隨機初始化的效果好,
3) lr : 0.003->0.0003
4)後處理與mc-cnn一模一樣

5 未解決

如何做到大尺寸視窗下在視差不連續處判斷的準確性。(?待整理)

參考文獻

[10] K. Wang, “Adaptive stereo matching algorithm based on edge detection,” in ICIP, vol. 2. IEEE, 2004, pp. 1345–1348.
[11] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” PAMI, vol. 28, no. 4, pp. 650–656, 2006.
[12] F. Tombari, S. Mattoccia, L. D. Stefano, and E. Addimanda, “Classification and evaluation of cost aggregation methods for stereo correspondence,” in CVPR. IEEE, 2008, pp. 1–8.
[13] J. ˇZbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2287–2318, 2016.
[14] S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in CVPR, June 2015, pp. 4353–4361.
[17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv preprint arXiv:1412.6856, 2014.
[22] L. Ladick`y, C. H¨ane, and M. Pollefeys, “Learning the matching function,” arXiv preprint arXiv:1502.00652, 2015.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV. Springer,
2014, pp. 346–361.
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
[26] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, June 2015, pp. 447–456.
[27] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” arXiv preprint arXiv:1505.04366, 2015.