1. 程式人生 > >論文學習:Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

論文學習:Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

20180313,谷歌開源了語義影象分割模型 DeepLab-v3+。

GitHub 地址:https://github.com/tensorflow/models/tree/master/research/deeplab

論文連結:https://arxiv.org/abs/1802.02611

===============================================================

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

空洞分離卷積編碼器-解碼器結構用於語義圖割

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam
Google Inc.
{lcchen, yukun, gpapan, fschroff, hadam}@google.com

Abstract. Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

摘要:深神經網路中採用空間金字塔池化模型或編碼器-解碼器結構進行語義分割。前一種網路通過探測輸入特徵或以多比例、多有效感受野的方式池化操作來對多尺度上下文資訊進行編碼,後一種網路通過逐漸恢復空間資訊來捕獲更清晰的物件邊界。在這項工作中,我們提出結合兩種方法的優點。具體來說,我們提出的模型DeepLabv3+擴充套件了DeepLabv3,通過新增一個簡單而有效的解碼器模組來細化分割結果,特別是沿著物件邊界。我們進一步研究了Xception模型,並將深度可分離卷積(depthwise separable convolution)應用於空洞空間金字塔池化(Atrous Spatial Pyramid Pooling)和解碼器模組,從而生成一個更快、更強編碼器-解碼器網路。我們在PASCAL VOC 2012和Cityscapes資料集上驗證了該模型的有效性,在沒有任何後期處理的情況下,測試集的效能分別達到89.0%和82.1%。我們的論文附帶了在Tensorflow中提出的模型的公開參考實現:https://github.com/tensorflow/models/tree/master/research/deeplab。

 

1    Introduction

Semantic segmentation with the goal to assign semantic labels to every pixel in an image [1,2,3,4,5] is one of the fundamental topics in computer vision. Deep convolutional neural networks [6,7,8,9,10] based on the Fully Convolutional Neural Network [8,11] show striking improvement over systems relying on hand-crafted features [12,13,14,15,16,17] on benchmark tasks. In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21,22] for semantic segmentation, where the former one captures rich contextual information by pooling features at different resolution while the latter one is able to obtain sharp object boundaries.

語義分割是計算機視覺的基礎課題之一,其目標是為影象中的每個畫素分配語義標籤[1,2,3,4,5]。深度卷積神經網路[6,7,8,9,10]基於完全卷積神經網路[8,11],相對於依賴於手動設計的特徵[12,13,14,15,16,17]的系統在基本測試任務上有顯著改進。在本文中,我們考慮了兩種使用空間金字塔池模組[18,19,20]或編碼器解碼器結構[21,22]進行語義分割的神經網路,前者通過不同解析度的池化特徵獲取豐富的上下文資訊,後者能夠獲得清晰的物件邊界。

In order to capture the contextual information at multiple scales, DeepLabv3[23] applies several parallel atrous convolution with different rates (called Atrous Spatial Pyramid Pooling, or ASPP), while PSPNet [24] performs pooling operations at different grid scales. Even though rich semantic information is encoded in the last feature map, detailed information related to object boundaries is missing due to the pooling or convolutions with striding operations within the network backbone. This could be alleviated by applying the atrous convolution to extract denser feature maps. However, given the design of state-of-art neural networks[7,9,10,25,26] and limited GPU memory, it is computationally prohibitive to extract output feature maps that are 8, or even 4 times smaller than the input resolution. Taking ResNet-101 [25] for example, when applying atrous convolution to extract output features that are 16 times smaller than input resolution, features within the last 3 residual blocks (9 layers) have to be dilated. Even worse, 26 residual blocks (78 layers!) will be affected if output features that are 8 times smaller than input are desired. Thus, it is computationally intensive if denser output features are extracted for this type of models. On the other hand, encoder-decoder models [21,22] lend themselves to faster computation (since no features are dilated) in the encoder path and gradually recover sharp object boundaries in the decoder path. Attempting to combine the advantages from both methods, we propose to enrich the encoder module in the encoder-decoder networks by incorporating the multi-scale contextual information.

為了在多個尺度上捕獲上下文資訊,DeepLabv3[23]應用了幾個具有不同比例的並行空洞卷積(Atrous Spatial Pyramid Pooling, or ASPP),而PSPNet[24]在不同網格尺度上執行池化操作。儘管在最後一個feature map中編碼了豐富的語義資訊,但是由於網路主幹中的池化或卷積交叉操作,與物件邊界相關的詳細資訊丟失了。這可以通過應用空洞卷積來提取更稠密的feature maps來緩解。然而,考慮到神經網路的設計水平[7,9,10,25,26]和有限的GPU記憶體,提取比輸入解析度小8倍甚至4倍的輸出特徵圖在計算上是非常困難的。以ResNet-101[25]為例,當應用全卷積提取比輸入解析度小16倍的輸出特徵時,必須對最後3個殘塊(9層)內的特徵進行擴充套件。更糟糕的是,如果需要比輸入小8倍的輸出特性,則會影響26個剩餘塊(78層!)因此,如果為這種型別的模型提取更密集的輸出特性,則計算量很大。另一方面,編碼器-譯碼器模型[21,22]在編碼器路徑中具有更快的計算速度(因為沒有擴充套件特性),並在解碼器路徑中逐漸恢復清晰的物件邊界。為了結合這兩種方法的優點,我們提出了在編碼器譯碼網路中加入多尺度上下文資訊來豐富編碼器模組。