1. 程式人生 > >【論文翻譯】中英對照翻譯--(Attentive Generative Adversarial Network for Raindrop Removal from A Single Image)

【論文翻譯】中英對照翻譯--(Attentive Generative Adversarial Network for Raindrop Removal from A Single Image)

【開始時間】2018.10.08

【完成時間】2018.10.09

【論文翻譯】Attentive GAN論文中英對照翻譯--(Attentive Generative Adversarial Network for Raindrop Removal

from A Single Image)

【中文譯名】一幅單一影象中雨滴去除的專注生成對抗性網路

【論文連結】https://arxiv.org/abs/1711.10098

 

 

【補充】

1)論文的發表時間是:6 May 2018,是在CVPR2018上發表的論文

2)文章解讀可參考:https://blog.csdn.net/gentlelu/article/details/80672490

【宣告】本文是本人根據原論文進行翻譯,有些地方加上了自己的理解,有些專有名詞用了最常用的譯法,時間匆忙,如有遺漏及錯誤,望各位包涵

                                                         題目:一幅單一影象中雨滴去除的專注生成對抗性網路

Abstract(摘要)

        Raindropsadheredtoaglasswindoworcameralenscan severely hamper the visibility of a background scene and degrade an image considerably. In this paper, we address the problem by visually removing raindrops, and thus transforming a raindrop degraded image into a clean one. The problem is intractable, since first the regions occluded by raindrops are not given. Second, the information about the background scene of the occluded regions is completely lost for most part. To resolve the problem, we apply an attentive generative network using adversarial training. Our main idea is to inject visual attention into both the generative and discriminative networks. During the training, our visual attention learns about raindrop regions and their surroundings. Hence, by injecting this information, the generative network will pay more attention to the raindrop regions and the surrounding structures, and the discriminative network will be able to assess the local consistency of the restored regions. This injection of visual attention to both generative and discriminative networks is the main contribution of this paper. Our experiments show the effectiveness of our approach, which outperforms the state of the art methods quantitatively and qualitatively.

      雨滴粘附在玻璃窗或相機鏡頭上會嚴重影響背景影象的可見性,並大大降低影象的質量。在本文中,我們通過視覺上去除雨滴解決了這個問題,將雨滴退化影象轉化為清晰的影象。這個問題很難解決,因為第一沒有給出雨滴遮擋的區域。第二,遮擋區域的背景場景資訊大部分是完全丟失的。為了解決這一問題,我們利用對抗性訓練建立了一個專注生成網路( an attentive generative network)。我們的主要想法是將視覺注意力(visual attention)注入到生成性和判別性的網路中。在訓練中,我們的視覺注意力學習雨滴區域及其周圍環境。因此,通過注入這些資訊,生成網路將更多地關注雨滴區域和周圍的結構,而判別網路將能夠評估恢復區域的區域性一致性(the local consistency)。將視覺注意力注入生成網路和區分網路是本文的主要貢獻。實驗證明了該方法的有效性,在數量和質量上都優於先進的方法。

 

1. Introduction(介紹)

     Raindrops attached to a glass window, windscreen or lens can hamper the visibility of a background scene and degrade an image. Principally, the degradation occurs because raindrop regions contain different imageries from those without raindrops. Unlike non-raindrop regions, raindrop regions are formed by rays of reflected light from a wider environment, due to the shape of raindrops, which is similar to that of a fish-eye lens. Moreover, in most cases, the focus of the camera is on the background scene, making the appearance of raindrops blur.

    附著在玻璃視窗、擋風玻璃或鏡頭上的雨滴會阻礙背景場景的可見度並降低影象的質量。這主要是由於雨滴區域與沒有雨滴的區域包含不同的影象資訊。與非雨滴區域不同,雨滴區域是由來自更廣闊環境的反射光形成的,因為雨滴的形狀類似於魚眼透鏡的形狀。此外,在大多數情況下,相機的焦點是在背景場景,這使得雨滴的外觀模糊。

 

   In this paper, we address this visibility degradation problem. Given an image impaired by raindrops, our goal is to remove the raindrops and produce a clean background as shown in Fig. 1. Our method is fully automatic. We consider that it will benefit image processing and computer vision applications, particularly for those suffering from raindrops, dirt, or similar artifacts.

   在本文中,我們解決了可見性退化問題。給出一個被雨滴破壞的影象,我們的目標是移除雨滴,併產生一個乾淨的背景,如圖1所示。我們的方法是全自動的。我們認為,這將有利於影象處理和計算機視覺應用,特別是對於那些遭受雨滴,汙垢或類似的人為影響的圖片而言。

圖1.演示我們的雨滴清除方法。左:輸入因雨滴而退化的影象。右:我們的結果,大部分雨滴被移除,結構細節被恢復。放大影象將提供一個更好的恢復質量。

 

    A few methods have been proposed to tackle the raindrop detection and removal problems. Methods such as  [17, 18, 12] are dedicated to detecting raindrops but not removing them. Other methods are introduced to detect and remove raindrops using stereo [20], video [22, 25], or specifically designed optical shutter [6], and thus are not applicable for a single input image taken by a normal camera. A method by Eigen et al. [1] has a similar setup to ours. It attempts to remove raindrops or dirt using a single image via deep learning method. However, it can only handle small raindrops, and produce blurry outputs [25]. In our experimental results (Sec. 6), we will find that the method fails to handle relatively large and dense raindrops.

    針對雨滴檢測和排除問題,之前已經提出了幾種解決方法。諸如[17,18,12]等方法專門用於檢測雨滴,但不去除雨滴。其他方法採用 stereo(立體)[20]、視訊[22、25]或專門設計的光學快門[6]探測和去除雨滴,因此不適用於由普通照相機拍攝的單個輸入影象。艾根等人提出的一種方法[1],與我們的相似。它試圖通過深度學習的方法,用一幅影象去除雨滴或汙垢。然而,它只能處理微小的雨滴,併產生模糊的輸出[25]。在我們的實驗結果中(第6節),我們會發現,這種方法不能處理較大和密集的雨滴。

 

     In contrast to [1], we intend to deal with substantial presence of raindrops, like the ones shown in Fig. 1. Generally, the raindrop-removal problem is intractable, since first the regions which are occluded by raindrops are not given. Second, the information about the background scene of the occluded regions is completely lost for most part. The problem gets worse when the raindrops are relatively large and distributed densely across the input image. To resolve the problem, we use a generative adversarial network, where our generated outputs will be assessed by our discriminative network to ensure that our outputs look like real images. To deal with the complexity of the problem, our generative network first attempts to produce an attention map. This attention map is the most critical part of our network, since it will guide the next process in the generative network to focus on raindrop regions. This map is produced by a recurrent network consisting of deep residual networks (ResNets) [8] combined with a convolutional LSTM [21] and a few standard convolutional layers. We call this attentive-recurrent network.

    與[1]相反,我們打算處理大量的雨滴,如圖1中所示。一般情況下,雨滴清除問題是難以解決的,因為第一沒有給出被雨滴遮擋的區域。第二,關於被遮擋區域的背景場景的資訊大部分是完全丟失的。當雨滴相對較大並且密集分佈在輸入影象上時,問題就變得更嚴重了。為了解決這個問題,我們使用了一個生成的對抗性網路,在該網路中,我們生成的輸出將由我們的區分網路( discriminative network)進行評估,以確保我們的輸出看起來像真實的影象。為了解決這個問題的複雜性,我們的生成網路首先嚐試製作一張注意力圖(attention map)。這張對映圖是我們網路中最關鍵的部分,因為它將引導生成網路的下一個過程集中在雨滴區域。該對映是由深度殘差網路(ResNet)[8]和卷積LSTM[21]以及幾個標準的卷積層組成的遞迴網路生成的。我們稱之為關注-迴圈網路( attentive-recurrent network)。

 

    The second part of our generative network is an autoencoder, which takes both the input image and the attention map as the input. To obtain wider contextual information, in the decoder side of the autoencoder, we apply multi-scale losses. Each of these losses compares the difference between the output of the convolutional layers and the cor- responding ground truth that has been downscaled accordingly. The input of the convolutional layers is the features from a decoder layer. Besides these losses, for the final output of the autoencoder, we apply a perceptual loss to obtain a more global similarity to the ground truth. This final output is also the output of our generative network.

    生成網路的第二部分是以輸入影象和注意圖為輸入的自動編碼器。為了獲得更廣泛的上下文資訊,在自動編碼器的解碼器端,我們採用了多尺度損失( multi-scale losses)。這些損失中的每一個都比較了卷積層的輸出與已被相應縮小的相應地面真相( ground truth)之間的差異。卷積層的輸入是來自解碼器層的特徵。除了這些損失,對於最終輸出的自動編碼器,我們應用一個感知損失( perceptual loss),以獲得一個更全面的對 於地面真相的相似性。這最後的輸出也是我們生成網路的輸出。

    

    Having obtained the generative image output, our discriminative network will check if it is real enough. Like in a few inpainting methods (e.g. [9, 13]), our discriminative network validates the image both globally and locally. However, unlike the case of inpainting, in our problem and particularly in the testing stage, the target raindrop regions are not given. Thus, there is no information on the local regions that the discriminative network can focus on. To address this problem, we utilize our attention map to guide the discriminative network toward local target regions.

    在獲得生成影象輸出後,我們的判別網路將檢查它是否足夠真實。就像一些修復方法(例如,[9,13])一樣,我們的判別網路對影象進行全域性和區域性驗證。然而,與修復的情況不同,在我們的問題中,特別是在測試階段,沒有給出目標雨滴區域。因此,沒有關於區域性區域的資訊可供 判別網路關注。為了解決這一問題,我們利用我們的注意力圖來引導判別網路趨向於區域性目標區域。

 

    Overall, besides introducing a novel method of raindrop removal, our other main contribution is the injection of the attention map into both generative and discriminative networks, which is novel and works effectively in removing raindrops, as shown in our experiments in Sec. 6. We will release our code and dataset.

    總之,除了引入一種新的雨滴去除方法外,我們的另一個主要貢獻是將注意力圖注入生成網路和區分網路,這是一種新穎的、有效地去除雨滴的方法,如我們的實驗(第6節)中所示。.我們將釋出程式碼和資料集。

 

   The rest of the paper is organized as follows. Section 2 discusses the related work in the fields of raindrop detection and removal, and in the fields of the CNN-based image inpainting. Section 3 explains the raindrop model in an image, which is the basis of our method. Section 4 describes our method, which is based on the generative adversarial network. Section 5 discusses how we obtain our synthetic and real images used for training our network. Section 6 shows our evaluations quantitatively and qualitatively. Finally, Section 7 concludes our paper.

     論文的其餘部分組織如下。第二節討論了雨滴檢測和去除領域以及CNN影象修復領域的相關工作。第三節解釋了影象中的雨滴模型,這是我們方法的基礎。第四節介紹了基於生成對抗網路的方法。第五節討論了我們如何獲得我們的合成和真實的影象,以用於培訓我們的網路。第六節從數量和質量兩個方面展示了我們的評價。第7節總結了我們的論文。

 

2. Related Work(相關工作)

      There are a few papers dealing with bad weather visibility enhancement, which mostly tackle haze or fog (e.g. [19, 7, 16]), and rain streaks (e.g. [3, 2, 14, 24]). Unfortunately, we cannot apply these methods directly to raindrop removal, since the image formation and the constraints of raindrops attached to a glass window or lens are different from haze, fog, or rain streaks.

     有幾篇關於惡劣天氣視覺增強的論文,主要用於防霧霾或霧(例如[19,7,16])和雨紋( rain streaks)(例如[3,2,14,24])。不幸的是,我們不能直接應用這些方法來去除雨滴,因為玻璃視窗或鏡片上的雨滴的形成和約束不同於霧霾、霧或雨紋。

 

    A number of methods have been proposed to detect raindrops. Kurihata et al.'s [12] learns the shape of raindrops using PCA, and attempts to match a region in the test image, with those of the learned raindrops. However, since raindrops are transparent and have various shapes, it is unclear how large the number of raindrops needs to be learned, how to guarantee that PCA can model the various appearance of raindrops, and how to prevent other regions locally similar

to raindrops to be detected as raindrops. Roser and Geiger’s [17] proposes a method that compares a synthetically generated raindrop with a patch that potentially has a raindrop. The synthetic raindrops are assumed to be a sphere section, and later assumed to be inclined sphere sections [18]. These assumptions might work in some cases, yet cannot be generalized to handle all raindrops, since raindrops can have various shapes and sizes.

    之前已經提出了許多檢測雨滴的方法。Kurihata等人[12]使用PCA來學習雨滴的形狀,並嘗試將測試影象中的一個區域與所學雨滴的區域相匹配。然而,由於雨滴是透明的,形狀各異,尚不清楚需要了解的雨滴數量有多大,如何保證PCA能夠模擬雨滴的各種外觀,以及如何防止區域性類似雨滴的區域被檢測為雨滴。Roser和蓋革[17]提出了一種方法,將一個綜合生成的雨滴與可能有雨滴的斑塊進行比較。綜合雨滴被假定為球面截面( a sphere section),後來被假定為傾斜球面截面[18]。這些假設在某些情況下可能是可行的,但卻不能推廣用於處理所有的雨滴,因為雨滴可以有不同的形狀和大小。

 

    Yamashita et al.’s [23] uses a stereo system to detect and remove raindrops. It detects raindrops by comparing the disparities measured by the stereo with the distance between the stereo cameras and glass surface. It then removes raindrops by replacing the raindrop regions with the textures of the corresponding image regions, assuming the other image does not have raindrops that occlude the same background scene. A similar method using an image sequence, instead of stereo, is proposed in Yamashita et al.’s [22]. Recently, You et al.’s [25] introduces a motion based method for detecting raindrops, and video completion to remove detected raindrops. While these methods work in removing raindrops to some extent, they cannot be applied directly to a single image.

   Yamashita等人[23]使用立體聲系統(stereo system)來探測和清除雨滴。它通過比較立體聲測量的差異(disparities)與立體相機與玻璃表面之間的距離來檢測雨滴。然後,將雨滴區域替換為相應影象區域的紋理,從而去除雨滴,假設其他影象沒有遮擋相同背景場景的雨滴。在Yamashita等人的[22]中,提出了一種用影象序列代替立體聲的類似方法。最近, You等人在[25]中介紹了一種基於運動的雨滴檢測方法,並通過視訊完成(video completion)來去除檢測到的雨滴。雖然這些方法在一定程度上起到了去除雨滴的作用,但它們不能直接應用於一幅影象。

 

    Eigen et al.’s [1] tackles single-image raindrop removal, which to our knowledge, is the only method in the literature dedicated to the problem. The basic idea of the method is to train a convolutional neural network with pairs of raindrop-degraded images and the corresponding raindrop-free images. Its CNN consists of 3 layers, where each has 512 neurons. While the method works, particularly for relatively sparse and small droplets as well as dirt, it cannot produce clean results for large and dense raindrops. Moreover, the output images are somehow blur. We suspect that all these are due to the limited capacity of the network and the de-

ficiency in providing enough constraints through its losses. Sec. 6 shows the comparison between our results with this method’s.

    艾根等人的[1]單影象雨滴去除,據我們所知,這是在文獻致力於這個問題的唯一方法。該方法的基本思想是使用雨滴退化影象和相應的無雨滴影象的影象對來訓練卷積神經網路。它的cnn由3層組成,每個層有512個神經元。雖然這種方法有效,特別是對於相對稀疏的和較小的液滴以及汙垢,但它不能對大而密集的雨滴產生乾淨的結果。此外,輸出影象也有些模糊。我們懷疑,所有這些都是由於網路容量有限以及通過損失提供足夠約束的不足所致。第6節將我們的結果與該方法的結果進行了比較。

 

   In our method, we utilize a GAN [4] as the backbone of our network, which is recently popular in dealing with the image inpainting or completion problem (e.g. [9, 13]). Like in our method, [9] uses global and local assessment in its discriminative network. However, in contrast to our method, in the image inpainting, the target regions are given, so that the local assessment (whether local regions are sufficiently real) can be carried out. Hence, we cannot apply the existing image inpainting methods directly to our problem. Another similar architecture is Pix2Pix [10], which translates one image to another image. It proposes a conditional GAN that not only learns the mapping from input image to output image, but also learns a loss function to the train the mapping. This method is a general mapping, and not proposed specifically to handle raindrop removal. In Sec. 6, we will show some evaluations between our method and Pix2Pix.

    在我們的方法中,我們使用一個GAN[4]作為我們網路的骨幹,這是最近在處理影象修復或完成問題(例如[9,13])中流行的方法。與我們的方法一樣,[9]在其判別網路中使用了全域性和區域性評估。然而,與我們的方法相比,在影象修復中,給出了目標區域,從而可以進行區域性評估(不管區域性區域是否足夠真實)。因此,我們不能將現有的影象修復方法直接應用於我們的問題。另一個類似的架構是 Pix2Pix[10],它將一個影象轉換成另一個影象。它提出了一種條件GAN,不僅學習從輸入影象到輸出影象的對映,而且對訓練的對映也學習了一個損失函式。這種方法是一種通用的對映方法,並沒有提出專門處理雨滴再移動的方法.在第6節中我們將展示我們的方法和 Pix2Pix之間的一些評估。

 

3. Raindrop Image Formation(雨滴影象生成)

     We model a raindrop degraded image as the combination of a background image and effect of the raindrops:

    我們將雨滴退化影象建模為背景影象和雨滴效應的結合:

     where I is the colored input image and M is the binary mask. In the mask, M(x) = 1 means the pixel x is part of a raindrop region, and otherwise means it is part of background regions. B is the background image and R is the effect brought by the raindrops, representing the complex mixture of the background information and the light reflected by the environment and passing through the raindrops adhered to a lens or windscreen. Operator ⊙ means element-wise multiplication.

    式中,I是輸入彩色影象;M是二值掩碼(binary mask)。在掩碼中,M (x)=1表示畫素 x 是雨滴區域的一部分,否則認為是背景區域的一部分;B是背景影象,R是雨滴所帶來的影響,代表複雜的混合背景資訊和環境中的光線反射,以及他們通過附著在鏡頭或擋風玻璃上的雨滴(造成的影響)。運算元⊙代表元素相乘。

 

   Raindrops are in fact transparent. However, due to their shapes and refractive index, a pixel in a raindrop region is not only influenced by one point in the real world but by the whole environment [25], making most part of raindrops seem to have their own imagery different from the background scene. Moreover, since our camera is assumed to focus on the background scene, this imagery inside a raindrop region is mostly blur. Some parts of the raindrops, particularly at the periphery and transparent regions, convey some information about the background. We notice that the information can be revealed and used by our network.

   雨滴實際上是透明的。然而,由於雨滴的形狀和折射率( refractive index),雨滴區域的畫素不僅受現實世界中某一點的影響,而且還受整個環境的影響[25],使得大部分雨滴似乎都有不同於背景場景的影象。此外,由於我們的相機被假定聚焦於背景場景,雨滴區域內的影象大多是模糊的。雨滴的某些部分,特別是在邊緣和透明區域,傳達了一些關於背景的資訊。我們注意到我們的網路可以顯示和使用這些資訊。

 

     Based on the model (Eq. (1)), our goal is to obtain the background image B from a given input I. To accomplish this, we create an attention map guided by the binary mask M. Note that, for our training data, as shown in Fig. 5, to obtain the mask we simply subtract the image degraded by raindrops I with its corresponding clean image B. We use a threshold to determine whether a pixel is part of a raindrop region. In practice, we set the threshold to 30 for all images in our training dataset. Thissimplethresholdingissufficient for our purpose of generating the attention map.

    基於模型(Eq.1),我們的目標是從給定的輸入I中獲取背景影象B。為了實現這一點,我們建立了一個由二值掩碼M引導的注意對映(attention map)。注意,對於我們的訓練資料,如圖5所示。為了獲得掩碼(mask),我們只需將雨滴退化影象I減去與其對應的乾淨影象B。我們使用閾值來確定畫素是否是雨滴區域的一部分。在實踐中,我們將所有的影象訓練資料集的閾值設定為30。這個簡單的閾值設定(simplethresholding)對於我們產生注意力圖的目的是足夠的。

    

4. Raindrop Removal using Attentive GAN(使用注意GAN去除雨滴

      Fig. 2 shows the overall architecture of our proposed network. Following the idea of generative adversarial networks[4], there are two main parts in our network: the generative and discriminative networks. Given an input image degraded by raindrops, our generative network attempts to produce an image as real as possible and free from raindrops. The discriminative network will validate whether the image produced by the generative network looks real.

      圖2展示了我們提出的網路的總體架構。遵循生成對抗網路的思想[4],在我們的網路中有兩個主要部分:生成網路和判別網路。如果輸入的影象因雨滴而退化,我們的生成網路試圖生成儘可能真實、不受雨滴影響的影象。判別網路將驗證生成網路產生的影象是否真實。

    Our generative adversarial loss can be expressed as:

    我們的生成性對抗性損失可以表示為:

        where G represents the generative network, and D represents the discriminative network. I is a sample drawn from our pool of images degraded by raindrops, which is the input of our generative network. R is a sample from a pool of clean natural images.

      其中G表示生成網路,D表示判別網路.I是從被雨滴侵蝕的影象庫中提取的樣本( images degraded by raindrops),這是我們的生成網路的輸入。R是從一堆乾淨的自然影象中提取的樣本。

圖2.我們提出的專注的GAN的結構。該生成器由一個注意力遞迴網路和一個具有跳過連線的上下文自動編碼器組成。該判別器由一系列卷積層組成,並由注意圖引導。 Best viewedin color.

 

4.1. Generative Network(生成網路

    As shown in Fig. 2, our generative network consists of two sub-networks: an attentive-recurrent network and a contextual autoencoder. The purpose of the attentive-recurrent network is to find regions in the input image that need to get attention. These regions are mainly the raindrop regions and their surrounding structures that are necessary for the contextual autoencoder to focus on, so that it can generate better local image restoration, and for the discriminative network to focus the assessment on.

    如圖2所示。我們的生成網路由兩個子網路組成:注意力遞迴網路( an attentive-recurrent network )和上下文自動編碼器(a contextual autoencoder)。注意-遞迴網路的目的是在輸入影象中尋找需要引起注意的區域。這些區域主要是雨滴區域及其周圍的結構,是上下文自動編碼器必須關注的區域,這樣才能產生更好的區域性影象恢復,並使判別網路集中於評估( assessment )。

 

    Attentive-Recurrent Network. Visual attention models have been applied to localizing targeted regions in an image to capture features of the regions. The idea has been utilized for visual recognition and classification (e.g. [26, 15, 5]). In a similar way, we consider visual attention to be important for generating raindrop-free background images, since it allows the network to know where the removal/restoration should be focused on. As shown in our architecture in Fig. 2, we employ a recurrent network to generate our visual attention. Each block (of each time step) in our recurrent network comprises of five layers of ResNet [8] that help extract features from the input image and the mask of the previous block, a convolutional LSTM unit [21] and convolutional layers for generating the 2D attention maps.

    關注-迴圈網路。視覺注意模型(Visual attention models)已被應用於定點陣圖像中的目標區域,以捕捉區域的特徵。這一思想已被用於視覺識別和分類(例如[26、15、5])。同樣,我們認為視覺關注對於生成無雨滴背景影象是非常重要的,因為它允許網路知道移除/恢復應該集中在哪裡。如圖2中我們的架構所示。我們使用一個遞迴網路來產生我們的視覺注意。 每一塊(即每個時間步長長)都包含5層ResNet[8]-它們幫助從前一塊的輸入影象和掩碼中提取特徵,一個卷積的LSTM單元[21]和用於生成2D注意分佈圖的卷積層。

 

    Our attention map, which is learned at each time step, is a matrix ranging from 0 to 1, where the greater the value, the greater attention it suggests, as shown in the visualization in Fig. 3. Unlike the binary mask, M, the attention map is a non-binary map, and represents the increasing attention from non-raindrop regions to raindrop regions, and the values vary even inside raindrop regions. This increasing attention makes sense to have, since the surrounding regions of raindrops also needs the attention, and the transparency of a raindrop area in fact varies (some parts do not totally occlude the background, and thus convey some background

information).

    我們的注意力圖,是在每個時間步驟中學習的,是一個從0到1的矩陣,其中值越大,它所表示的注意力就越多,如圖3中的視覺化所示。與二值掩碼M不同,注意對映是一種非二值對映,它代表著從非雨滴區域到雨滴區域的注意力的增加,雨滴區域內部的關注度也是不同的。這種注意力的增加是有意義的,因為雨滴周圍的區域也需要注意,而雨滴區域的透明度實際上是不同的(有些部分並不完全遮住背景,從而傳達了一些背景資訊)。

圖3.注意對映圖學習過程的視覺化。這個視覺化是最後的注意圖,AN。我們專注-迴圈網路顯示,在訓練過程中它更多地關注雨滴區域和相關結構。

 

   Our convolution LSTM unit consists of an input gate i t , a forget gate  ft , an output gate  ot as well as a cell state ct . The interaction between states and gates along time dimension is defined as follows:

    我們的卷積LSTM單元包括一個輸入門 i t 、一個忘記門ft、一個輸出門ot以及一個單元狀態ct。狀態與門隨時間維度的相互作用定義如下:

 

    where X t is the features generated by ResNet. C t encodes the cell state that will be fed to the next LSTM. H t represents the output features of the LSTM unit. Operator ∗ represents the convolution operation. The LSTM’s output feature is then fed into the convolutional layers, which generate a 2D attention map. In the training process, we initialize the values of the attention map to 0.5. In each time step, we concatenate the current attention map with the input image and then feed them into the next block of our recurrent network.

    其中,X t 是由ResNet生成的特徵; C t 對將要轉遞到下一個LSTM的狀態進行編碼; H t代表LSTM單元的輸出特性;運算子 * 表示卷積運算。LSTM的輸出特徵隨後被輸入到卷積層,這將產生一個2D的注意圖。在訓練過程中,我們將注意力圖的值初始化為0.5。在每個時間步驟中,我們將當前的注意力對映與輸入連線起來,然後將它們輸入到我們的遞迴網路的下一個塊中。

   

    In training the generative network, we use pairs of images with and without raindrops that contain exactly the same background scene. The loss function in each recurrent block is defined as the mean squared error (MSE) between the output attention map at time step t, or A t , and the binary mask, M. We apply this process N time steps. The earlier attention maps have smaller values and get larger when approaching the N th time step indicating the increase in confidence. The loss function is expressed as:

    在訓練生成網路時,我們使用包含和不包含雨滴的具有完全相同背景場景的影象對。每個迴圈塊中的損失函式定義為在時間步長t的輸出注意對映,或者說At與二值掩碼M之間的均方誤差(MSE)。我們在N個時間步驟中應用這個過程。較早的注意對映值較小,且隨著時間步長的增加而變大,這說明信任度的增加。損失函式表示為:

    where A t is the attention map produced by the attentive-recurrent network at time step t. A t = ATT t (F t−1 ,H t−1 ,C t−1 ), with F t−1 is the concatenation of the input image and the attention map from the previous time step. When t = 1, F t−1 is the input image concatenated with an initial attention map with values of 0.5. Function ATT t represents the attentive-recurrent network at time step t. We set N to 4 and θ to 0.8. We expect a higher N will produce a better attention map, but it also requires larger memory.

    其中At是由注意遞迴網路在時間步驟t生成的注意圖。At=ATTt(Ft−1,Ht−1,Ct−1),而Ft−1是輸入影象與注意圖的連線。當t=1時,Ft−1是輸入影象,它的初始注意對映的值為0.5。函式ATTt表示時間步長t處的注意遞迴網路。我們將N設為4,θ設為0.8。我們期望一個較高的N將產生一個更好的注意圖,但它也需要更大的記憶體。

 

    Fig. 3 shows some examples of attention maps generated by our network in the training procedure. As can be seen, our network attempts to find not only the raindrop regions but also some structures surrounding the regions. And Fig. 8 shows the effect of the attentive-recurrent network in the testing stage. With the increasing of time step, our network focuses more and more on the raindrop regions and relevant structures.

   圖3給出了在訓練過程中由我們的網路生成的注意圖的一些例子。可以看到,我們的網路不僅試圖找到雨滴區域,而且還試圖找到一些圍繞著這些區域的結構。還有圖8顯示了注意-迴圈網路在測試階段的效果。 隨著時間的推移,我們的網路工作越來越專注於在雨滴區域及其相關結構上。

 

     Contextual Autoencoder. The purpose of our contextual autoencoder is to generate an image that is free from raindrops. The input of the autoencoder is the concatenation of the input image and the final attention map from the attentive-recurrent network. Our deep autoencoder has 16 conv-relu blocks, and skip connections are added to prevent blurred outputs. Fig. 4 illustrates the architecture of our contextual autoencoder.

    上下文自動編碼器。我們的上下文自動編碼器的目的是生成一個不受雨滴影響的影象。自動編碼器的輸入是輸入影象和來自注意力遞迴網路的最終注意對映的結合。我們的深層自動編碼器有16個conv-relu塊,並增加了跳過連線( skip connections)以防止模糊輸出。圖4闡述了上下文自動編碼器的體系結構。

 

    

    圖4.我們的上下文自動編碼器的架構。利用多尺度損失和感知損耗來幫助訓練自動編碼器.

 

   As shown in the figure, there are two loss functions in our autoencoder: multi-scale losses and perceptual loss. For the multi-scale losses, we extract features from different decoder layers to form outputs in different sizes. By adopting this, we intend to capture more contextual information from different scales. This is also the reason why we call it contextual autoencoder.

   如圖4中所示,在我們的自動編碼器中有兩個損失函式:多尺度損耗( multi-scale losses)和感知損耗( perceptual loss)。針對多尺度損失,我們從不同的解碼器層中提取特徵,形成不同大小的輸出。通過採用這種方法,我們打算從不同的尺度上獲取更多的上下文資訊。這也是為什麼我們稱它為文字自動編碼器。

    We define the loss function as:

    我們將損失函式定義為:

     where S i indicates the ith output extracted from the decoder   layers, and T i indicates the ground truth that has the same scale as that of S i .  are the weights for different  scales. We put more weight at the larger scale. To be more specific, the outputs of the last 1 st , 3 rd and 5 th layers are used, whose sizes are  1/4、1/2 and 1 of the original size, respectively. Smaller layers are not used since the information is insignificant. We set λ’s to 0.6, 0.8, 1.0.

    其中Si表示從解碼器層提取的第i個輸出,Ti表示具有與Si相同尺度的地面真相(ground truth )。是不同尺度的權重。我們把更多的重量放在更大的尺度上。更具體地說,使用最後一層、第三層和第五層的輸出,它們的尺寸分別為原來尺寸的1/4、1/2和1。由於較小的層的資訊無關緊要,我們不使用較小的層。我們將λ設定為0.6、0.8、1.0。

 

    Aside from the multi-scale losses, which are based on a pixel-by-pixel operation, we also add a perceptual loss [11] that measures the global discrepancy between the features of the autoencoder’s output and those of the corresponding ground-truth clean image. These features can be extracted from a well-trained CNN, e.g. VGG16 pretrained on Ima- geNet dataset. Our perceptual loss function is expressed as:

    除了基於逐畫素操作的多尺度損失之外,我們還添加了一個感知損失[11],它測量了自動編碼器輸出的特徵與相應的地面真實幹淨影象之間的整體差異(global discrepancy )。這些特徵可以從訓練有素的cnn中提取出來,例如使用ImageNet資料集進行了預訓練的VGG-16。我們的感知損失函式表示為:

 

    where V GG is a pretrained CNN, and produces features from a given input image. O is the output image of the autoencoder or, in fact, of the whole generative network: O = G(I). T is the ground-truth image that is free from raindrops.

    其中vgg是預先訓練的CNN,並從給定的輸入影象生成特徵。O是自動編碼器的輸出影象,或者說,實際上是整個生成網路的輸出影象:O=g(I)。T是不受雨滴影響的真實影象。

 

    Overall, the loss of our generative can be written as:

    總的說來,我們的生成網路的損失可以寫成:

4.2. Discriminative Network(判別網路)

    To differentiate fake images from real ones, a few GAN-based methods adopt global and local image-content consistency in the discriminative part (e.g. [9, 13]) . The global discriminator looks at the whole image to check if there is any inconsistency, while the local discriminator looks at small specific regions. The strategy of a local discriminator is particularly useful if we know the regions that are likely tobefake(likeinthecaseofimageinpainting, where the regions to be restored are given). Unfortunately, in our problem, particularly in our testing stage, we do not know where the regions degraded by raindrops and the information is not given. Hence, the local discriminator must try to find those regions by itself.

    為了區分假影象和真實影象,一些基於GAN的方法在鑑別部分(例如[9,13])採用了全域性和區域性影象內容一致性。全域性鑑別器檢視整個影象以檢查是否有任何不一致性,而區域性鑑別器則檢視小的特定區域。如果我們知道那些可能是假的區域(如影象修復的區域,其中有大量需要恢復的區域),那麼區域性鑑別器的策略就特別有用。不幸的是,在我們的問題,特別是在我們的測試階段,我們不知道哪些地區因雨滴而退化,並且沒有提供資訊。因此,區域性鑑別者必須設法自己找到這些區域。

 

    To resolve this problem, our idea is to use an attentive discriminator. For this, we employ the attention map generated by our attentive-recurrent network. Specifically, we extract the features from the interior layers of the discriminator, and feed them to a CNN. We define a loss function based on the CNN’s output and the attention map. Moreover, we use the CNN’s output and multiply it with the original features from the discriminative network, before feeding them into the next layers. Our underlying idea of doing this is to guide our discriminator to focus on regions indicated by the attention map. Finally, at the end layer we use a fully connected layer to decide whether the input image is fake or real. The right part of Fig. 2 illustrates our discriminative architecture.

   為了解決這個問題,我們的想法是使用一個專注的鑑別器( an attentive discriminator)。為此,我們採用了我們的專注-迴圈網路所產生的專注對映圖( the attention map)。具體來說,我們從判別器(discriminator)的內部層中提取特徵,並將它們提供給cnn。我們根據CNN的輸出和注意力圖定義了一個損失函式。此外,我們使用cnn的輸出,並將其與判別網路中的原始特徵相乘,然後再將它們輸入下一層。我們這樣做的基本想法是引導我們的判別器專注於由注意力圖所決定的區域。最後,在最後一層,我們使用一個完全連線的層來判斷輸入影象是假的還是真實的。在圖2的右邊闡述了我們的判別器結構。

   

   The whole loss function of the discriminator can be expressed as:

   判別器的整個損失函式可以表示為:

  

        where L map is the loss between the features extracted from interior layers of the discriminator and the final attention map:

       其中Lmap對映是從鑑別器的內部層提取的特徵與最終的注意對映之間的損失:

        where  Dmap represents the process of producing a 2D map by the discriminative network. γ is set to 0.05. R is a sample image drawn from a pool of real and clean images. represents a map containing only 0 values. Thus, the second term of Eq. (9) implies that for R, there is no specific region necessary to focus on.

      其中, Dmap表示由判別網路生成2d地圖的過程。γ設定為0.05。R是從真實和乾淨的影象池中提取的樣本影象。0表示僅包含0值的對映。因此,eq(9)的第二部意味著對於R,沒有需要關注的特定區域。

 

     Our discriminative network contains 7 convolution layers with the kernel of (3, 3), a fully connected layer of 1024 and a single neuron with a sigmoid activation function. We extract the features from the last third convolution layers and multiply back in element-wise.

     我們的判別網路包含7個卷積層,核為(3,3),完全連線層為1024,單個神經元採用 sigmoid啟用函式。我們從倒數的第三層卷積層中提取特徵,並在一對一元素上(element-wise)進行乘法。

 

5. Raindrop Dataset(雨滴資料集)

    Similar to current deep learning methods, our method  requires relatively a large amount of data with groundtruths for training. However, since there is no such dataset for raindrops attached to a glass window or lens, we create our own. For our case, we need a set of image pairs, where each pair contains exactly the same background scene, yet one is degraded by raindrops and the other one is free from raindrops. To obtain this, we use two pieces of exactly the same glass: one sprayed with water, and the other is left clean. Using two pieces of glass allows us to avoid misalignment, as glass has a refractive index that is different from air, and thus refracts light rays. In general, we also need to manage any other causes of misalignment, such as camera motion, when taking the two images; and, ensure that the atmospheric conditions (e.g., sunlight, clouds, etc.) as well as the background objects to be static during the acquisition process.

    與目前的深度學習方法類似,我們的方法需要相對大量包括groundtruths的資料來進行訓練。然而,由於沒有雨滴附加到玻璃視窗或鏡頭這樣的資料集,所以我們建立自己的資料集。對於我們的情況,我們需要一組影象對,其中每對包含完全相同的背景場景,但一個被雨滴汙染,另一個是沒有雨滴。為了獲得這個結果,我們使用了兩塊完全相同的玻璃:一個是被噴上水的,另一個是乾淨的。使用兩塊玻璃可以使我們避免不對齊的情況( misalignment),因為玻璃的折射率不同於空氣,因此會折射光線。一般情況下,在拍攝這兩幅影象時,我們還需要處理任何其他導致不對齊的原因,例如攝像機的運動;並確保大氣條件(如陽光、雲層等)以及背景物體在交換過程中是靜態的。

 

     In total, we captured 1119 pairs of images, with various background scenes and raindrops. We used Sony A6000 and Canon EOS 60 for the image acquisition. Our glass slabs have the thickness of 3 mm and attached to the camera lens. We set the distance between the glass and the camera varying from 2 to 5 cm to generate diverse raindrop images, and to minimize the reflection effect of the glass. Fig. 5 shows some samples of our data.

     我們總共拍攝了1119對影象,有各種背景場景和雨滴。我們使用索尼A6000和佳能Eos 60進行影象採集。我們的玻璃板厚度為3毫米,附在照相機鏡頭上。我們設定玻璃和照相機之間的距離從2釐米到5釐米,以產生不同的雨滴影象,並儘量減少玻璃的反射效果。圖5展示了我們的資料樣本。

   圖5.我們資料集的樣本。上圖:影象因雨滴而退化。底部:對應的地面真實影象.

 

6. Experimental Results(實驗結果)

     Quantitative Evaluation. Table 1 shows the quantitative comparisons between our method and other existing methods: Eigen13 [1], Pix2Pix [10]. As shown in the table, compared to these two, our PSNR and SSIM values are higher. This indicates that our method can generate results more similar to the groundtruths.

     定量評價。表1顯示了我們的方法與其他現有方法的定量比較:本徵13[1],畫素2pix[10]。如表中所示,對於這兩種情況,我們的PSNR和 SSIM值更高。這表明,我們的方法可以產生更類似於地面真相的結果。

 

表1.定量評價結果。A是我們單獨的上下文編碼器。A+D是自動編碼器加鑑別器。A+AD是自動編碼器加註意鑑別器。AA+AD是我們的完整架構:專注的自動編碼器和注意力鑑別器。

 

     We also compare our whole attentive GAN with some parts of our own network: A (autoencoder alone without the attention map), A+D (non-attentive autoencoder plus non-attentive discriminator), A+AD (non-attentive autoencoder plus attentive discriminator). Our whole attentive GAN is indicated by AA+AD (attentive autoencoder plus attentive discriminator). As shown in the evaluation table, AA+AD performs better than the other possible configurations. This is the quantitative evidence that the attentive map is needed by both the generative and discriminative networks.

     我們還比較了我們整個專注的GAN和我們自己網路的某些部分:A(沒有注意對映的自動編碼器)、D(非專注的自動編碼器加非專注的鑑別器)、AD(非專注的自動編碼器加註意的鑑別器)。我們的整個注意力GAN是用AA+AD(注意力自動編碼器加註意力鑑別器)來表示的。如評估表所示,AA+AD的效能優於其他可能的配置。這是一個定量的證據,證明生成網路和區分網路都需要注意的地圖。

 

     Qualitative Evaluation. Fig. 6 shows the results of Eigen13 [1] and Pix2Pix [10] in comparison to our results. As can be seen, our method is considerably more effective in removing raindrops compared to Eigen13 and Pix2Pix. In Fig. 7, we also compare our whole network (AA+AD) with other possible configurations from our architectures (A, A+D, A+AD). Although A+D is qualitatively better than A, and A+AD is better than A+D, our overall network is more effective than A+AD.This is the qualitative evidence that, again, the attentive map is needed by both the generative and discriminative networks.

     定性評價。圖6給出了Eigen13 [1] and Pix2Pix [10]的結果,並與我們的結果進行了比較。可以看出,與Eigen13 [1] and Pix2Pix [10]相比,我們的方法在去除雨滴方面要有效得多。在圖7中我們還比較了我們的整個網路(AA+AD)和我們的架構的其他可能的配置(A,AD,A+AD)。雖然A+D在質量上優於A,A+AD優於A+D,但我們的整體網路比A+AD更有效。這是定性的證據(qualitative evidence),再次證明,生成網路和歧視網路都需要注意的地圖。

圖6.比較幾種不同方法的結果。從左到右:地面真實預想,雨滴影象(輸入), Eigen13 [1] 、Pix2Pix [10]和我們的方法。幾乎所有的雨滴都被我們的方法去除,儘管它們的顏色、形狀和透明度是多種多樣的

 

  圖7.比較我們網路架構的某些部分。從左到右:輸入, A, A+D, A+AD,我們的完整架構(AA+AD)。

 

圖8.由我們新穎的注意力遞迴網路生成的注意力圖的視覺化。隨著時間的推移,我們的網路越來越關注雨滴區域和相關結構。

 

圖9.仔細觀察一下我們的輸出和 Pix2Pix的輸出之間的比較。我們的輸出有較少的人工痕跡和較好的恢復結構。

 

      

      Application. To provide further evidence that our visibility enhancement could be useful for computer vision applications, we employ Google Vision API

(https://cloud.google.com/vision/) to test whether using our outputs can improve the recognition performance. The results are shown in Fig. 10. As can be seen, using our output, the general recognition is better than without our visibility enhancement process. Furthermore, we perform evaluation on our test dataset, and Fig. 11 shows statistically that using our visibility enhancement outputs significantly outperform those without visibility enhancement, both in terms of the average score of identifying the main object in the input image, and the number of object labels recognized.

     應用。為了進一步證明我們的可見性增強對於計算機視覺應用是有用的,我們使用 Google Vision API(https:/Cloud.google.com/vision/)測試使用我們的輸出是否可以提高識別效能。結果如圖10所示。可以看出,使用我們的輸出,一般的識別比沒有使用我們能見度增強過程要好。此外,我們還對測試資料集進行了評估。圖11統計資料顯示,使用我們的可見性增強輸出在識別輸入影象中的主要物件的平均分數和識別物件標籤的數量方面都明顯優於那些沒有可見性增強的輸出。

 

圖10.一個改進Google Vision API結果的示例。我們的方法增加了主目標檢測的分數和識別物件的分數。

 

圖11.基於Google Vision API的改進總結:(A)在輸入影象中識別主要物件的平均得分。(B)已識別的物體標籤的數目。該方法將識別成績提高10%,目標識別率提高100%。

 

7. Conclusion(總結)

    We have proposed a single-image based raindrop removal method. The method utilizes a generative adversarial network, where the generative network produces the attention map via an attentive-recurrent network and applies this map along with the input image to generate a raindrop-free image through a contextual autoencoder. Our discriminative network then assesses the validity of the generated output globally and locally. To be able to validate locally, we inject the attention map into the network. Our novelty lies on the use of the attention map in both generative and discriminative network. We also consider that our method is the first method that can handle relatively severe presence of raindrops, which the state of the art methods in raindrop removal fail to handle.

    我們提出了一種基於單一影象的雨滴去除方法。該方法利用生成對抗性網路,其中生成網路通過注意力遞迴網路生成注意對映,並將該對映與輸入影象一起應用於通過上下文自動編碼器,以生成的無雨滴影象。然後,我們的判別網路評估了生成的輸出在全域性和區域性的有效性 (validity)。為了能夠在區域性驗證,我們將注意力對映注入到網路中。我們的方法新奇之處在於注意力圖在生成網路和判別網路中的使用。我們還認為,我們的方法是第一種能夠處理相對嚴重的雨滴存在的方法,這是目前最先進的雨滴清除方法所不能處理的。

 

References(參考文獻)

[1] D. Eigen,