1. 程式人生 > >GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial Networks論文翻譯以及理解

GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial Networks論文翻譯以及理解

GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial Networks論文翻譯以及理解

格式:一段英文,一段中文

2. Proposed method

To reduce the effect of variations, GAN is employed as a regressor to generate invariant gait images, that contain side view giat images with normal clothing and without carrying objects. The gait images at arbitrary views can be converted to those at the side view since the side view data contains more dynamic information. While this is intuitively appealing, a key challenge that must be address is to preserve the human identification information in the generated gait images.

為了減少視角變化的影響,GAN被用作生成不變步態影象的迴歸器,其包含具有在正常衣服且不攜帶物體情況下的側視步態影象。由於側檢視包含更多動態資訊,因此任意檢視的步態影象可以轉換為側檢視中的步態影象。雖然這具有直觀吸引力,但必須解決的關鍵挑戰是在生成的步態影象中保留人類身份識別資訊。

The GaitGAN model is trained to generate gait images with normal clothing and without carrying objects at the side view by data from the training set. In the test phase, gait images are sent to the GAN model and invariant gait images contains human identification information are generated. The difference between the proposed method and most other GAN related methods is that the generated image here can help to improve the discriminant capability, not just generate some images which just looks realistic.The most challenging thing in the proposed method is to preserve human identification when generating realistic gait images.

用訓練集訓練GaitGAN模型以生成具有正常著裝情況下且沒有攜帶物的側視步態影象。在測試階段,將步態影象送入GAN模型,並且生成包含人類識別資訊的檢視無關的步態影象。所提出的方法與大多數其他GAN相關方法之間的區別在於,此處生成的影象對提高判別能力有幫助,而不僅僅是生成一些看起來很逼真的影象。所提出的方法最具挑戰性的是在生成逼真的步態影象過程中保留人類身份資訊。在損失函式中加強監督資訊

2.1. Gait energy image

The gait energy image [6] is a popular gait feature, which is produced by averaging the silhouettes in one gait cycle in a gait sequence as illustrated in Figure 1. GEI is well known for its robustness to noise and its efficient computation. The pixel values in a GEI can be interpretted as the probability of pixel positions in GEI being occupied by a human body over one gait cycle. According to the success of GEI in gait recognition, we take GEI as the input and target image of our method. The silhouettes and energy images used in the experiments are produced in the same way as those described in [22].

步態能量影象[6]是一種流行的步態特徵,其通過對一個步態週期中的步態序列輪廓進行平均來產生,如圖1所示.GEI以其對噪聲的魯棒性和高效計算而眾所周知。 GEI中的畫素值可以被解釋為在一個步態週期中人體佔據GEI中畫素位置的概率。 根據GEI在步態識別中的成功,我們將GEI作為我們方法的輸入和輸出影象。 實驗中使用的輪廓和步態能量影象以與[22]中描述的相同的方式產生。
GEI

2.2. Generative adversarial networks for pixel-level

domain transfer Generative adversarial networks (GAN) [4] are a branch of unsupervised machine learning, implemented by a system of two neural networks competing against each other in a zero-sum game framework. A generative model G that captures the data distribution. A discriminative model D then takes either a real data from the training set or a fake image generated from model G and estimates the probability of its input having come from the training data set rather than the generator. In the GAN for image data, the eventual goal of the generator is to map a small dimensional space z to a pixel-level image space with the objective that the generator can produce a realistic image given an input random vector z. Both G and D could be a non-linear mapping function. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation.

域轉移生成對抗網路(GAN)[4]是無監督機器學習的一個分支,由一個具有兩個神經網路的系統實現,這個系統在零和遊戲框架中相互競爭。 生成模型G先捕獲資料分佈。然後,判別模型D獲取來自訓練集的真實資料或者從模型G生成的假影象,並估計其輸入來自真實訓練資料集而不是生成器的概率。在用於影象資料的GAN中,生成器的最終目標是將低維空間 z 對映到畫素級影象空間,其目的是讓生成器可以根據給定輸入的隨機向量z來生成逼真的影象。 G和D都可以是非線性對映函式。 在G和D由多層感知器定義的情況下,整個系統可以用反向傳播進行訓練。

The input of the generative model can be an image instead of a noise vector. GAN can realize pixel-level domain transfer between input image and target image such as PixelDTGAN proposed by Yoo et al. [20]. PixelDTGAN can transfer a visual input into different forms which can then be visualized through the generated pixel-level image. In this way, it simulates the creation of mental images from visual scenes and objects that are perceived by the human eyes. In that work, the authors defined two domains, a source domain and a target domain. The two domains are connected by a semantic meaning. For instance, the source domain is an image of a dressed person with variations in pose and the target domain is an image of the person’s shirt. So PixelDTGAN can transfer an image from the source domain which is a photo of a dressed person to the pixel-level target image of shirts. Meanwhile the transferred image should look realistic yet preserving the semantic meaning. The framework consists of three important parts as illustrated in Figure 2. While the real/fake discriminator ensures that the generated images are realistic, the domain discriminator, on the other hand, ensures that the generated images contain semantic information.

生成模型的輸入可以是影象而不是噪聲向量。 GAN可以實現輸入影象和目標影象之間的畫素級域轉移,例如Yoo等人提出的PixelDTGAN[20]。 PixelDTGAN可以將檢視輸入轉換為不同的形式,然後可以通過生成的畫素級影象來視覺化。通過這種方式,它可以模擬產生人眼感知的物體和由視覺場景建立心理影象。在該工作中,作者定義了兩個域,即源域和目標域。這兩個域通過語義來連線。例如,源域是具有姿勢變化的穿著者的影象,並且目標域是人的襯衫的影象。因此PixelDTGAN可以將來自源域的穿著衣服的圖片轉換到穿著襯衫的畫素級目標影象。同時,使轉換後的影象應該看起來很逼真並且保留了語義。該框架由三個重要部分組成,如圖2所示。雖然真/假判別器可以確保生成的影象是真實的,但另一方面,域判別器確保生成的影象包含語義資訊。(還是那個身份資訊作為強監督資訊的作用)。這裡是一個生成器後面開兩個分支,每個分支是一個判別器
ttt

The first important component is a pixel-level converter which are composed of an encoder for semantic embedding of a source image and a decoder to produce a target image. The encoder and decoder are implemented by convolution
neural networks. However, training the converter is not straightforward because the target is not deterministic. Consequently, on the top of converter it needs some strategies like loss function to constrain the target image produced. Therefore, Yoo et al. connected a separate network named domain discriminator on top of the converter. The domain discriminator takes a pair of a source image and a target image as input, and is trained to produce a scalar probability of whether the input pair is associated or not. The loss function L D A L_{D}^{A} in [20] for the domain discriminator D A {D}_{A} is defined as:

第一個重要元件是畫素級轉換器,其由用於語義嵌入源影象的編碼器和用於產生目標影象的解碼器組成。編碼器和解碼器是通過卷積神經網路實現的。但是,訓練編碼器和解碼器並不簡單,因為目標是不明確的。因此,在轉換器的頂端需要一些策略,如損失函式來約束所產生的目標影象。因此,Yoo等人在轉換器頂端連線一個名為域判別器的獨立網路。 域判別器將一對源影象和目標影象作為輸入,然後訓練域判別器以產生輸入對是否相關的標量概率。域鑑別器 L D A L_{D}^{A} 的[20]中的損失函式 D A {D}_{A} 定義如下:
yy
where I S I_{S} is the source image, I T I_{T} is the ground truth target, I T I_{T}^{-} the irrelevant target, and I ^ T \hat{I}_{T} is the generated image from converter。
其中 I S I_{S} 是源影象, I T I_{T} 是基本事實目標, I T I_{T}^{-} 是無關目標,而 I ^ T \hat{I}_{T} 是來自轉換器的生成影象。

Another component is the real/fake discriminator which similar to traditional GAN in that it is supervised by the labels of real or fake, in order for the entire network to produce realistic images. Here, the discriminator produces a scalar probability to indicate if the image is a real one or not. The discriminator ’s loss function L D R L_{D}^{R} , according to[20], takes the form of binary cross entropy。

另一個元件是真/假鑑別器,它類似於傳統的GAN,因為它是由真實或假的標籤監督的,以便整個網路產生逼真的影象。 在裡,判別器產生標量概率以指示影象是否是真實的。 根據[20],判別器的損失函式 L D A L_{D}^{A} 採用二元交叉熵的形式。

tt
where { I i } \left \{ I^{i} \right \} contains real training images and { I ^ i } \left \{ \hat{I}^{i} \right \} contains fake images produced by the generator. Labels are given to the two discriminators, and they supervise the converter to produce images that are realistic while keeping the semantic mean.

其中 { I i } \left \{ I^{i} \right \} 包含真實的訓練影象, { I ^ i } \left \{ \hat{I}^{i} \right \} 包含由生成器產生的假影象。標籤被賦予給了兩個判別器,並且它們監督轉換器以產生逼真的影象,同時保持語義資訊。
這三部分是一起訓練的嗎,還是先訓練生成器,然後再訓練判別器

2.3. GaitGAN: GAN for gait gecognition

Inspired by the pixel-level domain transfer in PixelDTGAN, we propose GaitGAN to transform the gait data from any view, clothing and carrying conditions to the invariant view that contains side view with normal clothing and without carrying objects. Additionally, identification information is preserved. We set the GEIs at all the viewpoints with clothing and carrying variations as the source and the GEIs of normal walking at 9 0 90^{\circ} (side view) as the target, as shown in Figure 3. The converter contains an encoder and a decoder as shown in Figure4.

受PixelDTGAN中畫素級域轉移的啟發,我們建議GaitGAN將步態資料從無論任何檢視,服裝和攜帶條件轉換為包含正常服裝側檢視且不攜帶物體的不變檢視。 另外,還會保留識別資訊。 如圖3所示,我們將所有視角、衣服和攜帶物變化的GEIs作為源域影象,以正常行走的 9 0 90^{\circ} (側檢視)GEI作為目標域影象。包含編碼器和解碼器的轉換器如圖4所示。
yy

在這裡插入圖片描述
There are two discriminators. The first one is a real/fake discriminator which is trained to predict whether an image is real. If the input GEI is from real gait data at 90 view in 3.normal walking, the discriminator will output 1. Othervise,it will output 0. The structure of the real/fake discriminator is shown in Figure5:

有兩個鑑別器。 第一個是真/假鑑別器,其經過訓練以預測影象是否真實。 如果輸入的GEI來自 9 0 90^{\circ} 的三種正常行走的真實的步態能量圖,則判別器將輸出1。否則,它將輸出0。真/假判別器的結構如圖5所示:
tt

With the real/fake discriminator, we can only generate side view GEIs which look well. But, the identification information of the subjects may be lost. To preserve the identification information, another discriminator, named as identification discriminator, which is similar to the domain discriminator in [20] is involved. The identification discriminator takes a source image and a target image as input, and is trained to produce a scalar probability of whether the input pair is the same person. If the two inputs source images are from the same subject, the output should be 1. If they are source images belonging to two different subjects, the output should be 0. Likewise, if the input is a source image and the target one is generated by the converter, the discriminator function should output 0. The structure of identification discriminator is shown in Figure6.

使用真/假鑑別器,我們只能生成看起來很像側檢視的GEI。 但是,受試者的識別資訊可能會丟失。 為了保留識別資訊,需要另一個類似於[20]中的域判別器,叫做身份判別器的組。 身份判別器將源影象和目標影象作為輸入,並且被訓練以產生輸入對是否是同一個人的標量概率。 如果兩個輸入源影象來自同一主體,則輸出應為1。如果它們是屬於兩個不同主體的源影象,則輸出應為0。同樣,如果輸入是源影象並且目標影象是通過轉換器生成的,鑑別器功能應輸出0。身份判別器的結構如圖6所示。

yy

3. Experiments and analysis

3.1. Dataset

CASIA-B gait dataset [22] is one of the largest publicgait databases, which was created by the Institute of Automation, Chinese Academy of Sciences in January 2005. It consists of 124 subjects (31 females and 93 males) captured from 11 views. The view range is from 0Æto 1。with 18 interval between two nearest views. There are 11 sequences for each subject. There are 6 sequences for normal walking (”nm”), 2 sequences for walking with a bag (”bg”) and 2 sequences for walking in a coat (”cl”).

CASIA-B步態資料集[22]是最大的公共資料庫之一,由中國科學院自動化研究所於2005年1月建立。它包括從11個檢視中捕獲的124個科目(31名女性和93名男性)。 檢視範圍從 0 0^{\circ} 18 0 180^{\circ} 之間變化,且兩個最近檢視之間的間隔為 1 8 18^{\circ} 。每個受試者有11個序列。 有6個正常步行序列(“nm”),2個揹包行走的序列(“bg”)和2個穿外套行走的序列(“cl”)。

3.2. Experimental design

In our experiments, all the three types of gait data including”nm”, ”bg” and ”cl” are all involved. We put the six normal walking sequences, two sequences with coat and two sequences containing walking with a bag of the first 62 subjects into the training set and the remaining 62 subjects into the test set. In the test set, the first 4 normal walking sequences of each subjects are put into the gallery set and the others into the probe set as it is shown in Table 1. There are four probe sets to evaluate different kind of variations.

在我們的實驗中,所有三種類型的步態資料都包括在內,包括“nm”,“bg”和“cl”。 訓練集由前64個人物的六個正常步行序列構成,其中兩個帶有外套的序列和兩個揹包的步行序列。其餘62個受試者放入測試集。 在測試集中,每個受試者的前4個正常步行序列被放入Gallery集中,其他正常步行序列被放入Probe Set中,如表1所示。有四個Probe集用於評估不同型別的變化。

uu

3.3. Model parameters

In the experiments, we used a similar setup to that of[20], which is shown in Figure 4. The converter is a unified network that is end-to-end trainable but we can divide it into two parts, an encoder and a decoder. The encoder part is composed of four convolutional layers to abstract the source into another space which should capture the personal attributes of the source as well as possible. Then the resultant feature z is then fed into the decoder in order to construct a relevant target through the four decoding layers. Each decoding layer conducts fractional stride convolutions, where the convolution operates in the opposite direction. The details of the encoder and decoder structures are shown in Table 2 and 3. The structure of the real/fake discriminator and the identification discriminator are similar to the encoder’s first four convolution layers. The layers of the discriminators are all convolution layers.

在實驗中,我們使用了與[20]類似的設定,如圖4所示。轉換器是一個端到端可訓練的統一網路,但我們可以將它分為兩部分,一個編碼器和一個解碼器。編碼器部分由四個卷積層組成,以將源抽象到另一個空間,該空間應該儘可能地捕獲源的個體屬性。然後,將得到的特徵 z z 饋送到解碼器中,以便通過四個解碼層構造相關目標影象。每個解碼層進行分數步幅卷積,其中卷積在相反方向上操作。 編碼器和解碼器結構的細節如表2和3所示。真/假判別器和身份判別器的結構類似於編碼器的前四個卷積層。 判別器的各層都是卷積層。

ii

yy

Normally to achieve a good performance using deep learning related methods, a large number of iterations in training are needed. From Figure 7, we can find that more iterations can indeed result in a higher recognition rate, but the rate peaks at around 450 epoches. So in our experiments, the training was stopped after 450 epoches.

通常,為了使用深度學習相關方法獲得良好效能,需要大量的訓練迭代。 從圖7中,我們可以發現更多的迭代確實可以產生更高的識別率,但是速率在450個epoches時達到峰值。 因此,在我們的實驗中,訓練在450個epoches段後停止。
在這裡插入圖片描述