1. 程式人生 > >Multi-Task GANs for View-Specific Feature Learning in Gait Recognition論文翻譯以及理解

Multi-Task GANs for View-Specific Feature Learning in Gait Recognition論文翻譯以及理解

Multi-Task GANs for View-Specific Feature Learning in Gait Recognition論文翻譯以及理解

今天想嘗試一下翻譯一篇自己讀的論文。寫的不好,後續慢慢改進。

Abstract

 Abstract— Gait recognition is of great importance in the fields of surveillance and forensics to identify human beings since gait is the unique biometric feature that can be perceived efficiently at a distance. However, the accuracy of gait recognition to some extent suffers from both the variation of view angles and the deficient gait templates. On one hand, the existing cross-view methods focus on transforming gait templates among different views, which may accumulate the transformation error in a large variation of view angles. On the other hand, a commonly used gait energy image template loses temporal information of a gait sequence. To address these problems, this paper proposes multi-task generative adversarial networks (MGANs) for learning view-specific feature representations. In order to preserve more temporal information, we also propose a new multi-channel gait template, called period energy image (PEI). Based on the assumption of view angle manifold, the MGANs can leverage adversarial training to extract more discriminative features from gait sequences. Experiments on OU-ISIR, CASIA-B, and USF benchmark data sets indicate that compared with several recently published approaches, PEI + MGANs achieves competitive performance and is more interpretable to cross-view gait recognition.
Index Terms— Gait recognition, cross-view, generative adversarial networks, surveillance

譯文

摘要–由於步態是一種可以在遠處有效獲得的獨特生物特徵,因此步態識別在監視和取證領域非常重要。然而,步態識別的正確率在某種程度上受到視角變化和缺乏統一的步態模板的困擾。一方面,現有的跨視角方法專注於在不同視角之間轉換步態模板,這種方法在所跨視角較大時,誤差會不斷的累積。另一方面,常用的步態能量圖模板喪失掉了步態序列的時間資訊。為了解決這兩個問題,本文提出了用於學習視角特徵表示的多工的生成對抗式網路。為了能保留更多的時間資訊,我們提出了一種新的多通道步態模板,叫做period energy image(PEI)。基於視角變化的假設,MGAN可以利用對抗訓練從步態序列中提取更多的判別特徵。在基準資料集OU-ISIR,CASIA-B和USF上的實驗表明,與最近釋出的幾種方法相比,PEI + MGANs具有可比的效能,並且可以更好地解釋步態識別跨視角問題。

I. INTRODUCTION

 DIFFERENT from other biometric features such as human faces, fingerprints, and irises which are usually obtained at a close distance, gait is the unique biometric feature that can identify humans at a far distance. However, the performance of gait recognition [1] suffers from various exterior factors including clothing [2], walking speed [3], low resolution [4] and so on. Among these factors, the change of view angles greatly influences the generalization ability of gait recognition models. For example, when a person walks across a camera located at a fixed position, the gait appearance of the person may vary along walking directions, making a formidable barricade in recognizing the human under the cross-view case.

 與通常可以在近距離獲得的生物特徵(例如人臉,指紋和虹膜)不同,步態是可以識別遠距離識別人物的獨特生物特徵。然而,步態識別[1]的效能受到各種外部因素的影響,包括服裝[2],步行速度[3],低解析度[4]等。在這些因素中,視角的變化極大地影響了步態識別模型的泛化能力。例如,當人走過位於固定位置的攝像機時,人體的步態外觀會隨著行走方向發生變化,這是在跨視角情況下步態識別的一大挑戰。

 To solve this problem, some researchers [5]–[8] proposed to learn transformations or projections between different view angles in cross-view gait recognition. Specifically, the View Transform Model (VTM) [9]–[12] transforms gait templates such as Gait Energy Image (GEI) [13] from one view to another. However, VTM requires predicting each pixel value of GEI independently, which is time-consuming and inefficient. To reduce the computational time, an auto-encoder based model [14] is used to reconstruct GEI and extract viewinvariant features. In order to achieve view transformations, these two methods reconstruct gait templates via transitional view angles. In this way, however, the reconstruction error may be accumulated if there is a large view variation between two view angles.

  為了解決這個問題,一些研究人員[5] - [8]提出在跨視角步態識別中學習不同視角之間的變換或投影。具體來說,檢視變換模型(VTM)[9] - [12]將步態模板(如步態能量影象(GEI)[13])從一個檢視轉換為另一個檢視。然而,VTM需要獨立地預測GEI的每個畫素值,這是耗時且低效的。為了減少計算時間,另一種是基於自動編碼器的模型[14]用於重建GEI並提取檢視變數特徵。為了實現視角變換,這兩種方法通過過渡階段視角重建步態模板。然而,當兩個視角之間存在較大的角度變化時,這種方式會累積重建誤差。

 The recently published Generative Adversarial Networks (GANs) interpolate facial poses or age variations along a low-dimensional manifold [15], [16]. It has the ability to model data distribution to improve the performance of different vision tasks such as super-resolution [17] and inpainting [18]. However, original GANs methods generate images from random noise, lacking features that can preserve identity information, which is undesirable for cross-view gait recognition.

 最近發表的生成性對抗網路(GAN)可以沿著低維流形改變面部表情或年齡[15],[16]。它可以通過模擬資料的分佈來提高不同視覺任務(例如,超解析度和修復)的效能。然而,原始GAN方法從隨機噪聲中生成影象,缺少可以儲存身份資訊的特徵,這對於跨檢視步態識別是不利的。

 In order to overcome the shortcomings mentioned above, this paper proposes Multi-task Generative Adversarial Networks (MGANs) to learn view-specific features from gait templates.Further, we propose a new multi-channel gait template,named Period Energy Image (PEI), which is a generalization of GEI. The PEI template can maintain more temporal and spatial information compared with other templates such as GEI and Chrono-Gait Image (CGI) [19], [20]. Extensive experiments on three gait benchmark datasets indicate that our MGANs model with PEI achieves competitive performance in cross-view gait recognition compared with several recently published approaches.

 為了克服上述缺點,本文提出了多工生成對抗網路(MGAN)來學習步態模板中的檢視特徵。此外,我們提出了一種新的多通道步態模板,稱為週期能量影象(PEI)。這是GEI更加一般化的表達。與其他模板(如GEI和Chrono-Gait Image(CGI)[19],[20])相比,PEI模板可以保持更多的時間和空間資訊。在三個步態基準資料集上的多組實驗表明,與幾種最近發表的方法相比,我們的用PEI來訓練PEAN模型在跨視角步態識別中取得了具有競爭性的表現。

 The training structure of the proposed MGANs models is illustrated in Fig. 1. Inspired by the recent success of deep networks for cross-view gait recognition [21], the convolutional neural network is utilized in our model. PEI is first encoded as a view-specific feature in a latent space by the encoder. Then, a view transform layer transforms the feature from one view to another. Finally, a modified GANs structure is trained with both pixel-wise loss and multi-task adversarial loss. In addition, a view-angle classifier is trained with crossentropy loss to predict the view angle of the PEI in the testing phase.

 所提出的MGAN模型的訓練結構如圖1所示。最近深度網路在跨視角步態識別問題上取得了成功,受此[21]啟發,我們的模型中也使用了卷積神經網路結構。 PEI首先被編碼器編碼為隱藏空間中的特定視角特徵。緊接著,檢視轉換層將特徵從一個檢視轉換為另一個檢視。 最後,用畫素對損失和多工對抗性損失訓練修改後的GAN結構。另外,我們利用交叉熵損失訓練視角分類器,以預測測試階段PEI的視角。(這裡測試階段需要這個PEI視角幹什麼,會具體說明)

 The rest of this paper is organized as follows. Related work is reviewed in Section II. Section III presents the PEI template and explain the proposed MGANs model. Experimental results are analyzed in Section IV. Discussion and conclusion are given in Sections V and VI, respectively.

 本文其餘部分的組織結構如下。 相關工作在第II節中進行了回顧。 第三節介紹了PEI模板並詳解提出的MGAN模型。 實驗結果在第IV節中進行分析。 討論和結論分別在第五節和第六節給出。

在這裡插入圖片描述

II. RELATED WORK

A. Cross-View Gait Recognition

 The approaches of cross-view gait recognition can be divided into three categories. The first category devotes to reconstructing the 3D structure of a person through a set of gait images taken from multi-view cameras [22]–[24]. Constrained by the strict environmental requirement and expensive computational cost, however, it is less applicable in practice. The second category is to extract the hand-crafted view-invariant features from gait images to represent a person [25]–[28]. Due to the strong nonlinear relationship between view angles and gait images, extracting such view-invariant features from images is a challenge. As a result, the hand-crafted view invariant features cannot generalize well in the condition of a large view variation [21].

 跨視角步態識別的方法可以分為三類。 第一類致力於通過從多檢視相機[22] - [24]拍攝的一組步態影象來來重建人的3D結構。 然而,這種方法受到嚴格的環境要求和昂貴的計算成本的限制,使得它在實踐中不太適用。 第二類是從步態影象中提取手工檢視不變特徵來代表一個人[25] - [28]的步態特徵。 由於視角和步態影象之間是極度的非線性關係,從影象中提取這種檢視不變特徵將會是一個難題。 因此,提取的手工檢視變形特徵在檢視變化情況較大的的情況下不具有很好的泛化效能[21]。

 Most of the state-of-the-art methods belonging to the third category directly learn transformations or projections of gaits in different view angles. For example, Makihara et al. [10] proposed a View Transformation Model (VTM) to transform gait templates from one view to another. In their work, Singular Value Decomposition (SVD) was used to compute the projection matrix and view-invariant features for each GEI. Further, a truncated SVD was proposed to overcome the overfitting problem of the original VTM [11]. Moreover, they refined a VTM-based method to learn a nonlinear transformation between different view angles [9] by employing suppor vector regression.

 大多數最先進的方法都是屬於第三類的,這種方法直接學習不同視角的步態的變換或投影。 例如,Makihara等人[10]提出了一種檢視轉換模型(VTM),用於將步態模板從一個檢視轉換為另一個檢視。 在他們的工作中,奇異值分解(SVD)用於計算每個GEI的投影矩陣和檢視不變特徵。與此同時,截斷的SVD可以用來克服原始VTM的過度擬合問題[11]。 此外,他們改進了一種基於VTM的方法,通過支援向量迴歸來學習不同視角之間的非線性變換[9]。

 Different from VTM-based methods, Canonical Correlation Analysis (CCA) based approaches project the gait templates from multiple view angles onto a latent space with maximal correlation [5], [8], [12], [29]. For example, Bashir et al. [5] modeled the correlation of gait sequences from different view angles using CCA. Kusakunniran et al. [12] claimed that there might exist some local correlations in GEIs of different view angles. Xing et al. [8] proposed Complete Canonical Correlation Analysis (C3A) to overcome the shortcomings of CCA when directly dealing with two sets of high dimensional features. In their method, the original CCA was decomposed into two stable eigenvalue decomposition problems to avoid inconsistent projection directions between Principle Component Analysis (PCA) and CCA.

 與基於VTM的方法不同,基於典型相關分析(CCA)的方法將步態模板從多個視角投影到具有最大相關性的隱藏空間[5],[8],[12],[29]。 例如,Bashir等人 [5]使用CCA來建模來自不同視角的步態序列的相關性。 Kusakunniran等人 [12]聲稱在不同視角的GEI中可能存在一些區域性相關性。 Xing等人[8]提出了使用完全典型相關分析(C3A)來克服CCA在直接處理兩組高維特徵時的缺點。 在他們的方法中,原始CCA被分解為兩個穩定的特徵值分解問題,以避免主成分分析(PCA)和CCA之間投影方向不一致的問題。

 However, CCA-based methods assume that view angles are known in advance. Therefore, Hu et al. [7] proposed an alternative View-invariant Discriminative Projection (ViDP) to project the gait templates onto a latent space without knowing the view angles.

 然而,基於CCA的方法假設視角是預先已知的。 因此,胡等人[7]提出了一種替代的檢視不變判別投影(ViDP),在不需要知道視角的情況下,可以將步態模板投影到隱藏空間。

 Recently, deep neural networks based gait recognition methods were introduced in [2], [21], and [30]–[32]. The CNN-based method proposed by Wu et al. [21] automatically recognized the discriminative features to predict the similarity given a pair of gait images. The model they used is opaque on how the view variation affects the similarities between different samples. Instead of using silhouette images which are sensitive to the clothing and carrying variations, a posebased temporal-spatial network [2] is proposed to extract dynamic and static information from the key-point of human bodies [33]. The experimental results show that it may be a challenge for a pose-based method to extract discriminative information from key-points of a human body.

 最近,基於深度神經網路的步態識別方法在[2],[21]和[30] - [32]中有所介紹。 Wu等人提出的基於CNN的方法 [21]可以自動識別判別特徵以預測給定一對步態影象的相似性。 他們使用的模型對於檢視變化如何影響不同樣本之間的相似性的學習過程是隱式的。 與使用對服裝和攜帶變化敏感的輪廓影象不同,基於姿態的時空網路[2]來從人體的關鍵點來提取動態和靜態資訊[33]。 實驗結果表明,基於姿勢的方法從人體關鍵點提取判別資訊還是一個具有挑戰性的任務。

 However, there are some shortcomings remained in the existing methods. For example, VTM-based methods suffer from error accumulation stemming from large view variations. CCA-based methods and ViDP only model the linear correlation between features. CNN-based method lacks the interpretability of view variations. In order to overcome these shortcomings, our MGANs model benefits from the feature transformation in a latent space and the nonlinear deep model. Different from directly predicting the similarity given a pair of samples as in [21], our method learns view-specific features by utilizing prior knowledge about the view angles. This greatly facilitates the understanding of the view variation to the learned features

 但是,現有方法仍存在一些缺點。 例如,基於VTM的方法中,當檢視變化較大時會引起的誤差的累積。 基於CCA的方法和ViDP僅模擬特徵之間的線性相關性。 基於CNN的方法缺乏檢視變化現象的可解釋性。為了克服這些缺點,我們提出的MGANs模型主要優點是在隱空間中學習特徵變換和採用非線性深度模型。 與[21]中給出的直接預測一對樣本相似性不同,我們的方法通過利用有關視角的先驗知識來學習特定的檢視特徵。 這極大地促進了對學習特徵的檢視變化的理解。

B. Generative Adversarial Networks

 Recently, Generative Adversarial Networks (GANs) [34] were introduced as a novel way to model data distributions. Specifically, GANs are a pair of neural networks consisting of a generator G and a discriminator D. In the original GANs, the generator G generates fake data from a distribution of Pz. The goal of the discriminator D is to distinguish fake data from real data x. We assume that the distribution of real data is Pdata. Both the generator and discriminator are iteratively optimized against each other in a minimax game as follows [34]:

 最近,Generative Adversarial Networks(GANs)[34]作為一種模擬資料分佈的新方法被引入。 具體地,GAN是由生成器G和判別器D組成的一對神經網路。在原始GAN中,生成器G從Pz的分佈產生偽資料。 判別器D的目標是區分偽資料和真實資料x。 我們假設實際資料的分佈是Pdata。 生成器和判別器都相互迭代優化,minimax優化規則如下所示[34]:
通天塔

 where θG and θD are the parameters of G and D, respectively. However, the training of original GANs suffers from the problems of low quality, instability and mode collapse. Several variants of GANs were thus introduced to solve these problems. For example, WGANs [35], [36] and DCGANs [15] were proposed to improve the stability of training and to alleviate mode collapse.

 其中 θ G \theta _{G} θ D \theta _{D} 分別是G和D的引數。 然而,原始GAN的訓練存在質量低,不穩定和模式崩潰的問題。 因此引入了幾種GAN變體來解決這些問題。 例如,WGANs [35],[36]和DCGANs [15]被提出用來提高訓練的穩定性和緩解模式崩潰。(我不知道這個 mode collapse到底該怎麼翻譯)。

 Research on original GANs has also focused on utilizing supervised information. For example, conditional GANs [37] was proposed to generate samples by providing label information. Various vision problems such as super-resolution [17] and inpainting [18] were advanced based on conditional GANs.

 對原始GAN的研究也集中在利用監督資訊f方面。 例如,條件生成式對抗網路 [37]通過提供標籤資訊來生成樣本。 基於條件生成式對抗網路有效的提高了超解析度[17]和修復[18]等各種視覺問題效能。

 Recent researches on GANs are capable of interpolating facial poses or age variations along a low-dimensional manifold [15], [16]. In order to capture the manifold of view angles and model the distribution of gait images, we also introduce the GANs into our model. The structure of GANs in our proposed model is composed of one generator and several subdiscriminators. Each sub-discriminator is responsible to ensure that the generated gait images belong to a certain domain such as a view angle domain, an identity domain, or a channel domain of gait images.

 最近對GAN的研究主要是能夠沿著低維流形嵌入面部姿勢或年齡變化[15],[16]資訊。 為了捕捉視角的多樣性並對步態影象的分佈進行建模,我們也將GAN引入到我們的模型中。 我們提出的模型中的GAN結構由一個生成器和幾個子判別器組成。 每個子判別器負責確保所生成的步態影象屬於某個域,例如視角域,身份域或步態影象的通道域。

 Note that the recently published GaitGAN [38] also introduced the GANs to learn view-invariant features. The proposed two discriminators were used to ensure that the generated gait images are realistic and the identity can be preserved. There are two main differences between their work and ours. The first is that their work directly transformed the gait template from arbitrary view angles to the side view angle without utilizing the assumption of view angle manifold. The second is that the two discriminators proposed in their work are mutually independent, whereas different discriminators will share the weights of the network in our method.

值得注意的是,最近釋出的GaitGAN [38]還引入了GAN來學習檢視不變特徵。 提出的兩個判別器用於確保生成的步態影象是真實的並且可以保持身份資訊的。 我們之間工作的主要區別有兩點。 首先,他們的工作直接將步態模板從任意視角轉換為側視角,而沒有利用視角的多樣性的假設。 第二,在他們的工作中提出的兩個判別器是相互獨立的,而在我們的方法中不同的判別器將中共享網路的權重。

III. METHOD

 In this section, we first give an overview of our method for cross-view gait recognition. Then, we describe a novel gait template called Period Energy Image (PEI). We also formulate the model of our proposed Multi-task Generative Adversarial Networks (MGANs). Finally, we introduce the objective functions of our methods.

 在本節中,我們首先概述我們的跨檢視步態識別方法。 然後,我們描述了一種名為週期能量影象(PEI)的新型步態模板。 我們用公式詳細解釋我們提出的多工生成對抗網路(MGAN)的模型。 最後,我們介紹了我們方法的目標函式。

iii

A. Method Overview

在這裡插入圖片描述

  識別流程如圖二所示。給定一個在視角為p的probe模板 x p x^{p} 和視角為g的n個gallery模板 { x 1 g , x 2 g , . . . , x n g { x_{1}^{g},x_{2}^{g},...,x_{n}^{g} } },gallery模板的n個身份定義為 { y 1 g , y 2 g , . . . , y n g { y_{1}^{g},y_{2}^{g},...,y_{n}^{g} } }。跨視角識別問題的目標就是識別 x p x^{p} 的身份。步態模板 x p x^{p} 首先被編碼為一個在隱式空間中的特定視角特徵 z p = E ( x p ) z^{p}=E(x^{p}) ,其中 E 是一個編碼器。然後視角分類器預測probe 和 gallery 模板的視角。繼此之後,檢視轉換層 V 將 z p z^{p} 從檢視 p轉化為檢視 g,轉換方程為 z g = E ( z p , p , g ) z^{g}=E(z^{p},p,g) 。步態模板 x i g x_{i}^{g} 也被編碼為在隱空間的特徵 z i p = E ( x i g ) z_{i}^{p}=E(x_{i}^{g}) ,其中, i ϵ { 1 , 2 , . . . , n } i\epsilon \left \{ {1,2,...,n} \right \} x p x^{p} 身份用最近鄰分類器記做 y i g y_{i*}^{g} ,其中,
在這裡插入圖片描述
2 \left \| \bullet \right \|_{2} 代表2-範數。

B. Period Energy Image


C. Multi-Task Generative Adversarial Network

 Our proposed MGANs model consists of five components: an encoder that encodes the gait templates as view-specific features in a latent space, a view-angle classifier that predicts the view angles of the view-specific features, a view transform layer that transforms the view-specific features from one view to another, a generator that generates the gait images from the view-specific features, and a discriminator that discriminates whether the generated gait images belong to certain domains or distributions. In this subsection, we detail each component as follows.

 我們提出的MGAN模型由五個部分組成:編碼器將步態模板編碼為潛在空間中的特定檢視特徵,視角分類器預測特定檢視特徵的視角,檢視轉換層將特定檢視特徵從一個檢視到另一個檢視,生成器從特定檢視特徵中產生步態影象,判別器用來區分生成的步態影象是屬於某些域或者特定的分佈。 在本小節中,我們按如下方式詳細說明每個元件。

1) Encoder:

 In order to obtain a view-specific feature for recognition, a convolutional neural network is adopted as the encoder in our model. The structure of the encoder is shown in Fig. 1. The input x u x^{u} to the encoder is our proposed PEI template in view angle u. The size of the x u x^{u} is 64 × 64 × n c n_{c} where n c n_{c} is the number of channels in PEI. We use temporal pooling to aggregate temporal information in gait templates. Temporal pooling is commonly used to summarize several video frames into one feature vector in previous literature [41]. In our method, we treat each channel of PEI as one frame and aggregate the temporal information across all channels. Therefore, each channel of PEI is independently fed to the encoder. Mean-pooling is chosen as the implementation of temporal pooling in our method, which is the same as in [41]. We use four convolutional layers followed by batch-normalization layers to build the encoder. Each component in MGANs uses LeakyReLU as the nonlinear activation function. The negative slope of LeakyReLU is set as 0.01. Instead of using the max pooling layer, convolutional layers with a stride size of 2 are adopted. The number of filters is increased by a factor of 2 from 32 to 256.

  為了獲得用於識別的特定檢視特徵,我們採用卷積神經網路作為模型中編碼器。編碼器的結構如圖1所示。編碼器的輸入 x u x^{u} 是我們在PEI在視角 u 下的模板。 x u x^{u} 的大小是64×64× n c n_{c} ,其中 n c n_{c}