1. 程式人生 > >ReID:Harmonious Attention Network for Peson Re-Identification 解讀

ReID:Harmonious Attention Network for Peson Re-Identification 解讀

Problem

  • Existing person re-identification(re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images.
  • 現有的re-id方法一般假設人物的bounding box是well-aligned的,或者依賴於constrained attention selection mechanisms去矯正bounding box使它們對齊。
  • They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors.
  • 因此作者認為它們在re-id matching問題中是區域性最優的,潛在的包含大量的human pose variations 和 auto detection errors。
    • Auto-detection: misalignment with background cluster, occlusion, missing body parts
    • Auto Detection會由於混亂背景或者身體部分缺失而出錯
  • A small number of attention deep learning models for re-id have been recently developed for reducing the negative effect from poor detection and human pose change
  • 然後就有人嘗試attention selection deep learning model in re-id
  • Nevertheless, these deep methods implicitly assume the availability of large labelled training data by simply adopting existing deep architectures with
    high complexity in model design. Additionally, they often consider only coarse region-level attention whilst ignoring the fine-grained pixel-level saliency.
  • 儘管如此,這些deep model複雜度較高,需要的training data較大,並且它們重視region-level attention而忽略了fine-grained pixel-level saliency.
  • Hence, these techniques are ineffective when only a small set of labelled
    data is available for model training whilst also facing noisy person images of arbitrary misalignment and background clutter.
  • 因此,這些方法在訓練集較小的時候效率不高,而且還會面臨由misalignment和background clutter引起的混亂的圖片場景。

總的來說,這篇論文解決的是ReID傳統問題。

Motivation

  • Existing works:

    • simply adopting a standard deep CNN network typically with a large number of model parameters and high computational cost in model deployment
    • Consider only coarse region-level attention whilst ignoring the fine-grained pixel-level saliency
  • Our works:

    • We design a lightweight yet deep CNN architecture by devising a holistic attention mechanism for locating the most discriminative pixels and regions in order to identify optimal visual patterns for re-id.
    • The proposed HA-CNN model is designed particularly to address the weakness of existing deep methods as above by formulating a joint learning scheme for modelling both soft and hard attention in a singe re-id deep model.
  • 問題一:現存的方法大多采用傳統的CNN,這樣帶來的影響是:引數過多,計算的代價過大

所以作者提出了HA-CNN網路,該網路是一個lightweight (引數少) 同時又保證了deep(足夠深)的特性。

  • 問題二: 現存的方法中,雖然考慮到了hard region-level attention,但pix-level attention 卻被忽略了

所以作者提出的HA-CNN網路採用了聯合學習hard and soft attention 的scheme,充分考慮hard and soft attention。

Contribution

  • (I) We formulate a novel idea of jointly learning multi-granularity attention selection and feature representation for optimizing person re-id in deep learning.
  • 貢獻一:提出了Jointly learning of attention selection 與 feature representation (global && local feature)
  • (II) We propose a Harmonious Attention Convolution Neural Network (HA-CNN) to simultaneously learn hard region-level and soft pixel-level attention within arbitrary person bounding boxes along with re-id feature representations for maximizing the correlated complementary information between attention selection and feature discrimination。
  • 貢獻二: 提出了HA-CNN 模型
  • (III) We introduce a cross-attention interaction learning scheme for further enhancing the compatibility between attention selection and feature representation given re-id discriminative constraints.
  • 貢獻三:引入了cross-attention interaction

我個人覺得這三點歸結起來就是提出了一個較為novel 的 architecture — HA-CNN.下面就詳細講述這個網路。

HA-CNN

HA-CNN

我個人總結了該網路的四個特點:
1. LightWeight (less parameters)
2. Joint learning of global and local features;
3. Joint learning of soft and hard attention;
4. Cross-attention interaction learning scheme between attention selection and feature representation.

該網路是一個多分支網路,包括獲取global features 的 global branch 與 獲取local features 的 local branches。每個branch的基本單位都是Inception-A/B(某種結構,還有其它結構如ResNet,VGG,AlexNet,你可以看成一個工具箱,能用就行了)。

Global branch 由3個Inception A(深色)與3個Inceprtion B(淺色)構成,還包含3個Harmonious Attention(紅色),1個Global average pooing(綠色),1個Fully-Connected Layer(灰色), 最後獲得一個512-dim global features。

Local branches 有多條(T branches),每條由3個Inception B(淺色) 和 1個 Global average pooling構成,最後每條分支的輸出彙總到一起,通過一個 Fully-Connected Layer以獲得512-dim local features.

補充: Global branch 只有一條,Local branches有T條,每條Local branch處理一個region。每一個bounding box可以有T個regions。

然後Global feature 與 Local feature 連線起來獲得1024-dim feature,即是HA-CNN的輸出。

圖中的虛線與紅色箭頭,將在後面結合HA解釋。這裡先鋪墊一下:Global features 是從 whole image 提取的, Local features 是從 來自於bounding box 的 regions,而這些regions是由HA提供的。即虛線是HA將Regions 傳送到前面的結點,然後紅線是將這些regions分配到各個Local branches。

講清楚了這個網路的結構,便能解釋它的第一個特點— LightWeight
1. 採用分支網路,引數量的計算由乘法降為加法;
2. Global branch 與 Local branches 共享第一層Conv的引數;
3. Local branches 共享d1, d2, d3的引數。

該網路同時學習Global and Local Features,所以體現了它的第二個特點 — Joint learning of global and local features

補充一下圖上引數的註解:
1. di 表示filter的數目,也就是channel的數目;
2. 第一層卷積 {32,33,2} 表示32個filters,3*3 卷積核, 2 步長。

在深入瞭解HA結構之前,我們需要了解一下Attention機制。

什麼是Attention?我覺得就是一個衡量資訊價值的權重,以確定搜尋範圍。比如我現在要在一張圖片上搜索某個人的臉部,那麼這張影象上價值權重最高的部分便是包含臉部的regions,這些regions就是我們的attention,也就是我們的搜尋範圍。再舉個例子,我現在有個包含10個單詞的句子,我每個單詞賦予一個權重,作為每個單詞在這個句子中的價值衡量,權重越大,價值越高。自然,我的Attention就是一個10-dim vector,這也是它的本質。

Attention主要包含兩類:Hard attention 與 Soft attention。簡單的來說,Hard attention 關注的是 region級別的,Soft attention 關注的是 pixel 級別的。 舉個例子:現在有一張聚會的合影,合影背景有各種吃剩的食物,瓶子等。但是你依然能很快的從中發現你認識的人(假如有你認識的人)。這就是一個Hard attention。即你能在非常混亂的背景下找到你認識的人,而沒有受到太大幹擾。這種確實很適合解決misaligned image。然後再舉個閱讀理解的例子:先閱讀問題,提取出關鍵字(token),然後迴文中查詢。你尋找的這些token便是soft attention的體現。

Stack overflow上一段比較形象的解釋 Attention

這裡寫圖片描述

HA結構包含四個框:red、yellow、green、black。red 框 代表 soft attention learning, black 框代表 hard attention learning, red框內的green 框代表soft spatial attention, red 框內的yellow 框代表soft channel attention。

下面解釋各個框,結合公式可能會好理解一點。

首先來看red 框。(1) green 框的輸出 與 yellow 框的輸出 進行 multiply op,得到的結果(2) 通過一層卷積層,再 (3) 經過一個Sigmoid獲得red框的輸出(we use the sigmoid operation to normalise the full soft attention into the range between 0.5 and 1)。公式(1) 描述的是步驟(1).
這裡寫圖片描述
補充: 將 yellow 框與 green 框 的輸出 作multiply op 以獲得 soft attention,然後經過一層卷積,這層卷積有利於這兩種soft attention 的 combination。最後經過sigmoid層,讓輸出每一分量保持在0.5~1範圍。

接著看green 框。(1) HA的輸入傳入Reduce層(Global cross-channel averaging pooling layer), (2)得到的結果經過一層卷積層,(3)再經過一層Resize層(雙線性插值), 最後(4)再經過一層卷積得到 soft spatial attention。公式(2) 描述的是步驟(1)的Reduce層,其實本質上就是一個channels的平均。
這裡寫圖片描述
補充: Reduce Layer是對通道作avg操作,即將3d tensor轉化為spatial tensor。緊接著的一層卷積用於提取spatial attention的特徵。然後通過resize層,恢復h、w大小,該層採用的是雙線性插值。最後一層Layer應該是為了增大其非線性表達。

然後看yellow 框。(1) HA的輸入傳入Global averaging pooling layer,對輸入進行squeeze運算,(2)得到的結果通過兩個卷積層得到結果。公式(3)描述的是步驟 (1) 的squeeze函式,公式(4) 描述的是步驟 (2)。

這裡寫圖片描述
這裡寫圖片描述
補充: 第一層squeeze操作,是將3d tensor 轉化為 channel tensor。然後經過兩層卷積提取特徵。

red框的輸出與HA的輸入作multiply op後傳入下一層。red框體現了HA-CNN的第三個特點: Joint learning of soft and hard attention

最後來看黑色框。作者 model the hard region attention as a transform matrix, 即公式(5), which allows for image cropping, translation and isotropic scaling operations by varying two scale factors(sh,sw) and 2-D spatial position (tx,ty)
這裡寫圖片描述

作者通過使用固定的shsw 來限制模型複雜度,所以hard attention model 只需要考慮兩個引數。目前輸入是一個c-D vector
, 並且我們提取T個regions,所以藍色部分的全連線層的引數便是2T*c 個引數。然後經過Tanh,最後獲得2T輸出θ

補充: 作者將hard attention 給 model 成 一個變換矩陣,該矩陣主要有4個引數:shswtxty。 其中,shsw 用於固定region的大小,以限制模型的複雜度,所以hard attention learning 學的便是這兩個t變數。所以,全連線層的輸出是2T,再經過Tanh,將position轉化為百分比,以方便定位region的位置,即輸出的θ是T個region的位置資訊

θ 按照虛線傳輸回之前的結點,然後分成T個parts根據紅線輸入到各個Local branches做add op。

更正:現在解釋一下HA-CNN的虛線與紅線。虛線部分,引用原文如下:

The hard region attention is enforced on that of the corresponding network block to generate T different parts which are subsequently fed into the corresponding streams of the local branch.

這個“enforce” and then “generate T parts” 很迷,應該是沒講清楚。作者的意思應該是從HA模組傳回的θ是regions的position info,然後從所到達的network block的當前feature map中獲得region,將他們分別resize 成24x28x32、12x14xd1、6x7xd2,傳入local分支中與local feature做 add op。這一過程可以用公式(6) 描述
這裡寫圖片描述

補充: 為什麼做這個加法?作者的意思將global branch 學習到的東西分享給local branch,相當於分享了global branch 的學習能力,所以這樣可以減少local branch的層數,繼而減少引數量。

模型的引數訓練,如公式(7)
這裡寫圖片描述

公式(6) 與 公式 (7) 體現了HA-CNN的第四個特點:Cross-attention interaction learning scheme between attention selection and feature representation

HA-CNN on ReID

  • A test probe image Ip, A set of test gallery image {Iig}
    (1) We first compute their corresponding 1,024-D feature vectors by forward-feeding the images to a trained HA-CNN model, denoted as xp=[xgp;xlp] and {xig=[xgg;xlg]}
    (2) We then compute L2 normalisation on the global and local features, respectively.
    (3) Lastly, we compute the crosscamera matching distances between xp and xig by the L2distance.
    (4) We then rank all gallery images in ascendant order by their L2 distances to the probe image.

Experiment

實現細節:
1. All person images are resized to 160×64;
2. we set the width of Inception units at the 1st/2nd/3rd levels as: d1 =128, d2 = 256 and d3 = 384;
3. we use T = 4 regions for hard attention;
4. In each stream, we fix the size of three levels of hard attention
as 24×28, 12×14 and 6×7;
5. For model optimisation, we use the ADAM algorithm at the initial learning rate 5×10−4 with the two moment terms β1 = 0:9 and β2 = 0:999;
6. We set the batch size to 32, epoch to 150, momentum to 0.9;
7. , we do not adopt any data argumentation methods (e.g. scaling, rotation, flipping, and colour distortion), neither model pre-training.

實驗結果:

  • Market-1501
    這裡寫圖片描述

  • DukeMTMC-ReID
    這裡寫圖片描述

  • CUHK03
    這個比較特殊,原本是1367/100 training/test split,作者採用的是767/700
    這裡寫圖片描述

  • Attention Evaluation
    這裡寫圖片描述

  • CAIL Evaluation
    這裡寫圖片描述

  • Joint of global and local feature Evaluation
    這裡寫圖片描述

  • 模型引數對比
    這裡寫圖片描述

  • Visulisation
    Soft attention捕捉具有強烈區分性的特徵,如那一坨彩色的東西;
    Hard attention能夠定位身體部位,如那四個框框
    這裡寫圖片描述