ReID：Harmonious Attention Network for Peson Re-Identification 解讀

阿新 • • 發佈：2019-01-27

Problem

Existing person re-identification(re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images.
現有的re-id方法一般假設人物的bounding box是well-aligned的，或者依賴於constrained attention selection mechanisms去矯正bounding box使它們對齊。

They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors.
因此作者認為它們在re-id matching問題中是區域性最優的，潛在的包含大量的human pose variations 和 auto detection errors。
- Auto-detection: misalignment with background cluster, occlusion, missing body parts
- Auto Detection會由於混亂背景或者身體部分缺失而出錯
A small number of attention deep learning models for re-id have been recently developed for reducing the negative effect from poor detection and human pose change
然後就有人嘗試attention selection deep learning model in re-id
Nevertheless, these deep methods implicitly assume the availability of large labelled training data by simply adopting existing deep architectures with
high complexity in model design. Additionally, they often consider only coarse region-level attention whilst ignoring the fine-grained pixel-level saliency.

儘管如此，這些deep model複雜度較高，需要的training data較大，並且它們重視region-level attention而忽略了fine-grained pixel-level saliency.
Hence, these techniques are ineffective when only a small set of labelled
data is available for model training whilst also facing noisy person images of arbitrary misalignment and background clutter.
因此，這些方法在訓練集較小的時候效率不高，而且還會面臨由misalignment和background clutter引起的混亂的圖片場景。

總的來說，這篇論文解決的是ReID傳統問題。

Motivation

Existing works:
- simply adopting a standard deep CNN network typically with a large number of model parameters and high computational cost in model deployment
- Consider only coarse region-level attention whilst ignoring the fine-grained pixel-level saliency
Our works:
- We design a lightweight yet deep CNN architecture by devising a holistic attention mechanism for locating the most discriminative pixels and regions in order to identify optimal visual patterns for re-id.
- The proposed HA-CNN model is designed particularly to address the weakness of existing deep methods as above by formulating a joint learning scheme for modelling both soft and hard attention in a singe re-id deep model.
問題一：現存的方法大多采用傳統的CNN，這樣帶來的影響是：引數過多，計算的代價過大

所以作者提出了HA-CNN網路，該網路是一個lightweight (引數少) 同時又保證了deep（足夠深）的特性。

問題二：現存的方法中，雖然考慮到了hard region-level attention，但pix-level attention 卻被忽略了

所以作者提出的HA-CNN網路採用了聯合學習hard and soft attention 的scheme，充分考慮hard and soft attention。

Contribution

(I) We formulate a novel idea of jointly learning multi-granularity attention selection and feature representation for optimizing person re-id in deep learning.
貢獻一：提出了Jointly learning of attention selection 與 feature representation (global && local feature)
(II) We propose a Harmonious Attention Convolution Neural Network (HA-CNN) to simultaneously learn hard region-level and soft pixel-level attention within arbitrary person bounding boxes along with re-id feature representations for maximizing the correlated complementary information between attention selection and feature discrimination。
貢獻二: 提出了HA-CNN 模型
(III) We introduce a cross-attention interaction learning scheme for further enhancing the compatibility between attention selection and feature representation given re-id discriminative constraints.
貢獻三：引入了cross-attention interaction

我個人覺得這三點歸結起來就是提出了一個較為novel 的 architecture — HA-CNN.下面就詳細講述這個網路。

HA-CNN

我個人總結了該網路的四個特點：
1. LightWeight (less parameters)；
2. Joint learning of global and local features;
3. Joint learning of soft and hard attention;
4. Cross-attention interaction learning scheme between attention selection and feature representation.

該網路是一個多分支網路，包括獲取global features 的 global branch 與獲取local features 的 local branches。每個branch的基本單位都是Inception-A/B(某種結構，還有其它結構如ResNet,VGG,AlexNet，你可以看成一個工具箱，能用就行了)。

Global branch 由3個Inception A(深色)與3個Inceprtion B(淺色)構成，還包含3個Harmonious Attention(紅色)，1個Global average pooing(綠色)，1個Fully-Connected Layer(灰色), 最後獲得一個512-dim global features。

Local branches 有多條(T branches)，每條由3個Inception B(淺色) 和 1個 Global average pooling構成，最後每條分支的輸出彙總到一起，通過一個 Fully-Connected Layer以獲得512-dim local features.

補充： Global branch 只有一條，Local branches有T條，每條Local branch處理一個region。每一個bounding box可以有T個regions。

然後Global feature 與 Local feature 連線起來獲得1024-dim feature，即是HA-CNN的輸出。

圖中的虛線與紅色箭頭，將在後面結合HA解釋。這裡先鋪墊一下：Global features 是從 whole image 提取的， Local features 是從來自於bounding box 的 regions，而這些regions是由HA提供的。即虛線是HA將Regions 傳送到前面的結點，然後紅線是將這些regions分配到各個Local branches。

講清楚了這個網路的結構，便能解釋它的第一個特點— LightWeight。
1. 採用分支網路，引數量的計算由乘法降為加法；
2. Global branch 與 Local branches 共享第一層Conv的引數；
3. Local branches 共享d1, d2, d3的引數。

該網路同時學習Global and Local Features，所以體現了它的第二個特點 — Joint learning of global and local features

補充一下圖上引數的註解：
1. $d_{i}$ 表示filter的數目，也就是channel的數目；
2. 第一層卷積 ${32, 3 * 3, 2}$ 表示32個filters，3*3 卷積核， 2 步長。

在深入瞭解HA結構之前，我們需要了解一下Attention機制。

什麼是Attention？我覺得就是一個衡量資訊價值的權重，以確定搜尋範圍。比如我現在要在一張圖片上搜索某個人的臉部，那麼這張影象上價值權重最高的部分便是包含臉部的regions，這些regions就是我們的attention，也就是我們的搜尋範圍。再舉個例子，我現在有個包含10個單詞的句子，我每個單詞賦予一個權重，作為每個單詞在這個句子中的價值衡量，權重越大，價值越高。自然，我的Attention就是一個10-dim vector，這也是它的本質。

Attention主要包含兩類：Hard attention 與 Soft attention。簡單的來說，Hard attention 關注的是 region級別的，Soft attention 關注的是 pixel 級別的。舉個例子：現在有一張聚會的合影，合影背景有各種吃剩的食物，瓶子等。但是你依然能很快的從中發現你認識的人(假如有你認識的人)。這就是一個Hard attention。即你能在非常混亂的背景下找到你認識的人，而沒有受到太大幹擾。這種確實很適合解決misaligned image。然後再舉個閱讀理解的例子：先閱讀問題，提取出關鍵字(token)，然後迴文中查詢。你尋找的這些token便是soft attention的體現。

Stack overflow上一段比較形象的解釋 Attention

這裡寫圖片描述

HA結構包含四個框：red、yellow、green、black。red 框代表 soft attention learning， black 框代表 hard attention learning， red框內的green 框代表soft spatial attention， red 框內的yellow 框代表soft channel attention。

下面解釋各個框，結合公式可能會好理解一點。

首先來看red 框。(1) green 框的輸出與 yellow 框的輸出進行 multiply op，得到的結果(2) 通過一層卷積層，再 (3) 經過一個Sigmoid獲得red框的輸出(we use the sigmoid operation to normalise the full soft attention into the range between 0.5 and 1)。公式(1) 描述的是步驟(1).
這裡寫圖片描述
補充： 將 yellow 框與 green 框的輸出作multiply op 以獲得 soft attention，然後經過一層卷積，這層卷積有利於這兩種soft attention 的 combination。最後經過sigmoid層，讓輸出每一分量保持在0.5~1範圍。

接著看green 框。(1) HA的輸入傳入Reduce層(Global cross-channel averaging pooling layer), (2)得到的結果經過一層卷積層，(3)再經過一層Resize層(雙線性插值), 最後(4)再經過一層卷積得到 soft spatial attention。公式(2) 描述的是步驟(1)的Reduce層，其實本質上就是一個channels的平均。
這裡寫圖片描述
補充： Reduce Layer是對通道作avg操作，即將3d tensor轉化為spatial tensor。緊接著的一層卷積用於提取spatial attention的特徵。然後通過resize層，恢復h、w大小，該層採用的是雙線性插值。最後一層Layer應該是為了增大其非線性表達。

然後看yellow 框。(1) HA的輸入傳入Global averaging pooling layer，對輸入進行squeeze運算，(2)得到的結果通過兩個卷積層得到結果。公式(3)描述的是步驟 (1) 的squeeze函式，公式(4) 描述的是步驟 (2)。

這裡寫圖片描述

補充： 第一層squeeze操作，是將3d tensor 轉化為 channel tensor。然後經過兩層卷積提取特徵。

red框的輸出與HA的輸入作multiply op後傳入下一層。red框體現了HA-CNN的第三個特點: Joint learning of soft and hard attention

最後來看黑色框。作者 model the hard region attention as a transform matrix, 即公式(5), which allows for image cropping, translation and isotropic scaling operations by varying two scale factors $(s_{h}, s_{w})$ and 2-D spatial position $(t_{x}, t_{y})$
這裡寫圖片描述

作者通過使用固定的 $s_{h}$ 與 $s_{w}$ 來限制模型複雜度，所以hard attention model 只需要考慮兩個引數。目前輸入是一個c-D vector
, 並且我們提取T個regions，所以藍色部分的全連線層的引數便是2T*c 個引數。然後經過Tanh，最後獲得2T輸出 $θ$ 。

補充： 作者將hard attention 給 model 成一個變換矩陣，該矩陣主要有4個引數： $s_{h}$ 、 $s_{w}$ 、 $t_{x}$ 、 $t_{y}$ 。其中， $s_{h}$ 、 $s_{w}$ 用於固定region的大小，以限制模型的複雜度，所以hard attention learning 學的便是這兩個t變數。所以，全連線層的輸出是2T，再經過Tanh，將position轉化為百分比，以方便定位region的位置，即輸出的 $θ$ 是T個region的位置資訊

~~$θ$ 按照虛線傳輸回之前的結點，然後分成T個parts根據紅線輸入到各個Local branches做add op。~~

更正：現在解釋一下HA-CNN的虛線與紅線。虛線部分，引用原文如下：

The hard region attention is enforced on that of the corresponding network block to generate T different parts which are subsequently fed into the corresponding streams of the local branch.

這個“enforce” and then “generate T parts” 很迷，應該是沒講清楚。作者的意思應該是從HA模組傳回的 $θ$ 是regions的position info，然後從所到達的network block的當前feature map中獲得region，將他們分別resize 成24x28x32、12x14x $d_{1}$ 、6x7x $d_{2}$ ，傳入local分支中與local feature做 add op。這一過程可以用公式(6) 描述
這裡寫圖片描述

補充： 為什麼做這個加法？作者的意思將global branch 學習到的東西分享給local branch，相當於分享了global branch 的學習能力，所以這樣可以減少local branch的層數，繼而減少引數量。

模型的引數訓練，如公式(7)
這裡寫圖片描述

公式(6) 與公式 (7) 體現了HA-CNN的第四個特點：Cross-attention interaction learning scheme between attention selection and feature representation

HA-CNN on ReID

A test probe image $I^{p}$ , A set of test gallery image ${I_{i}^{g}}$
(1) We first compute their corresponding 1,024-D feature vectors by forward-feeding the images to a trained HA-CNN model, denoted as $x p = [x_{g}^{p}; x_{l}^{p}]$ and ${x_{i}^{g} = [x_{g}^{g}; x_{l}^{g}]}$
(2) We then compute L2 normalisation on the global and local features, respectively.
(3) Lastly, we compute the crosscamera matching distances between $x^{p}$ and $x_{i}^{g}$ by the L2distance.
(4) We then rank all gallery images in ascendant order by their L2 distances to the probe image.

Experiment

實現細節：
1. All person images are resized to 160×64；
2. we set the width of Inception units at the 1st/2nd/3rd levels as: d1 =128, d2 = 256 and d3 = 384；
3. we use T = 4 regions for hard attention；
4. In each stream, we fix the size of three levels of hard attention
as 24×28, 12×14 and 6×7；
5. For model optimisation, we use the ADAM algorithm at the initial learning rate 5×10−4 with the two moment terms β1 = 0:9 and β2 = 0:999；
6. We set the batch size to 32, epoch to 150, momentum to 0.9；
7. , we do not adopt any data argumentation methods (e.g. scaling, rotation, flipping, and colour distortion), neither model pre-training.

實驗結果：

Market-1501
DukeMTMC-ReID
CUHK03
這個比較特殊，原本是1367/100 training/test split，作者採用的是767/700
Attention Evaluation
CAIL Evaluation
Joint of global and local feature Evaluation
模型引數對比
Visulisation
Soft attention捕捉具有強烈區分性的特徵，如那一坨彩色的東西；
Hard attention能夠定位身體部位，如那四個框框

ReID：Harmonious Attention Network for Peson Re-Identification 解讀

Problem

Motivation

Contribution

HA-CNN

HA-CNN on ReID

Experiment

ReID：Harmonious Attention Network for Peson Re-Identification 解讀

論文筆記：Residual Attention Network for Image Classification

Attention-Aware Compositional Network for Person Re-identification論文精讀

Mask-guided Contrastive Attention Model for Person Re-Identification 詳解

2014 CVPR-DeepReID Deep Filter Pairing Neural Network for Person Re-Identification

Person Re-identification 系列論文筆記（二）：A Discriminatively Learned CNN Embedding for Person Re-identification

《17.Residual Attention Network for Image Classification》

論文解讀：Stacked Attention Networks for Image Question Answering

part-aligned系列論文：1707.Deeply-Learned Part-Aligned Representations for Person Re-Identification 論文筆記

『演算法學習』CPN：Cascaded Pyramid Network for Multi-Person Pose Estimation

[Paper note] Gated Siamese Convolutional Neural Network Architecture for Human Re-Identification

論文筆記（CPN）：Cascaded Pyramid Network for Multi-Person Pose Estimation

Hierarchical Attention Network for Document Classification--tensorflow實現篇

Residual Attention Network for Image Classification, cvpr17

Human Semantic Parsing for Person Re-identification

行人重識別——《A Systematic Evaluation and Benchmark for Person Re-Identification Features, Metrics, and D》

論文筆記（8）--（Re-ID）Camera Style Adaptation for Person Re-identification

論文筆記（3）--（Re-ID）In Defense of the Triplet Loss for Person Re-Identification

【論文閱讀】Batch Feature Erasing for Person Re-identification and Beyond

【Person Re-ID】Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

ReID：Harmonious Attention Network for Peson Re-Identification 解讀

Problem

Motivation

Contribution

HA-CNN

HA-CNN on ReID

Experiment

相關推薦