1. 程式人生 > >[Paper note] Video-based Person Re-identification with Accumulative Motion Context

[Paper note] Video-based Person Re-identification with Accumulative Motion Context

Highlight

  • Two stream: spatial + temporal (optical flow).
  • Use a motion network pre-trained on optical flow to predict OF and also learn end-to-end in training phase.
  • Fusion of motion and spatial features
  • Multiloss: siamese reid and classification loss.

Model

  • Structure of the whole model:
    model
  • Structure of motion network (pre-trained on LK or Epic optical flow):
    motion network
  • Structure of spatial network:
    spatial network
  • Different spatial fusion method: concatenate, sum, max
  • Different spatial fusion position: @ any layer in spatial network
  • Motion context accumulation: via RNN (not LSTM in this paper)
  • Multiloss: siamese (distance) loss + classification (softmax)
  • Pre-train motion network on optical flow: smoothed L-1 loss (l=1,2,3 representing optical flow estimation with different resolutions)
    • L(l)(motion)(e(l),g(l))=i,j,ksmoothL1(e(l)i,j,kg(l)i,j,k)
    • smoothL1(θ)={0.5θ2|θ|0.5if |θ|<1otherwise

Experiment

  • Datasets:
    • iLIDS-VID: 300 IDs, 2 camera views, sequence length 23~192
    • PRID-2011: 749 IDs, 2 camera views, sequence length 5~675
  • Settings
    • Input of spatial net: 64 x 32; Input of motion net: 128 x 64
    • Data augmentation as both training and test phase
    • 10 times experiment on different training/test split
    • Sub-sequence 16 frames
    • Sequence length 128 for testing
  • Ablation study
    • Motion information: compare LK & Epic OF, use OF as direct input or pre-train and train end-to-end. End-to-end training with Epic OF performs best.
    • Spatial fusion method and location: concatenate and fuse @Max-pooling2 performs better.
  • Compare with state-of-the-art: new state-of-the-art on PRID-2011.