1. 程式人生 > >視訊理解論文和資料集

視訊理解論文和資料集

轉自:https://github.com/sujiongming/awesome-video-understanding

Awesome Video Understanding

Understanding Video: Perceiving dynamic actions could be a huge advance in how software makes sense of the world.(from MIT Technology Review December 6, 2017)

A list of resources for video understanding. Most of papers can be searched by scholar.google.com.

This list is updated on December 13th 2017.

  • Video Classification
  • Action Recognition
  • Video Captioning: will be updated
  • Temporal Action Detection: will be updated
  • Video Datasets

Table of Contents

Papers

Video Classification

  • image-based methods
    • Zha S, Luisier F, Andrews W, et al. Exploiting Image-trained CNN Architectures for Unconstrained Video Classification[J]. Computer Science, 2015.
    • Sánchez J, Perronnin F, Mensink T, et al. Image Classification with the Fisher Vector: Theory and Practice[J]. International Journal of Computer Vision, 2013, 105: 222-245.
  • CNN-based methods
    • Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
    • Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
    • Fernando B, Gould S. Learning end-to-end video classification with rank-pooling[C]//International Conference on Machine Learning. 2016: 1187-1196.
  • RNN-based methods
    • Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015: 461-470.
    • Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4694-4702.

Action Recognition

  • CNN-based methods
    • Ji S, Xu W, Yang M, et al. 3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1):221-231.
    • Tran D, Bourdev L D, Fergus R, et al. C3D: generic features for video analysis[J]. CoRR, abs/1412.0767, 2014, 2(7): 8.
    • Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. arXiv preprint arXiv:1604.04494, 2016.
    • Sun L, Jia K, Yeung D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4597-4605.
    • Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in neural information processing systems. 2014: 568-576.
    • Ye H, Wu Z, Zhao R W, et al. Evaluating two-stream CNN for video classification[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015: 435-442.
    • Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4305-4314.
    • Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1933-1941.
    • Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 20-36.
    • Zhang B, Wang L, Wang Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2718-2726.
    • Wang X, Farhadi A, Gupta A. Actions~ transformations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2658-2667.
    • Zhu W, Hu J, Sun G, et al. A key volume mining deep framework for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1991-1999.
    • Bilen H, Fernando B, Gavves E, et al. Dynamic image networks for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3034-3042.
    • Fernando B, Anderson P, Hutter M, et al. Discriminative hierarchical rank pooling for activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 1924-1932.
    • Cherian A, Fernando B, Harandi M, et al. Generalized rank pooling for activity recognition[J]. arXiv preprint arXiv:1704.02112, 2017.
    • Fernando B, Gavves E, Oramas J, et al. Rank pooling for action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(4): 773-787.
    • Fernando B, Gould S. Discriminatively Learned Hierarchical Rank Pooling Networks[J]. arXiv preprint arXiv:1705.10420, 2017.
  • RNN-based methods
    • Baccouche M, Mamalet F, Wolf C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Springer, Berlin, Heidelberg, 2011: 29-39.
    • Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 2625-2634.
    • Veeriah V, Zhuang N, Qi G J. Differential recurrent neural networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4041-4049.
    • Li Q, Qiu Z, Yao T, et al. Action recognition by learning deep multi-granular spatio-temporal video representation[C]//Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016: 159-166.
    • Wu Z, Jiang Y G, Wang X, et al. Multi-stream multi-class fusion of deep networks for video classification[C]//Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016: 791-800.
    • Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015.
    • Li Z, Gavves E, Jain M, et al. VideoLSTM convolves, attends and flows for action recognition[J]. arXiv preprint arXiv:1607.01794, 2016.
  • Unsupervised learning methods
    • Taylor G W, Fergus R, LeCun Y, et al. Convolutional learning of spatio-temporal features[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2010: 140-153.
    • Le Q V, Zou W Y, Yeung S Y, et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 3361-3368.
    • Yan X, Chang H, Shan S, et al. Modeling video dynamics with deep dynencoder[C]//European Conference on Computer Vision. Springer, Cham, 2014: 215-230.
    • Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learning of video representations using lstms[C]//International Conference on Machine Learning. 2015: 843-852.
    • Pan Y, Li Y, Yao T, et al. Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure[C]//IJCAI. 2016: 3832-3838.
    • Ballas N, Yao L, Pal C, et al. Delving deeper into convolutional networks for learning video representations[J]. arXiv preprint arXiv:1511.06432, 2015.

Video Datasets

  • HMDB51
    • Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]//Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011: 2556-2563.
    • state-of-the-art: 75%
      • Lan Z, Zhu Y, Hauptmann A G. Deep Local Video Feature for Action Recognition[J]. arXiv preprint arXiv:1701.07368, 2017.
  • UCF-101
    • Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
    • state-of-the-art: 95.6%
      • Diba A, Sharma V, Van Gool L. Deep temporal linear encoding networks[J]. arXiv preprint arXiv:1611.06678, 2016.
  • ActivityNet
    • Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 961-970.
    • state-of-the-art: 91.3%
      • Wang L, Xiong Y, Lin D, et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection[J]. arXiv preprint arXiv:1703.03329, 2017.
  • Sports-1M
    • Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
    • state-of-the-art: 67.6%
      • Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
  • YouTube-8M
    • Abu-El-Haija S, Kothari N, Lee J, et al. YouTube-8M: A large-scale video classification benchmark[J]. arXiv preprint arXiv:1609.08675, 2016.
    • state-of-the-art: 84.967%
      • Miech A, Laptev I, Sivic J. Learnable pooling with Context Gating for video classification[J]. arXiv preprint arXiv:1706.06905, 2017.
  • Kinetics
    • Kay W, Carreira J, Simonyan K, et al. The Kinetics Human Action Video Dataset[J]. arXiv preprint arXiv:1705.06950, 2017.
    • state-of-the-art: ?
  • Moments in Time Dataset
    • Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Tom Yan, Alex Andonian, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, Aude Oliva.Moments in Time Dataset: one million videos for event understanding. tech report
    • state-of-the-art: ?