Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3343031.3350876acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition

Published: 15 October 2019 Publication History

Abstract

Despite the remarkable performance in video-based action recognition over the past several years, current state-of-the-art approaches heavily rely on the optical flow as motion representation. However, computing the optical flow in advance is computationally expensive, which restricts action recognition to be real-time. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Inspired by Persistence of Vision in human visual system, we design a novel motion cue called Persistence of Appearance (PA), which enables the network to distill motion information directly from adjacent RGB frames. Our PA derives from optical flow and focuses on the small displacements of motion boundaries. Compared with other motion representations, our PA enables the network to achieve competitive accuracy on UCF101. Meanwhile, the inference speed reaches 1855 fps, which is over 120x faster than that of the traditional optical flow based methods. Besides, we devise a decision strategy called Various-timescale inference Pooling (VIP) to empower the network with the ability of long-range temporal modeling across various timescales. We further incorporate the proposed PA and VIP to form a unified framework called Persistent Appearance Network (PAN). Compared with methods using only RGB frames, our delicately designed PAN achieves state-of-the-art results on three benchmark datasets: UCF101, HMDB51 and Kinetics, where it reaches 96.2%, 74.8% and 82.5% accuracy respectively with the run-time speed as high as 595 fps. The code for this project is available at: https://github.com/zhang-can/PAN-PyTorch .

References

[1]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6299--6308.
[2]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. (2009).
[3]
Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. 2017. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017).
[4]
Ali Diba, Ali Mohammad Pazandeh, and Luc Van Gool. 2016. Efficient two-stream motion and appearance 3d cnns for video classification. arXiv preprint arXiv:1608.08851 (2016).
[5]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2625--2634.
[6]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.
[7]
Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6016--6025.
[8]
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems. 3468--3476.
[9]
Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4768--4777.
[10]
Christoph Feichtenhofer, Axel Pinz, Richard P Wildes, and Andrew Zisserman. 2018. What have we learned from deep representations for action recognition?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7844--7853.
[11]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1933--1941.
[12]
Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In European Conference on Computer Vision. Springer, 849--866.
[13]
Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016b. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 923--932.
[14]
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 971--980.
[15]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, and Others. 2017. The" Something Something" Video Database for Learning and Evaluating Visual Common Sense. In ICCV, Vol. 2. 8.
[16]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 6546--6555.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Sun Jian. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision & Pattern Recognition .
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 9 (2015), 1904--1916.
[19]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Computer Vision & Pattern Recognition .
[20]
Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence, Vol. 17, 1--3 (1981), 185--203.
[21]
Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2018. Toward Efficient Action Recognition: Principal Backpropagation for Training Two-Stream Networks. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1773--1782.
[22]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2462--2470.
[23]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[24]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2013), 221--231.
[25]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, and Others. 2017. The kinetics human action video dataset . arXiv preprint arXiv:1705.06950 (2017).
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems .
[27]
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, 2556--2563.
[28]
Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. 2018. Motion feature network: Fixed motion filter for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 387--403.
[29]
Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. 2018. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, Vol. 166 (2018), 41--50.
[30]
Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.
[31]
Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S Davis. 2018. Actionflownet: Learning motion representation for action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1616--1624.
[32]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision . 5533--5541.
[33]
Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.
[34]
Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, and Michael J Black. 2017. On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416 (2017).
[35]
Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).
[36]
Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576.
[37]
Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 (2014).
[38]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild . arXiv preprint arXiv:1212.0402 (2012).
[39]
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In International conference on machine learning. 843--852.
[40]
Deqing Sun, Stefan Roth, and Michael J Black. 2010. Secrets of optical flow estimation and their principles. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2432--2439.
[41]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018b. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 8934--8943.
[42]
Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018a. Optical flow guided feature: a fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1390--1399.
[43]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.
[44]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489--4497.
[45]
Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).
[46]
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1430--1439.
[47]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[48]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4694--4702.
[49]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint pattern recognition symposium. Springer, 214--223.
[50]
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.
[51]
Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2718--2726.
[52]
Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018. Recognize Actions by Disentangling Components of Dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6566--6575.
[53]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV). 803--818.
[54]
Jiagang Zhu, Zheng Zhu, and Wei Zou. 2018. End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 645--650.
[55]
Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G Hauptmann. 2017. Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389 (2017).
[56]
Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV). 695--712.

Cited By

View all
  • (2024)Robust and Accurate Hand Gesture Authentication With Cross-Modality Local-Global Behavior AnalysisIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345136719(8630-8643)Online publication date: 2024
  • (2024)Hand Gesture Authentication by Discovering Fine-Grained Spatiotemporal Identity CharacteristicsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328646034:1(461-474)Online publication date: Jan-2024
  • (2024)MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB SensorsIEEE Sensors Journal10.1109/JSEN.2024.336304224:7(11770-11782)Online publication date: 1-Apr-2024
  • Show More Cited By

Index Terms

  1. PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '19: Proceedings of the 27th ACM International Conference on Multimedia
    October 2019
    2794 pages
    ISBN:9781450368896
    DOI:10.1145/3343031
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 October 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fast action recognition
    2. motion representation
    3. persistence of appearance
    4. persistent appearance network

    Qualifiers

    • Research-article

    Funding Sources

    • National Engineering Laboratory for Video Technology - Shenzhen Division
    • Shenzhen Municipal Development and Reform Commission
    • Aoto-PKUSZ Joint Lab

    Conference

    MM '19
    Sponsor:

    Acceptance Rates

    MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Robust and Accurate Hand Gesture Authentication With Cross-Modality Local-Global Behavior AnalysisIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345136719(8630-8643)Online publication date: 2024
    • (2024)Hand Gesture Authentication by Discovering Fine-Grained Spatiotemporal Identity CharacteristicsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328646034:1(461-474)Online publication date: Jan-2024
    • (2024)MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB SensorsIEEE Sensors Journal10.1109/JSEN.2024.336304224:7(11770-11782)Online publication date: 1-Apr-2024
    • (2024)Multi-scale Temporal Attention Network for Blepharospasm Recognition in Videos2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL62147.2024.10604035(344-349)Online publication date: 19-Apr-2024
    • (2024)A Spatiotemporal Excitation Classifier Head for Action Recognition Applications2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00020(59-62)Online publication date: 25-Jun-2024
    • (2024)L3AM: Linear Adaptive Additive Angular Margin Loss for Video-Based Hand Gesture AuthenticationInternational Journal of Computer Vision10.1007/s11263-024-02068-wOnline publication date: 6-May-2024
    • (2024)Driver intention prediction based on multi-dimensional cross-modality information interactionMultimedia Systems10.1007/s00530-024-01282-330:2Online publication date: 15-Mar-2024
    • (2024)A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognitionThe Visual Computer10.1007/s00371-024-03638-2Online publication date: 21-Oct-2024
    • (2023)HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention MechanismApplied Sciences10.3390/app1307455813:7(4558)Online publication date: 3-Apr-2023
    • (2023)Improving Scene Graph Generation with Superpixel-Based Interaction LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611889(1809-1820)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media