research-article

PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition

Authors:

Lei GanAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 500 - 509

https://doi.org/10.1145/3343031.3350876

Published: 15 October 2019 Publication History

Abstract

Despite the remarkable performance in video-based action recognition over the past several years, current state-of-the-art approaches heavily rely on the optical flow as motion representation. However, computing the optical flow in advance is computationally expensive, which restricts action recognition to be real-time. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Inspired by Persistence of Vision in human visual system, we design a novel motion cue called Persistence of Appearance (PA), which enables the network to distill motion information directly from adjacent RGB frames. Our PA derives from optical flow and focuses on the small displacements of motion boundaries. Compared with other motion representations, our PA enables the network to achieve competitive accuracy on UCF101. Meanwhile, the inference speed reaches 1855 fps, which is over 120x faster than that of the traditional optical flow based methods. Besides, we devise a decision strategy called Various-timescale inference Pooling (VIP) to empower the network with the ability of long-range temporal modeling across various timescales. We further incorporate the proposed PA and VIP to form a unified framework called Persistent Appearance Network (PAN). Compared with methods using only RGB frames, our delicately designed PAN achieves state-of-the-art results on three benchmark datasets: UCF101, HMDB51 and Kinetics, where it reaches 96.2%, 74.8% and 82.5% accuracy respectively with the run-time speed as high as 595 fps. The code for this project is available at: https://github.com/zhang-can/PAN-PyTorch .

References

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6299--6308.

[2]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. (2009).

[3]

Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. 2017. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017).

[4]

Ali Diba, Ali Mohammad Pazandeh, and Luc Van Gool. 2016. Efficient two-stream motion and appearance 3d cnns for video classification. arXiv preprint arXiv:1608.08851 (2016).

[5]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition . 2625--2634.

[6]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.

Digital Library

[7]

Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6016--6025.

[8]

Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems. 3468--3476.

[9]

Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4768--4777.

[10]

Christoph Feichtenhofer, Axel Pinz, Richard P Wildes, and Andrew Zisserman. 2018. What have we learned from deep representations for action recognition?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7844--7853.

[11]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1933--1941.

[12]

Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In European Conference on Computer Vision. Springer, 849--866.

[13]

Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016b. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 923--932.

[14]

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 971--980.

[15]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, and Others. 2017. The" Something Something" Video Database for Learning and Evaluating Visual Common Sense. In ICCV, Vol. 2. 8.

[16]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 6546--6555.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Sun Jian. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision & Pattern Recognition .

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 9 (2015), 1904--1916.

Digital Library

[19]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Computer Vision & Pattern Recognition .

[20]

Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence, Vol. 17, 1--3 (1981), 185--203.

Digital Library

[21]

Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2018. Toward Efficient Action Recognition: Principal Backpropagation for Training Two-Stream Networks. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1773--1782.

Digital Library

[22]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2462--2470.

[23]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Digital Library

[24]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2013), 221--231.

Digital Library

[25]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, and Others. 2017. The kinetics human action video dataset . arXiv preprint arXiv:1705.06950 (2017).

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems .

[27]

Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, 2556--2563.

Digital Library

[28]

Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. 2018. Motion feature network: Fixed motion filter for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 387--403.

[29]

Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. 2018. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, Vol. 166 (2018), 41--50.

Digital Library

[30]

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.

[31]

Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, and Larry S Davis. 2018. Actionflownet: Learning motion representation for action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1616--1624.

[32]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision . 5533--5541.

[33]

Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.

[34]

Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, and Michael J Black. 2017. On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416 (2017).

[35]

Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).

[36]

Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576.

[37]

Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 (2014).

[38]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild . arXiv preprint arXiv:1212.0402 (2012).

[39]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In International conference on machine learning. 843--852.

Digital Library

[40]

Deqing Sun, Stefan Roth, and Michael J Black. 2010. Secrets of optical flow estimation and their principles. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2432--2439.

[41]

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018b. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 8934--8943.

[42]

Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018a. Optical flow guided feature: a fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1390--1399.

[43]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.

[44]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision . 4489--4497.

Digital Library

[45]

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).

[46]

Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1430--1439.

[47]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.

[48]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4694--4702.

[49]

Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint pattern recognition symposium. Springer, 214--223.

[50]

Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.

[51]

Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2718--2726.

[52]

Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018. Recognize Actions by Disentangling Components of Dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6566--6575.

[53]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV). 803--818.

Digital Library

[54]

Jiagang Zhu, Zheng Zhu, and Wei Zou. 2018. End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 645--650.

[55]

Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G Hauptmann. 2017. Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389 (2017).

[56]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV). 695--712.

Digital Library

Cited By

Zhang YKang WSong W(2024)Robust and Accurate Hand Gesture Authentication With Cross-Modality Local-Global Behavior AnalysisIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345136719(8630-8643)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3451367
Song WKang WLin L(2024)Hand Gesture Authentication by Discovering Fine-Grained Spatiotemporal Identity CharacteristicsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328646034:1(461-474)Online publication date: Jan-2024
https://doi.org/10.1109/TCSVT.2023.3286460
Li YMa MWu JYang KPei ZRen J(2024)MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB SensorsIEEE Sensors Journal10.1109/JSEN.2024.336304224:7(11770-11782)Online publication date: 1-Apr-2024
https://doi.org/10.1109/JSEN.2024.3363042
Show More Cited By

Index Terms

PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

EAR: Efficient action recognition with local-global temporal aggregation
Highlights
- We propose an efficient action recognition network called EAR.
- EAR ...
Abstract
Temporal modeling in videos is crucial for action recognition. Traditionally, it involves feature aggregation for both local motion and global semantic. In this paper, we propose an Efficient Action Recognition network (EAR), which ...
Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal ...
Evolution of Trajectories: A Novel Representation for Deep Action Recognition
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Achieving high classification accuracy remains a challenge for human action recognition approaches that are based on convolutional neural networks (CNN). CNN-based action recognition methods on resource-constrained edge systems are currently unable to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Engineering Laboratory for Video Technology - Shenzhen Division
Shenzhen Municipal Development and Reform Commission
Aoto-PKUSZ Joint Lab

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)6

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YKang WSong W(2024)Robust and Accurate Hand Gesture Authentication With Cross-Modality Local-Global Behavior AnalysisIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345136719(8630-8643)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3451367
Song WKang WLin L(2024)Hand Gesture Authentication by Discovering Fine-Grained Spatiotemporal Identity CharacteristicsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328646034:1(461-474)Online publication date: Jan-2024
https://doi.org/10.1109/TCSVT.2023.3286460
Li YMa MWu JYang KPei ZRen J(2024)MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB SensorsIEEE Sensors Journal10.1109/JSEN.2024.336304224:7(11770-11782)Online publication date: 1-Apr-2024
https://doi.org/10.1109/JSEN.2024.3363042
Guo YDing LJin LYan J(2024)Multi-scale Temporal Attention Network for Blepharospasm Recognition in Videos2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL62147.2024.10604035(344-349)Online publication date: 19-Apr-2024
https://doi.org/10.1109/CVIDL62147.2024.10604035
Nguyen DLiu SSintunata VWang YHo JLim ZLee RLeman K(2024)A Spatiotemporal Excitation Classifier Head for Action Recognition Applications2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00020(59-62)Online publication date: 25-Jun-2024
https://doi.org/10.1109/CAI59869.2024.00020
Song WKang WKong AZhang YQiao Y(2024)L3AM: Linear Adaptive Additive Angular Margin Loss for Video-Based Hand Gesture AuthenticationInternational Journal of Computer Vision10.1007/s11263-024-02068-wOnline publication date: 6-May-2024
https://doi.org/10.1007/s11263-024-02068-w
Xue MXu ZQiao SZheng JLi TWang YPeng D(2024)Driver intention prediction based on multi-dimensional cross-modality information interactionMultimedia Systems10.1007/s00530-024-01282-330:2Online publication date: 15-Mar-2024
https://doi.org/10.1007/s00530-024-01282-3
Gao ZLiu XWang ALin L(2024)A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognitionThe Visual Computer10.1007/s00371-024-03638-2Online publication date: 21-Oct-2024
https://doi.org/10.1007/s00371-024-03638-2
Zhang YWang F(2023)HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention MechanismApplied Sciences10.3390/app1307455813:7(4558)Online publication date: 3-Apr-2023
https://doi.org/10.3390/app13074558
Wang JZhang CHuang JRen BDeng ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Improving Scene Graph Generation with Superpixel-Based Interaction LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611889(1809-1820)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611889
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents