research-article

Cross-enhancement transform two-stream 3D ConvNets for action recognition

Authors:

Dongdong ZhangAuthors Info & Claims

AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing

Article No.: 32, Pages 1 - 5

https://doi.org/10.1145/3371425.3371491

Published: 19 December 2019 Publication History

Abstract

Action recognition is an important research topic in computer vision. It is the basic work for visual understanding and has been applied in many fields. Since human actions can vary in different environments, it is difficult to infer actions in completely different states with a same structural model. For this case, we propose a Cross-Enhancement Transform Two-Stream 3D ConvNets algorithm, which considers the action distribution characteristics on the specific dataset. As a teaching model, stream with better performance in both streams is expected to assist in training another stream. In this way, the enhanced-trained stream and teacher stream are combined to infer actions. We implement experiments on the video datasets UCF-101, HMDB-51, and Kinetics-400, and the results confirm the effectiveness of our algorithm.

References

[1]

A Karpathy, G Toderici, S Sherry, T Leung, R Sukthankar and F F Li (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, CVPR.

Digital Library

[2]

I Laptev, M Marszalek, C Schmid and B Rozenfeld (2008) Learning realistic human actions from movies. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, CVPR.

[3]

J C Niebles, H Wang and F F Li (2008). Unsupervised learning of human action categories using spatial-temporal words. In International journal of computer vision, 79(3), 299--318.

[4]

H Wang and C Schmid (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV.

Digital Library

[5]

J Donahue, L A Hendricks, S Guadarrama, M Rohrbach, S Venugopalan, K Saenko and T Darrell (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[6]

J Y H Ng, M Hausknecht, S Vijayanarasimhan, O Vinyals, R Monga and G Toderici (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[7]

T Cooijmans, N Ballas, C Laurent, Ç Gülçehre and A Courville (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.

[8]

L Colin, D F Michael, V Rene, R Austin and D H Gregory (2017). Temporal convolutional networks for action segmentation and detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[9]

L Wang, Y Xiong, Z Wang, Y Qiao, D Lin, X Tang and V Gool (2016). Temporal segment networks: towards good practices for deep action recognition. In Proceedings of European Conference on Computer Vision, ECCV.

[10]

K Simonyan and A Zisserman (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, NIPS.

[11]

C Feichtenhofer, A Pinz and A Zisserman (2016). Convolutional Two-Stream Network Fusion for Video Action Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[12]

M Xu, A Sharghi, X Chen and D Crandall (2018). Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition. arXiv preprint arXiv: 1801.03983.

[13]

H Wang, A Kläser, C Schmid and C Liu (2011). Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

Digital Library

[14]

H Wang and C Schmid (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV.

Digital Library

[15]

C Zach, T Pock and H Bischof (2007). A duality based approach for realtime TV-L1 optical flow. In Deutsche Arbeitsgemeinschaft für Mustererkennung, DAGM.

[16]

A Dosovitskiy, P Fischer, E Ilg, P Häusser, C Hazirbäs, V Golkov, P V D Smagt, D Cremers and T Brox (2015). Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, ICCV.

Digital Library

[17]

E Ilg, N Mayer, T Saikia, M Keuper, A Dosovitskiy and T Brox (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[18]

S Ji, W Xu, M Yangand K Yu (2013). 3D convolutional neural networks for human action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221--231.

Digital Library

[19]

W T Graham, R Fergus, Y Lecun and C Bregler (2010). Convolutional Learning of Spatio-temporal Features. In Proceedings of European Conference on Computer Vision, ECCV.

[20]

D Tran, L Bourdev, R Fergus, L Torresani and M Paluri (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV.

Digital Library

[21]

G Varol, I Laptev and C Schmid (2017). Long-term temporal convolutions for action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]

C Joao and Z Andrew (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[23]

Y Tang, L Ma and L Zhou (2019). Hallucinating optical flow features for video classification. In International Joint Conferences on Artificial Intelligence, IJCAI.

[24]

Y Zhu, Z Lan, S Newsam and A G Hauptmann (2018). Hidden two-stream convolutional networks for action recognition. In Asian Conference on Computer Vision ACCV.

[25]

H Kuehne, H Jhuang, E Garrote, T Poggio and T Serre (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, ICCV.

Digital Library

[26]

K Soomro, A R Zamir and M Shah (2012). UCF101: A dataset of 101 human actions classes fromvideos in the wild. arXiv preprint arXiv:1212.0402.

[27]

W Kay, J Carreira, K Simonyan, B Zhang, C Hillier, S Vijayanarasimhan, F Viola, T Green, T Back, P Natsev, M Suleyman and A Zisserman (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

[28]

N Crasto, P Weinzaepfel, K Alahari and C Schmid (2019). MARS: Motion-augmented RGB stream for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.

[29]

D Purwanto, R R A Pramono, Y T Chen and W H Fang (2019). Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation. In IEEE International Conference on Computer Vision Workshop.

[30]

D Lopez-Paz, L Bottou, B Scholkopf and V Vapnik (2016). Unifying distillation and privileged information. In International Conference on Learning Representations, ICLR.

Cited By

Zhang XWang XZhang WWang YLiu XWei D(2023)Multi-attention network for pedestrian intention prediction based on spatio-temporal feature fusionProceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering10.1177/09544070231190522238:13(4202-4215)Online publication date: 2-Aug-2023
https://doi.org/10.1177/09544070231190522
Abdulazeem YBalaha HBahgat WBadawy M(2021)Human Action Recognition Based on Transfer Learning ApproachIEEE Access10.1109/ACCESS.2021.30866689(82058-82069)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3086668

Index Terms

Cross-enhancement transform two-stream 3D ConvNets for action recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors

Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015)...
Spatio-temporal deformable 3D ConvNets with attention for action recognition
Highlights
- We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).
Abstract
The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric ...
Multi-head attention-based two-stream EfficientNet for action recognition
Abstract
Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing

December 2019

464 pages

ISBN:9781450376334

DOI:10.1145/3371425

Conference Chairs:
João Manuel R. S. Tavares,
Zeshui Xu

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ASciE: Association for Science and Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AIIPCC '19

Sponsor:

ASciE

AIIPCC '19: 2019 International Conference on Artificial Intelligence, Information Processing and Cloud Computing

December 19 - 21, 2019

Sanya, China

Acceptance Rates

AIIPCC '19 Paper Acceptance Rate 78 of 211 submissions, 37%;

Overall Acceptance Rate 78 of 211 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
55
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XWang XZhang WWang YLiu XWei D(2023)Multi-attention network for pedestrian intention prediction based on spatio-temporal feature fusionProceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering10.1177/09544070231190522238:13(4202-4215)Online publication date: 2-Aug-2023
https://doi.org/10.1177/09544070231190522
Abdulazeem YBalaha HBahgat WBadawy M(2021)Human Action Recognition Based on Transfer Learning ApproachIEEE Access10.1109/ACCESS.2021.30866689(82058-82069)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3086668

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten