Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3371425.3371491acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiipccConference Proceedingsconference-collections
research-article

Cross-enhancement transform two-stream 3D ConvNets for action recognition

Published: 19 December 2019 Publication History

Abstract

Action recognition is an important research topic in computer vision. It is the basic work for visual understanding and has been applied in many fields. Since human actions can vary in different environments, it is difficult to infer actions in completely different states with a same structural model. For this case, we propose a Cross-Enhancement Transform Two-Stream 3D ConvNets algorithm, which considers the action distribution characteristics on the specific dataset. As a teaching model, stream with better performance in both streams is expected to assist in training another stream. In this way, the enhanced-trained stream and teacher stream are combined to infer actions. We implement experiments on the video datasets UCF-101, HMDB-51, and Kinetics-400, and the results confirm the effectiveness of our algorithm.

References

[1]
A Karpathy, G Toderici, S Sherry, T Leung, R Sukthankar and F F Li (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, CVPR.
[2]
I Laptev, M Marszalek, C Schmid and B Rozenfeld (2008) Learning realistic human actions from movies. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, CVPR.
[3]
J C Niebles, H Wang and F F Li (2008). Unsupervised learning of human action categories using spatial-temporal words. In International journal of computer vision, 79(3), 299--318.
[4]
H Wang and C Schmid (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV.
[5]
J Donahue, L A Hendricks, S Guadarrama, M Rohrbach, S Venugopalan, K Saenko and T Darrell (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[6]
J Y H Ng, M Hausknecht, S Vijayanarasimhan, O Vinyals, R Monga and G Toderici (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[7]
T Cooijmans, N Ballas, C Laurent, Ç Gülçehre and A Courville (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
[8]
L Colin, D F Michael, V Rene, R Austin and D H Gregory (2017). Temporal convolutional networks for action segmentation and detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[9]
L Wang, Y Xiong, Z Wang, Y Qiao, D Lin, X Tang and V Gool (2016). Temporal segment networks: towards good practices for deep action recognition. In Proceedings of European Conference on Computer Vision, ECCV.
[10]
K Simonyan and A Zisserman (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, NIPS.
[11]
C Feichtenhofer, A Pinz and A Zisserman (2016). Convolutional Two-Stream Network Fusion for Video Action Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[12]
M Xu, A Sharghi, X Chen and D Crandall (2018). Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition. arXiv preprint arXiv: 1801.03983.
[13]
H Wang, A Kläser, C Schmid and C Liu (2011). Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[14]
H Wang and C Schmid (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV.
[15]
C Zach, T Pock and H Bischof (2007). A duality based approach for realtime TV-L1 optical flow. In Deutsche Arbeitsgemeinschaft für Mustererkennung, DAGM.
[16]
A Dosovitskiy, P Fischer, E Ilg, P Häusser, C Hazirbäs, V Golkov, P V D Smagt, D Cremers and T Brox (2015). Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, ICCV.
[17]
E Ilg, N Mayer, T Saikia, M Keuper, A Dosovitskiy and T Brox (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[18]
S Ji, W Xu, M Yangand K Yu (2013). 3D convolutional neural networks for human action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221--231.
[19]
W T Graham, R Fergus, Y Lecun and C Bregler (2010). Convolutional Learning of Spatio-temporal Features. In Proceedings of European Conference on Computer Vision, ECCV.
[20]
D Tran, L Bourdev, R Fergus, L Torresani and M Paluri (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV.
[21]
G Varol, I Laptev and C Schmid (2017). Long-term temporal convolutions for action recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence.
[22]
C Joao and Z Andrew (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[23]
Y Tang, L Ma and L Zhou (2019). Hallucinating optical flow features for video classification. In International Joint Conferences on Artificial Intelligence, IJCAI.
[24]
Y Zhu, Z Lan, S Newsam and A G Hauptmann (2018). Hidden two-stream convolutional networks for action recognition. In Asian Conference on Computer Vision ACCV.
[25]
H Kuehne, H Jhuang, E Garrote, T Poggio and T Serre (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, ICCV.
[26]
K Soomro, A R Zamir and M Shah (2012). UCF101: A dataset of 101 human actions classes fromvideos in the wild. arXiv preprint arXiv:1212.0402.
[27]
W Kay, J Carreira, K Simonyan, B Zhang, C Hillier, S Vijayanarasimhan, F Viola, T Green, T Back, P Natsev, M Suleyman and A Zisserman (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
[28]
N Crasto, P Weinzaepfel, K Alahari and C Schmid (2019). MARS: Motion-augmented RGB stream for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[29]
D Purwanto, R R A Pramono, Y T Chen and W H Fang (2019). Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation. In IEEE International Conference on Computer Vision Workshop.
[30]
D Lopez-Paz, L Bottou, B Scholkopf and V Vapnik (2016). Unifying distillation and privileged information. In International Conference on Learning Representations, ICLR.

Cited By

View all
  • (2023)Multi-attention network for pedestrian intention prediction based on spatio-temporal feature fusionProceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering10.1177/09544070231190522238:13(4202-4215)Online publication date: 2-Aug-2023
  • (2021)Human Action Recognition Based on Transfer Learning ApproachIEEE Access10.1109/ACCESS.2021.30866689(82058-82069)Online publication date: 2021

Index Terms

  1. Cross-enhancement transform two-stream 3D ConvNets for action recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing
    December 2019
    464 pages
    ISBN:9781450376334
    DOI:10.1145/3371425
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • ASciE: Association for Science and Engineering

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 December 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D ConvNets
    2. action recognition
    3. two-stream

    Qualifiers

    • Research-article

    Conference

    AIIPCC '19
    Sponsor:
    • ASciE

    Acceptance Rates

    AIIPCC '19 Paper Acceptance Rate 78 of 211 submissions, 37%;
    Overall Acceptance Rate 78 of 211 submissions, 37%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Multi-attention network for pedestrian intention prediction based on spatio-temporal feature fusionProceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering10.1177/09544070231190522238:13(4202-4215)Online publication date: 2-Aug-2023
    • (2021)Human Action Recognition Based on Transfer Learning ApproachIEEE Access10.1109/ACCESS.2021.30866689(82058-82069)Online publication date: 2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media