Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3240508.3240659acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern

Published: 15 October 2018 Publication History

Abstract

At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.

References

[1]
Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., and Fei-Fei L. 2014. Large-scale video classification with convolutional neural networks. IEEE conference on Computer Vision and Pattern Recognition (2014), 1725--1732.
[2]
Prest A., Ferrari V., and Schmid C. 2013. Explicit modeling of human-object interactions in realistic videos. IEEE transactions on pattern analysis and machine intelligence 35, 4 (2013), 835--848.
[3]
Feichtenhofer C., Pinz A., and Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1933--1941.
[4]
Oneata D., Revaud J., Verbeek J., and Schmmid C. 2014. Spatio-temporal object detection proposals. European Conference on Computer Vision (2014).
[5]
Tran D., Bourdev L., Fergus R., Torresani L., and Paluri M. 2015. Learning spatiotemporal features with 3d convolutional networks. IEEE International Conference on Computer Vision (2015), 4489--4497.
[6]
Gkioxari G. and Malik J. 2015. Finding action tubes. IEEE Conference on Computer Vision and Pattern Recognition (2015), 759--768.
[7]
Singh G., Saha S., and Cuzzolin F. 2016. Online Real time Multiple Spatiotemporal Action Localisation and Prediction on a Single Platform. ArXiv (2016). https: //doi.org/arXiv:1611.08563
[8]
Behl H., Sapienza M., Singh G., Saha S., Cuzzolin F., and Torr P.H. 2017. Incremental tube construction for human action detection. ArXiv (2017). https: //doi.org/arXiv:1704.01358
[9]
Wang H. and Schmid C. 2013. Action recognition with improved trajectories. IEEE International Conference on Computer Vision (2013), 3551--3558.
[10]
Zhu H., Vial R., and Lu S. 2017. TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal. IEEE International Conference on Computer Vision (2017), 5814--5822.
[11]
Donahue J., Anne Hendricks L., Guadarrama S., Rohrbach M., Venugopalan S., Saenko K., and Darrell T. 2015. Long-term recurrent convolutional networks for visual recognition and description. IEEE conference on computer vision and pattern recognition (2015), 2625--2634.
[12]
Yue-Hei Ng J., Hausknecht M., Vijayanarasimhan S., Vinyals O., Monga R., and Toderici G. 2015. Beyond short snippets: Deep networks for video classification. IEEE conference on computer vision and pattern recognition (2015), 4694--4702.
[13]
Gemert J.C., Jain M., Gati E., and Snoek C.G. 2015. Apt: Action localization proposals from dense trajectories. British Machine Vision Conference (2015), 177--184.
[14]
Simonyan K. and Zisserman A. 2014. Very deep convolutional networks for large scale image recognition. ArXiv (2014). https://doi.org/arXiv:1409.1556
[15]
Soomro K., Zamir A.R., and Shah M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012). https://doi.org/arXiv: 1212.0402,2012
[16]
Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action Tubelet Detector for Spatio-Temporal Action Localization. In IEEE International Conference on Computer Vision. 4415--4423.
[17]
Amer M. and Todorovic S. 2016. Sum product networks for activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2016), 800-- 813.
[18]
Jain M., Van Gemert J., JÃ?gou H., Bouthemy P., and Snoek C.G. 2014. Action localization with tubelets from motion. IEEE Conference on Computer Vision and Pattern Recognition (2014), 740--747.
[19]
Rodriguez M.D., Ahmed J., and Shah M. 2008. Action mach a spatio-temporal maximum average correlation height filter for action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2008), 1--8.
[20]
Li N., Xu D., Ying. Z., Li Z., and Li G. 2012. Searching action proposals via spatial actionness estimation and temporal path inference and tracking. Asian Conference on Computing Vision (2012), 384--399.
[21]
Tokmakov P., Alahari K., and Schmid C. 2017. Learning video object segmentation with visual memory. IEEE International Conference on Computer Vision (2017), 4491--4500.
[22]
Weinzaepfel P., Harchaoui Z., and Schmid C. 2015. Learning to track for spatiotemporal action localization. IEEE International Conference on Computer Vision (2015), 3164--3172.
[23]
Collobert R., Kavukcuoglu K., and Farabet C. 2011. Torch7: A matlab-like environment for machine learning. Advances in neural information processing systems workshop (2011).
[24]
Girshick R., Donahue J., Darrell T., and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE conference on computer vision and pattern recognition (2014), 580--587.
[25]
Ren S., He K., Girshick R., and Sun J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems (2015), 91--99.
[26]
Saha S., Singh G., Sapienza M., Torr P.H., and Cuzzolin F. 2016. Deep learning for detecting multiple space-time action tubes in videos. ArXiv (2016). https: //doi.org/arXiv:1608.01529
[27]
Jain S.D. and Grauman K. 2007. Supervoxel-consistent foreground propagation in video. European Conference on Computer Vision (2007), 656--671.
[28]
Tieleman T. and Hinton G. 2012. RMSProp. COURSERA:Lecture 6.5 - Neural Networks for Machine Learning (2012).
[29]
Kalogeiton V., Weinzaepfel P., Ferrari V., and Schmid C. 2017. Action Tubelet Detector for Spatio-Temporal Action Localization. ArXiv (2017). https://doi.org/ arXiv:1705.01861
[30]
Sultani W. and Shah M. 2016. What If We Do Not Have Multiple Videos of the Same Action?--Video Action Localization Using Web Images. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1077--1085.
[31]
Zhu W., Hu J., Sun G., Cao X., and Qiao Y. 2016. A key volume mining deep framework for action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1991--1999.
[32]
Peng X. and Schmid C. 2016. Multi-region two-stream R-CNN for action detection. European Conference on Computer Vision (2016), 744--759.

Cited By

View all
  • (2024)Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple CluesIEEE Access10.1109/ACCESS.2024.338853212(54661-54672)Online publication date: 2024
  • (2023)Improved action proposals using fine-grained proposal features with recurrent attention modelsJournal of Visual Communication and Image Representation10.1016/j.jvcir.2022.10370990(103709)Online publication date: Feb-2023
  • (2020)CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action LocalizationComputer Vision – ECCV 202010.1007/978-3-030-58517-4_30(510-527)Online publication date: 10-Oct-2020
  • Show More Cited By

Index Terms

  1. Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '18: Proceedings of the 26th ACM international conference on Multimedia
    October 2018
    2167 pages
    ISBN:9781450356657
    DOI:10.1145/3240508
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. encoder-decoder model
    2. online action tune generation
    3. spatio-temporal action detection

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation of China
    • National Engineering Laboratory Shenzhen Division for Video Technology
    • National Natural Science Foundation of China and Guangdong Province Scientific Research on Big Data

    Conference

    MM '18
    Sponsor:
    MM '18: ACM Multimedia Conference
    October 22 - 26, 2018
    Seoul, Republic of Korea

    Acceptance Rates

    MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple CluesIEEE Access10.1109/ACCESS.2024.338853212(54661-54672)Online publication date: 2024
    • (2023)Improved action proposals using fine-grained proposal features with recurrent attention modelsJournal of Visual Communication and Image Representation10.1016/j.jvcir.2022.10370990(103709)Online publication date: Feb-2023
    • (2020)CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action LocalizationComputer Vision – ECCV 202010.1007/978-3-030-58517-4_30(510-527)Online publication date: 10-Oct-2020
    • (2019)Relation Understanding in VideosProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3356082(2652-2656)Online publication date: 15-Oct-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media