research-article

Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern

Authors:

Ge LiAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 993 - 1001

https://doi.org/10.1145/3240508.3240659

Published: 15 October 2018 Publication History

Abstract

At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.

References

[1]

Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., and Fei-Fei L. 2014. Large-scale video classification with convolutional neural networks. IEEE conference on Computer Vision and Pattern Recognition (2014), 1725--1732.

Digital Library

[2]

Prest A., Ferrari V., and Schmid C. 2013. Explicit modeling of human-object interactions in realistic videos. IEEE transactions on pattern analysis and machine intelligence 35, 4 (2013), 835--848.

Digital Library

[3]

Feichtenhofer C., Pinz A., and Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1933--1941.

[4]

Oneata D., Revaud J., Verbeek J., and Schmmid C. 2014. Spatio-temporal object detection proposals. European Conference on Computer Vision (2014).

[5]

Tran D., Bourdev L., Fergus R., Torresani L., and Paluri M. 2015. Learning spatiotemporal features with 3d convolutional networks. IEEE International Conference on Computer Vision (2015), 4489--4497.

Digital Library

[6]

Gkioxari G. and Malik J. 2015. Finding action tubes. IEEE Conference on Computer Vision and Pattern Recognition (2015), 759--768.

[7]

Singh G., Saha S., and Cuzzolin F. 2016. Online Real time Multiple Spatiotemporal Action Localisation and Prediction on a Single Platform. ArXiv (2016). https: //doi.org/arXiv:1611.08563

[8]

Behl H., Sapienza M., Singh G., Saha S., Cuzzolin F., and Torr P.H. 2017. Incremental tube construction for human action detection. ArXiv (2017). https: //doi.org/arXiv:1704.01358

[9]

Wang H. and Schmid C. 2013. Action recognition with improved trajectories. IEEE International Conference on Computer Vision (2013), 3551--3558.

Digital Library

[10]

Zhu H., Vial R., and Lu S. 2017. TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal. IEEE International Conference on Computer Vision (2017), 5814--5822.

[11]

Donahue J., Anne Hendricks L., Guadarrama S., Rohrbach M., Venugopalan S., Saenko K., and Darrell T. 2015. Long-term recurrent convolutional networks for visual recognition and description. IEEE conference on computer vision and pattern recognition (2015), 2625--2634.

[12]

Yue-Hei Ng J., Hausknecht M., Vijayanarasimhan S., Vinyals O., Monga R., and Toderici G. 2015. Beyond short snippets: Deep networks for video classification. IEEE conference on computer vision and pattern recognition (2015), 4694--4702.

[13]

Gemert J.C., Jain M., Gati E., and Snoek C.G. 2015. Apt: Action localization proposals from dense trajectories. British Machine Vision Conference (2015), 177--184.

[14]

Simonyan K. and Zisserman A. 2014. Very deep convolutional networks for large scale image recognition. ArXiv (2014). https://doi.org/arXiv:1409.1556

[15]

Soomro K., Zamir A.R., and Shah M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv (2012). https://doi.org/arXiv: 1212.0402,2012

[16]

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action Tubelet Detector for Spatio-Temporal Action Localization. In IEEE International Conference on Computer Vision. 4415--4423.

[17]

Amer M. and Todorovic S. 2016. Sum product networks for activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2016), 800-- 813.

Digital Library

[18]

Jain M., Van Gemert J., JÃ?gou H., Bouthemy P., and Snoek C.G. 2014. Action localization with tubelets from motion. IEEE Conference on Computer Vision and Pattern Recognition (2014), 740--747.

Digital Library

[19]

Rodriguez M.D., Ahmed J., and Shah M. 2008. Action mach a spatio-temporal maximum average correlation height filter for action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2008), 1--8.

[20]

Li N., Xu D., Ying. Z., Li Z., and Li G. 2012. Searching action proposals via spatial actionness estimation and temporal path inference and tracking. Asian Conference on Computing Vision (2012), 384--399.

[21]

Tokmakov P., Alahari K., and Schmid C. 2017. Learning video object segmentation with visual memory. IEEE International Conference on Computer Vision (2017), 4491--4500.

[22]

Weinzaepfel P., Harchaoui Z., and Schmid C. 2015. Learning to track for spatiotemporal action localization. IEEE International Conference on Computer Vision (2015), 3164--3172.

Digital Library

[23]

Collobert R., Kavukcuoglu K., and Farabet C. 2011. Torch7: A matlab-like environment for machine learning. Advances in neural information processing systems workshop (2011).

[24]

Girshick R., Donahue J., Darrell T., and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE conference on computer vision and pattern recognition (2014), 580--587.

Digital Library

[25]

Ren S., He K., Girshick R., and Sun J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems (2015), 91--99.

Digital Library

[26]

Saha S., Singh G., Sapienza M., Torr P.H., and Cuzzolin F. 2016. Deep learning for detecting multiple space-time action tubes in videos. ArXiv (2016). https: //doi.org/arXiv:1608.01529

[27]

Jain S.D. and Grauman K. 2007. Supervoxel-consistent foreground propagation in video. European Conference on Computer Vision (2007), 656--671.

[28]

Tieleman T. and Hinton G. 2012. RMSProp. COURSERA:Lecture 6.5 - Neural Networks for Machine Learning (2012).

[29]

Kalogeiton V., Weinzaepfel P., Ferrari V., and Schmid C. 2017. Action Tubelet Detector for Spatio-Temporal Action Localization. ArXiv (2017). https://doi.org/ arXiv:1705.01861

[30]

Sultani W. and Shah M. 2016. What If We Do Not Have Multiple Videos of the Same Action?--Video Action Localization Using Web Images. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1077--1085.

[31]

Zhu W., Hu J., Sun G., Cao X., and Qiao Y. 2016. A key volume mining deep framework for action recognition. IEEE Conference on Computer Vision and Pattern Recognition (2016), 1991--1999.

[32]

Peng X. and Schmid C. 2016. Multi-region two-stream R-CNN for action detection. European Conference on Computer Vision (2016), 744--759.

Cited By

Su SZhang Y(2024)Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple CluesIEEE Access10.1109/ACCESS.2024.338853212(54661-54672)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3388532
Pehlivan SLaaksonen J(2023)Improved action proposals using fine-grained proposal features with recurrent attention modelsJournal of Visual Communication and Image Representation10.1016/j.jvcir.2022.10370990(103709)Online publication date: Feb-2023
https://doi.org/10.1016/j.jvcir.2022.103709
Li YLin WSee JXu NXu SYan KYang C(2020)CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action LocalizationComputer Vision – ECCV 202010.1007/978-3-030-58517-4_30(510-527)Online publication date: 10-Oct-2020
https://doi.org/10.1007/978-3-030-58517-4_30
Show More Cited By

Index Terms

Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Single Shot Temporal Action Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting ...
Long term spatio-temporal modeling for action detection
Abstract
Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-...
Graphical abstract

Display Omitted
Highlights
- A novel method for jointly inferring actions performed by multiple people.
- ...
Cascading spatio-temporal attention network for real-time action detection
Abstract
Accurately detecting human actions in video has many applications, such as video surveillance and somatosensory games. In this paper, we propose a spatial-aware attention module (SAM) and a temporal-aware attention module (TAM) for spatio-temporal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation of China
National Engineering Laboratory Shenzhen Division for Video Technology
National Natural Science Foundation of China and Guangdong Province Scientific Research on Big Data

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
230
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Su SZhang Y(2024)Online Hierarchical Linking of Action Tubes for Spatio-Temporal Action Detection Based on Multiple CluesIEEE Access10.1109/ACCESS.2024.338853212(54661-54672)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3388532
Pehlivan SLaaksonen J(2023)Improved action proposals using fine-grained proposal features with recurrent attention modelsJournal of Visual Communication and Image Representation10.1016/j.jvcir.2022.10370990(103709)Online publication date: Feb-2023
https://doi.org/10.1016/j.jvcir.2022.103709
Li YLin WSee JXu NXu SYan KYang C(2020)CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action LocalizationComputer Vision – ECCV 202010.1007/978-3-030-58517-4_30(510-527)Online publication date: 10-Oct-2020
https://doi.org/10.1007/978-3-030-58517-4_30
Shang XXiao JDi DChua TAmsaleg LHuet BLarson MGravier GHung HNgo CTsang Ooi W(2019)Relation Understanding in VideosProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3356082(2652-2656)Online publication date: 15-Oct-2019
https://dl.acm.org/doi/10.1145/3343031.3356082

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents