Recognizing 50 human action categories of web videos

Kishore K. Reddy¹ &
Mubarak Shah¹

2753 Accesses
419 Citations
3 Altmetric
Explore all metrics

Abstract

Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (6 actions), IXMAS (13 actions), and Weizmann (10 actions). Challenges like camera motion, different viewpoints, large interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. Also, an increased number of categories and the inclusion of actions with high confusion add to the challenges. In this paper, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion features, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the very large number of categories. We demonstrate that scene context is a very important feature to perform action recognition on very large datasets. The proposed method does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach gives good performance on a large number of action categories; it has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, 257–267 (2001)
Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3273–3280 (2011)
Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: Proceedings of the 11th European Conference on Computer Vision: Part V, pp. 71–84 (2010)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: IEEE 12th International Conference on Computer Vision, pp. 1933–1940 (2009)
Hong, P., Huang, T.S., Turk, M.: Gesture modeling and recognition using finite state machines. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 410–415 (2000)
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the 11th European Conference on Computer Vision: Part I, pp. 494–507 (2010)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision, pp. 2556–2563 (2011)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1996–2003 (2009)
Liu, J., Shah, M.: Learning human actions via information maximization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674–679 (1981)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)
van de Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1582–1596 (2010)
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th International Conference on Multimedia, pp. 357–360 (2007)
Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 399–402 (2005)
Song, Y., Zhao, M., Yagnik, J., Wu, X.: Taxonomic classification for web-based videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 871–878 (2010)
Wang., H., Klaser., A., Liu., C.L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)
Wang, Z., Zhao, M., Song, Y., Kumar, S., Li, B.: Youtubecat: learning to categorize wild web videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 879–886 (2010)
Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. In: Computer Vision and Image Understanding, vol. 115, pp. 224–241 (2011)
Wilson, A., Bobick, A.: Parametric hidden markov models for gesture recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 884–900 (1999)
Wong, S.F., Kim, T.K., Cipolla, R.: Learning motion categories using both semantic and structural information. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007)
Zheng, Y.T., Neo, S.Y., Chua, T.S., Tian, Q.: Probabilistic optimized ranking for multimedia semantic concept detection via rvm. In: Proceedings of International Conference on Content-Based Image and Video Retrieval, pp. 161–168 (2008)

Download references

Author information

Authors and Affiliations

4000 Central Florida Blvd, Orlando, USA
Kishore K. Reddy & Mubarak Shah

Authors

Kishore K. Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Mubarak Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kishore K. Reddy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reddy, K.K., Shah, M. Recognizing 50 human action categories of web videos. Machine Vision and Applications 24, 971–981 (2013). https://doi.org/10.1007/s00138-012-0450-4

Download citation

Received: 22 January 2012
Revised: 05 September 2012
Accepted: 07 September 2012
Published: 16 November 2012
Issue Date: July 2013
DOI: https://doi.org/10.1007/s00138-012-0450-4

Recognizing 50 human action categories of web videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

Breaking video into pieces for action recognition

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Recognizing 50 human action categories of web videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

Breaking video into pieces for action recognition

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation