Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Video benchmarks of human action datasets: a review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Vision-based Human activity recognition is becoming a trendy area of research due to its wide application such as security and surveillance, human–computer interactions, patients monitoring system, and robotics. In the past two decades, there are several publically available human action, and activity datasets are reported based on modalities, view, actors, actions, and applications. The objective of this survey paper is to outline the different types of video datasets and highlights their merits and demerits under practical considerations. Based on the available information inside the dataset we can categorise these datasets into RGB (Red, Green, and Blue) and RGB-D(depth). The most prominent challenges involved in these datasets are occlusions, illumination variation, view variation, annotation, and fusion of modalities. The key specification of these datasets is discussed such as resolutions, frame rate, actions/actors, background, and application domain. We have also presented the state-of-the-art algorithms in a tabular form that give the best performance on such datasets. In comparison with earlier surveys, our works give a better presentation of datasets on the well-organised comparison, challenges, and latest evaluation technique on existing datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Abbasnejad I, Sridharan S, Denman S, Fookes C, Lucey S (2016) Complex event detection using joint max margin and semantic features. In: International conference on digital image computing: techniques and applications, Gold Coast

  • Agahian S, Negin F, Köse C (2018) Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis Comput. https://doi.org/10.1007/s00371-018-1489-7

    Google Scholar 

  • Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1–43

    Article  Google Scholar 

  • Aggarwal H, Vishwakarma DK (2016) Covariate conscious approach for Gait recognition based upon Zernike moment invariants. IEEE Trans Cognit Dev Syst 10(2):397–407

    Article  Google Scholar 

  • Aggarwal J, Xia L (2013) Human activity recognition from 3D data-a review. Pattern Recognit Lett 48:70–80

    Article  Google Scholar 

  • Althloothi S, Mahoor MH, Zhang X, Voyles RM (2014) Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn 47:1800–1812

    Article  Google Scholar 

  • Amin S, Andriluka M, Rohrbach M, Schiele B (2013) Multi-view pictorial structures for 3D human pose estimation. In: British machine vision conference

  • Awwad S, Piccardi M (2016) Local depth patterns for fine-grained activity recognition in-depth videos. In: International conference on image and vision computing New Zealand, Palmerston North

  • Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Proceedings of the second international conference on human behavior understanding

  • Barekatain M, et al. (2017) Okutama-action: an aerial view video dataset for concurrent human action detection. In: IEEE conference on computer vision and pattern recognition workshops, Honolulu

  • Baró X, Gonzalez J, Fabian J, Bautista MA, Oliu M, Escalante HJ, Guyon I (2015) ChaLearn Looking at People 2015 challenges: action spotting and cultural event recognition. In: IEEE conference on computer vision and pattern recognition workshops, Boston, MA

  • Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Tenth IEEE international conference on computer vision (ICCV’05), Beijing

  • Bloom V, Argyriou V, Makris D (2016) Hierarchical transfer learning for online recognition of compound actions. Comput Vis Image Underst 144:62–72

    Article  Google Scholar 

  • Blunsden B, Fisher RB (2009) The BEHAVE video dataset: ground truthed video for multi-person behavior classification. Ann BMVA 4:4

    Google Scholar 

  • Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  • Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circuits Syst Video Technol 23(11):1993–2008

    Article  Google Scholar 

  • Bux A, Angelov P, Habib Z (2016) Vision based human activity recognition: a review. Adv Comput Intell Syst 513:341–371

    Article  Google Scholar 

  • Chaquet JM, Carmona EJ, Caballero AF (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117:633–659

    Article  Google Scholar 

  • Chaudhry R, Ofli F, Kurillo G, Bajcsy R, Vidal R (2013) Bio-inspired dynamic 3D discriminative skeletal features for human action recognition. In: IEEE conference on computer vision and pattern recognition workshops, Portland

  • Chen L, Wei H, Ferryman J (2014) ReadingAct RGB-D action dataset and human action recognition from local features. Pattern Recogn Lett 50:159–169

    Article  Google Scholar 

  • Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of IEEE international conference on image processing, Canada

  • Cherian BF, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In CVPR, Hawaii

  • Chéron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. In: IEEE international conference on computer vision, Santiago

  • Cippitelli E, Gambi E, Spinsante S, Revuelta FF (2016) Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET international conference on technologies for active and assisted living, London

  • Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of European conference on computer vision

  • Das Dawn D, Shaikh SH (2016) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32(3):289–306

    Article  Google Scholar 

  • Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance

  • Donahue J, Hendricks L, Guadarrama S, Rohrbach MV, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Du K, Shi Y, Lei B, Chen J, Sun M (2016) A method of human action recognition based on spatio-temporal interest points and PLSA. In: International conference on industrial informatics—computing technology, intelligent technology, industrial information integration, Wuhan

  • Duta IC, Ionescu B, Aizawa K, Sebe N (2017) Spatio-temporal vector of locally max pooled features for action recognition in videos. In: CVPR, Hawaii

  • Edwards M, Deng J, Xie X (2016) From pose to activity: surveying dataset sand introducing CONVERSE. Comput Vis Image Underst 144:73–105

    Article  Google Scholar 

  • Elmadany NED, He Y, Guan L (2016) Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International workshop on multimedia signal processing, Montreal, QC

  • Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 1933–1941

  • Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR), Hawaii

  • Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: ICML

  • Fernando B, Gavves E, Oramas M, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR)

  • Firman M (2016) RGBD datasets: past, present and future. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

  • Fu L, Zhang J, Huang K (2017) ORGM: occlusion relational graphical model for human pose estimation. IEEE Trans Image Process 26(2):927–941

    Article  MathSciNet  MATH  Google Scholar 

  • Gaglio S, Re GL, Morana M (2015) Human activity recognition process using 3-D posture data. IEEE Trans Hum Mach Syst 45(5):586–597

    Article  Google Scholar 

  • Gaidon A, Harchaoui Z, Schmid C (2011) Actom sequence models for efficient action detection. In: IEEE conference on computer vision and pattern recognition

  • Gao Z, Li S, Zhu Y, Wang C, Zhang H (2017) Collaborative sparse representation learning model for RGBD action recognition. J Vis Commun Image Represent 48:442–452

    Article  Google Scholar 

  • Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3DPost multi-view and 3D human action/interaction. In: Conference for visual media production, London, UK

  • Goodfellow I, Abadie JP, Mirza M, Xu B, Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems

  • Gopalan R (2013) Joint sparsity-based representation and analysis of unconstrained activities. In: IEEE conference on computer vision and pattern recognition, Portland

  • Gorban A, Idrees H, Jiang Y-G, Roshan Zamir A, Laptev I, Shah M, Sukthankar R (2015) {THUMOS} challenge: action recognition with a large number of classes. http://www.thumos.info

  • Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: The tenth IEEE international conference on computer vision (ICCV’05)

  • Goyal R, Kahou SE, Michalski V, Materzy´nska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Freitag MM, Hoppe F, Thurau C, Bax I, Memisevic R (2018) The “something something” video database for learning and evaluating visual common sense. arXiv:1706.04261v2 [cs.CV]

  • Gross OK, Gurovich Y, Hassner T, Wolf L (2012) Motion interchange patterns for action recognition in unconstrained videos. In: ECCV, Firenze, Italy

  • Guha T, Ward RK (2012) Learning sparse representations for human action recognition. IEEE Trans Pattern Anal Mach Intell 34(8):1576–1588

    Article  Google Scholar 

  • Guo H, Wu X, Feng W (2017) Multi-stream deep networks for human action classification with sequential tensor decomposition. Sig Process 140:198–206

    Article  Google Scholar 

  • Hadfield S, Bowden R (2013) Hollywood 3D: recognizing actions in 3D natural scenes. In: IEEE conference on computer vision and pattern recognition, Portland

  • Hadfield S, Lebeda K, Bowden R (2017) Hollywood {3D}: what are the best {3D} features for action recognition? Int J Comput Vision 121(1):95–110

    Article  MathSciNet  Google Scholar 

  • Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) YouTube-8M: a large-scale video classification benchmark. In: CoRR

  • Han F, Reily B, Hoff W, Zhang H (2017) Space–time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst 158:85–105

    Article  Google Scholar 

  • Hao T, Wu D, Wang Q, Sun J-S (2017) Multi-view representation learning for multi-view action recognition. J Vis Commun Image Represent 48:453–460

    Article  Google Scholar 

  • Harris C, Stephens M (1988) A combined corner and edge detector. In: Fourth Alvey vision conference

  • Hassner T (2013) A critical review of action recognition benchmarks. In: IEEE conference on computer vision and pattern recognition workshops, Portland

  • Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA

  • Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21

    Article  Google Scholar 

  • Hongeng S, Nevatia R (2003) Large-scale event detection using semi-hidden Marko models. In: Proceedings of the international conference on computer vision (ICCV)

  • Hu JF, Zheng WS, Lai J, Zhang J (2015) Jointly learning heterogeneous features for RGB-D activity recognition. In: IEEE conference on computer vision and pattern recognition, Boston, MA

  • Hu JF, Zheng WS, Lai JH, Zhang J (2016a) Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans Pattern Anal Mach Intell 99:1

    Google Scholar 

  • Hu N, Bestick A, Englebienne G, Bajscy R, Kröse B (2016) Human intent forecasting using intrinsic kinematic constraints. In: IEEE/RSJ international conference on intelligent robots and systems, Daejeon

  • Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The THUMOS challenge on action recognition for videos “in the wild”. Comput Vis Image Underst 155:1–23

    Article  Google Scholar 

  • Imran J, Kumar P (2016) Human action recognition using RGB-D sensor and deep convolutional neural networks. In: International conference on advances in computing, communications and informatics, Jaipur

  • Iosifidis A, Tefas A (2013) Dynamic action recognition based on dynemes and extreme learning machine. Pattern Recogn Lett 34:1890–1898

    Article  Google Scholar 

  • Iosifidis A, Tefas A, Pitas I (2013) Learning sparse representations for view-independent human action recognition based on fuzzy distances. Neurocomputing 121:344–353

    Article  Google Scholar 

  • Iosifidis A, Tefas A, Nikolaidis N, Pitas I (2014) Human action recognition in stereoscopic videos based on a bag of features and disparity pyramids. In: 22nd European signal processing conference, Lisbon

  • Iosifidis A, Tefas A, Pitas I (2014b) Regularized extreme learning machine for multi-view semi-supervised action recognition. Neurocomputing 145:250–262

    Article  Google Scholar 

  • Iosifidis A, Marami E, Tefas A, Pitas I, Lyroudia K (2015) The MOBISERV-AIIA Eating and Drinking multi-view database for vision-based assisted living. J Inf Hiding Multimed Signal Process 6(2):254–273

    Google Scholar 

  • Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: CVPR

  • Jalal A, Kim Y (2014) Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data. In: 11th IEEE international conference on advanced video and signal based surveillance

  • Jalal A, Kamal S, Kim D (2014) A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7):11735–11759

    Article  Google Scholar 

  • Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  • Ji X, Feng CW, Tao D (2018) Skeleton embedded motion body partition for human action recognition using depth sequences. Sig Process 143:56–68

    Article  Google Scholar 

  • Jiang Y-G, Dai Q, Xue X, Liu W, Ngo C-W (2012) Trajectory-based modeling of human actions with motion reference points. In: Proceedings of the European conference on computer vision (ECCV)

  • Jiang YG, Wu Z, Wang J, Xue X, Chang SF (2017) Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans Pattern Anal Mach Intell 99:1

    Google Scholar 

  • Junejo I, Junejo K, Aghbari Z (2014) Silhouette-based human action recognition using SAX-Shapes. Vis Comput 30(3):259–269

    Article  Google Scholar 

  • Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition, Columbus, OH

  • Kellokumpu V, Zhao G, Pietikinen M (2008) Human activity recognition using a dynamic texture based method. In: British machine vision conference

  • Kim YJ, Cho NG, Lee SW (2014) Group activity recognition with group interaction zone. In: 22nd International conference on pattern recognition, Stockholm

  • Kläser A, MarszaÅek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In BMVC08

  • Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: European conference on computer vision

  • Kong Y, Liang W, Dong Z, Jia Y (2014) Recognising human interaction from videos by a discriminative model. IET Comput Vision 8(4):277–286

    Article  Google Scholar 

  • Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) {HMDB}: a large video database for human motion recognition. In: Proceedings of the international conference on computer vision (ICCV)

  • Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: International conference on computer vision, Barcelona

  • Laptev I (2005) On space–time interest points. Int J Comput Vision 64(2–3):107–123

    Article  Google Scholar 

  • Laptev I, Lindeberg T (2004) Velocity adaptation of space-time interest points. In: Proceedings of the 17th international conference on pattern recognition

  • Laptev I, Lindeberg T (2004) Local descriptors for spatio-temporal recognition. In: ECCV workshop on spatial coherence for visual motion analysis

  • Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, Anchorage, AK

  • Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: The IEEE conference on computer vision and pattern recognition (CVPR), Hawaii

  • Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: IEEE computer society conference on computer vision and pattern recognition, San Francisco

  • Li Y, Ye J, Wang T, Huang S (2015) Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis Comput 31(10):1383–1394

    Article  Google Scholar 

  • Lin X, Casas J, Pard M (2016) 3D point cloud segmentation oriented to the analysis of interactions. In: The 24th European signal processing conference, Budapest, Hungary

  • Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the Wild”. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  • Liu L, Shao L, Zhen X, Li X (2013) Learning discriminative key poses for action recognition. IEEE Trans Cybern 43(6):1860–1870

    Article  Google Scholar 

  • Liu Z, Zhou L, Leung H, Shum HPH (2016a) Kinect posture reconstruction based on a local mixture of gaussian process models. IEEE Trans Visual Comput Graph 22(11):2437–2450

    Article  Google Scholar 

  • Liu T, Wang X, Dai X, Luo J (2016) Deep recursive and hierarchical conditional random fields for human action recognition. In: IEEE winter conference on applications of computer vision, Lake Placid, NY

  • Liu C, Hu Y, Li Y, Song S, Liu J (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475

  • Liu AA, Su YT, Nie WZ, Kankanhalli M (2017b) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  • Liu M, Liu H, Chen C (2017c) Enhanced skeleton visualization for view-invariant human action recognition. Pattern Recogn 68:346–361

    Article  Google Scholar 

  • Lopez JA, Calvo MS, Guillo AF, Rodriguez JG, Cazorla M, Pont MTS (2016) Group activity description and recognition based on trajectory analysis and neural networks. In: International joint conference on neural networks, Vancouver, BC

  • Lun R, Zhao W (2015) A survey of applications and human motion recognition with Microsoft Kinect. Int J Pattern Recognit Artif Intell 29(5):1555008

    Article  Google Scholar 

  • Ma S, Sigal L, Sclarof S (2016) Learning activity progression in LSTMs for activity detection and early detection. In: IEEE conference on computer vision and pattern recognition, Las Vegas, NV

  • Mademlis I, Tefas A, Pitas I (2018) A salient dictionary learning framework for activity video summarization via key-frame extraction. Inf Sci 432:319–331

    Article  Google Scholar 

  • Mahjoub AB, Atri M (2016) Human action recognition using RGB data. In: 11th International design & test symposium, Hammamet

  • Marszaek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision & pattern recognition

  • Mathieu M, Couprie C, LeCun Y (2015) Deep multi-scale video prediction beyond mean square error. In: CoRR

  • Matikainen P, Hebert M, Sukthankar R (2009) Trajectons: action recognition through the motion analysis of tracked features. In: IEEE 12th international conference on computer vision

  • Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of. In: Proceedings of the international conference on computer vision (ICCV)

  • Miech A, Laptev I, Sivic J (2017) Learnable pooling with context gating for video classification. In: CVPR workshop, Hawaii

  • Misra I, Zitnick C, Hebert M (2016) Unsupervised learning using sequential verification for Action Recognition. arXiv preprint arXiv:1603.08561

  • Mo L, Li F, Zhu Y, Huang A (2016) Human physical activity recognition based on computer vision with deep learning model. In: IEEE international instrumentation and measurement technology conference proceedings, Taipei

  • Mygdalis V, Iosifidis A, Tefas A, Pitas I (2016) Graph embedded one-class classifiers for media data classification. Pattern Recogn 60:585–595

    Article  Google Scholar 

  • Negin F, Rodriguez P, Koperski M, Kerboua A, Gonzàlez J, Bourgeois J, Chapoulie E, Robert P, Bremond F (2018) PRAXIS: towards automatic cognitive assessment using gesture recognition. In: Expert systems with applications, vol 106, pp 21–35

  • Ng J-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Ni B, Wang G, Moulin P (2011) RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: IEEE international conference on computer vision workshops

  • Ni B, Moulin P, Yang X, Yan S (2015) Motion part regularization: improving action recognition via trajectory group selection. In: IEEE conference on computer vision and pattern recognition, Boston

  • Niebles C, Chen W, Fei F (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: 11th European conference on computer vision (ECCV)

  • Norouznezhad E, Harandi M, Bigdeli A, Baktash M, Postula A, Lovell B (2012) Directional space–time oriented gradients for 3D visual pattern analysis. In: Proceedings of the European conference on computer vision (ECCV)

  • Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berkeley MHAD: a comprehensive multimodal human action database. In: IEEE workshop on applications of computer vision (WACV), Tampa, FL

  • Oreifej O, Liu Z (2013) HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, Portland, Oregon

  • Pei L, Ye M, Zhao X, Dou Y, Bao J (2016) Action recognition by learning temporal slowness invariant features. Vis Comput 32(11):1395–1404

    Article  Google Scholar 

  • Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: ECCV

  • Pieropan A, Salvi G, Pauwels K, Kjellström H (2014) Audio-visual classification and detection of human manipulation actions. In: IEEE/RSJ international conference on intelligent robots and systems, Chicago, IL

  • Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE 77 (2)

  • Rahmani H, Mahmood A, Huynh D, Mian A (2014) HOPC: histogram of oriented principal components of 3D point clouds for action recognition. In: European conference on computer vision (ECCV)

  • Reddy KK, Shah M (2012) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981

    Article  Google Scholar 

  • Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition, Anchorage, AK

  • Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: Computer vision and pattern recognition

  • Ryoo MS, Aggarwal JK (2009) Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE international conference on computer vision (ICCV), Kyoto, Japan

  • Ryoo MS, Chen CC, Aggarwal J, Chowdhury AR (2010) An overview of contest on semantic description of human activities. Recognizing patterns in signals, speech, images and videos, vol. 6388

  • Sadanand S, Corso J (2012) Action bank: a high-level representation of activity in video. In IEEE conference on computer vision and pattern recognition

  • Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition

  • Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition, Las Vegas

  • Shan Y, Zhang Z, Yang P, Huang K (2015) Adaptive slice representation for human action classification. IEEE Trans Circuits Syst Video Technol 25(10):1624–1636

    Article  Google Scholar 

  • Shao L, Zhen X, Tao D, Li X (2014) Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Trans Cybern 44(6):817–827

    Article  Google Scholar 

  • Shechtman E, Irani M (2005) Space-time behaviour based correlation. In: IEEE conference on computer vision and pattern analysis, Los Alamitos, CA

  • Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of advances in neural information processing systems

  • Singh B, Marks T, Jones M, Tuzel C (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE conference on computer vision and pattern recognition (CVPR)

  • Somasundaram G, Cherian A, Morellas V, Papanikolopoulos N (2014) Action recognition using global spatio-temporal features derived from sparse representations. Comput Vis Image Underst 123:1–13

    Article  Google Scholar 

  • Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. In: Computer vision in sports, pp 181–208

  • Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human action classes from videos in the wild. In: CoRR

  • Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: CoRR

  • Stein S, McKenna SJ (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: ACM international joint conference on pervasive and ubiquitous computing, Zurich, Switzerland

  • Sun C, Nevatia R (2013) ACTIVE: activity concept transitions in video event classification. In: Proceedings of the international conference on computer vision (ICCV)

  • Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. In: IEEE international conference on robotics and automation, Saint Paul, MN

  • Tang K, Fei LF, Koller D (2012) Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • Tayyub J, Tavanai A, Gatsoulis Y, Cohn A, Hogg D (2015) Qualitative and quantitative spatiotemporal relations. In: ACCV

  • The TH, Le B-V, Lee S, Yoon Y (2016) Interactive activity recognition using pose-based spatio–temporal relation features and four-level Pachinko Allocation Model. Inform Comput Sci Intell Syst Appl 369:317–333

    MathSciNet  Google Scholar 

  • Tian Y, Cao L, Liu Z, Zhang Z (2012) Hierarchical filtered motion for action recognition in crowded videos. IEEE Trans Syst Man Cybern 42(3):313–323

    Article  Google Scholar 

  • Tran D, Sorokin A (2008) Human activity recognition with metric. In: European conference on computer vision, Marseille, France

  • Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the international conference on computer vision

  • Vaquette G, Orcesi AL, Achard C (2017) The daily home life activity dataset: a high semantic activity dataset for online recognition. In IEEE international conference on automatic face & gesture recognition (FG 2017), Washington, DC

  • Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:1604.04494

  • Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009

    Article  Google Scholar 

  • Vishwakarma DK, Kapoor R (2015) Hybrid classifier based human activity recognition using the silhouette and cells. Expert Syst Appl 42(20):6957–6965

    Article  Google Scholar 

  • Vishwakarma DK, Singh K (2017) Human activity recognition based on spatial distribution of gradients at sub-levels of average energy silhouette images. IEEE Trans Cognit Dev Syst 9(4):316–327

    Article  Google Scholar 

  • Vishwakarma DK, Kapoor R, Dhiman A (2016a) A proposed framework for the recognition of human activity by exploiting the characteristics of action dynamics. Robot Auton Syst 77:25–38

    Article  Google Scholar 

  • Vishwakarma DK, Kapoor R, Dhiman A (2016b) A unified framework for human activity recognition: an approach using spatial edge distribution and ℜ-transform. Int J Electr Commun 70(3):341–353

    Article  Google Scholar 

  • Wang Y, Mori G (2011) Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans Pattern Anal Mach Intell 33(7):1310–1323

    Article  Google Scholar 

  • Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the international conference on computer vision (ICCV)

  • Wang Y, Huang K, Tan T (2007) Human activity recognition based on R transform. In IEEE conference on computer vision and pattern recognition, Minneapolis, MN

  • Wang H, Ullah M, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spa-tio-temporal features for action recognition. In: British machine vision conference

  • Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition

  • Wang H, Klaeser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. In: IJCV

  • Wang J, Nie BX, Xia Y, Wu Y, Zhu S-C (2014) Cross-view action modeling, learning and recognition. In: Computer vision and pattern recognition, Columbus, Ohio

  • Wang P, Li W, Gao Z, Tang C, Zhang J, Ogunbona PO (2015) Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACM international conference on multimedia

  • Wang Z, Wang L, Du W, Qiao Y (2015) Exploring fisher vector and deep networks for action spotting. In: CVPR

  • Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Hum Mach Syst 46(4):498–509

    Article  Google Scholar 

  • Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmed nets for weakly supervised action recognition and detection. In: The IEEE conference on computer vision and pattern recognition (CVPR), Hawaii

  • Wang P, Li W, Ogunbona PO, Escalera S (2017b) RGB-D-based motion recognition with deep learning: a survey. Int J Comput Vis 99:1–34

    Google Scholar 

  • Weinland D, Ronfard R, Boyer E (2006) Free-viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257

    Article  Google Scholar 

  • Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3D exemplars. In IEEE 11th international conference on computer vision, Rio de Janeiro

  • Willems G, Tuytelaars T, Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proceedings of the European conference on computer vision (ECCV)

  • Wolf C, Mille J, Lombardi E, Celiktutan O, Jiu M, Dogan E, Eren G, Baccouche M, Dellandrea E, Bichot C-E, Garcia C, Sankur B (2014) Evaluation of video activity localizations integrating quality and quantity measurements. Comput Vis Image Underst 127:14–30

    Article  Google Scholar 

  • Wu Z, Fu Y, Jiang YG, Sigal L (2016) Harnessing object and scene semantics for large-scale video understanding. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV

  • Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: Proceedings of the 23th international conference on multimedia, Brisbane, Queensland, Australia

  • Xu Z, Hu J, Deng W (2016) Recurrent convolutional neural network for video classification. In: IEEE international conference on multimedia and expo, Seattle, WA

  • Xu W, Miao Z, Zhang XP, Tian Y (2017) A hierarchical spatio-temporal model for human activity recognition. IEEE Trans Multimed 99:1

    Google Scholar 

  • Yadav GK, Shukla P, Sethi A (2016) Action recognition using interest points capturing differential motion information. In: IEEE international conference on acoustics, speech and signal processing, Shanghai

  • Yan H (2016) Discriminative sparse projections for activity-based person recognition. Neurocomputing 208:183–192

    Article  Google Scholar 

  • Yan X, Chang H, Shan S, Chen X (2014) Modeling video dynamics with deep dynencoder. In: Proceedings of European conference on computer vision

  • Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas

  • Yilmaz A, Shah M (2005) Actions sketch: a novel action representation. In: IEEE computer society conference on computer vision and pattern recognition

  • Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: IEEE conference on computer vision and pattern recognition, Boston, MA

  • Yu Y, Choi J, Kim Y, Yoo K,Lee S-H, Kim G (2017) Supervising neural attention models for video captioning by human gaze data. In: The IEEE conference on computer vision and pattern recognition (CVPR), Hawaii

  • Yuan J, Ni B, Yang X, Kassim AA (2016) Temporal action localization with pyramid of score distribution features. In: IEEE conference on computer vision and pattern recognition, Las Vegas, NV

  • Zhang Z, Huang K, Tan T, Wang L (2007) Trajectory series analysis based event rule induction for visual surveillance. In: IEEE conference on computer vision and pattern recognition, Minneapolis, MN

  • Zhang Z, Huang K, Tan T (2008) Multi-thread parsing for recognizing complex events in videos. In: 10th European conference on computer vision: part III, Marseille, France

  • Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D based action recognition datasets: a survey. Pattern Recognit 60:86–105

    Article  Google Scholar 

  • Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928

    Article  Google Scholar 

  • Zhou Y, Ni B, Hong R, Wang M, Tian Q (2015) Interaction part mining: a mid-level approach for fine-grained action recognition. In: IEEE conference on computer vision and pattern recognition, Boston, MA

  • Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial–temporal volumes for human action recognition. In: Proceedings of the Asian conference on computer vision

  • Zhu G, Zhang L, Shen P, Song J, Zhi L, Yi K (2015) Human action recognition using key poses and atomic motions. In: IEEE international conference on robotics and biomimetics, Zhuhai

  • Zhua F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55:42–52

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dinesh Kumar Vishwakarma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, T., Vishwakarma, D.K. Video benchmarks of human action datasets: a review. Artif Intell Rev 52, 1107–1154 (2019). https://doi.org/10.1007/s10462-018-9651-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-018-9651-1

Keywords

Navigation