Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

Maryam Asadi-Aghbolaghi^7,8,9,
Albert Clapés^8,9,
Marco Bellantonio¹⁰,
Hugo Jair Escalante¹¹,
Víctor Ponce-López¹²,
Xavier Baró¹³,
Isabelle Guyon^14,15,
Shohreh Kasaei⁷ &
…
Sergio Escalera ORCID: orcid.org/0000-0003-0617-8873^8,9

Part of the book series: The Springer Series on Challenges in Machine Learning ((SSCML))

3273 Accesses
32 Citations
1 Altmetric

Abstract

Interest in automatic action and gesture recognition has grown considerably in the last few years. This is due in part to the large number of application domains for this type of technology. As in many other computer vision areas, deep learning based methods have quickly become a reference methodology for obtaining state-of-the-art performance in both tasks. This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images. The survey reviews both fundamental and cutting edge methodologies reported in the last few years. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and challenges for future research. To the best of our knowledge this is the first survey in the topic. We foresee this survey will become a reference in this ever dynamic field of research.

A reduced version of this appeared appeared as: M. Asadi-Aghbolaghi et al. A survey on deep learning based approaches for action and gesture recognition in image sequences. In: Proceedings of 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), 2017.

Editors: Sergio Escalera, Isabelle Guyon, Vassilis Athitsos

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Efficient Coarse-to-Fine Networks for Action and Gesture Recognition

Survey on Deep Learning for Human Action Recognition

Multi-scale Deep Learning for Gesture Detection and Localization

References

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems, 2015a, http://tensorflow.org/
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: large-scale machine learning on heterogeneous systems, 2015b, http://www.tensorflow.org
S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan, Youtube-8m: a large-scale video classification benchmark. CoRR, abs/1609.08675 (2016)
Google Scholar
E. Ahmed, M. Jones, T.K. Marks, An improved deep learning architecture for person re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3908–3916
Google Scholar
R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, et al., Theano: a python framework for fast computation of mathematical expressions, 2016, arXiv:1605.02688
M.R. Amer, S. Todorovic, A. Fern, S.-C. Zhu, Monte carlo tree search for scheduling activity recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1353–1360
Google Scholar
R. Araujo, M.S. Kamel, A semi-supervised temporal clustering method for facial emotion analysis, in 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, 2014, pp. 1–6
Google Scholar
K. Avgerinakis, K. Adam, A. Briassouli, Y. Kompatsiaris, Moving camera human activity localization and recognition with motionplanes and multiple homographies, in ICIP, IEEE, 2015, pp. 2085–2089
Google Scholar
M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, Action classification in soccer videos with long short-term memory recurrent neural networks, in International Conference on Artificial Neural Networks (Springer, Berlin, 2010), pp. 154–159
Google Scholar
M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, Sequential deep learning for human action recognition, in International Workshop on Human Behavior Understanding (Springer, New York, 2011), pp. 29–39
Google Scholar
N. Ballas, L. Yao, A. Courville, Delving deeper into convolutional networks for learning video representations, in Proceedings of International Conference on Learning Representations, 2016
Google Scholar
I. Bayer, T. Silbermann. A multi modal approach to gesture recognition from audio and video data, in ICMI (2013), pp. 461–466. ISBN 978-1-4503-2129-7. doi:10.1145/2522848.2532592
Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult. TNN 5(2), 157–166 (1994)
Google Scholar
H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042
Google Scholar
N.C. Camgoz, S. Hadfield, O. Koller, R. Bowden, Using convolutional 3d neural networks for user-independent continuous gesture recognition, in Proceedings IEEE International Conference of Pattern Recognition (International Conference on Pattern Recognition), ChaLearn Workshop, 2016
Google Scholar
X. Chai, Z. Liu, F. Yin, Z. Liu, X. Chen, Two streams recurrent neural networks for large-scale continuous gesture recognition, in Proceedings of International Conference on Pattern RecognitionW, 2016
Google Scholar
R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, R. Vidal, Bio-inspired dynamic 3d discriminative skeletal features for human action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 471–478
Google Scholar
R. Chavarriaga, H. Sagha, J. del R. Milln, Ensemble creation and reconfiguration for activity recognition: an information theoretic approach, in SMC, 2011, pp. 2761–2766. ISBN 978-1-4577-0652-3, http://dblp.uni-trier.de/db/conf/smc/smc2011.html#ChavarriagaSM11
C. Chen, B. Zhang, Z. Hou, J. Jiang, M. Liu, Y. Yang, Action recognition from depth sequences using weighted fusion of 2d and 3d auto-correlation of gradients features, in Multimedia Tools and Applications, 2016, pp. 1–19
Google Scholar
W. Chen, J.J. Corso, Action detection by implicit intentional motion clustering, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3298–3306
Google Scholar
G. Chéron, I. Laptev, C. Schmid, P-cnn: pose-based cnn features for action recognition, in Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226, 2015
Google Scholar
R. Collobert, S. Bengio, J. Marithoz, Torch: a modular machine learning software library (Technical Report, IDIAP, 2002)
Google Scholar
Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, M.J. Roshtkhari, G. Mori, Deep structured models for group activity recognition, in Proceedings of the British Machine Vision Conference (BMVC) ed. by M.W.J. Xianghua Xie, G.K.L. Tam (BMVA Press, Guildford, 2015), pp. 179.1–179.12. ISBN 1-901725-53-7. doi:10.5244/C.29.179
Z. Deng, A. Vahdat, H. Hu, G. Mori, Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
A. Diba, A. Mohammad Pazandeh, H. Pirsiavash, L. Van Gool, Deepcamp: deep convolutional action and attribute mid-level patterns, in IEEE CVPR, 2016
Google Scholar
Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1110–1118. doi:10.1109/CVPR.2015.7298714
J. Duan, S. Zhou, J. Wan, X. Guo, S.Z. Li, Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition, 2016, arXiv:1611.06689
I.C. Duta, B. Ionescu, K. Aizawa, N. Sebe, Spatio-temporal vlad encoding for human action recognition in videos, in International Conference on Multimedia Modeling (Springer, New York, 2017), pp. 365–378
Google Scholar
T. Eleni, Gesture recognition with a convolutional long short term memory recurrent neural network, in ESANN, 2015, https://books.google.cl/books?id=E8qMjwEACAAJ
J.L. Elman, Finding structure in time. Cognitive Sci. 14(2), 179–211 (1990)
Article Google Scholar
H.J. Escalante, C.A. Hérnadez, L.E. Sucar, M. Montes. Late fusion of heterogeneous methods for multimedia image retrieval, in Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR’08 (ACM, New York, 2008), pp. 172–179. ISBN 978-1-60558-312-9. doi:10.1145/1460096.1460125
H.J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, J. Wan, Principal motion components for gesture recognition using a single example, in PAA, 2015
Google Scholar
H.J. Escalante, E.F. Morales, L.E. Sucar, A naïve bayes baseline for early gesture recognition. PRL 73, 91–99 (2016a)
Article Google Scholar
H.J. Escalante, V. Ponce, J. Wan, M. Riegler, A. Clapes, S. Escalera, I. Guyon, X. Baro, P. Halvorsen, H. Müller, M. Larson, Chalearn joint contest on multimedia challenges beyond visual analysis: an overview, in Proceedings of International Conference on Pattern Recognition, 2016b
Google Scholar
V. Escorcia, F.C. Heilbron, J.C. Niebles, B. Ghanem, DAPs: deep action proposals for action understanding, in European Conference on Computer Vision, 2016
Google Scholar
C. Feichtenhofer, A. Pinz, R. Wildes, Spatiotemporal residual networks for video action recognition, in Advances in Neural Information Processing Systems, 2016a, pp. 3468–3476
Google Scholar
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1933–1941
Google Scholar
B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, T. Tuytelaars, Rank pooling for action recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
Google Scholar
D. Fortun, P. Bouthemy, C. Kervrann, Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. 134, 1–21 (2015)
Article MATH Google Scholar
F.A. Gers, N.N. Schraudolph, J. Schmidhuber, Learning precise timing with lstm recurrent networks. JMLR 3, 115–143 (2002)
MathSciNet MATH Google Scholar
G. Gkioxari, J. Malik, Finding action tubes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 759–768
Google Scholar
A. Grushin, D.D. Monner, J.A. Reggia, A. Mishra, Robust human action recognition via long short-term memory, in The 2013 International Joint Conference on, Neural Networks (IJCNN), IEEE, 2013, pp. 1–8
Google Scholar
F. Gu, M. Sridhar, A. Cohn, D. Hogg, F. Flrez-Revuelta, D. Monekosso, P. Remagnino, Weakly supervised activity analysis with spatio-temporal localisation, Neurocomputing, 2016. ISSN 0925-2312. doi:10.1016/j.neucom.2016.08.032, http://www.sciencedirect.com/science/article/
S. Han, H. Mao, W. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, in Proceedings of International Conference on Learning Representations, 2016
Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 770–778
Google Scholar
Y. He, S. Shirakabe, Y. Satoh, H. Kataoka, Human action recognition without human, in Proceedings of European Conference on Computer Vision 2016 Workshops (Springer, New York, 2016b), pp. 11–17
Google Scholar
F.C. Heilbron, V. Escorcia, B. Ghanem, J.C. Niebles, Activitynet: a large-e video benchmark for human activity understanding, in CVPR, 2015, pp. 961–970
Google Scholar
S. Hochreiter, Untersuchungen zu dynamischen neuronalen netzen (Technische Universität München, Diploma, 1991), p. 91
Google Scholar
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
J. Huang, W. Zhou, H. Li, W. Li, Sign language recognition using 3d convolutional neural networks, in ICME, 2015, pp. 1–6
Google Scholar
M.S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, G. Mori, A hierarchical deep temporal model for group activity recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
A. Jain, J. Tompson, M. Andriluka, G.W. Taylor, C. Bregler, Learning human pose estimation features with convolutional networks, in International Conference on Learning Representations, Cornell University, 2014a, pp. 1–14
Google Scholar
A. Jain, J. Tompson, Y. LeCun, C. Bregler, MoDeep: a deep learning framework using motion features for human pose estimation, vol. 9004, 2015a, pp. 302–315
Google Scholar
M. Jain, J. van Gemert, C.G.M. Snoek, University of Amsterdam at thumos challenge, in ECCV THUMOS Challenge 2014 (Zürich, Switzerland, September, 2014), 2014b
Google Scholar
M. Jain, J.C. van Gemert, T. Mensink, C.G.M. Snoek. Objects2action: classifying and localizing actions without any video example, in IEEE ICCV, 2015b, arXiv.org/abs/1510.06939
M. Jain, J.C. van Gemert, C.G. Snoek, What do 15,000 object categories tell us about classifying and localizing actions? in CVPR, 2015c, pp. 46–55
Google Scholar
S. Ji, W. Xu, M. Yang, K. Yu. 3d convolutional neural networks for human action recognition, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 495–502
Google Scholar
S. Ji, W. Xu, M. Yang, K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, vol. 35(1), 2013, pp. 221–231. ISSN 0162-8828. doi:10.1109/TPAMI.2012.59
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in ACM MM (ACM, New York, 2014), pp. 675–678
Google Scholar
Y.-G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, R. Sukthankar, THUMOS challenge: action recognition with a large number of classes. ICCV13-Action-Workshop, 2013
Google Scholar
V. John, A. Boyali, S. Mita, M. Imanishi, N. Sanma. Deep learning-based fast hand gesture recognition using representative frames, in 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), IEEE, 2016, pp. 1–8
Google Scholar
J. Joo, W. Li, F.F. Steen, S.-C. Zhu. Visual persuasion: Inferring communicative intents of images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 216–223
Google Scholar
B. Kang, S. Tripathi, T.Q. Nguyen, Real-time sign language fingerspelling recognition using convolutional neural networks from depth map, in ACPR, 2015, arXiv:abs/1509.03001
S. Karaman, L. Seidenari, A.D. Bagdanov, A.D. Bimbo, L1-regularized logistic regression stacking and transductive crf smoothing for action recognition in video, in Results of the THUMOS 2013 Action Recognition Challenge with a Large Number of Classes, 2013
Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732
Google Scholar
T. Kerola, N. Inoue, K. Shinoda, Cross-view human action recognition from depth maps using spectral graph sequences. Comput. Vis. Image Underst. 154, 108–126 (2017)
Article Google Scholar
O. Koller, H. Ney, R. Bowden, Deep hand: how to train a cnn on 1 million hand images when your data is continuous and weakly labelled, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3793–3802
Google Scholar
J. Konecny, M. Hagara, One-shot-learning gesture recognition using hog-hof features, in JMLR, vol. 15, 2014, pp. 2513–2532, http://jmlr.org/papers/v15/konecny14a.html
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105
Google Scholar
Y. Kuniyoshi, H. Inoue, M. Inaba, Design and implementation of a system that generates assembly programs from visual recognition of human action sequences, in IEEE International Workshop on Intelligent Robots and Systems’ 90.’Towards a New Frontier of Applications’, Proceedings, IROS’90, IEEE, 1990, pp. 567–574
Google Scholar
G. Lev, G. Sadeh, B. Klein, L. Wolf, Rnn fisher vectors for action recognition and image annotation, in European Conference on Computer Vision (Springer, New York, 2016), pp. 833–850
Google Scholar
S. Li, Z.-Q. Liu, A.B. Chan, Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. IJCV, vol. 113(1), May 2015a, pp. 19–36. ISSN 0920-5691. doi:10.1007/s11263-014-0767-8
S. Li, W. Zhang, A.B. Chan, Maximum-margin structured learning with deep networks for 3d human pose estimation, in ICCV, 2015b, pp. 2848–2856
Google Scholar
Y. Li, W. Li, V. Mahadevan, N. Vasconcelos, Vlad3: encoding dynamics of deep features for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1951–1960
Google Scholar
Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu, R. Li, J. Song, Large-scale gesture recognition with a fusion of rgb-d data based on c3d model, in Proceedings of International Conference on Pattern RecognitionW, 2016b
Google Scholar
C. Liang, Y. Song, Y. Zhang, Hand gesture recognition using view projection from point cloud, in 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 4413–4417
Google Scholar
Z. Liang, G. Zhang, J.X. Huang, Q.V. Hu, Deep learning for healthcare decision making with emrs, in 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2014, pp. 556–559
Google Scholar
H.-I. Lin, M.-H. Hsu, W.-K. Chen, Human hand gesture recognition using a convolution neural network, in CASE, 2015, pp. 1038–1043
Google Scholar
A.-A. Liu, Y.-T. Su, W.-Z. Nie, M. Kankanhalli, Hierarchical clustering multi-task learning for joint human action grouping and recognition. TPAMI 39(1), 102–114 (2017)
Article Google Scholar
J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in European Conference on Computer Vision (Springer, New York, 2016a), pp. 816–833
Google Scholar
Z. Liu, C. Zhang, Y. Tian, 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 55, 93–100 (2016b)
Article Google Scholar
J. Luo, W. Wang, H. Qi, Group sparsity and geometry constrained dictionary learning for action recognition from depth maps, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1809–1816
Google Scholar
B. Mahasseni, S. Todorovic, Regularizing long short term memory with 3d human-skeleton sequences for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3054–3062
Google Scholar
E. Mansimov, N. Srivastava, R. Salakhutdinov, Initialization strategies of spatio-temporal convolutional neural networks, 2015, arXiv:1503.07274
R. Marks, System and method for providing a real-time three-dimensional interactive environment, Dec. 6 2011. US Patent 8,072,470
Google Scholar
P. Mettes, J.C. van Gemert, C.G. Snoek, Spot on: action localization from pointly-supervised proposals, in European Conference on Computer Vision (Springer, New York, 2016), pp. 437–453
Google Scholar
V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski et al., Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
P. Molchanov, S. Gupta, K. Kim, J. Kautz, Hand gesture recognition with 3d convolutional neural networks, in CVPRW, June 2015, pp. 1–7. doi:10.1109/CVPRW.2015.7301342
P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, J. Kautz, Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network, in CVPR, 2016
Google Scholar
A. Montes, A. Salvador, X. Giro-i Nieto, Temporal activity detection in untrimmed videos with recurrent neural networks, 2016, arXiv:1608.08128
H. Mousavi Hondori, M. Khademi, A review on technical and clinical impact of microsoft kinect on physical therapy and rehabilitation. J. Med. Eng. (2014). doi:10.1155/2014/846514
K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Bar, H.J. Escalante, T.B. Moeslund, Deep learning based super-resolution for improved action recognition, in IPTA, 2015, pp. 67–72. ISBN 978-1-4799-8637-8, http://dblp.uni-trier.de/db/conf/ipta/ipta2015.html#NasrollahiERABE15
N. Neverova, C. Wolf, G. Paci, G. Sommavilla, G.W. Taylor, F. Nebout, A multi-scale approach to gesture detection and recognition, in ICCVW, 2013, pp. 484–491, http://liris.cnrs.fr/publis/?id=6330
N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Multi-scale deep learning for gesture detection and localization. ECCVW. LNCS 8925, 474–490 (2014)
Google Scholar
N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Hand segmentation with structured convolutional learning, in ACCV. LNCS, vol. 9005, 2015a, pp. 687–702. ISBN 978-3-319-16811-1. doi:10.1007/978-3-319-16811-1_45
N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition, in IEEE TPAMI, 2015b
Google Scholar
J.Y.-H. Ng, J. Choi, J. Neumann, L.S. Davis, Actionflownet: learning motion representation for action recognition, 2016, arXiv:1612.03052
B. Ni, Y. Pei, Z. Liang, L. Lin, P. Moulin, Integrating multi-stage depth-induced contextual information for human action recognition and localization, in FG, April 2013, pp 1–8. doi:10.1109/FG.2013.6553756
B. Ni, X. Yang, S. Gao, Progressively parsing interactional objects for fine grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1020–1028
Google Scholar
N. Nishida, H. Nakayama, Multimodal gesture recognition using multi-stream recurrent neural network, in PSIVT, 2016, pp. 682–694
Google Scholar
S. Oh, A large-scale benchmark dataset for event recognition in surveillance video, in CVPR, 2011, pp. 3153–3160. ISBN 978-1-4577-0394-2. doi:10.1109/CVPR.2011.5995586
E. Ohn-Bar, M.M. Trivedi, Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations, in IEEE-ITS, vol. 15(6), Dec 2014, pp. 2368–2377. ISSN 1524-9050. doi:10.1109/TITS.2014.2337331
F.J. Ordóñez, D. Roggen, Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1), 115 (2016)
Article Google Scholar
W. Ouyang, X. Chu, X. Wang, Multi-source deep learning for human pose estimation, in CVPR, 2014, pp. 2337–2344
Google Scholar
O.K. Oyedotun, A. Khashman, Deep learning in vision-based static hand gesture recognition, in Neural Computing and Applications, 2016, pp. 1–11
Google Scholar
E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep cnns for action recognition, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–8
Google Scholar
X. Peng, C. Schmid, Encoding feature maps of cnns for action recognition, in CVPR, THUMOS Challenge 2015 Workshop, 2015
Google Scholar
X. Peng, C. Schmid, Multi-region two-stream r-cnn for action detection, in European Conference on Computer Vision (Springer, New York, 2016), pp. 744–759
Google Scholar
X. Peng, L. Wang, Z. Cai, Y. Qiao, Q. Peng, Hybrid super vector with improved dense trajectories for action recognition, in ICCV Workshops, vol. 13, 2013
Google Scholar
X. Peng, C. Zou, Y. Qiao, Q. Peng, Action recognition with stacked fisher vectors, in European Conference on Computer Vision (Springer, New York, 2014), pp. 581–595
Google Scholar
X. Peng, L. Wang, Z. Cai, Y. Qiao, Action and Gesture Temporal Spotting with Super Vector Representation, 2015, pp. 518–527. ISBN 978-3-319-16178-5. doi:10.1007/978-3-319-16178-5_36
L. Pigou, S. Dieleman, P.-J. Kindermans, B. Schrauwen, Sign language recognition using convolutional neural networks, in European Conference on Computer Vision’14, 2015a, pp. 572–578. ISBN 978-3-319-16178-5. doi:10.1007/978-3-319-16178-5_40
L. Pigou, A.V.D. Oord, S. Dieleman, M.V. Herreweghe, J. Dambre, Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. CoRR, 2015b, arXiv.org/abs/1506.01911
Y. Poleg, A. Ephrat, S. Peleg, C. Arora, Compact cnn for indexing egocentric videos, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–9
Google Scholar
Z. Qiu, Q. Li, T. Yao, T. Mei, Y. Rui, Msr asia msm at thumos challenge 2015, in CVPR Workshop, vol. 8 (2015)
Google Scholar
A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in Proceedings of International Conference on Learning Representations, 2016
Google Scholar
H. Rahmani, A. Mian, 3d action recognition from novel viewpoints, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1506–1515
Google Scholar
H. Rahmani, A. Mian, and M. Shah. Learning a deep model for human action recognition from novel viewpoints, arXiv preprint arXiv:1602.00828
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in Advances in neural information processing systems, 2015, pp. 91–99
Google Scholar
N. Rhinehart, K.M. Kitani, Learning action maps of large environments via first-person vision, in Proceedings of European Conference on Computer Vision, 2016
Google Scholar
A. Richard, J. Gall, Temporal action detection using a statistical language model, in CVPR, 2016
Google Scholar
H. Sagha, J. del R. Milln, R. Chavarriaga, Detecting anomalies to improve classification performance in opportunistic sensor networks, in PERCOM Workshops, March 2011a, pp. 154–159. doi:10.1109/PERCOMW.2011.5766860
H. Sagha, S.T. Digumarti, J. del R. Millán, R. Chavarriaga, A. Calatroni, D. Roggen, G. Tröster, Benchmarking classification techniques using the opportunity human activity dataset, in IEEE SMC, Oct 2011b, pp. 36 –40. doi:10.1109/ICSMC.2011.6083628
S. Saha, G. Singh, M. Sapienza, P.H. Torr, F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos, 2016, arXiv:1608.01529
J. Scharcanski, M.E. Celebi, Computer vision techniques for the diagnosis of skin cancer (Springer, New York, 2014)
Book Google Scholar
A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, NTU RGB+ D: a large scale dataset for 3d human activity analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1010–1019
Google Scholar
A. Shahroudy, T.-T. Ng, Y. Gong, G. Wang, Deep multimodal feature analysis for action recognition in RGB+ D videos, 2016b, arXiv:1603.07120
L. Shao, L. Liu, M. Yu, Kernelized multiview projection for robust action recognition. Int. J. Comput. Vis. 118(2), 115–129, June 2016, http://nrl.northumbria.ac.uk/24276/
Z. Shou, D. Wang, S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage CNNS, in CVPR, 2016a
Google Scholar
Z. Shou, D. Wang, S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage CNNS. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1049–1058
Google Scholar
Z. Shu, K. Yun, D. Samaras, Action Detection with Improved Dense Trajectories and Sliding Window, Cham, 2015, pp. 541–551. ISBN 978-3-319-16178-5. doi:10.1007/978-3-319-16178-5_38
K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in NIPS, 2014, pp. 568–576
Google Scholar
B. Singh, T.K. Marks, M. Jones, O. Tuzel, M. Shao, A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1961–1970
Google Scholar
S. Singh, C. Arora, C. Jawahar, First person action recognition using deep learned descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 2620–2628
Google Scholar
K. Soomro, H. Idrees, M. Shah, Action localization in videos through context walk, in ICCV, 2015
Google Scholar
W. Sultani, M. Shah, Automatic action annotation in weakly labeled videos. CoRR, 2016, arXiv.org/abs/1605.08125
L. Sun, K. Jia, D.-Y. Yeung, B.E. Shi, Human action recognition using factorized spatio-temporal convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4597–4605
Google Scholar
J. Tompson, Y.L. Murphy Stein, K. Perlin, Real-time continuous pose recovery of human hands using convolutional networks. ACM-ToG, 33(5), 169:1–169:10 (2014). ISSN 0730-0301. doi:10.1145/2629500
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 4489–4497
Google Scholar
P. Turaga, A. Veeraraghavan, R. Chellappa, Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision, in CVPR, IEEE, 2008, pp. 1–8
Google Scholar
J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition, 2016, arXiv:1604.04494
V. Veeriah, N. Zhuang, G.-J. Qi, Differential recurrent neural networks for action recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4041–4049
Google Scholar
S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance. Visual Comput. 29(10), 983–1009 (2013)
Article Google Scholar
C. Vondrick, D. Ramanan, Video annotation and tracking with active learning, in NIPS, 2011
Google Scholar
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang, Phoneme recognition using time-delay neural networks, in Readings in Speech Recognition, 1990, pp. 393–404
Google Scholar
H. Wang, D. Oneata, J. Verbeek, C. Schmid, A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119, 1–20 (2015a)
MathSciNet Google Scholar
H. Wang, W. Wang, L. Wang, How scenes imply actions in realistic videos? in ICIP IEEE, 2016a, pp. 1619–1623
Google Scholar
J. Wang, W. Wang, R. Wang, W. Gao, et al., Deep alternative neural network: exploring contexts as early as possible for action recognition, in Advances in Neural Information Processing Systems, 2016b, pp. 811–819
Google Scholar
L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015b, pp. 4305–4314
Google Scholar
L. Wang, Z. Wang, Y. Xiong, Y. Qiao, CUHK&SIAT submission for THUMOS15 action recognition challenge, in THUMOS Action Recognition challenge, 2015c, pp. 1–3
Google Scholar
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards good practices for very deep two-stream convnets, 2015d, arXiv:1507.02159
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in European Conference on Computer Vision (Springer, New York, 2016c), pp. 20–36
Google Scholar
P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P.O. Ogunbona, Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-Mach. Syst. 46(4), 498–509 (2016d)
Google Scholar
P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, P. Ogunbona, Large-scale continuous gesture recognition using convolutional neural networks, in Proceedings of International Conference on Pattern RecognitionW, 2016e
Google Scholar
P. Wang, Q. Song, H. Han, J. Cheng, Sequentially supervised long short-term memory for gesture recognition, in Cognitive Computation, 2016f, pp. 1–10
Google Scholar
P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, P. Ogunbona, Large-scale isolated gesture recognition using convolutional neural networks, 2017, arXiv:1701.01814
X. Wang, A. Farhadi, A. Gupta, Actions$\tilde{\,}$ transformations, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016g, pp. 2658–2667
Google Scholar
Y. Wang, M. Hoai, Improving human action recognition by non-action classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2698–2707
Google Scholar
Y. Wang, J. Song, L. Wang, L. Van Gool, O. Hilliges, Two-stream SR-CNNS for action recognition in videos, BMVC, 2016h
Google Scholar
Z. Wang, L. Wang, W. Du, Y. Qiao, Exploring fisher vector and deep networks for action spotting, in CVPRW, 2015e, pp. 10–14. doi:10.1109/CVPRW.2015.7301330
P. Weinzaepfel, Z. Harchaoui, C. Schmid, Learning to track for spatio-temporal action localization, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3164–3172
Google Scholar
P. Weinzaepfel, Z. Harchaoui, C. Schmid, Learning to track for spatio-temporal action localization, in ICCV, Santiago, Chile, Dec 2015, arXiv: 1506.01929
P.A. Wilson, B. Lewandowska-Tomaszczyk, Affective robotics: modelling and testing cultural prototypes. Cogn. Comput. 6(4), 814–840 (2014)
Article Google Scholar
C. Wolf, E. Lombardi, J. Mille, O. Celiktutan, M. Jiu, E. Dogan, G. Eren, M. Baccouche, E. Dellandréa, C.-E. Bichot, C. Garcia, B. Sankur, Evaluation of video activity localizations integrating quality and quantity measurements, in CVIU, vol. 127, Oct 2014, pp. 14–30. ISSN 1077-3142. doi:10.1016/j.cviu.2014.06.014
D. Wu, L. Pigou, P.J. Kindermans, N. Le, L. Shao, J. Dambre, J.M. Odobez, Deep dynamic neural networks for multimodal gesture segmentation and recognition, in IEEE TPAMI, Feb 2016a
Google Scholar
J. Wu, J. Cheng, C. Zhao, H. Lu, Fusing multi-modal features for gesture recognition, in ICMI, 2013, pp. 453–460. ISBN 978-1-4503-2129-7. doi:10.1145/2522848.2532589
J. Wu, P. Ishwar, J. Konrad, Two-stream CNNS for gesture-based verification and identification: learning user style, in CVPRW, 2016b
Google Scholar
J. Wu, G. Wang, W. Yang, X. Ji, Action recognition with joint attention on multi-level deep features, 2016c, arXiv:1607.02556
Z. Wu, Y. Fu, Y.-G. Jiang, L. Sigal, Harnessing object and scene semantics for large-scale video understanding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016d, pp. 3112–3121
Google Scholar
X. Xu, T.M. Hospedales, S. Gong, Multi-task zero-shot action recognition with prioritised data augmentation, in Proceedings of European Conference on Computer Vision, 2016
Google Scholar
Z. Xu, L. Zhu, Y. Yang, A.G. Hauptmann, UTS-CMU at THUMOS 2015, in CVPR THUMOS Challenge, 2015a
Google Scholar
Z. Xu, L. Zhu, Y. Yang, A.G. Hauptmann, UTS-CMU at THUMOS, 2015b
Google Scholar
J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92, IEEE, 1992, pp. 379–385
Google Scholar
Y. Ye, Y. Tian, Embedding sequential information into spatiotemporal features for action recognition, in CVPRW, 2016
Google Scholar
S. Yeung, O. Russakovsky, G. Mori, L. Fei-Fei, End-to-end learning of action detection from frame glimpses in videos, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2678–2687
Google Scholar
D. Yu, A. Eversole, M. Seltzer, K. Yao, Z. Huang, B. Guenter, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang et al., An introduction to computational networks and the computational network toolkit (Technical Report, TR MSR, 2014)
Google Scholar
J. Yu, K. Weng, G. Liang, G. Xie, A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation, in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), IEEE, 2013, pp. 1175–1180
Google Scholar
J. Yuan, B. Ni, X. Yang, A. Kassim, Temporal action localization with pyramid of score distribution features, in CVPR, 2016
Google Scholar
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep networks for video classification, in CVPR, 2015, pp. 4694–4702
Google Scholar
S. Zha, F. Luisier, W. Andrews, N. Srivastava, R. Salakhutdinov, Exploiting image-trained cnn architectures for unconstrained video classification, 2015, arXiv:1503.04144
B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, Real-time action recognition with enhanced motion vector CNNS, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718–2726
Google Scholar
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in NIPS, 2014, pp. 487–495
Google Scholar
T. Zhou, N. Li, X. Cheng, Q. Xu, L. Zhou, Z. Wu, Learning semantic context feature-tree for action recognition via nearest neighbor fusion. Neurocomputing 201, 1–11 (2016)
Article Google Scholar
Y. Zhou, B. Ni, R. Hong, M. Wang, Q. Tian, Interaction part mining: a mid-level approach for fine-grained action recognition, in CVPR, 2015, pp. 3323–3331
Google Scholar
G. Zhu, L. Zhang, L. Mei, J. Shao, J. Song, P. Shen, Large-scale isolated gesture recognition using pyramidal 3d convolutional networks, in Proceedings of International Conference on Pattern RecognitionW, 2016a
Google Scholar
W. Zhu, J. Hu, G. Sun, X. Cao, Y. Qiao, A key volume mining deep framework for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1991–1999
Google Scholar
C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in European Conference on Computer Vision (Springer, New York, 2014), pp. 391–405
Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Spanish projects TIN2015-66951-C2-2-R and TIN2016-74946-P (MINECO/FEDER, UE) and CERCA Programme / Generalitat de Catalunya. Hugo Jair Escalante was supported by CONACyT under grants CB2014-241306 and PN-215546.

Author information

Authors and Affiliations

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Maryam Asadi-Aghbolaghi & Shohreh Kasaei
Computer Vision Center, Autonomous University of Barcelona, Barcelona, Spain
Maryam Asadi-Aghbolaghi, Albert Clapés & Sergio Escalera
Department of Mathematics and Informatics, University of Barcelona, Barcelona, Spain
Maryam Asadi-Aghbolaghi, Albert Clapés & Sergio Escalera
Facultat d’Informatica, Polytechnic University of Barcelona, Barcelona, Spain
Marco Bellantonio
Instituto Nacional de Astrofísica, Óptica Y Electrónica, 72840, Puebla, Mexico
Hugo Jair Escalante
Eurecat, Barcelona, Catalonia, Spain
Víctor Ponce-López
EIMT, Open University of Catalonia, Barcelona, Spain
Xavier Baró
UPSud and INRIA, Université Paris-Saclay, Paris, France
Isabelle Guyon
ChaLearn, Berkeley, CA, USA
Isabelle Guyon

Authors

Maryam Asadi-Aghbolaghi
View author publications
You can also search for this author in PubMed Google Scholar
Albert Clapés
View author publications
You can also search for this author in PubMed Google Scholar
Marco Bellantonio
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Jair Escalante
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Ponce-López
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Baró
View author publications
You can also search for this author in PubMed Google Scholar
Isabelle Guyon
View author publications
You can also search for this author in PubMed Google Scholar
Shohreh Kasaei
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maryam Asadi-Aghbolaghi .

Editor information

Editors and Affiliations

University of Barcelona, Barcelona, Spain
Sergio Escalera
ChaLearn, Berkeley, California, USA
Isabelle Guyon
Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
Vassilis Athitsos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Asadi-Aghbolaghi, M. et al. (2017). Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey. In: Escalera, S., Guyon, I., Athitsos, V. (eds) Gesture Recognition. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-57021-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-57021-1_19
Published: 20 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57020-4
Online ISBN: 978-3-319-57021-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Efficient Coarse-to-Fine Networks for Action and Gesture Recognition

Survey on Deep Learning for Human Action Recognition

Multi-scale Deep Learning for Gesture Detection and Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Efficient Coarse-to-Fine Networks for Action and Gesture Recognition

Survey on Deep Learning for Human Action Recognition

Multi-scale Deep Learning for Gesture Detection and Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation