Abstract
Most of the current action recognition algorithms are based on deep networks which stack multiple convolutional, pooling and fully connected layers. While convolutional and fully connected operations have been widely studied in the literature, the design of pooling operations that handle action recognition, with different sources of temporal granularity in action categories, has comparatively received less attention, and existing solutions rely mainly on max or averaging operations. The latter are clearly powerless to fully exhibit the actual temporal granularity of action categories and thereby constitute a bottleneck in classification performances. In this paper, we introduce a novel hierarchical pooling design that captures different levels of temporal granularity in action recognition. Our design principle is coarse-to-fine and achieved using a tree-structured network; as we traverse this network top-down, pooling operations are getting less invariant but timely more resolute and well localized. Learning the combination of operations in this network—which best fits a given ground-truth—is obtained by solving a constrained minimization problem whose solution corresponds to the distribution of weights that capture the contribution of each level (and thereby temporal granularity) in the global hierarchical pooling process. Besides being principled and well grounded, the proposed hierarchical pooling is also video-length and resolution agnostic. Extensive experiments conducted on the challenging UCF-101, HMDB-51 and JHMDB-21 databases corroborate all these statements.
Similar content being viewed by others
Data availability
Our manuscript has no associated data, and all data used for experiments are presently openly accessible in the web.
Notes
Already available/pretrained on ImageNet to capture the appearance.
Whose complexity scales quadratically w.r.t. the size of training data.
These subclasses of actions are not explicitly defined in a supervised manner but implicitly by allowing enough flexibility in the multiple instances of temporal pyramids in order to capture different (unknown) subclasses of action dynamics.
In order to make training cycles efficient, we only use the skeleton frames.
Training of each lightweight GCN architecture lasts less than an hour on a GeForce GTX 1070 GPU (with 8 GB memory).
As reported in [70], the number of parameters in HAN-2S and HAN architectures are 940 k and 530 k, respectively.
References
Bagautdinov T, Alahi A, Fleuret F, Fua P, Savarese S (2017) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Pantic M, Pentland A, Nijholt A, Huang TS (2007) Human computing and machine understanding of human behavior: a survey. In: Human computing and machine understanding of human behavior
Jiu M, Sahbi H (2016) Laplacian deep kernel learning for image annotation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1551–1555
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491
Han Y, Zhanga P, Zhuob T, Huang W, Zhanga Y (2018) Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit Lett 107:83–90
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Lu M, Li Z-N, Wang Y, Pan G (2019) Deep attention network for egocentric action recognition. IEEE Trans Image Process 28(8):3703
Mahmud T, Billah M, Hasan M, Roy-Chowdhury AK (2019) Captioning near-future activity sequences. arXiv:1908.00943
Laptev I, Perez P (2007) Retrieving actions in movies. In: International conference on computer vision (ICCV)
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302
Jaimes A, Omura K, Nagamine T, Hirata K (2004) Memory cues for meeting video retrieval. In: CARPE proceedings of the 1st ACM workshop on continuous archival and retrieval of personal experiences, pp 74–85
Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: International conference on computational vision (ICCV)
Meng H, Pears N, Bailey C (2007) A human action recognition system for embedded computer vision application. In: IEEE conference on computer vision and pattern recognition (CVPR)
Theodoridis T, Agapitos A, Hu H, Lucas SM (2008) Ubiquitous robotics in physical human action recognition: a comparison between dynamic ANNs and GP. In: IEEE international conference on robotics and automation
Demiris Y (2007) Prediction of intent in robotics and multi-agent systems. Cognit Process 8:151. https://doi.org/10.1007/s10339-007-0168-9
Nan M, Ghiţă AS, Gavril A, Trascau M, Sorici A, Cramariuc B, Florea AM (2019) Human action recognition for social robots. In: International conference on control systems and computer science
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: IEEE conference on computer vision and pattern recognition (CVPR)
Chen L, Duan L, Xu D (2013) Event recognition in videos by learning from heterogeneous web sources. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Xu D, Chang S-F (2007) Visual event recognition in news video using kernel methods with multi-level temporal alignment. In: IEEE international conference on computer vision and pattern recognition(CVPR)
Wang H, Yuan C, Hu W, Sun C (2012) Supervised class-specific dictionary learning for sparse modeling in action recognition. Pattern Recognit 45(11):3902–3911
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: IEEE international conference on pattern recognition (ICPR)
Wang L, Sahbi H (2014) Bags-of-daglets for action recognition. In: 2014 IEEE international conference on image processing (ICIP). IEEE, pp 1550–1554
Lu W, Little JJ (2006) Simultaneous tracking and action recognition using the PCA-HOG descriptor. In: European conference on Computer vision (ECCV)
Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: European conference on computer vision (ECCV)
Csurka G, Perronnin F (2010) Fisher vectors: beyond bag-of-visual-words image representations. In: International conference on computer vision, imaging and computer graphics
Wang L, Sahbi H (2013) Directed acyclic graph kernels for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3168–3175
Wang L, Sahbi H (2015) Nonlinear cross-view sample enrichment for action recognition. In: Computer vision-ECCV 2014 workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part III 13. Springer, pp 47–62
Kaiming H, Xiangyu Z, Shaoqing R, Jian S (2016) Deep residual learning for image recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
Gy Hinton, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29:82–97
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: IEEE international conference on computer vision (ICCV)
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE conference on computer vision and pattern recognition (CVPR)
Mazari A, Sahbi H (2020) Coarse-to-fine aggregation for cross-granularity action recognition. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 1541–1545
Sahbi H, Zhan H (2021) FFNB: forgetting-free neural blocks for deep continual learning. In: The British machine vision conference (BMVC)
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Neural information processing systems (NeurIPS)
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR)
Carreira J, Zisserman A, Vadis Q (2017) Action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition (CVPR)
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) PoTion: Pose MoTion representation for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Feichtenhofer C, Pinz A, Wildes R-P (2016) Spatiotemporal residual networks for video action recognition. In: Neural information processing systems (NeurIPS)
Feichtenhofer C, Pinz A, Wildes R-P (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Mazari A, Sahbi H (2019) Deep temporal pyramid design for action recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
Martin P-E, Benois-Pineau J, Péteri R, Morlier J (2020) Fine grained sport action recognition with twin spatio-temporal convolutional neural networks: application to table tennis. Multimed Tools Appl 79(2020):20429–20447
Ullah A et al (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904
Murray N, Perronnin F (2014) Generalized max pooling. In: IEEE conference on computer vision and pattern recognition (CVPR)
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: IEEE conference on computer vision and pattern recognition (CVPR)
Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AÁR (2019) Forward–backward visual saliency propagation in deep NNS vs internal attentional mechanisms. In: 2019 Ninth international conference on image processing theory, tools and applications (IPTA). IEEE, pp 1–6
Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AAR (2018) Introduction of explicit visual saliency in training of deep CNNs: application to architectural styles classification. In: 2018 International conference on content-based multimedia indexing (CBMI). IEEE, pp 1–5
Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001
Piergiovanni AJ, Ryoo MS (2018) Fine-grained activity recognition in baseball videos. In: IEEE conference on computer vision and pattern recognition (CVPR), workshop on computer vision in sports
Soomro ARZK, Shah M (2012) Ucf101: a dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01
Shahroudy A, Liu J, Ng T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR)
Pramono RRA, Chen Y-T, Fang W-H (2019) Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 61–70
Wang Y, Long M, Wang J, Yu PS (2017) Spatio temporal pyramid network for video action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Zhu J, Zou W, Zhu Z (2018) End-to-end video level representation learning for action recognition. In: International conference on learning representation (ICLR)
Zheng Z, An G, Wu D, Ruan Q (2019) Spatial-temporal pyramid based Convolutional Neural Network for action recognition. Neurocomputing 358:446–455
Zhang D, Dai X, Wang YF (2018) Dynamic temporal pyramid network: a closer look at multi-scale modeling for activity detection. In: Asian conference on computer vision (ACCV)
Yang K, Li R, Qiao P, Wang Q, Li D, Dou Y (2018) Temporal pyramid relation network for video-based gesture recognition. In: IEEE international conference on image processing (ICIP)
Jin S, Cao Z, Song X (2022) IA-FPN: interactive aggregation feature pyramid network for action detection. In: 2022 4th International conference on intelligent control, measurement and signal processing (ICMSP). IEEE, pp 1063–1068
Cai J, Hu J, Li S, Lin J, Wang J (2020) Combination of temporal-channels correlation information and bilinear feature for action recognition. IET Comput Vis 14(8):634–641
Li H, Huang J, Zhou M, Shi Q, Fei Q (2022) Self-attention pooling-based long-term temporal network for action recognition. IEEE Trans Cognit Dev Syst 15(1):65
Kusumoseniarto RH (2020) Two-stream 3D convolution attentional network for action recognition. In: 2020 Joint 9th international conference on informatics, electronics & vision (ICIEV) and 2020 4th international conference on imaging, vision & pattern recognition (icIVPR). IEEE, pp 1–6
Ha M-H, Chen OT-C (2021) Deep neural networks using residual fast–slow refined highway and global atomic spatial attention for action recognition and detection. IEEE Access 9:164887–164902
Pramono RRA, Fang W-H, Chen Y-T (2021) Relational reasoning for group activity recognition via self-attention augmented conditional random field. IEEE Trans Image Process 30:8184–8199
Liu J, Wang Y, Xiang S, Pan C (2021) Han: an efficient hierarchical self-attention network for skeleton-based gesture recognition. arXiv preprint arXiv:2106.13391
Yihuang J (2017) Pretrained 2D two streams network for action recognition on UCF-101 based on temporal segment network. https://github.com/jeffreyyihuang/two-stream-action-recognition
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision (ECCV)
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the international conference on computer vision (ICCV)
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision (ICCV)
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695–712
Wu C-Y, Zaheer M, Hu H, Manmatha R, Smola AJ, Krahenbuhl P (2018) Compressed video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6026–6035
Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE international conference on computer vision, pp 7083–7093
Li J, Wei P, Zheng N (2021) Nesting spatiotemporal attention networks for action recognition. Neurocomputing 459:338–348
Ohn-Barand E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE TITS 15(6):2368–2377
Oreifej O, Liu Z (2013) HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723
Rahmani H, Mian A (2016) 3D Action recognition from novel viewpoints. In: CVPR, pp 1506–1515
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks In: AAAI, vol 2, p 6
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: ICCV, pp 2752–2759
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a Lie group. In: IEEE CVPR, pp 588–595
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE CVPR, pp 1110–1118
Zhang X, Wang Y, Gou M, Sznaier M, Camps O (2016) Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: CVPR, pp 4498–4507
Garcia-Hernando G, Kim T-K (2017) Transition forests: learning discriminative temporal transitions for action recognition. In: CVPR, pp 407–415
Hu J, Zheng W, Lai J, Zhang J (2015) Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR
Huang Z, Gool LV (2017) A Riemannian network for SPD matrix learning. In: AAAI, pp 2036–2042
Huang Z, Wu J, Gool LV (2018) Building deep networks on Grassmann manifolds. In: AAAI, pp 3279–3286
Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First person hand action Benchmark with RGB-D videos and 3D hand pose annotations. In: CVPR
Sahbi H (2011) Learning connectivity with graph convolutional networks. In: 2020 25th International conference on pattern recognition (ICPR). IEEE, pp 9996–10003
Sahbi H (2021) Lightweight connectivity in graph convolutional networks for skeleton-based recognition. In: IEEE international conference on image processing (ICIP), pp 2329–2333
Sahbi H (2022) Topologically-consistent magnitude pruning for very lightweight graph convolutional networks. In: IEEE international conference on image processing (ICIP), pp 3495–3499
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No grant was received for conducting this study. The authors have no competing interests to declare that are relevant to the content of this article. The authors declare that they have no conflicts of interest.
Human and/or animal rights
This article does not contain any studies involving human participants performed by any of the authors. This article does not contain any studies involving animals performed by any of the authors
Informed consent
For this type of study, informed consent is not required.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mazari, A., Sahbi, H. Deep multiple aggregation networks for action recognition. Int J Multimed Info Retr 13, 9 (2024). https://doi.org/10.1007/s13735-023-00317-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00317-1