research-article

Temporal information oriented motion accumulation and selection network for RGB-based action recognition

Authors:

Xianfeng GuAuthors Info & Claims

Volume 137, Issue C

https://doi.org/10.1016/j.imavis.2023.104785

Published: 01 September 2023 Publication History

Highlights

•

Motion Accumulation Module (MAM) captures comprehensive motion information.

•

Motion Selection Module (MSM) identifies stimulating motion features.

•

MAS-Net achieves state-of-the-art results on Something-Something V1&V2 and Diving48 dataset.

•

Competitive results on Kinetics-400 with low computational load.

Abstract

Numerous studies have highlighted the crucial role of motion information in accurate action recognition in videos. However, current methods heavily rely on temporal differences of features extracted by convolutional neural networks (CNNs) to represent motion, which may have two potential limitations: (1) incomplete representation of the moving target contour due to the difference operation, and (2) equal treatment of all extracted motion features, regardless of their relevance to the classification task, which may negatively impact performance. To address these limitations, we propose a novel approach called the Motion Accumulation and Selection Network (MAS-Net). Although our new approach also considers spatial attributes, it draws inspiration from the cumulative and selective nature of human visual attention, with a primary focus on capturing the temporal attributes of actions for recognition. Further, an motion selection module is exploited to prioritize relevant temporal features while filtering out irrelevant ones. Currently, there is a growing demand for action recognition with strong temporal information, as opposed to conventional scene-related datasets such as UCF-101 and HMDB-51. Therefore, we evaluated MAS-Net on benchmark video datasets that primarily emphasize temporal information, including Something-Something V1&V2, Diving48, and Kinetics-400. Our experimental results demonstrate that MAS-Net achieves state-of-the-art performance on Something-Something V1&V2 and Diving48 datasets. Furthermore, when compared to other 2D CNN-based models, MAS-Net exhibits competitive results on the Kinetics-400 dataset while maintaining computational efficiency. These findings highlight the effectiveness and efficiency of MAS-Net for temporal modeling in video analysis tasks.

References

[1]

S. Herath, M. Harandi, F. Porikli, Going deeper into action recognition: A survey, Image Vis. Comput. 60 (2017) 4–21.

Digital Library

[2]

F.I. Eyiokur, A. Kantarcı, M.E. Erakın, N. Damer, F. Ofli, M. Imran, J. Križaj, A.A. Salah, A. Waibel, V. Štruc, H.K. Ekenel, A survey on computer vision based human analysis in the covid-19 era, Image Vis. Comput. 130 (2023).

[3]

M. Li, H. Leung, Graph-based approach for 3d human skeletal action recognition, Pattern Recogn. Lett. 87 (2017) 195–202.

Digital Library

[4]

G. Yao, T. Lei, J. Zhong, A review of convolutional-neural-network-based action recognition, Pattern Recogn. Lett. 118 (2019) 14–22.

[5]

K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, Adv. Neural Inf. Process. Syst. (NIPS) (2014) 568–576.

[6]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks for action recognition in videos, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2740–2755.

[7]

C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1933–1941.

[8]

L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, M.J. Black, On the integration of optical flow and action recognition, in: German Conference on Pattern Recognition (GCPR), 2018, pp. 281–297.

[9]

H.-H. Pham, L. Khoudour, A. Crouzil, P. Zegers, S.A. Velastin, Exploiting deep residual networks for human action recognition from skeletal data, Comput. Vis. Image Underst. 170 (2018) 51–66.

[10]

C. Plizzari, M. Cannici, M. Matteucci, Skeleton-based action recognition via spatial and temporal transformer networks, Comput. Vis. Image Underst. 208–209 (2021).

[11]

J. Zhou, T. Komuro, An asymmetrical-structure auto-encoder for unsupervised representation learning of skeleton sequences, Comput. Vis. Image Underst. 222 (2022).

[12]

S. Li, J. Yi, Y.A. Farha, J. Gall, Pose refinement graph convolutional network for skeleton-based action recognition, IEEE Robot. Autom. Lett. 6 (2) (2021) 1028–1035.

[13]

Y. Quan, Y. Chen, R. Xu, H. Ji, Attention with structure regularization for action recognition, Comput. Vis. Image Underst. 187 (2019).

Digital Library

[14]

J. Lin, C. Gan, S. Han, Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 7083–7093.

[15]

B. Jiang, M. Wang, W. Gan, W. Wu, J. Yan, Stm: Spatiotemporal and motion encoding for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2000–2009.

[16]

Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, L. Wang, Tea: Temporal excitation and aggregation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 906–915.

[17]

A.J. Ma, P.C. Yuen, W.W.W. Zou, J.-H. Lai, Supervised spatio-temporal neighborhood topology learning for action recognition, IEEE Trans. Circuits Syst. Video Technol. 23 (8) (2013) 1447–1460.

[18]

H. Hu, Z. Zhang, Z. Xie, S. Lin, Local relation networks for image recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3464–3473.

[19]

M. Kim, H. Kwon, C. Wang, S. Kwak, M. Cho, Relational self-attention: What’s missing in attention for video understanding, Adv. Neural Inf. Process. Syst. (NIPS) 34 (2021) 8046–8059.

[20]

G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?, in: Proceedings of the International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 813–824.

[21]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.u. Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. (NIPS) 30 (2017).

[22]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1725–1732.

[23]

B. Zhou, A. Andonian, A. Oliva, A. Torralba, Temporal relational reasoning in videos, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 803–818.

[24]

K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6546–6555.

[25]

P. Byvshev, P. Mettes, Y. Xiao, Are 3d convolutional networks inherently biased towards appearance?, Comput. Vis. Image Underst. 220 (2022).

[26]

Z. Wang, Q. She, A. Smolic, Action-net: Multipath excitation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13214–13223.

[27]

L. Niu, S. Huang, X. Zhao, L. Kang, Y. Zhang, L. Zhang, Hallucinating uncertain motion and future for static image action recognition, Comput. Vis. Image Underst. 215 (2022).

[28]

Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, T. Lu, Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, 2020, pp. 11669–11676.

[29]

L. Wang, Z. Tong, B. Ji, G. Wu, Tdn: Temporal difference networks for efficient action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1895–1904.

[30]

J. Yuan, X. Jiang, F. Huang, Y. Tai, J. Li, C. Wang, J. Weng, D. Luo, Y. Wang, Temporal distinct representation learning for action recognition., in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 363–378.

[31]

D. Wei, Y. Tian, L. Wei, H. Zhong, S. Chen, S. Pu, H. Lu, Efficient dual attention slowfast networks for video action recognition, Comput. Vis. Image Underst. 222 (2022).

[32]

Z. Liu, L. Wang, W. Wu, C. Qian, T. Lu, Tam: Temporal adaptive module for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 13708–13718.

[33]

S. Sudhakaran, S. Escalera, O. Lanz, Gate-shift networks for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1099–1108.

[34]

J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.

[35]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.

[36]

C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6201–6210.

[37]

D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, S. Wen, Stnet: Local and global spatial-temporal modeling for action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019, pp. 8401–8408.

[38]

C. Luo, A. Yuille, Grouped spatial-temporal aggretation for efficient action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5512–5521.

[39]

Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5533–5541.

[40]

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.

[41]

L. Wang, W. Li, W. Li, L.V. Gool, Appearance-and-relation networks for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1430–1439.

[42]

S. Xie, C. Sun, J. Huang, Z. Tu, K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.

[43]

M. Zolfaghari, K. Singh, T. Brox, ECO: efficient convolutional network for online video understanding, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 695–712.

[44]

C.-Y. Zhang, Y.-Y. Xiao, J.-C. Lin, C.L.P. Chen, W. Liu, Y.-H. Tong, 3-d deconvolutional networks for the unsupervised representation learning of human motions, IEEE Trans. Cybern. 52 (1) (2022) 398–410.

[45]

D. Neimark, O. Bar, M. Zohar, D. Asselmann, Video transformer network, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 3163–3172.

[46]

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846.

[47]

C.-F. Chen, R. Panda, K. Ramakrishnan, R.S. Feris, J.M. Cohn, A. Oliva, Q. Fan, Deep analysis of cnn-based spatio-temporal representations for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6165–6175.

[48]

R. Goyal, S.E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic, The “something something video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5843–5851.

[49]

D. Tran, H. Wang, M. Feiszli, L. Torresani, Video classification with channel-separated convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5551–5560.

[50]

C. Feichtenhofer, X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 200–210.

[51]

L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, IEEE Trans. Pattern Anal. Mach. Intell. 29 (12) (2007) 2247–2253.

[52]

A. Yilmaz, M. Shah, Actions sketch: a novel action representation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 984–989.

[53]

H. Shao, S. Qian, Y. Liu, Temporal interlacing network, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11966–11973.

[54]

E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep cnns for action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–8.

[55]

C. Plizzari, M. Planamente, G. Goletto, M. Cannici, E. Gusso, M. Matteucci, B. Caputo, E2 (go) motion: Motion augmented event stream for egocentric action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19935–19947.

[56]

M. Planamente, C. Plizzari, E. Alberti, B. Caputo, Domain generalization through audio-visual relative norm alignment in first person action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 1807–1818.

[57]

M. Ramanathan, W.-Y. Yau, N.M. Thalmann, E.K. Teoh, Mutually reinforcing motion-pose framework for pose invariant action recognition, Int. J. Biom. 11 (2) (2019) 113–147.

[58]

Y. Zhu, H. Shuai, G. Liu, Q. Liu, Multilevel spatial–temporal excited graph network for skeleton-based action recognition, IEEE Trans. Image Process. 32 (2022) 496–508.

[59]

S. Asghari-Esfeden, M. Sznaier, O. Camps, Dynamic motion representation for human action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 557–566.

[60]

J. Arunnehru, S. Thalapathiraj, R. Dhanasekar, L. Vijayaraja, R. Kannadasan, A.A. Khan, M.A. Haq, M. Alshehri, M.I. Alwanain, I. Keshta, Machine vision-based human action recognition using spatio-temporal motion features (stmf) with difference intensity distance group pattern (didgp), Electronics 11 (15) (2022) 2363.

[61]

R. Wang, X. Wu, Combining multiple deep cues for action recognition, Multimed. Tools Appl. 78 (2019) 9933–9950.

[62]

V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, Potion: Pose motion representation for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2018, pp. 7024–7033.

[63]

X. Ji, Q. Zhao, J. Cheng, C. Ma, Exploiting spatio-temporal representation for 3d human action recognition from depth map sequences, Knowl.-Based Syst. 227 (2021).

Digital Library

[64]

S. Sun, Z. Kuang, L. Sheng, W. Ouyang, W. Zhang, Optical flow guided feature: A fast and robust motion representation for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1390–1399.

[65]

A. Abdari, P. Amirjan, A. Mansouri, Speeding up action recognition using dynamic accumulation of residuals in compressed domain, arXiv preprint arXiv: 2209.14757.

[66]

H. Zhang, L. Wang, J. Sun, Exploiting spatio-temporal knowledge for video action recognition, IET Comput. Vis. 17 (2) (2023) 222–230.

[67]

V. Escorcia, J. Niebles, Spatio-temporal human-object interactions for action recognition in videos, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 508–514.

[68]

H.H. Pham, L. Khoudour, A. Crouzil, P. Zegers, S.A. Velastin, Video-based human action recognition using deep learning: a review, arXiv preprint arXiv: 2208.03775.

[69]

S. Purushwalkam, A. Gupta, Pose from action: Unsupervised learning of pose features based on motion, arXiv preprint arXiv: 1609.05420.

[70]

S.-H. Lee, D.-W. Lee, M.S. Kim, A deep learning-based semantic segmentation model using mcnn and attention layer for human activity recognition, Sensors 23 (4) (2023) 2278.

[71]

M. Lee, S. Lee, S. Son, G. Park, N. Kwak, Motion feature network: Fixed motion filter for action recognition, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 387–403.

[72]

L. Shao, X. Zhen, D. Tao, X. Li, Spatio-temporal laplacian pyramid coding for action recognition, IEEE Trans. Cybern. 44 (6) (2014) 817–827.

[73]

Y. Wang, J. Ye, Tmf: Temporal motion and fusion for action recognition, Comput. Vis. Image Underst. 213 (2021).

[74]

J.Y.-H. Ng, L.S. Davis, Temporal difference networks for video action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1587–1596.

[75]

Y. Zhao, Y. Xiong, D. Lin, Recognize actions by disentangling components of dynamics, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6566–6575.

[76]

J. Hou, X. Wu, Y. Sun, Y. Jia, Content-attention representation by factorized action-scene network for action recognition, IEEE Trans. Multimed. 20 (6) (2018) 1537–1547.

[77]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[78]

C. Sun, H. Song, X. Wu, Y. Jia, J. Luo, Exploiting informative video segments for temporal action localization, IEEE Trans. Multimed. 24 (2022) 274–287.

[79]

Y. Li, Y. Li, N. Vasconcelos, Resound: Towards action recognition without representation bias, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 513–528.

[80]

H. Kuehne, H. Jhuang, J. Garrote-Esteban, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in: Proceedings of the International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 2556–2563.

[81]

K. Soomro, A.R. Zamir, M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, CoRR abs/1212.0402. arXiv: 1212.0402.

[82]

X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794–7803.

[83]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

[84]

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv 2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.

[85]

X. Li, Y. Wang, Z. Zhou, Y. Qiao, Smallbignet: Integrating core and contextual views for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1089–1098.

[86]

H. Kwon, M. Kim, S. Kwak, M. Cho, Motionsqueeze: Neural motion feature learning for video understanding, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 1933–1941.

[87]

W. Wu, D. He, T. Lin, F. Li, C. Gan, E. Ding, Mvfnet: Multi-view fusion network for efficient video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, 2021, pp. 2943–2951.

[88]

H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, Multiscale vision transformers, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6824–6835.

[89]

M. Patrick, D. Campbell, Y. Asano, I. Misra, F. Metze, C. Feichtenhofer, A. Vedaldi, J.F. Henriques, Keeping your eye on the ball: Trajectory attention in video transformers, Adv. Neural Inf. Process. Syst. (NIPS) 34 (2021) 12493–12506.

[90]

H. Wang, D. Tran, L. Torresani, M. Feiszli, Video modeling with non-local networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 349–358.

[91]

Q. Fan, C.-F. Chen, H. Kuehne, M. Pistoia, D. Cox, More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation, Adv. Neural Inf. Process. Syst. (NIPS) 32 (2019) 2261–2270.

[92]

K. Li, X. Li, Y. Wang, J. Wang, Y. Qiao, Ct-net: Channel tensorization network for video classification, in: Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[93]

C. Zhang, A. Gupta, A. Zisserman, Temporal query networks for fine-grained video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4484–4494.

[94]

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3202–3211.

[95]

H. Tan, J. Lei, T. Wolf, M. Bansal, Vimpac: Video pre-training via masked token prediction and contrastive learning, arXiv preprint arXiv: 2106.11250.

[96]

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2630–2640.

[97]

Z. Tong, Y. Song, J. Wang, L. Wang, Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, Adv. Neural Inf. Process. Syst. (NIPS) 35 (2022) 10078–10093.

[98]

S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, C. Schmid, Multiview transformers for video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3333–3343.

[99]

C. Zhang, Y. Zou, G. Chen, L. Gan, Pan: Persistent appearance network with an efficient motion cue for fast action recognition, in: Proceedings of the ACM International Conference on Multimedia (ACMMM), 2019, pp. 500–509.

[100]

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization., in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929.

[101]

L. Van der Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res. 9 (11) (2008) 123–134.

Index Terms

Temporal information oriented motion accumulation and selection network for RGB-based action recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Activity recognition and understanding
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Local velocity-adapted motion events for spatio-temporal recognition

In this paper, we address the problem of motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we ...
A spatiotemporal and motion information extraction network for action recognition
Abstract
With the continuous advancement in Internet-of-Things and deep learning, video action recognition is gradually emerging in daily and industrial applications. Spatiotemporal and motion patterns are two crucial and complementary types of information ...
A Factorization-Based Approach for Articulated Nonrigid Shape, Motion and Kinematic Chain Recovery From Video

Recovering articulated shape and motion, especially human body motion, from video is a challenging problem with a wide range of applications in medical study, sport analysis and animation, etc. Previous work on articulated motion recovery generally ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Image and Vision Computing

Image and Vision Computing Volume 137, Issue C

Sep 2023

383 pages

ISSN:0262-8856

Issue’s Table of Contents

Copyright © 2023.

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 01 September 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents