Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Temporal information oriented motion accumulation and selection network for RGB-based action recognition

Published: 01 September 2023 Publication History

Highlights

Motion Accumulation Module (MAM) captures comprehensive motion information.
Motion Selection Module (MSM) identifies stimulating motion features.
MAS-Net achieves state-of-the-art results on Something-Something V1&V2 and Diving48 dataset.
Competitive results on Kinetics-400 with low computational load.

Abstract

Numerous studies have highlighted the crucial role of motion information in accurate action recognition in videos. However, current methods heavily rely on temporal differences of features extracted by convolutional neural networks (CNNs) to represent motion, which may have two potential limitations: (1) incomplete representation of the moving target contour due to the difference operation, and (2) equal treatment of all extracted motion features, regardless of their relevance to the classification task, which may negatively impact performance. To address these limitations, we propose a novel approach called the Motion Accumulation and Selection Network (MAS-Net). Although our new approach also considers spatial attributes, it draws inspiration from the cumulative and selective nature of human visual attention, with a primary focus on capturing the temporal attributes of actions for recognition. Further, an motion selection module is exploited to prioritize relevant temporal features while filtering out irrelevant ones. Currently, there is a growing demand for action recognition with strong temporal information, as opposed to conventional scene-related datasets such as UCF-101 and HMDB-51. Therefore, we evaluated MAS-Net on benchmark video datasets that primarily emphasize temporal information, including Something-Something V1&V2, Diving48, and Kinetics-400. Our experimental results demonstrate that MAS-Net achieves state-of-the-art performance on Something-Something V1&V2 and Diving48 datasets. Furthermore, when compared to other 2D CNN-based models, MAS-Net exhibits competitive results on the Kinetics-400 dataset while maintaining computational efficiency. These findings highlight the effectiveness and efficiency of MAS-Net for temporal modeling in video analysis tasks.

References

[1]
S. Herath, M. Harandi, F. Porikli, Going deeper into action recognition: A survey, Image Vis. Comput. 60 (2017) 4–21.
[2]
F.I. Eyiokur, A. Kantarcı, M.E. Erakın, N. Damer, F. Ofli, M. Imran, J. Križaj, A.A. Salah, A. Waibel, V. Štruc, H.K. Ekenel, A survey on computer vision based human analysis in the covid-19 era, Image Vis. Comput. 130 (2023).
[3]
M. Li, H. Leung, Graph-based approach for 3d human skeletal action recognition, Pattern Recogn. Lett. 87 (2017) 195–202.
[4]
G. Yao, T. Lei, J. Zhong, A review of convolutional-neural-network-based action recognition, Pattern Recogn. Lett. 118 (2019) 14–22.
[5]
K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, Adv. Neural Inf. Process. Syst. (NIPS) (2014) 568–576.
[6]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks for action recognition in videos, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2740–2755.
[7]
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1933–1941.
[8]
L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, M.J. Black, On the integration of optical flow and action recognition, in: German Conference on Pattern Recognition (GCPR), 2018, pp. 281–297.
[9]
H.-H. Pham, L. Khoudour, A. Crouzil, P. Zegers, S.A. Velastin, Exploiting deep residual networks for human action recognition from skeletal data, Comput. Vis. Image Underst. 170 (2018) 51–66.
[10]
C. Plizzari, M. Cannici, M. Matteucci, Skeleton-based action recognition via spatial and temporal transformer networks, Comput. Vis. Image Underst. 208–209 (2021).
[11]
J. Zhou, T. Komuro, An asymmetrical-structure auto-encoder for unsupervised representation learning of skeleton sequences, Comput. Vis. Image Underst. 222 (2022).
[12]
S. Li, J. Yi, Y.A. Farha, J. Gall, Pose refinement graph convolutional network for skeleton-based action recognition, IEEE Robot. Autom. Lett. 6 (2) (2021) 1028–1035.
[13]
Y. Quan, Y. Chen, R. Xu, H. Ji, Attention with structure regularization for action recognition, Comput. Vis. Image Underst. 187 (2019).
[14]
J. Lin, C. Gan, S. Han, Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 7083–7093.
[15]
B. Jiang, M. Wang, W. Gan, W. Wu, J. Yan, Stm: Spatiotemporal and motion encoding for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2000–2009.
[16]
Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, L. Wang, Tea: Temporal excitation and aggregation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 906–915.
[17]
A.J. Ma, P.C. Yuen, W.W.W. Zou, J.-H. Lai, Supervised spatio-temporal neighborhood topology learning for action recognition, IEEE Trans. Circuits Syst. Video Technol. 23 (8) (2013) 1447–1460.
[18]
H. Hu, Z. Zhang, Z. Xie, S. Lin, Local relation networks for image recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3464–3473.
[19]
M. Kim, H. Kwon, C. Wang, S. Kwak, M. Cho, Relational self-attention: What’s missing in attention for video understanding, Adv. Neural Inf. Process. Syst. (NIPS) 34 (2021) 8046–8059.
[20]
G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?, in: Proceedings of the International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 813–824.
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.u. Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst. (NIPS) 30 (2017).
[22]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1725–1732.
[23]
B. Zhou, A. Andonian, A. Oliva, A. Torralba, Temporal relational reasoning in videos, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 803–818.
[24]
K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6546–6555.
[25]
P. Byvshev, P. Mettes, Y. Xiao, Are 3d convolutional networks inherently biased towards appearance?, Comput. Vis. Image Underst. 220 (2022).
[26]
Z. Wang, Q. She, A. Smolic, Action-net: Multipath excitation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13214–13223.
[27]
L. Niu, S. Huang, X. Zhao, L. Kang, Y. Zhang, L. Zhang, Hallucinating uncertain motion and future for static image action recognition, Comput. Vis. Image Underst. 215 (2022).
[28]
Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, T. Lu, Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, 2020, pp. 11669–11676.
[29]
L. Wang, Z. Tong, B. Ji, G. Wu, Tdn: Temporal difference networks for efficient action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1895–1904.
[30]
J. Yuan, X. Jiang, F. Huang, Y. Tai, J. Li, C. Wang, J. Weng, D. Luo, Y. Wang, Temporal distinct representation learning for action recognition., in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 363–378.
[31]
D. Wei, Y. Tian, L. Wei, H. Zhong, S. Chen, S. Pu, H. Lu, Efficient dual attention slowfast networks for video action recognition, Comput. Vis. Image Underst. 222 (2022).
[32]
Z. Liu, L. Wang, W. Wu, C. Qian, T. Lu, Tam: Temporal adaptive module for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 13708–13718.
[33]
S. Sudhakaran, S. Escalera, O. Lanz, Gate-shift networks for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1099–1108.
[34]
J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.
[35]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
[36]
C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6201–6210.
[37]
D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, S. Wen, Stnet: Local and global spatial-temporal modeling for action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019, pp. 8401–8408.
[38]
C. Luo, A. Yuille, Grouped spatial-temporal aggretation for efficient action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5512–5521.
[39]
Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5533–5541.
[40]
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.
[41]
L. Wang, W. Li, W. Li, L.V. Gool, Appearance-and-relation networks for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1430–1439.
[42]
S. Xie, C. Sun, J. Huang, Z. Tu, K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.
[43]
M. Zolfaghari, K. Singh, T. Brox, ECO: efficient convolutional network for online video understanding, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 695–712.
[44]
C.-Y. Zhang, Y.-Y. Xiao, J.-C. Lin, C.L.P. Chen, W. Liu, Y.-H. Tong, 3-d deconvolutional networks for the unsupervised representation learning of human motions, IEEE Trans. Cybern. 52 (1) (2022) 398–410.
[45]
D. Neimark, O. Bar, M. Zohar, D. Asselmann, Video transformer network, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 3163–3172.
[46]
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6836–6846.
[47]
C.-F. Chen, R. Panda, K. Ramakrishnan, R.S. Feris, J.M. Cohn, A. Oliva, Q. Fan, Deep analysis of cnn-based spatio-temporal representations for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6165–6175.
[48]
R. Goyal, S.E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic, The “something something video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5843–5851.
[49]
D. Tran, H. Wang, M. Feiszli, L. Torresani, Video classification with channel-separated convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5551–5560.
[50]
C. Feichtenhofer, X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 200–210.
[51]
L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, IEEE Trans. Pattern Anal. Mach. Intell. 29 (12) (2007) 2247–2253.
[52]
A. Yilmaz, M. Shah, Actions sketch: a novel action representation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 984–989.
[53]
H. Shao, S. Qian, Y. Liu, Temporal interlacing network, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11966–11973.
[54]
E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep cnns for action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–8.
[55]
C. Plizzari, M. Planamente, G. Goletto, M. Cannici, E. Gusso, M. Matteucci, B. Caputo, E2 (go) motion: Motion augmented event stream for egocentric action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19935–19947.
[56]
M. Planamente, C. Plizzari, E. Alberti, B. Caputo, Domain generalization through audio-visual relative norm alignment in first person action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 1807–1818.
[57]
M. Ramanathan, W.-Y. Yau, N.M. Thalmann, E.K. Teoh, Mutually reinforcing motion-pose framework for pose invariant action recognition, Int. J. Biom. 11 (2) (2019) 113–147.
[58]
Y. Zhu, H. Shuai, G. Liu, Q. Liu, Multilevel spatial–temporal excited graph network for skeleton-based action recognition, IEEE Trans. Image Process. 32 (2022) 496–508.
[59]
S. Asghari-Esfeden, M. Sznaier, O. Camps, Dynamic motion representation for human action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 557–566.
[60]
J. Arunnehru, S. Thalapathiraj, R. Dhanasekar, L. Vijayaraja, R. Kannadasan, A.A. Khan, M.A. Haq, M. Alshehri, M.I. Alwanain, I. Keshta, Machine vision-based human action recognition using spatio-temporal motion features (stmf) with difference intensity distance group pattern (didgp), Electronics 11 (15) (2022) 2363.
[61]
R. Wang, X. Wu, Combining multiple deep cues for action recognition, Multimed. Tools Appl. 78 (2019) 9933–9950.
[62]
V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, Potion: Pose motion representation for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2018, pp. 7024–7033.
[63]
X. Ji, Q. Zhao, J. Cheng, C. Ma, Exploiting spatio-temporal representation for 3d human action recognition from depth map sequences, Knowl.-Based Syst. 227 (2021).
[64]
S. Sun, Z. Kuang, L. Sheng, W. Ouyang, W. Zhang, Optical flow guided feature: A fast and robust motion representation for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1390–1399.
[65]
A. Abdari, P. Amirjan, A. Mansouri, Speeding up action recognition using dynamic accumulation of residuals in compressed domain, arXiv preprint arXiv: 2209.14757.
[66]
H. Zhang, L. Wang, J. Sun, Exploiting spatio-temporal knowledge for video action recognition, IET Comput. Vis. 17 (2) (2023) 222–230.
[67]
V. Escorcia, J. Niebles, Spatio-temporal human-object interactions for action recognition in videos, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 508–514.
[68]
H.H. Pham, L. Khoudour, A. Crouzil, P. Zegers, S.A. Velastin, Video-based human action recognition using deep learning: a review, arXiv preprint arXiv: 2208.03775.
[69]
S. Purushwalkam, A. Gupta, Pose from action: Unsupervised learning of pose features based on motion, arXiv preprint arXiv: 1609.05420.
[70]
S.-H. Lee, D.-W. Lee, M.S. Kim, A deep learning-based semantic segmentation model using mcnn and attention layer for human activity recognition, Sensors 23 (4) (2023) 2278.
[71]
M. Lee, S. Lee, S. Son, G. Park, N. Kwak, Motion feature network: Fixed motion filter for action recognition, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 387–403.
[72]
L. Shao, X. Zhen, D. Tao, X. Li, Spatio-temporal laplacian pyramid coding for action recognition, IEEE Trans. Cybern. 44 (6) (2014) 817–827.
[73]
Y. Wang, J. Ye, Tmf: Temporal motion and fusion for action recognition, Comput. Vis. Image Underst. 213 (2021).
[74]
J.Y.-H. Ng, L.S. Davis, Temporal difference networks for video action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1587–1596.
[75]
Y. Zhao, Y. Xiong, D. Lin, Recognize actions by disentangling components of dynamics, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6566–6575.
[76]
J. Hou, X. Wu, Y. Sun, Y. Jia, Content-attention representation by factorized action-scene network for action recognition, IEEE Trans. Multimed. 20 (6) (2018) 1537–1547.
[77]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[78]
C. Sun, H. Song, X. Wu, Y. Jia, J. Luo, Exploiting informative video segments for temporal action localization, IEEE Trans. Multimed. 24 (2022) 274–287.
[79]
Y. Li, Y. Li, N. Vasconcelos, Resound: Towards action recognition without representation bias, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 513–528.
[80]
H. Kuehne, H. Jhuang, J. Garrote-Esteban, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in: Proceedings of the International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 2556–2563.
[81]
K. Soomro, A.R. Zamir, M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, CoRR abs/1212.0402. arXiv: 1212.0402.
[82]
X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794–7803.
[83]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[84]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv 2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
[85]
X. Li, Y. Wang, Z. Zhou, Y. Qiao, Smallbignet: Integrating core and contextual views for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1089–1098.
[86]
H. Kwon, M. Kim, S. Kwak, M. Cho, Motionsqueeze: Neural motion feature learning for video understanding, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 1933–1941.
[87]
W. Wu, D. He, T. Lin, F. Li, C. Gan, E. Ding, Mvfnet: Multi-view fusion network for efficient video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, 2021, pp. 2943–2951.
[88]
H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, Multiscale vision transformers, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6824–6835.
[89]
M. Patrick, D. Campbell, Y. Asano, I. Misra, F. Metze, C. Feichtenhofer, A. Vedaldi, J.F. Henriques, Keeping your eye on the ball: Trajectory attention in video transformers, Adv. Neural Inf. Process. Syst. (NIPS) 34 (2021) 12493–12506.
[90]
H. Wang, D. Tran, L. Torresani, M. Feiszli, Video modeling with non-local networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 349–358.
[91]
Q. Fan, C.-F. Chen, H. Kuehne, M. Pistoia, D. Cox, More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation, Adv. Neural Inf. Process. Syst. (NIPS) 32 (2019) 2261–2270.
[92]
K. Li, X. Li, Y. Wang, J. Wang, Y. Qiao, Ct-net: Channel tensorization network for video classification, in: Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[93]
C. Zhang, A. Gupta, A. Zisserman, Temporal query networks for fine-grained video understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4484–4494.
[94]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3202–3211.
[95]
H. Tan, J. Lei, T. Wolf, M. Bansal, Vimpac: Video pre-training via masked token prediction and contrastive learning, arXiv preprint arXiv: 2106.11250.
[96]
A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2630–2640.
[97]
Z. Tong, Y. Song, J. Wang, L. Wang, Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, Adv. Neural Inf. Process. Syst. (NIPS) 35 (2022) 10078–10093.
[98]
S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, C. Schmid, Multiview transformers for video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3333–3343.
[99]
C. Zhang, Y. Zou, G. Chen, L. Gan, Pan: Persistent appearance network with an efficient motion cue for fast action recognition, in: Proceedings of the ACM International Conference on Multimedia (ACMMM), 2019, pp. 500–509.
[100]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization., in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929.
[101]
L. Van der Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res. 9 (11) (2008) 123–134.

Index Terms

  1. Temporal information oriented motion accumulation and selection network for RGB-based action recognition
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Image and Vision Computing
          Image and Vision Computing  Volume 137, Issue C
          Sep 2023
          383 pages

          Publisher

          Butterworth-Heinemann

          United States

          Publication History

          Published: 01 September 2023

          Author Tags

          1. Action recognition
          2. Motion accumulation
          3. Selective network
          4. Motion

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 26 Sep 2024

          Other Metrics

          Citations

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media