Exploring Text-Driven Approaches for Online Action Detection

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14675))

Included in the following conference series:

International Work-Conference on the Interplay Between Natural and Artificial Computation

304 Accesses

Abstract

The use of task-agnostic, pre-trained models for knowledge transfer has become more prevalent due to the availability of extensive open-source vision-language models (VLMs), and increased computational power. However, despite their widespread application across various domains, their potential for online action detection has not been fully explored. Current approaches rely on pre-extracted features from convolutional neural networks. In this paper we explore the potential of using VLMs for online action detection, emphasizing their effectiveness and capabilities for zero-shot and few-shot learning scenarios. Our research highlights the potential of VLMs in this field through empirical demonstrations of their robust performance, positioning them as a powerful tool for further enhancing the state of the art in online action detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Using External Knowledge to Improve Zero-Shot Action Recognition in Egocentric Videos

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

VLG: General Video Recognition with Web Textual Knowledge

Article 25 May 2024

References

An, J., Kang, H., Han, S.H., Yang, M.H., Kim, S.J.: Miniroad: minimal RNN framework for online action detection. In: ICCV, pp. 10341–10350, October 2023
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer (2021)
Google Scholar
Azorin-Lopez, J., Saval-Calvo, M., Fuster-Guillo, A., Garcia-Rodriguez, J.: A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process. Lett. 43(2), 363–387 (2015)
Article Google Scholar
Azorín-López, J., Saval-Calvo, M., Fuster-Guilló, A., García-Rodríguez, J.: Human behaviour recognition based on trajectory analysis using neural networks. In: IJCNN, pp. 1–7 (2013)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Rodriguez-Juan, J., Berenguer-Agullo, A., Psarrou, A., Garcia-Rodriguez, J.: A comprehensive study on pain assessment from multimodal sensor data. Sensors 23(24) (2023)
Google Scholar
Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: Interaction estimation in egocentric videos via simultaneous hand-object recognition. In: SOCO, pp. 439–448 (2022)
Google Scholar
Benavent-Lledo, M., Oprea, S., Castro-Vargas, J.A., Mulero-Perez, D., Garcia-Rodriguez, J.: Predicting human-object interactions in egocentric videos. In: IJCNN, pp. 1–7 (2022)
Google Scholar
Cheng, F., Wang, X., Lei, J., Crandall, D., Bansal, M., Bertasius, G.: Vindlu: a recipe for effective video-and-language pretraining (2023)
Google Scholar
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_17
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2021)
Google Scholar
Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Learning to discriminate information for online action detection. In: CVPR, June 2020
Google Scholar
Flórez-Revuelta, F., García-Chamizo, J.M., Garcia-Rodriguez, J., Hernández Sáez, A., et al.: Representation of 2d objects with a topology preserving network (2002)
Google Scholar
Gao, J., Yang, Z., Nevatia, R.: Red: Reinforced encoder-decoder networks for action anticipation (2017)
Google Scholar
Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: Woad: weakly supervised online action detection in untrimmed videos. In: CVPR, pp. 1915–1923, June 2021
Google Scholar
García-Rodríguez, J., García-Chamizo, J.M.: Surveillance and human-computer interaction applications of self-growing models. Appl. Soft Comput. 11(7), 4413–4431 (2011)
Article Google Scholar
Gomez-Donoso, F., Orts-Escolano, S., Garcia-Garcia, A., Garcia-Rodriguez, J., Castro-Vargas, J.A., Ovidiu-Oprea, S., Cazorla, M.: A robotic platform for customized and interactive rehabilitation of persons with disabilities. Pattern Recogn. Lett. 99, 105–113 (2017)
Article Google Scholar
Górriz, J., Álvarez Illán, I., Álvarez Marquina, A., Arco, J., Atzmueller, M., et al.: Computational approaches to explainable artificial intelligence: advances in theory, applications and trends. Inf. Fusion 100, 101945 (2023)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Google Scholar
Jiang, Y.G., Liu, J., et al.: Thumos challenge: Action recognition with a large number of classes (2014)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding (2022)
Google Scholar
Kim, J., Misu, T., Chen, Y.T., Tawari, A., Canny, J.: Grounding human-to-vehicle advice for self-driving vehicles. In: CVPR, June 2019
Google Scholar
Li, R., Yan, L., Peng, Y., Qing, L.: Lighter transformer for online action detection, ICIGP 2023, pp. 161–167. Association for Computing Machinery (2023)
Google Scholar
Li, Z., et al.: A strong baseline for temporal video-text alignment (2023)
Google Scholar
Ni, P., Lv, S., Zhu, X., Cao, Q., Zhang, W.: A light-weight on-line action detection with hand trajectories for industrial surveillance. Digital Commun. Networks 7(1), 157–166 (2021)
Article Google Scholar
Papalampidi, P., et al.: A simple recipe for contrastively pre-training video-first encoders beyond 16 frames (2023)
Google Scholar
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Ramanishka, V., Chen, Y.T., et al.: Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. In: CVPR (2018)
Google Scholar
Tong, L., Ma, H., Lin, Q., He, J., Peng, L.: A novel deep learning bi-gru-i model for real-time human activity recognition using inertial sensors. IEEE Sens. J. 22(6), 6164–6174 (2022)
Article Google Scholar
Viejo, D., Garcia, J., Cazorla, M., Gil, D., Johnsson, M.: Using GNG to improve 3D feature extraction-application to 6DoF egomotion. Neural Netw. 32, 138–146 (2012)
Article Google Scholar
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition (2021)
Google Scholar
Wang, X., et al.: Oadtr: online action detection with transformers. In: ICCV, pp. 7565–7575, October 2021
Google Scholar
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference, vol. 37, pp. 2847–2855 (2023)
Google Scholar
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding (2021)
Google Scholar
Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: ICCV, October 2019
Google Scholar
Xu, M., et al.: Long short-term transformer for online action detection. In: NeurIPS (2021)
Google Scholar
Yang, L., Han, J., Zhang, D.: Colar: effective and efficient online action detection by consulting exemplars. In: CVPR (2022)
Google Scholar
Zhao, W.X., et al.: A survey of large language models (2023)
Google Scholar
Zhao, Y., Krähenbühl, P.: Real-time online video detection with temporal smoothing transformers. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar

Download references

Acknowledgments

We would like to thank “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the “CHAN-TWIN” project (grant TED2021-130890B-C21). HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning and CIAICO/2022/132 Consolidated group project “AI4Health” funded by Valencian government. This work has also been supported by a Spanish national and two regional grants for PhD studies, FPU21/00414, CIACIF/2021/430 and CIACIF/2022/175. Finally, we would like to thank the support of the University Institute for Computer Research at the UA.

Author information

Authors and Affiliations

Department of Computer Technology, University of Alicante, Raspeig, Spain
Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez & Jose Garcia-Rodriguez
Department of Computer Science and Artificial Intelligence, University of Alicante, Raspeig, Spain
Sergio Orts-Escolano

Authors

Manuel Benavent-Lledo
View author publications
You can also search for this author in PubMed Google Scholar
David Mulero-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
David Ortiz-Perez
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Orts-Escolano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Garcia-Rodriguez .

Editor information

Editors and Affiliations

Universidad Politécnica de Cartagena, Cartagena, Spain
José Manuel Ferrández Vicente
Polytechnic University of Valencia, Valencia, Spain
Mikel Val Calvo
Ohio State University, Columbus, OH, USA
Hojjat Adeli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Garcia-Rodriguez, J., Orts-Escolano, S. (2024). Exploring Text-Driven Approaches for Online Action Detection. In: Ferrández Vicente, J.M., Val Calvo, M., Adeli, H. (eds) Bioinspired Systems for Translational Applications: From Robotics to Social Engineering. IWINAC 2024. Lecture Notes in Computer Science, vol 14675. Springer, Cham. https://doi.org/10.1007/978-3-031-61137-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-61137-7_6
Published: 31 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-61136-0
Online ISBN: 978-3-031-61137-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring Text-Driven Approaches for Online Action Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using External Knowledge to Improve Zero-Shot Action Recognition in Egocentric Videos

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

VLG: General Video Recognition with Web Textual Knowledge

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploring Text-Driven Approaches for Online Action Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using External Knowledge to Improve Zero-Shot Action Recognition in Egocentric Videos

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

VLG: General Video Recognition with Web Textual Knowledge

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation