Nothing Special   »   [go: up one dir, main page]

Skip to main content

Exploring Text-Driven Approaches for Online Action Detection

  • Conference paper
  • First Online:
Bioinspired Systems for Translational Applications: From Robotics to Social Engineering (IWINAC 2024)

Abstract

The use of task-agnostic, pre-trained models for knowledge transfer has become more prevalent due to the availability of extensive open-source vision-language models (VLMs), and increased computational power. However, despite their widespread application across various domains, their potential for online action detection has not been fully explored. Current approaches rely on pre-extracted features from convolutional neural networks. In this paper we explore the potential of using VLMs for online action detection, emphasizing their effectiveness and capabilities for zero-shot and few-shot learning scenarios. Our research highlights the potential of VLMs in this field through empirical demonstrations of their robust performance, positioning them as a powerful tool for further enhancing the state of the art in online action detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. An, J., Kang, H., Han, S.H., Yang, M.H., Kim, S.J.: Miniroad: minimal RNN framework for online action detection. In: ICCV, pp. 10341–10350, October 2023

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer (2021)

    Google Scholar 

  3. Azorin-Lopez, J., Saval-Calvo, M., Fuster-Guillo, A., Garcia-Rodriguez, J.: A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process. Lett. 43(2), 363–387 (2015)

    Article  Google Scholar 

  4. Azorín-López, J., Saval-Calvo, M., Fuster-Guilló, A., García-Rodríguez, J.: Human behaviour recognition based on trajectory analysis using neural networks. In: IJCNN, pp. 1–7 (2013)

    Google Scholar 

  5. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  6. Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Rodriguez-Juan, J., Berenguer-Agullo, A., Psarrou, A., Garcia-Rodriguez, J.: A comprehensive study on pain assessment from multimodal sensor data. Sensors 23(24) (2023)

    Google Scholar 

  7. Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: Interaction estimation in egocentric videos via simultaneous hand-object recognition. In: SOCO, pp. 439–448 (2022)

    Google Scholar 

  8. Benavent-Lledo, M., Oprea, S., Castro-Vargas, J.A., Mulero-Perez, D., Garcia-Rodriguez, J.: Predicting human-object interactions in egocentric videos. In: IJCNN, pp. 1–7 (2022)

    Google Scholar 

  9. Cheng, F., Wang, X., Lei, J., Crandall, D., Bansal, M., Bertasius, G.: Vindlu: a recipe for effective video-and-language pretraining (2023)

    Google Scholar 

  10. De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_17

    Chapter  Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2021)

    Google Scholar 

  12. Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Learning to discriminate information for online action detection. In: CVPR, June 2020

    Google Scholar 

  13. Flórez-Revuelta, F., García-Chamizo, J.M., Garcia-Rodriguez, J., Hernández Sáez, A., et al.: Representation of 2d objects with a topology preserving network (2002)

    Google Scholar 

  14. Gao, J., Yang, Z., Nevatia, R.: Red: Reinforced encoder-decoder networks for action anticipation (2017)

    Google Scholar 

  15. Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: Woad: weakly supervised online action detection in untrimmed videos. In: CVPR, pp. 1915–1923, June 2021

    Google Scholar 

  16. García-Rodríguez, J., García-Chamizo, J.M.: Surveillance and human-computer interaction applications of self-growing models. Appl. Soft Comput. 11(7), 4413–4431 (2011)

    Article  Google Scholar 

  17. Gomez-Donoso, F., Orts-Escolano, S., Garcia-Garcia, A., Garcia-Rodriguez, J., Castro-Vargas, J.A., Ovidiu-Oprea, S., Cazorla, M.: A robotic platform for customized and interactive rehabilitation of persons with disabilities. Pattern Recogn. Lett. 99, 105–113 (2017)

    Article  Google Scholar 

  18. Górriz, J., Álvarez Illán, I., Álvarez Marquina, A., Arco, J., Atzmueller, M., et al.: Computational approaches to explainable artificial intelligence: advances in theory, applications and trends. Inf. Fusion 100, 101945 (2023)

    Article  Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  20. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)

    Google Scholar 

  21. Jiang, Y.G., Liu, J., et al.: Thumos challenge: Action recognition with a large number of classes (2014)

    Google Scholar 

  22. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding (2022)

    Google Scholar 

  23. Kim, J., Misu, T., Chen, Y.T., Tawari, A., Canny, J.: Grounding human-to-vehicle advice for self-driving vehicles. In: CVPR, June 2019

    Google Scholar 

  24. Li, R., Yan, L., Peng, Y., Qing, L.: Lighter transformer for online action detection, ICIGP 2023, pp. 161–167. Association for Computing Machinery (2023)

    Google Scholar 

  25. Li, Z., et al.: A strong baseline for temporal video-text alignment (2023)

    Google Scholar 

  26. Ni, P., Lv, S., Zhu, X., Cao, Q., Zhang, W.: A light-weight on-line action detection with hand trajectories for industrial surveillance. Digital Commun. Networks 7(1), 157–166 (2021)

    Article  Google Scholar 

  27. Papalampidi, P., et al.: A simple recipe for contrastively pre-training video-first encoders beyond 16 frames (2023)

    Google Scholar 

  28. Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning (2022)

    Google Scholar 

  29. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  30. Ramanishka, V., Chen, Y.T., et al.: Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. In: CVPR (2018)

    Google Scholar 

  31. Tong, L., Ma, H., Lin, Q., He, J., Peng, L.: A novel deep learning bi-gru-i model for real-time human activity recognition using inertial sensors. IEEE Sens. J. 22(6), 6164–6174 (2022)

    Article  Google Scholar 

  32. Viejo, D., Garcia, J., Cazorla, M., Gil, D., Johnsson, M.: Using GNG to improve 3D feature extraction-application to 6DoF egomotion. Neural Netw. 32, 138–146 (2012)

    Article  Google Scholar 

  33. Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition (2021)

    Google Scholar 

  34. Wang, X., et al.: Oadtr: online action detection with transformers. In: ICCV, pp. 7565–7575, October 2021

    Google Scholar 

  35. Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference, vol. 37, pp. 2847–2855 (2023)

    Google Scholar 

  36. Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding (2021)

    Google Scholar 

  37. Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: ICCV, October 2019

    Google Scholar 

  38. Xu, M., et al.: Long short-term transformer for online action detection. In: NeurIPS (2021)

    Google Scholar 

  39. Yang, L., Han, J., Zhang, D.: Colar: effective and efficient online action detection by consulting exemplars. In: CVPR (2022)

    Google Scholar 

  40. Zhao, W.X., et al.: A survey of large language models (2023)

    Google Scholar 

  41. Zhao, Y., Krähenbühl, P.: Real-time online video detection with temporal smoothing transformers. In: European Conference on Computer Vision (ECCV) (2022)

    Google Scholar 

Download references

Acknowledgments

We would like to thank “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the “CHAN-TWIN” project (grant TED2021-130890B-C21). HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning and CIAICO/2022/132 Consolidated group project “AI4Health” funded by Valencian government. This work has also been supported by a Spanish national and two regional grants for PhD studies, FPU21/00414, CIACIF/2021/430 and CIACIF/2022/175. Finally, we would like to thank the support of the University Institute for Computer Research at the UA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Garcia-Rodriguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Garcia-Rodriguez, J., Orts-Escolano, S. (2024). Exploring Text-Driven Approaches for Online Action Detection. In: Ferrández Vicente, J.M., Val Calvo, M., Adeli, H. (eds) Bioinspired Systems for Translational Applications: From Robotics to Social Engineering. IWINAC 2024. Lecture Notes in Computer Science, vol 14675. Springer, Cham. https://doi.org/10.1007/978-3-031-61137-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-61137-7_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-61136-0

  • Online ISBN: 978-3-031-61137-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics