Abstract
The use of task-agnostic, pre-trained models for knowledge transfer has become more prevalent due to the availability of extensive open-source vision-language models (VLMs), and increased computational power. However, despite their widespread application across various domains, their potential for online action detection has not been fully explored. Current approaches rely on pre-extracted features from convolutional neural networks. In this paper we explore the potential of using VLMs for online action detection, emphasizing their effectiveness and capabilities for zero-shot and few-shot learning scenarios. Our research highlights the potential of VLMs in this field through empirical demonstrations of their robust performance, positioning them as a powerful tool for further enhancing the state of the art in online action detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
An, J., Kang, H., Han, S.H., Yang, M.H., Kim, S.J.: Miniroad: minimal RNN framework for online action detection. In: ICCV, pp. 10341–10350, October 2023
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer (2021)
Azorin-Lopez, J., Saval-Calvo, M., Fuster-Guillo, A., Garcia-Rodriguez, J.: A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process. Lett. 43(2), 363–387 (2015)
Azorín-López, J., Saval-Calvo, M., Fuster-Guilló, A., García-Rodríguez, J.: Human behaviour recognition based on trajectory analysis using neural networks. In: IJCNN, pp. 1–7 (2013)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Rodriguez-Juan, J., Berenguer-Agullo, A., Psarrou, A., Garcia-Rodriguez, J.: A comprehensive study on pain assessment from multimodal sensor data. Sensors 23(24) (2023)
Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: Interaction estimation in egocentric videos via simultaneous hand-object recognition. In: SOCO, pp. 439–448 (2022)
Benavent-Lledo, M., Oprea, S., Castro-Vargas, J.A., Mulero-Perez, D., Garcia-Rodriguez, J.: Predicting human-object interactions in egocentric videos. In: IJCNN, pp. 1–7 (2022)
Cheng, F., Wang, X., Lei, J., Crandall, D., Bansal, M., Bertasius, G.: Vindlu: a recipe for effective video-and-language pretraining (2023)
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_17
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2021)
Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Learning to discriminate information for online action detection. In: CVPR, June 2020
Flórez-Revuelta, F., García-Chamizo, J.M., Garcia-Rodriguez, J., Hernández Sáez, A., et al.: Representation of 2d objects with a topology preserving network (2002)
Gao, J., Yang, Z., Nevatia, R.: Red: Reinforced encoder-decoder networks for action anticipation (2017)
Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: Woad: weakly supervised online action detection in untrimmed videos. In: CVPR, pp. 1915–1923, June 2021
García-Rodríguez, J., García-Chamizo, J.M.: Surveillance and human-computer interaction applications of self-growing models. Appl. Soft Comput. 11(7), 4413–4431 (2011)
Gomez-Donoso, F., Orts-Escolano, S., Garcia-Garcia, A., Garcia-Rodriguez, J., Castro-Vargas, J.A., Ovidiu-Oprea, S., Cazorla, M.: A robotic platform for customized and interactive rehabilitation of persons with disabilities. Pattern Recogn. Lett. 99, 105–113 (2017)
Górriz, J., Álvarez Illán, I., Álvarez Marquina, A., Arco, J., Atzmueller, M., et al.: Computational approaches to explainable artificial intelligence: advances in theory, applications and trends. Inf. Fusion 100, 101945 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Jiang, Y.G., Liu, J., et al.: Thumos challenge: Action recognition with a large number of classes (2014)
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding (2022)
Kim, J., Misu, T., Chen, Y.T., Tawari, A., Canny, J.: Grounding human-to-vehicle advice for self-driving vehicles. In: CVPR, June 2019
Li, R., Yan, L., Peng, Y., Qing, L.: Lighter transformer for online action detection, ICIGP 2023, pp. 161–167. Association for Computing Machinery (2023)
Li, Z., et al.: A strong baseline for temporal video-text alignment (2023)
Ni, P., Lv, S., Zhu, X., Cao, Q., Zhang, W.: A light-weight on-line action detection with hand trajectories for industrial surveillance. Digital Commun. Networks 7(1), 157–166 (2021)
Papalampidi, P., et al.: A simple recipe for contrastively pre-training video-first encoders beyond 16 frames (2023)
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Ramanishka, V., Chen, Y.T., et al.: Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. In: CVPR (2018)
Tong, L., Ma, H., Lin, Q., He, J., Peng, L.: A novel deep learning bi-gru-i model for real-time human activity recognition using inertial sensors. IEEE Sens. J. 22(6), 6164–6174 (2022)
Viejo, D., Garcia, J., Cazorla, M., Gil, D., Johnsson, M.: Using GNG to improve 3D feature extraction-application to 6DoF egomotion. Neural Netw. 32, 138–146 (2012)
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition (2021)
Wang, X., et al.: Oadtr: online action detection with transformers. In: ICCV, pp. 7565–7575, October 2021
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference, vol. 37, pp. 2847–2855 (2023)
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding (2021)
Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: ICCV, October 2019
Xu, M., et al.: Long short-term transformer for online action detection. In: NeurIPS (2021)
Yang, L., Han, J., Zhang, D.: Colar: effective and efficient online action detection by consulting exemplars. In: CVPR (2022)
Zhao, W.X., et al.: A survey of large language models (2023)
Zhao, Y., Krähenbühl, P.: Real-time online video detection with temporal smoothing transformers. In: European Conference on Computer Vision (ECCV) (2022)
Acknowledgments
We would like to thank “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the “CHAN-TWIN” project (grant TED2021-130890B-C21). HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning and CIAICO/2022/132 Consolidated group project “AI4Health” funded by Valencian government. This work has also been supported by a Spanish national and two regional grants for PhD studies, FPU21/00414, CIACIF/2021/430 and CIACIF/2022/175. Finally, we would like to thank the support of the University Institute for Computer Research at the UA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Benavent-Lledo, M., Mulero-Pérez, D., Ortiz-Perez, D., Garcia-Rodriguez, J., Orts-Escolano, S. (2024). Exploring Text-Driven Approaches for Online Action Detection. In: Ferrández Vicente, J.M., Val Calvo, M., Adeli, H. (eds) Bioinspired Systems for Translational Applications: From Robotics to Social Engineering. IWINAC 2024. Lecture Notes in Computer Science, vol 14675. Springer, Cham. https://doi.org/10.1007/978-3-031-61137-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-61137-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-61136-0
Online ISBN: 978-3-031-61137-7
eBook Packages: Computer ScienceComputer Science (R0)