research-article

Open access

FTAN: Exploring Frame-Text Attention for lightweight Video Captioning

Authors:

Shengjin WangAuthors Info & Claims

ICCPR '23: Proceedings of the 2023 12th International Conference on Computing and Pattern Recognition

Pages 40 - 44

https://doi.org/10.1145/3633637.3633643

Published: 28 February 2024 Publication History

All formats PDF

Abstract

Traditional video captioning approaches employ LSTM as a lightweight decoder. However, these methods focus on fully extracting visual features, but pay less attention to textual information, resulting in relatively low-quality performance. Recent transformer-based methods achieve more accurate results, but at the cost of excessive computing resources. In this paper, we propose a lightweight model for video captioning named Frame-Text Attention Network (FTAN), aiming to make full use of both visual and textual features to obtain more accurate captions. We develop a novel text attention module in FTAN, which uses the hidden state of LSTM as query to generate attentive text features. Then the attentive text features are merged with visual features, which are used as input for LSTM to generate more accurate captions. To the best of our knowledge, we are the first to introduce attention mechanism to extract more textual information hidden in LSTM architecture in video captioning. Extensive experiments demonstrate the effectiveness of FTAN. FTAN outperforms the state-of-the-art LSTM-based method on MSVD dataset by 0.8 in CIDEr-D and is about one-fourth of the transformer-based methods in terms of parameters.

References

[1]

Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666.

[2]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.

Digital Library

[3]

Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10337–10346.

[4]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV). 358–373.

Digital Library

[5]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.

[6]

Xuelong Li, Bin Zhao, Xiaoqiang Lu, 2017. Mam-rnn: multi-level attention model based rnn for video captioning. In IJCAI, Vol. 2017. 2208–2214.

[7]

Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17949–17958.

[8]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.

[9]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1029–1038.

[10]

Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8347–8356.

[11]

Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D Yoo. 2021. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2514–2522.

[12]

Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017).

[13]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).

[14]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[15]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014).

[16]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.

[17]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7622–7631.

[18]

Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7512–7520.

[19]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.

[20]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507–4515.

Digital Library

[21]

Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2022. Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17939–17948.

[22]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4584–4593.

[23]

Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336.

[24]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278–13288.

Index Terms

FTAN: Exploring Frame-Text Attention for lightweight Video Captioning
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
An attention based dual learning approach for video captioning
Abstract
Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual ...
Highlights
- We propose a novel attention based dual learning approach for video captioning.
Multimodal architecture for video captioning with memory networks and an attention mechanism

A multimodal architecture for video captioning is proposed.The architecture relies on memory networks and an attention mechanism.The memory networks capture long-term temporal dynamics from 2D CNN features.The attention mechanism is helpful in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCPR '23: Proceedings of the 2023 12th International Conference on Computing and Pattern Recognition

October 2023

589 pages

ISBN:9798400707988

DOI:10.1145/3633637

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China in the 14th Five-Year
National Key Research and Development Program of China in the 14th Five-Year

Conference

ICCPR 2023

ICCPR 2023: 2023 12th International Conference on Computing and Pattern Recognition

October 27 - 29, 2023

Qingdao, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
156
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)38

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents