short-paper

VideoTRM: Pre-training for Video Captioning Challenge 2020

Authors:

Jingwen Chen,

Hongyang ChaoAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4605 - 4609

https://doi.org/10.1145/3394171.3416291

Published: 12 October 2020 Publication History

Get Access

Abstract

The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT benchmark. As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. Additionally, we extend the decoder in our VideoTRM with mesh-like connections and gate fusion mechanism in multi-head attention during fine-tuning to take advantage of multi-level visual features and bypass less informative attention results, respectively. In the evaluation on test server, our VideoTRM achieves superior performances and ranks the second place on the leadboard finally.

Supplementary Material

MP4 File (3394171.3416291.mp4)

In this video, we present the paper titled ?VideoTRM: Pre-training for Video Captioning Challenge 2020?. First of all, we introduce the task of video captioning and pre-training briefly to provide the background. Then, we demonstrate our proposed VideoTRM in terms of the full architecture, four proxy tasks for pre-training and the involved learning strategies. Next, it comes to the quantitative results and corresponding analysis. Finally, we summarize the contributions of the paper.

Download
22.80 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Computer Vision- ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 9909), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 382--398.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Video Captioning using Hierarchical Multi-Attention Model

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations