Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394171.3416291acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

VideoTRM: Pre-training for Video Captioning Challenge 2020

Published: 12 October 2020 Publication History

Abstract

The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT benchmark. As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. Additionally, we extend the decoder in our VideoTRM with mesh-like connections and gate fusion mechanism in multi-head attention during fine-tuning to take advantage of multi-level visual features and bypass less informative attention results, respectively. In the evaluation on test server, our VideoTRM achieves superior performances and ranks the second place on the leadboard finally.

Supplementary Material

MP4 File (3394171.3416291.mp4)
In this video, we present the paper titled ?VideoTRM: Pre-training for Video Captioning Challenge 2020?. First of all, we introduce the task of video captioning and pre-training briefly to provide the background. Then, we demonstrate our proposed VideoTRM in terms of the full architecture, four proxy tasks for pre-training and the involved learning strategies. Next, it comes to the quantitative results and corresponding analysis. Finally, we summarize the contributions of the paper.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Computer Vision- ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 9909), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,June 18--22, 2018. IEEE Computer Society, 6077--6086.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 2425--2433.
[4]
Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRRabs/1607.06450 (2016).
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005,Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association for Computational Linguistics, 65--72.
[6]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei.2019. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 8167--8174.
[7]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning UNiversal Image-TExt Representations. CoRRabs/1909.11740 (2019).
[8]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2019.M2: Meshed-Memory Transformer for Image Captioning. CoRRabs/1912.08226(2019).
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.
[11]
Lun Huang, Wenmin Wang, Jie Chen, and Xiaoyong Wei. 2019. Attention onAttention for Image Captioning. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4633--4642.
[12]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengioand Yann LeCun (Eds.).
[13]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 1106--1114.
[14]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang.2019. Visual BERT: A Simple and Performant Baseline for Vision and Language. CoRRabs/1908.03557 (2019).
[15]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly Localizing and Describing Events for Dense Video Captioning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,UT, USA, June 18--22, 2018. IEEE Computer Society, 7492--7500.
[16]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2019. Pointing Novel Objects in Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 12497--12506.
[17]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8--14 December 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d?Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 13--23.
[18]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375(2020).
[19]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly Modeling Embedding and Translation to Bridge Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 4594--4602.
[20]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 984--992.
[21]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. CoRRabs/2003.14080 (2020).
[22]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6--12, 2002, Philadelphia, PA, USA. ACL, 311--318.
[23]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[24]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach,Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008.
[26]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 4566--4575.
[27]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence - Video to Text. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 4534--4542.
[28]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, 1494--1504.
[29]
Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. 2019. Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10--16, 2019, Sarit Kraus (Ed.). ijcai.org, 940--946.
[30]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June27--30, 2016. IEEE Computer Society, 5288--5296.
[31]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 4507--4515.
[32]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,July 21--26, 2017. IEEE Computer Society, 5263--5271.
[33]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring Visual Relationship for Image Captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part XIV (Lecture Notes in Computer Science, Vol. 11218), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 711--727.
[34]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2621--2629.
[35]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting Image Captioning with Attributes. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 4904--4912.
[36]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 13041--13049.

Cited By

View all
  • (2023)Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3243465(1-20)Online publication date: 2023
  • (2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
  • (2022)Dual-Scale Alignment-Based Transformer on Linguistic Skeleton Tags for Non-Autoregressive Video Captioning2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859882(1-6)Online publication date: 18-Jul-2022
  • Show More Cited By

Index Terms

  1. VideoTRM: Pre-training for Video Captioning Challenge 2020

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. representation learning
    2. video captioning
    3. vision-language pre-training

    Qualifiers

    • Short-paper

    Funding Sources

    • Guangzhou Science and Technology Program China
    • NSF of China

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,010 of 7,772 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3243465(1-20)Online publication date: 2023
    • (2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
    • (2022)Dual-Scale Alignment-Based Transformer on Linguistic Skeleton Tags for Non-Autoregressive Video Captioning2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859882(1-6)Online publication date: 18-Jul-2022
    • (2022)Visual-Aware Attention Dual-Stream Decoder for Video Captioning2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859743(1-6)Online publication date: 18-Jul-2022
    • (2021)Memory-attended semantic context-aware network for video captioningSoft Computing10.1007/s00500-021-06360-6Online publication date: 11-Nov-2021
    • (2021)Modeling Context-Guided Visual and Linguistic Semantic Feature for Video CaptioningArtificial Neural Networks and Machine Learning – ICANN 202110.1007/978-3-030-86383-8_54(677-689)Online publication date: 7-Sep-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media