Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3123266.3127901acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Knowing Yourself: Improving Video Caption via In-depth Recap

Published: 23 October 2017 Publication History

Abstract

Generating natural language descriptions for videos (a.k.a video captioning) has attracted much research attention in recent years, and a lot of models have been proposed to improve the caption performance. However, due to the rapid progress in dataset expansion and feature representation, newly proposed caption models have been evaluated on different settings, which makes it unclear about the contributions from either features or models. Therefore, in this work we aim to gain a deep understanding about "where are we" for the current development of video captioning. First, we carry out extensive experiments to identify the contribution from different components in video captioning task and make fair comparison among several state-of-the-art video caption models. Second, we discover that these state-of-the-art models are complementary so that we could benefit from "wisdom of the crowd" through ensembling and reranking. Finally, we give a preliminary answer to the question "how far are we from the human-level performance in general'' via a series of carefully designed experiments. In summary, our caption models achieve the state-of-the-art performance on the MSR-VTT 2017 challenge, and it is comparable with the average human-level performance on current caption metrics. However, our analysis also shows that we still have a long way to go, such as further improving the generalization ability of current caption models.

References

[1]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2016.
[2]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770--778, 2016.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[4]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. Computer Science, 2014.
[5]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, pages 433--440, 2013.
[6]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507--4515, 2015.
[7]
Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. Describing videos using multi-modal fusion. In ACM, pages 1087--1091, 2016.
[8]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594--4602, 2016.
[9]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. CVPR, 2016.
[10]
Qin Jin and Junwei Liang. Video description generation using audio and visual cues. In ICMR, pages 239--242. ACM, 2016.
[11]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, pages 4534--4542, 2015.
[12]
Shizhe Chen, Jia Chen, and Qin Jin. Generating video descriptions with topic guidance. In ICMR, pages 5--13. ACM, 2017.
[13]
Li Yao, Nicolas Ballas, Kyunghyun Cho, John R Smith, and Yoshua Bengio. Oracle performance for visual captioning. BMVC, 2015.
[14]
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. Multimodal video description. In MM, pages 1092--1096. ACM, 2016.
[15]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. CVPR, 2016.
[16]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
[17]
Shizhe Chen, Jia Chen, Qin Jin, and Alexander Hauptmann. Video captioning with guidance of multimodal latent topics. In MM. ACM, 2017.
[18]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489--4497. IEEE, 2015.
[19]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[20]
R Memisevic and G Hinton. Unsupervised learning of image transformations. In CVPR, pages 1--8, 2007.
[21]
Ekaterina Garmash and Christof Monz. Ensemble learning for multi-source neural machine translation. In COLING, pages 1409--1418, 2016.
[22]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288--5296, 2016.
[23]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, pages 2712--2719, 2013.
[24]
Trecvid 2016 video to text dataset. http://www-nlpir.nist.gov/projects/tv2016/tv2016.html#vtt, 2016.
[25]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575, 2015.
[26]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and back. In CVPR, pages 1473--1482, 2015.

Cited By

View all
  • (2021)End-to-End Video Question-Answer Generation With Generator-Pretester NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305127731:11(4497-4507)Online publication date: Nov-2021
  • (2019)Impact of Video Compression and Multimodal Embedding on Scene DescriptionElectronics10.3390/electronics80909638:9(963)Online publication date: 30-Aug-2019
  • (2019)Hierarchical Global-Local Temporal Modeling for Video CaptioningProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3351072(774-783)Online publication date: 15-Oct-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ensemble
  2. reranking
  3. temporal attention
  4. topic-guided model
  5. video captioning

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Plan

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)End-to-End Video Question-Answer Generation With Generator-Pretester NetworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305127731:11(4497-4507)Online publication date: Nov-2021
  • (2019)Impact of Video Compression and Multimodal Embedding on Scene DescriptionElectronics10.3390/electronics80909638:9(963)Online publication date: 30-Aug-2019
  • (2019)Hierarchical Global-Local Temporal Modeling for Video CaptioningProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3351072(774-783)Online publication date: 15-Oct-2019
  • (2019)Deep multimodal embedding for video captioningMultimedia Tools and Applications10.1007/s11042-019-08011-3Online publication date: 24-Jul-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media