research-article

Hierarchical Global-Local Temporal Modeling for Video Captioning

Authors:

Zhenzhong Chen,

Feng WuAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 774 - 783

https://doi.org/10.1145/3343031.3351072

Published: 15 October 2019 Publication History

Abstract

In this paper, a Hierarchical Temporal Model (HTM) is proposed for the video captioning task, based on exploring the global and local temporal structure to better recognize fine-grained objects and actions. In our HTM, the encoder and decoder are hierarchically aligned according to different levels of features. The encoder applies two LSTM layers to construct temporal structures at both frame-level and object-level where the attention mechanism is applied to locate objects of interest, and the decoder uses corresponding LSTM layers to extract pivotal features from global to local through multi-level attention mechanism. Moreover, the local temporal structure is constructed implicitly from candidate object-oriented features under the guidance of global temporal-spatial representation, that could generate more accurate descriptions in handling shot-switching problems. Experiments on the widely used Microsoft Video Description Corpus (MSVD) and Charades datasets demonstrate the effectiveness of our proposed approach when compared to the state-of-the-art methods.

References

[1]

Y Alp Aslandogan and Clement T. Yu. 1999. Techniques and Systems for Image and Video Retrieval. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 11 (1999), 56--63.

Digital Library

[2]

Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2016. Delving Deeper into Convolutional Networks for Learning Video Representations. Proceedings of the International Conference on Representation Learning (ICLR) .

[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization (ACL workshop). 65--72.

[4]

Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 3185--3194.

[5]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS). 1171--1179.

[6]

David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL). 190--200.

[7]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv preprint arXiv:1504.00325 (2015).

[8]

Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138 (2017).

[9]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[10]

Rasool Fakoor, Abdel-rahman Mohamed, Margaret Mitchell, Sing Bing Kang, and Pushmeet Kohli. 2016. Memory-augmented Attention Modelling for Videos. arXiv preprint arXiv:1611.02261 (2016).

[11]

Michael A. Goodrich and Alan C. Schultz. 2008. Human--robot Interaction: A Survey. Foundations and Trends® in Human-Computer Interaction, Vol. 1, 3 (2008), 203--275.

Digital Library

[12]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2712--2719.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[15]

Qin Jin, Shizhe Chen, Jia Chen, and Alexander Hauptmann. 2017. Knowing Yourself: Improving Video Caption via In-depth Recap. In Proceedings of the 25th ACM International Conference on Multimedia (ACM). 1906--1911.

Digital Library

[16]

Xuelong Li, Bin Zhao, and Xiaoqiang Lu. 2017. MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence(IJCAI). 2208--2214.

[17]

Xiangpeng Li, Zhilong Zhou, Lijiang Chen, and Lianli Gao. 2019. Residual Attention-based LSTM for Video Captioning. World Wide Web (WWW), Vol. 22, 2 (2019), 621--636.

Digital Library

[18]

Chin-Yew Lin. 2004. ROUGE: a Package for Automatic Evaluation of Aummaries. Proceedings of the Workshop on Text Summarization Branches Out (WAS) (2004).

[19]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV). 740--755.

[20]

Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling Convolutional Encoder for Video Captioning. In Proceedings of the 26th ACM International Conference on Multimedia (ACM). 1425--1434.

Digital Library

[21]

Yuan Liu, Xue Li, and Zhongchao Shi. 2017. Video Captioning with Listwise Supervision. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI). 4197--4203.

[22]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1029--1038.

[23]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 984--992.

[24]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the the 40Th Annual Meeting on Association for Computational Linguistics (ACL) (2002), 311--318.

[25]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS). 91--99.

Digital Library

[26]

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 433--440.

Digital Library

[27]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), Vol. 115, 3 (2015), 211--252.

Digital Library

[28]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In Proceedings of the European Conference on Computer Vision (ECCV). 510--526.

[29]

Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI). 2737--2743.

[30]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The Journal of Machine Learning Research (JMLR), Vol. 15, 1 (2014), 1929--1958.

Digital Library

[31]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4566--4575.

[32]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence -- Video to Text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4534--4542.

Digital Library

[33]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 1494--1504.

[34]

Violeta Voykinska, Shiri Azenkot, Shaomei Wu, and Gilly Leshed. 2016. How Blind People Interact with Visual Content on Social Networking Services. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (ACM). 1584--1595.

Digital Library

[35]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018b. Reconstruction Network for Video Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7622--7631.

[36]

Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018c. Hierarchical Memory Modelling for Video Captioning. In Proceedings of the 26th ACM International Conference on Multimedia (ACM). 63--71.

Digital Library

[37]

Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018a. Video Captioning via Hierarchical Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4213--4222.

[38]

Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 6829--6837.

[39]

Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI). 2346--2352.

[40]

Ziwei Yang, Yahong Han, and Zheng Wang. 2017. Catching the Temporal Regions-of-Interest for Video Captioning. In Proceedings of the 25th ACM International Conference on Multimedia (ACM). 146--153.

Digital Library

[41]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015a. Describing Videos by Exploiting Temporal Structure. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 4507--4515.

Digital Library

[42]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015b. Describing Videos by Exploiting Temporal Structure. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 4507--4515.

Digital Library

[43]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4584--4593.

[44]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4694--4702.

[45]

Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

[46]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Video Captioning with Tube Features. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence(IJCAI) . 1177--1183.

Cited By

Zhang HZeng PGao LLyu XSong JShen H(2024)SPT: Spatial Pyramid Transformer for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333637134:6(4829-4842)Online publication date: Jun-2024
https://doi.org/10.1109/TCSVT.2023.3336371
Chang DSain AMa ZSong YWang RGuo J(2024)Mind the Gap: Open Set Domain Adaptation via Mutual-to-Separate FrameworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332686234:6(4159-4174)Online publication date: Jun-2024
https://doi.org/10.1109/TCSVT.2023.3326862
Zheng LGuo PMiao ZXu W(2024)CLIP-based Semantic Enhancement and Vocabulary Expansion for Video Captioning Using Reinforcement Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651205(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651205
Show More Cited By

Index Terms

Hierarchical Global-Local Temporal Modeling for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Recurrent convolutional video captioning with global and local attention
Highlights
- We propose a novel video captioning model with global-local attention.
- We ...
Abstract
Video captioning with encoder–decoder structures has been extensively studied in the recent literature, where a great deal of work focuses on multimodal features and attention mechanisms. Most of the previous work uses only the global ...
Video captioning with global and local text attention
Abstract
The task of video captioning is to generate a video description corresponding to the video content, so there are stringent requirements for the extraction of fine-grained video features and the language processing of tag text. A new method using ...
Exploiting the local temporal information for video captioning
Abstract
Typical video captioning methods are developed based on the encoder-decoder architecture. To better exploit the local temporal information, e.g., details about objects and their corresponding actions, we propose a reinforcement ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
538
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)4

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang HZeng PGao LLyu XSong JShen H(2024)SPT: Spatial Pyramid Transformer for Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333637134:6(4829-4842)Online publication date: Jun-2024
https://doi.org/10.1109/TCSVT.2023.3336371
Chang DSain AMa ZSong YWang RGuo J(2024)Mind the Gap: Open Set Domain Adaptation via Mutual-to-Separate FrameworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332686234:6(4159-4174)Online publication date: Jun-2024
https://doi.org/10.1109/TCSVT.2023.3326862
Zheng LGuo PMiao ZXu W(2024)CLIP-based Semantic Enhancement and Vocabulary Expansion for Video Captioning Using Reinforcement Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651205(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651205
Ren JLin LZheng W(2024)Product promotion copywriting from multimodal data: New benchmark and modelNeurocomputing10.1016/j.neucom.2024.127253575(127253)Online publication date: Mar-2024
https://doi.org/10.1016/j.neucom.2024.127253
Zheng LXu WMiao ZQiu XGong S(2024)RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioningThe Visual Computer10.1007/s00371-024-03350-1Online publication date: 18-Apr-2024
https://doi.org/10.1007/s00371-024-03350-1
Wang HHu YZhu YQi JWu BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612175
Cao GZhou FLiu KWang AFan L(2023)A Decoupled Kernel Prediction Network Guided by Soft Mask for Single Image HDR ReconstructionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027719:2s(1-23)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3550277
Nousias SArvanitis GLalos AMoustakas K(2023)Deep Saliency Mapping for 3D Meshes and ApplicationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355007319:2(1-22)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3550073
Shi YXu HYuan CLi BHu WZha Z(2023)Learning Video-Text Aligned Representations for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354682819:2(1-21)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3546828
Zhu JZeng PGao LLi GLiao DSong J(2023)Complementarity-Aware Space Learning for Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.323552333:8(4362-4374)Online publication date: Aug-2023
https://doi.org/10.1109/TCSVT.2023.3235523
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents