research-article

Progressive Semantic Matching for Video-Text Retrieval

Authors:

Yuanyuan LiuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5083 - 5091

https://doi.org/10.1145/3474085.3475621

Published: 17 October 2021 Publication History

Abstract

Cross-modal retrieval between texts and videos is important yet challenging. Until recently, previous works in this domain typically rely on learning a common space to match the text and video, but it is difficult to match due to the semantic gap between videos and texts. Although some methods employ coarse-to-fine or multi-expert networks to encode one or more common spaces for easier matching, they almost directly optimize one matching space, which is challenging, because of the huge semantic gap between different modalities. To address this issue, we aim at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and propose a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel encoding network for videos and texts, and design some auxiliary common spaces, which are mapped by the outputs of encoders in different levels. Then all the common spaces are jointly trained end to end. In this way, the model can effectively encode videos and texts into a fusion common space by a progressive paradigm. Experimental results on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages of our PSM, which achieves significant performance improvement compared with state-of-the-art approaches.

Supplementary Material

MP4 File (MM21-fp2483.mp4)

Presentation Video for paper: Progressive Semantic Matching for Video-Text Retrieval.

Download
148.07 MB

References

[1]

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.

[2]

Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2011. Million-scale near-duplicate video retrieval system. In Proceedings of the 19th ACM International Conference on Multimedia. 837--838.

Digital Library

[3]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638--10647.

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv: 1406.1078 (2014).

[5]

Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377--3388.

Digital Library

[6]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346--9355.

[7]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Digital Library

[8]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[9]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 7--16.

Digital Library

[10]

Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: a deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[12]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746--1751.

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM International Conference on Multimedia. 1786--1794.

Digital Library

[15]

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641--4650.

[16]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).

[17]

Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2020. Multilevel correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia 22, 12 (2020), 3101--3114.

Digital Library

[18]

Pascal Mettes, Dennis C Koelma, and Cees GM Snoek. 2020. Shuffled ImageNet banks for video event detection and search. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1--21.

Digital Library

[19]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).

[20]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630--2640.

[21]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19--27.

Digital Library

[22]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594--4602.

[23]

Yale Song and Mohammad Soleymani. 2019. Polysemous visualsemantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979--1988.

[24]

Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with naturallanguage. arXiv preprint arXiv:1609.08124 (2016).

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

Digital Library

[26]

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015).

[27]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.

Digital Library

[28]

Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2020. Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia (2020).

[29]

Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 2 (2009).

Digital Library

[30]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-ofspeech embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 450--459.

[31]

Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM International Conference on Multimedia. 218--227.

Digital Library

[32]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.

[33]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.

[34]

Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross- Modal Retrieval. In Proceedings of the 30th International Joint Conference on Artificial Intelligence. 982--988.

Digital Library

[35]

Ran Xu, Caiming Xiong, Wei Chen, and Jason Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29.

Digital Library

[36]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339--1348.

Digital Library

[37]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471--487.

Digital Library

Cited By

Wang ZGao ZHan MYang YShen H(2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3407664
Cai RDong JLiang TLiang YWang YYang XWang XWang M(2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3400060
Lee SKim HRo Y(2024)Text-guided distillation learning to diversify video embeddings for text-video retrievalPattern Recognition10.1016/j.patcog.2024.110754156(110754)Online publication date: Dec-2024
https://doi.org/10.1016/j.patcog.2024.110754
Show More Cited By

Index Terms

Progressive Semantic Matching for Video-Text Retrieval
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Cross-modal retrieval between videos and texts has attracted growing attention due to the rapid growth of user-generated videos on the web. To solve this problem, most approaches try to learn a joint embedding space to measure the cross-modal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
410
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)6

Reflects downloads up to 17 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZGao ZHan MYang YShen H(2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3407664
Cai RDong JLiang TLiang YWang YYang XWang XWang M(2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3400060
Lee SKim HRo Y(2024)Text-guided distillation learning to diversify video embeddings for text-video retrievalPattern Recognition10.1016/j.patcog.2024.110754156(110754)Online publication date: Dec-2024
https://doi.org/10.1016/j.patcog.2024.110754
Ma WChen QZhou TZhao SCai Z(2023)Using Multimodal Contrastive Knowledge Distillation for Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325719333:10(5486-5497)Online publication date: 14-Mar-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3257193
Li PXie CZhao LXie HGe JZheng YZhao DZhang Y(2023)Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00379(4077-4087)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00379
Wang ZGao ZGuo KYang YWang XShen H(2023)Multilateral Semantic Relations Modeling for Image Text Retrieval2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00277(2830-2839)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00277
Wang ZXu XWei JXie NShao JYang Y(2023)Quaternion Representation Learning for cross-modal matchingKnowledge-Based Systems10.1016/j.knosys.2023.110505270:COnline publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110505
Wang ZGao ZXu XLuo YYang YShen HMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Point to Rectangle Matching for Image Text RetrievalProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548237(4977-4986)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548237
Wang YDong JLiang TZhang MCai RWang XMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Cross-Lingual Cross-Modal Retrieval with Noise-Robust LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548003(422-433)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548003
Dong JChen XZhang MYang XChen SLi XWang XMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Partially Relevant Video RetrievalProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547976(246-257)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547976

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents