Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475621acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Progressive Semantic Matching for Video-Text Retrieval

Published: 17 October 2021 Publication History

Abstract

Cross-modal retrieval between texts and videos is important yet challenging. Until recently, previous works in this domain typically rely on learning a common space to match the text and video, but it is difficult to match due to the semantic gap between videos and texts. Although some methods employ coarse-to-fine or multi-expert networks to encode one or more common spaces for easier matching, they almost directly optimize one matching space, which is challenging, because of the huge semantic gap between different modalities. To address this issue, we aim at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and propose a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel encoding network for videos and texts, and design some auxiliary common spaces, which are mapped by the outputs of encoders in different levels. Then all the common spaces are jointly trained end to end. In this way, the model can effectively encode videos and texts into a fusion common space by a progressive paradigm. Experimental results on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages of our PSM, which achieves significant performance improvement compared with state-of-the-art approaches.

Supplementary Material

MP4 File (MM21-fp2483.mp4)
Presentation Video for paper: Progressive Semantic Matching for Video-Text Retrieval.

References

[1]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.
[2]
Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2011. Million-scale near-duplicate video retrieval system. In Proceedings of the 19th ACM International Conference on Multimedia. 837--838.
[3]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638--10647.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv: 1406.1078 (2014).
[5]
Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377--3388.
[6]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346--9355.
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[8]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[9]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 7--16.
[10]
Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: a deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[12]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746--1751.
[13]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[14]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2vv++ fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM International Conference on Multimedia. 1786--1794.
[15]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641--4650.
[16]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
[17]
Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2020. Multilevel correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia 22, 12 (2020), 3101--3114.
[18]
Pascal Mettes, Dennis C Koelma, and Cees GM Snoek. 2020. Shuffled ImageNet banks for video event detection and search. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1--21.
[19]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).
[20]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630--2640.
[21]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19--27.
[22]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594--4602.
[23]
Yale Song and Mohammad Soleymani. 2019. Polysemous visualsemantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979--1988.
[24]
Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with naturallanguage. arXiv preprint arXiv:1609.08124 (2016).
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[26]
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015).
[27]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.
[28]
Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2020. Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia (2020).
[29]
Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 2 (2009).
[30]
Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-ofspeech embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 450--459.
[31]
Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM International Conference on Multimedia. 218--227.
[32]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.
[33]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.
[34]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross- Modal Retrieval. In Proceedings of the 30th International Joint Conference on Artificial Intelligence. 982--988.
[35]
Ran Xu, Caiming Xiong, Wei Chen, and Jason Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29.
[36]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339--1348.
[37]
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471--487.

Cited By

View all
  • (2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
  • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
  • (2024)Text-guided distillation learning to diversify video embeddings for text-video retrievalPattern Recognition10.1016/j.patcog.2024.110754156(110754)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal matching
  2. dual encoding
  3. representation learning
  4. video-text retrieval

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
  • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
  • (2024)Text-guided distillation learning to diversify video embeddings for text-video retrievalPattern Recognition10.1016/j.patcog.2024.110754156(110754)Online publication date: Dec-2024
  • (2023)Using Multimodal Contrastive Knowledge Distillation for Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325719333:10(5486-5497)Online publication date: 14-Mar-2023
  • (2023)Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00379(4077-4087)Online publication date: 1-Oct-2023
  • (2023)Multilateral Semantic Relations Modeling for Image Text Retrieval2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00277(2830-2839)Online publication date: Jun-2023
  • (2023)Quaternion Representation Learning for cross-modal matchingKnowledge-Based Systems10.1016/j.knosys.2023.110505270:COnline publication date: 21-Jun-2023
  • (2022)Point to Rectangle Matching for Image Text RetrievalProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548237(4977-4986)Online publication date: 10-Oct-2022
  • (2022)Cross-Lingual Cross-Modal Retrieval with Noise-Robust LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548003(422-433)Online publication date: 10-Oct-2022
  • (2022)Partially Relevant Video RetrievalProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547976(246-257)Online publication date: 10-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media