research-article

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Authors:

Yuan QiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4626 - 4636

https://doi.org/10.1145/3581783.3612006

Published: 27 October 2023 Publication History

Abstract

In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.

References

[1]

Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Inf. Process. Manag., Vol. 39 (2003), 45--65.

Digital Library

[2]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 1708--1718.

[3]

Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. 2021. Cross Modal Retrieval with Querybank Normalisation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5184--5195.

[4]

David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Annual Meeting of the Association for Computational Linguistics.

[5]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10635--10644.

[6]

Xingyi Cheng, Hezheng Lin, Xiangyu Wu, F. Yang, and Dong Shen. 2021. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. ArXiv, Vol. abs/2109.04290 (2021).

[7]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 44 (2021), 4065--4080.

[8]

Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, and Oswald Lanz. 2022. Relevance-based Margin for Contrastively-trained Video Retrieval Models. Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR), 146--157.

Digital Library

[9]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. ArXiv, Vol. abs/2106.11097 (2021).

[10]

Sheng Fang, Shuhui Wang, Junbao Zhuo, Qingming Huang, Bin Ma, Xiaoming Wei, and Xiaolin Wei. 2022. Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 4789--4800.

Digital Library

[11]

Valentin Gabeur, Chen Sun, Alahari Karteek, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV).

[12]

Yizhao Gao and Zhiwu Lu. 2022. SST-VLM: Sparse Sampling-Twice Inspired Video-Language Model. In Asian Conference on Computer Vision (ACCV).

[13]

Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. 2021. CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval. ArXiv, Vol. abs/2111.05610 (2021).

[14]

Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4996--5005.

[15]

Michael U Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics.

[16]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9726--9735.

[17]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961--970.

[18]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. 2017. Localizing Moments in Video with Natural Language. In 2017 IEEE International Conference on Computer Vision (ICCV). 5804--5813.

[19]

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).

[20]

Karen Spärck Jones. 2021. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, Vol. 60 (2021), 493--502.

[21]

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc.

[22]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv, Vol. abs/1412.6980 (2014).

[23]

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).

[24]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven C. H. Hoi. 2021. Align and Prompt: Video-and-Language Pre-training with Entity Prompts. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4943--4953.

[25]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[26]

Yikang Li, Jenhao Hsiao, and Chiu Man Ho. 2022. VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP. In Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR).

Digital Library

[27]

Fangyu Liu and Rongtian Ye. 2019. A strong and robust baseline for text-image matching. arXiv preprint arXiv:1906.01205 (2019).

[28]

Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. 2023. Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021a. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 11895--11905.

[30]

Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas A. Funkhouser, and Li Yi. 2021b. Contrastive Multimodal Fusion with TupleInfoNCE. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 734--743.

[31]

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. 2022. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV).

Digital Library

[32]

Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. In 5th International Conference on Learning Representations (ICLR).

[33]

Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, and Zhiwu Lu. 2022. LGDN: Language-Guided Denoising Network for Video-Language Modeling. In Advances in Neural Information Processing Systems (NeurIPS).

[34]

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. 2020. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. ArXiv, Vol. abs/2002.06353 (2020).

[35]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning. Neurocomput., Vol. 508, C (oct 2022), 293--304.

[36]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Chao Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 638--647.

Digital Library

[37]

Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. ArXiv, Vol. abs/2111.09734 (2021).

[38]

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2065--2074.

[39]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML).

[40]

Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, Vol. 11, sept (2010), 2487--2531.

[41]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, Vol. abs/2204.06125 (2022).

[42]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 815--823.

[43]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725.

[44]

Chen Sun, Austin Myers, Carl Vondrick, Kevin P. Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7463--7472.

[45]

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR (2018).

[46]

Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. ActionCLIP: A New Paradigm for Video Action Recognition. ArXiv, Vol. abs/2109.08472 (2021).

[47]

Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. 2022b. Disentangled Representation Learning for Text-Video Retrieval. ArXiv, Vol. abs/2203.07111 (2022).

[48]

Ziyue Wang, Aozhu Chen, Fan Hu, and Xirong Li. 2022a. Learn to Understand Negation in Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 434--443.

Digital Library

[49]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 450--459.

[50]

Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical Alignment Networks for Video-Text Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM). 3518--3527.

Digital Library

[51]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5288--5296.

[52]

Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 11542--11552.

[53]

Ye Yuan, Wuyang Chen, Yang Yang, and Zhangyang Wang. 2020. In Defense of the Triplet Loss Again: Learning Robust Person Re-Identification with Fast Approximated Triplet Loss and Label Distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 1454--1463.

[54]

Zhang Yuhao, Zhu Hongji, Wang Yongliang, Xu Nan, Li Xiaobo, and Zhao Binqiang. 2022. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 4892--4903.

[55]

Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).

Digital Library

[56]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2021. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, Vol. 130 (2021), 2337 -- 2348.

Digital Library

[57]

Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8743--8752.

Cited By

Shifa AKennedy RAsghar M(2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670472
Dong XFeng ZZhou CYu XYang MGuo QHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657833(2156-2166)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657833
Wang ZLi LXie ZLiu C(2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.103954

Index Terms

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval

Recommendations

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Cross-modal text-to-video retrieval aims to find semantically related videos for a text query. Since video and text are distinct modalities, the major challenge comes from building the correspondence between two modalities, thus relevant samples could be ...
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Video Corpus Moment Retrieval with Contrastive Learning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
201
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)17

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shifa AKennedy RAsghar M(2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670472
Dong XFeng ZZhou CYu XYang MGuo QHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657833(2156-2166)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657833
Wang ZLi LXie ZLiu C(2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.103954

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents