Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3612006acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Published: 27 October 2023 Publication History

Abstract

In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.

References

[1]
Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Inf. Process. Manag., Vol. 39 (2003), 45--65.
[2]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 1708--1718.
[3]
Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. 2021. Cross Modal Retrieval with Querybank Normalisation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5184--5195.
[4]
David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Annual Meeting of the Association for Computational Linguistics.
[5]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10635--10644.
[6]
Xingyi Cheng, Hezheng Lin, Xiangyu Wu, F. Yang, and Dong Shen. 2021. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. ArXiv, Vol. abs/2109.04290 (2021).
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 44 (2021), 4065--4080.
[8]
Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, and Oswald Lanz. 2022. Relevance-based Margin for Contrastively-trained Video Retrieval Models. Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR), 146--157.
[9]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. ArXiv, Vol. abs/2106.11097 (2021).
[10]
Sheng Fang, Shuhui Wang, Junbao Zhuo, Qingming Huang, Bin Ma, Xiaoming Wei, and Xiaolin Wei. 2022. Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 4789--4800.
[11]
Valentin Gabeur, Chen Sun, Alahari Karteek, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV).
[12]
Yizhao Gao and Zhiwu Lu. 2022. SST-VLM: Sparse Sampling-Twice Inspired Video-Language Model. In Asian Conference on Computer Vision (ACCV).
[13]
Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. 2021. CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval. ArXiv, Vol. abs/2111.05610 (2021).
[14]
Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4996--5005.
[15]
Michael U Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics.
[16]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9726--9735.
[17]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 961--970.
[18]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. 2017. Localizing Moments in Video with Natural Language. In 2017 IEEE International Conference on Computer Vision (ICCV). 5804--5813.
[19]
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).
[20]
Karen Spärck Jones. 2021. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, Vol. 60 (2021), 493--502.
[21]
Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc.
[22]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv, Vol. abs/1412.6980 (2014).
[23]
Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
[24]
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven C. H. Hoi. 2021. Align and Prompt: Video-and-Language Pre-training with Entity Prompts. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4943--4953.
[25]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[26]
Yikang Li, Jenhao Hsiao, and Chiu Man Ho. 2022. VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP. In Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR).
[27]
Fangyu Liu and Rongtian Ye. 2019. A strong and robust baseline for text-image matching. arXiv preprint arXiv:1906.01205 (2019).
[28]
Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. 2023. Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29]
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021a. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 11895--11905.
[30]
Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong, Thomas A. Funkhouser, and Li Yi. 2021b. Contrastive Multimodal Fusion with TupleInfoNCE. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 734--743.
[31]
Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. 2022. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV).
[32]
Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. In 5th International Conference on Learning Representations (ICLR).
[33]
Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, and Zhiwu Lu. 2022. LGDN: Language-Guided Denoising Network for Video-Language Modeling. In Advances in Neural Information Processing Systems (NeurIPS).
[34]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. 2020. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. ArXiv, Vol. abs/2002.06353 (2020).
[35]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning. Neurocomput., Vol. 508, C (oct 2022), 293--304.
[36]
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Chao Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 638--647.
[37]
Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. ArXiv, Vol. abs/2111.09734 (2021).
[38]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2065--2074.
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML).
[40]
Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, Vol. 11, sept (2010), 2487--2531.
[41]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, Vol. abs/2204.06125 (2022).
[42]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 815--823.
[43]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725.
[44]
Chen Sun, Austin Myers, Carl Vondrick, Kevin P. Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7463--7472.
[45]
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR (2018).
[46]
Mengmeng Wang, Jiazheng Xing, and Yong Liu. 2021. ActionCLIP: A New Paradigm for Video Action Recognition. ArXiv, Vol. abs/2109.08472 (2021).
[47]
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. 2022b. Disentangled Representation Learning for Text-Video Retrieval. ArXiv, Vol. abs/2203.07111 (2022).
[48]
Ziyue Wang, Aozhu Chen, Fan Hu, and Xirong Li. 2022a. Learn to Understand Negation in Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 434--443.
[49]
Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 450--459.
[50]
Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical Alignment Networks for Video-Text Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM). 3518--3527.
[51]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5288--5296.
[52]
Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 11542--11552.
[53]
Ye Yuan, Wuyang Chen, Yang Yang, and Zhangyang Wang. 2020. In Defense of the Triplet Loss Again: Learning Robust Person Re-Identification with Fast Approximated Triplet Loss and Label Distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 1454--1463.
[54]
Zhang Yuhao, Zhu Hongji, Wang Yongliang, Xu Nan, Li Xiaobo, and Zhao Binqiang. 2022. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 4892--4903.
[55]
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
[56]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2021. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, Vol. 130 (2021), 2337 -- 2348.
[57]
Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8743--8752.

Cited By

View all
  • (2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
  • (2024)M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657833(2156-2166)Online publication date: 10-Jul-2024
  • (2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024

Index Terms

  1. Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dual-modal attention-enhanced
    2. negativeaware infonce
    3. text-video retrieval
    4. triplet partial margin contrastive learning

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)201
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
    • (2024)M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657833(2156-2166)Online publication date: 10-Jul-2024
    • (2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media