Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

Published: 07 December 2022 Publication History

Abstract

We present a novel network to transfer the image-language pre-trained model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and language from a large-scale video-text dataset. Differently, we leverage the pre-trained image-language model, and simplify it as a two-stage framework including co-learning of image and text, and enhancing temporal relations between video frames and video-text respectively. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pre-training (CLIP) model, our model involves a Temporal Difference Block (TDB) to capture motions at fine temporal video frames, and a Temporal Alignment Block (TAB) to re-align the tokens of video clips and phrases and enhance the cross-modal correlation. These two temporal blocks efficiently realize video-language learning and enable the proposed model to scale well on comparatively small datasets. We conduct extensive experimental studies including ablation studies and comparisons with existing SOTA methods, and our proposed approach outperforms them on the popularly-employed text-to-video and video-to-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.

References

[1]
Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for video question answering and retrieval,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 471–487.
[2]
L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8746–8755.
[3]
J. Lei et al., “Less is more: ClipBERT for video-and-language learning via sparse sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7331–7341.
[4]
Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, “Use what you have: Video retrieval using representations from collaborative experts,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–19.
[5]
V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 214–229.
[6]
M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, “MDMMT: Multidomain multimodal transformer for video retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2021, pp. 3354–3363.
[7]
N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learning joint embedding with multimodal cues for cross-modal video-text retrieval,” in Proc. ACM Int. Conf. Multimedia Retrieval, 2018, pp. 19–27.
[8]
B. Zhang, H. Hu, and F. Sha, “Cross-modal and hierarchical modeling of video and text,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 374–390.
[9]
S. Liu et al., “HIT: Hierarchical transformer with momentum contrast for video-text retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11895–11905.
[10]
J. Dong et al., “Dual encoding for zero-example video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9346–9355.
[11]
G. Bertasius, H. Wang, and L. Torresani, “IS space-time attention all you need for video understanding?,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 813–824.
[12]
A. Arnab et al., “ViVit: A video vision transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 6816–6826.
[13]
F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.
[14]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5288–5296.
[15]
D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proc. Annu. Meeting Assoc. Comput. Linguistics: Hum. Lang. Technol., 2011, pp. 190–200.
[16]
A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. German Conf. Pattern Recognit., Berlin, Germany, 2015, pp. 209–221.
[17]
L. A. Hendricks et al., “Localizing moments in video with natural language,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 5804–5813.
[18]
D. Zhang, X. Dai, X. Wang, and Y. Wang, “S3D: Single shot multi-span detector via fully 3D convolutional networks,” in Proc. Brit. Mach. Vis. Conf., 2018, pp. 1–14.
[19]
C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 6202–6211.
[20]
L. Wang, P. Koniusz, and D. Q. Huynh, “Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 8698–8708.
[21]
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1708–1718.
[22]
A. Miech et al., “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2630–2640.
[23]
H. Xue et al., “Advancing high-resolution video-language representation with large-scale video transcriptions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5036–5045.
[24]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
[25]
J. A. J. Portillo-Quintero, C. Ortiz-Bayliss, and H. Terashima-Marín, “A straightforward framework for video retrieval using CLIP” in MCPR, Mexico City, Mexico, (Lecture Notes in Computer Science Series), E. Roman-Rangel, Á. F.K. Morales, J. F.M. Trinidad, J. A. Carrasco-Ochoa, and J. A. Olvera-López, Eds., Berlin, Germany, 2021, 12725, pp. 3–12.
[26]
H. Luo et al., “CLIP4CLip: An empirical study of clip for end to end video clip retrieval,” 2021, pp. 1–14.
[27]
A. Rohrbach et al., “Movie description,” Int. J. Comput. Vis., vol. 123, no. 1, pp. 94–120, 2017.
[28]
X. Wang et al., “VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 4581–4591.
[29]
Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3533–3545.
[30]
Y. Wang et al., “PFAN++: Bi-directional image-text retrieval with position focused attention network,” IEEE Trans. Multimedia, vol. 23, pp. 3362–3376, 2021.
[31]
G. Song and X. Tan, “Real-world cross-modal retrieval via sequential learning,” IEEE Trans. Multimedia, vol. 23, pp. 1708–1721, 2021.
[32]
K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Image-text embedding learning via visual and textual semantic reasoning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 641–656, Jan. 2022.
[33]
Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,” IEEE Trans. Multimedia, vol. 18, no. 7, pp. 1363–1377, Jul. 2016.
[34]
H. Chen et al., “IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12652–12660.
[35]
K. Zhang, Z. Mao, A. Liu, and Y. Zhang, “Unified adaptive relevance distinguishable attention network for image-text matching,” IEEE Trans. Multimedia, early access, Jan. 10, 2022.
[36]
X. Fu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei, “Rich features embedding for cross-modal retrieval: A simple baseline,” IEEE Trans. Multimedia, vol. 22, no. 9, pp. 2354–2365, Sep. 2020.
[37]
N. Malali and Y. Keller, “Learning to embed semantic similarity for joint image-text retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 10252–10260, Dec. 2022.
[38]
Y. Wu, S. Wang, G. Song, and Q. Huang, “Augmented adversarial training for cross-modal retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 559–571, 2021.
[39]
N. Zhao, H. Zhang, R. Hong, M. Wang, and T.-S. Chua, “VideoWhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2080–2092, Sep. 2017.
[40]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–21.
[41]
G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words what is a video worth?,” 2021, pp. 1–11, arXiv:2103.13915.
[42]
A. Miech et al., “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9879–9889.
[43]
D. Ghadiyaram et al., “Large-scale weakly-supervised pre-training for video action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8746–8755.
[44]
J. C. Stroud et al., “Learning video representations from textual web supervision,” 2020, pp. 1–13, arXiv:2007.14937.
[45]
T. Li and L. Wang, “Learning spatiotemporal features via video and text pair discrimination,” 2020, pp. 1–17, arXiv:2001.05691.
[46]
Z. Chen, L. Ma, W. Luo, and K.-Y. K. Wong, “Weakly-supervised spatio-temporally grounding natural sentence in video,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 1884–1894.
[47]
Z. Chen et al., “Look closer to ground better: Weakly-supervised temporal grounding of sentence in video,” 2020, pp. 1–7, arXiv:2001.09308.
[48]
J. Teng et al., “Regularized two granularity loss function for weakly supervised video moment retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 1141–1151, 2022.
[49]
H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 1338–1349, 2022.
[50]
K. Ning, M. Cai, D. Xie, and F. Wu, “An attentive sequence to sequence translator for localizing video clips by natural language,” IEEE Trans. Multimedia, vol. 22, no. 9, pp. 2434–2443, Sep. 2020.
[51]
Z. Zhang et al., “Temporal textual localization in video via adversarial bi-directional interaction networks,” IEEE Trans. Multimedia, vol. 23, pp. 3306–3317, 2021.
[52]
J. Dong, X. Li, and C. G. M. Snoek, “Predicting visual features from text for image and video caption retrieval,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, Dec. 2018.
[53]
X. Ma, T. Zhang, and C. Xu, “Multi-level correlation adversarial hashing for cross-modal retrieval,” IEEE Trans. Multimedia, vol. 22, no. 12, pp. 3101–3114, Dec. 2020.
[54]
F. Chen, J. Shao, Y. Zhang, X. Xu, and H. T. Shen, “Interclass-relativity-adaptive metric learning for cross-modal matching and beyond,” IEEE Trans. Multimedia, vol. 23, pp. 3073–3084, 2021.
[55]
J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Video storytelling: Textual summaries for events,” Trans. Multimedia, vol. 22, no. 2, pp. 554–565, Feb. 2020.
[56]
X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Trans. Multimedia, vol. 24, pp. 2986–2997, 2022.
[57]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7464–7373.
[58]
L. Li et al., “HERO: Hierarchical encoder for video+language omni-representation pre-training,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 2046–2065.
[59]
J. Luo et al., “CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 5600–5608.
[60]
W. Wang, J. Gao, X. Yang, and C. Xu, “Learning coarse-to-fine graph neural networks for video-text retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 2386–2397, 2021.
[61]
X. Song, J. Chen, Z. Wu, and Y.-G. Jiang, “Spatial-temporal graphs for cross-modal text2video retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 2914–2923, 2022.
[62]
W. Wang, J. Gao, X. Yang, and C. Xu, “Many hands make light work: Transferring knowledge from auxiliary tasks for video-text retrieval,” IEEE Trans. Multimedia, early access, Feb. 08, 2022.
[63]
X. Li, F. Zhou, C. Xu, J. Ji, and G. Yang, “SEA: Sentence encoder assembly for video retrieval by textual queries,” IEEE Trans. Multimedia, vol. 23, pp. 4351–4362, 2021.
[64]
S.-V. Bogolin, I. Croitoru, H. Jin, Y. Liu, and S. Albanie, “Cross modal retrieval with querybank normalisation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5194–5205.
[65]
S. Zhao, L. Zhu, X. Wang, and Y. Yang, “CenterCLIP: Token clustering for efficient text-video retrieval,” in Proc. 45th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2022, pp. 970–981.
[66]
S. K. Gorti et al., “X-pool: Cross-modal language-video attention for text-video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4996–5005.
[67]
A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, pp. 1–13, arXiv:1807.03748.
[68]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[69]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, pp. 1–14, arXiv:1607.06450.
[70]
Y. Li et al., “TEA: Temporal excitation and aggregation for action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 906–915.
[71]
A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, pp. 1–24, 2019.
[72]
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2016, pp. 1715–1725.
[73]
R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5297–5307.
[74]
X. Wang, L. Zhu, and Y. Yang, “T2VLAD: Global-local sequence alignment for text-video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5079–5088.
[75]
H. Luo et al., “UniVL: A unified video and language pre-training model for multimodal understanding and generation,” 2020, pp. 1–15, arXiv:2002.06353.
[76]
S. Chen, Y. Zhao, Q. Jin, and Q. Wu, “Fine-grained video-text retrieval with hierarchical graph reasoning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10638–10647.
[77]
A. Miech, I. Laptev, and J. Sivic, “Learning a text-video embedding from incomplete and heterogeneous data,” 2018, pp. 1–16, arXiv:1804.02516.
[78]
M. Patrick et al., “Support-set bottlenecks for video-text representation learning,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–19.
[79]
R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” 2014, pp. 1–13, arXiv:1411.2539.
[80]
E. Amrani, R. Ben-Ari, D. Rotman, and A. M. Bronstein, “Noise estimation using density estimation for self-supervised multimodal learning,” in Proc. 35th AAAI Conf. Artif. Intell., 2021, pp. 6644–6652.
[81]
I. Croitoru et al., “TeachText: Crossmodal generalized distillation for text-video retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11583–11593.
[82]
R. Wightman, “Pytorch image models,” 2019. [Online]. Available: https://github.com/rwightman/pytorch-image-models
[83]
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,” in Proc. Neural Inf. Process. Syst. Workshop Energy Efficient Mach. Learn. Cogn. Comput., 2019, pp. 1–5.
[84]
Z. Gao et al., “CLIP2TV: An empirical study on transformer-based methods for video-text retrieval,” 2021, pp. 1–17, arXiv:2111.05610.
[85]
X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen, “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” 2021, pp. 1–11, arXiv:2109.04290.
[86]
R. Goyal et al., “The “something something” video database for learning and evaluating visual common sense,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5842–5850.

Cited By

View all
  • (2024)Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature AlignmentProceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval10.1145/3664524.3675369(45-50)Online publication date: 10-Jun-2024
  • (2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
  • (2024)MAC: Masked Contrastive Pre-Training for Efficient Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340261326(9962-9972)Online publication date: 1-Jan-2024
  • Show More Cited By

Index Terms

  1. Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 25, Issue
        2023
        8932 pages

        Publisher

        IEEE Press

        Publication History

        Published: 07 December 2022

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 09 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature AlignmentProceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval10.1145/3664524.3675369(45-50)Online publication date: 10-Jun-2024
        • (2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
        • (2024)MAC: Masked Contrastive Pre-Training for Efficient Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340261326(9962-9972)Online publication date: 1-Jan-2024
        • (2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 1-Apr-2024
        • (2024)Dual Modality Prompt Tuning for Vision-Language Pre-Trained ModelIEEE Transactions on Multimedia10.1109/TMM.2023.329158826(2056-2068)Online publication date: 1-Jan-2024
        • (2024)Vision-Language Models Can Identify Distracted Driver Behavior From Naturalistic VideosIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.338117525:9(11602-11616)Online publication date: 4-Apr-2024
        • (2023)Mask to Reconstruct: Cooperative Semantics Completion for Video-text RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611756(3847-3856)Online publication date: 26-Oct-2023

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media