research-article

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

Authors:

Wenhan LuoAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 25

Pages 7772 - 7785

https://doi.org/10.1109/TMM.2022.3227416

Published: 07 December 2022 Publication History

Abstract

We present a novel network to transfer the image-language pre-trained model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and language from a large-scale video-text dataset. Differently, we leverage the pre-trained image-language model, and simplify it as a two-stage framework including co-learning of image and text, and enhancing temporal relations between video frames and video-text respectively. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pre-training (CLIP) model, our model involves a Temporal Difference Block (TDB) to capture motions at fine temporal video frames, and a Temporal Alignment Block (TAB) to re-align the tokens of video clips and phrases and enhance the cross-modal correlation. These two temporal blocks efficiently realize video-language learning and enable the proposed model to scale well on comparatively small datasets. We conduct extensive experimental studies including ablation studies and comparisons with existing SOTA methods, and our proposed approach outperforms them on the popularly-employed text-to-video and video-to-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.

References

[1]

Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for video question answering and retrieval,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 471–487.

[2]

L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8746–8755.

[3]

J. Lei et al., “Less is more: ClipBERT for video-and-language learning via sparse sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7331–7341.

[4]

Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, “Use what you have: Video retrieval using representations from collaborative experts,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–19.

[5]

V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 214–229.

[6]

M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, “MDMMT: Multidomain multimodal transformer for video retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2021, pp. 3354–3363.

[7]

N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learning joint embedding with multimodal cues for cross-modal video-text retrieval,” in Proc. ACM Int. Conf. Multimedia Retrieval, 2018, pp. 19–27.

[8]

B. Zhang, H. Hu, and F. Sha, “Cross-modal and hierarchical modeling of video and text,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 374–390.

[9]

S. Liu et al., “HIT: Hierarchical transformer with momentum contrast for video-text retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11895–11905.

[10]

J. Dong et al., “Dual encoding for zero-example video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9346–9355.

[11]

G. Bertasius, H. Wang, and L. Torresani, “IS space-time attention all you need for video understanding?,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 813–824.

[12]

A. Arnab et al., “ViVit: A video vision transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 6816–6826.

[13]

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.

[14]

J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5288–5296.

[15]

D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proc. Annu. Meeting Assoc. Comput. Linguistics: Hum. Lang. Technol., 2011, pp. 190–200.

Digital Library

[16]

A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. German Conf. Pattern Recognit., Berlin, Germany, 2015, pp. 209–221.

[17]

L. A. Hendricks et al., “Localizing moments in video with natural language,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 5804–5813.

[18]

D. Zhang, X. Dai, X. Wang, and Y. Wang, “S3D: Single shot multi-span detector via fully 3D convolutional networks,” in Proc. Brit. Mach. Vis. Conf., 2018, pp. 1–14.

[19]

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 6202–6211.

[20]

L. Wang, P. Koniusz, and D. Q. Huynh, “Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 8698–8708.

[21]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1708–1718.

[22]

A. Miech et al., “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2630–2640.

[23]

H. Xue et al., “Advancing high-resolution video-language representation with large-scale video transcriptions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5036–5045.

[24]

A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.

[25]

J. A. J. Portillo-Quintero, C. Ortiz-Bayliss, and H. Terashima-Marín, “A straightforward framework for video retrieval using CLIP” in MCPR, Mexico City, Mexico, (Lecture Notes in Computer Science Series), E. Roman-Rangel, Á. F.K. Morales, J. F.M. Trinidad, J. A. Carrasco-Ochoa, and J. A. Olvera-López, Eds., Berlin, Germany, 2021, 12725, pp. 3–12.

[26]

H. Luo et al., “CLIP4CLip: An empirical study of clip for end to end video clip retrieval,” 2021, pp. 1–14.

[27]

A. Rohrbach et al., “Movie description,” Int. J. Comput. Vis., vol. 123, no. 1, pp. 94–120, 2017.

Digital Library

[28]

X. Wang et al., “VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 4581–4591.

[29]

Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3533–3545.

[30]

Y. Wang et al., “PFAN++: Bi-directional image-text retrieval with position focused attention network,” IEEE Trans. Multimedia, vol. 23, pp. 3362–3376, 2021.

[31]

G. Song and X. Tan, “Real-world cross-modal retrieval via sequential learning,” IEEE Trans. Multimedia, vol. 23, pp. 1708–1721, 2021.

Digital Library

[32]

K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Image-text embedding learning via visual and textual semantic reasoning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 641–656, Jan. 2022.

[33]

Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,” IEEE Trans. Multimedia, vol. 18, no. 7, pp. 1363–1377, Jul. 2016.

Digital Library

[34]

H. Chen et al., “IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12652–12660.

[35]

K. Zhang, Z. Mao, A. Liu, and Y. Zhang, “Unified adaptive relevance distinguishable attention network for image-text matching,” IEEE Trans. Multimedia, early access, Jan. 10, 2022.

Digital Library

[36]

X. Fu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei, “Rich features embedding for cross-modal retrieval: A simple baseline,” IEEE Trans. Multimedia, vol. 22, no. 9, pp. 2354–2365, Sep. 2020.

[37]

N. Malali and Y. Keller, “Learning to embed semantic similarity for joint image-text retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 10252–10260, Dec. 2022.

[38]

Y. Wu, S. Wang, G. Song, and Q. Huang, “Augmented adversarial training for cross-modal retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 559–571, 2021.

[39]

N. Zhao, H. Zhang, R. Hong, M. Wang, and T.-S. Chua, “VideoWhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2080–2092, Sep. 2017.

Digital Library

[40]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–21.

[41]

G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words what is a video worth?,” 2021, pp. 1–11, arXiv:2103.13915.

[42]

A. Miech et al., “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9879–9889.

[43]

D. Ghadiyaram et al., “Large-scale weakly-supervised pre-training for video action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8746–8755.

[44]

J. C. Stroud et al., “Learning video representations from textual web supervision,” 2020, pp. 1–13, arXiv:2007.14937.

[45]

T. Li and L. Wang, “Learning spatiotemporal features via video and text pair discrimination,” 2020, pp. 1–17, arXiv:2001.05691.

[46]

Z. Chen, L. Ma, W. Luo, and K.-Y. K. Wong, “Weakly-supervised spatio-temporally grounding natural sentence in video,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 1884–1894.

[47]

Z. Chen et al., “Look closer to ground better: Weakly-supervised temporal grounding of sentence in video,” 2020, pp. 1–7, arXiv:2001.09308.

[48]

J. Teng et al., “Regularized two granularity loss function for weakly supervised video moment retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 1141–1151, 2022.

[49]

H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 1338–1349, 2022.

Digital Library

[50]

K. Ning, M. Cai, D. Xie, and F. Wu, “An attentive sequence to sequence translator for localizing video clips by natural language,” IEEE Trans. Multimedia, vol. 22, no. 9, pp. 2434–2443, Sep. 2020.

[51]

Z. Zhang et al., “Temporal textual localization in video via adversarial bi-directional interaction networks,” IEEE Trans. Multimedia, vol. 23, pp. 3306–3317, 2021.

[52]

J. Dong, X. Li, and C. G. M. Snoek, “Predicting visual features from text for image and video caption retrieval,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, Dec. 2018.

Digital Library

[53]

X. Ma, T. Zhang, and C. Xu, “Multi-level correlation adversarial hashing for cross-modal retrieval,” IEEE Trans. Multimedia, vol. 22, no. 12, pp. 3101–3114, Dec. 2020.

Digital Library

[54]

F. Chen, J. Shao, Y. Zhang, X. Xu, and H. T. Shen, “Interclass-relativity-adaptive metric learning for cross-modal matching and beyond,” IEEE Trans. Multimedia, vol. 23, pp. 3073–3084, 2021.

[55]

J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Video storytelling: Textual summaries for events,” Trans. Multimedia, vol. 22, no. 2, pp. 554–565, Feb. 2020.

Digital Library

[56]

X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Trans. Multimedia, vol. 24, pp. 2986–2997, 2022.

Digital Library

[57]

C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7464–7373.

[58]

L. Li et al., “HERO: Hierarchical encoder for video+language omni-representation pre-training,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 2046–2065.

[59]

J. Luo et al., “CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 5600–5608.

Digital Library

[60]

W. Wang, J. Gao, X. Yang, and C. Xu, “Learning coarse-to-fine graph neural networks for video-text retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 2386–2397, 2021.

Digital Library

[61]

X. Song, J. Chen, Z. Wu, and Y.-G. Jiang, “Spatial-temporal graphs for cross-modal text2video retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 2914–2923, 2022.

Digital Library

[62]

W. Wang, J. Gao, X. Yang, and C. Xu, “Many hands make light work: Transferring knowledge from auxiliary tasks for video-text retrieval,” IEEE Trans. Multimedia, early access, Feb. 08, 2022.

Digital Library

[63]

X. Li, F. Zhou, C. Xu, J. Ji, and G. Yang, “SEA: Sentence encoder assembly for video retrieval by textual queries,” IEEE Trans. Multimedia, vol. 23, pp. 4351–4362, 2021.

[64]

S.-V. Bogolin, I. Croitoru, H. Jin, Y. Liu, and S. Albanie, “Cross modal retrieval with querybank normalisation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5194–5205.

[65]

S. Zhao, L. Zhu, X. Wang, and Y. Yang, “CenterCLIP: Token clustering for efficient text-video retrieval,” in Proc. 45th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2022, pp. 970–981.

[66]

S. K. Gorti et al., “X-pool: Cross-modal language-video attention for text-video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4996–5005.

[67]

A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, pp. 1–13, arXiv:1807.03748.

[68]

A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[69]

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, pp. 1–14, arXiv:1607.06450.

[70]

Y. Li et al., “TEA: Temporal excitation and aggregation for action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 906–915.

[71]

A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, pp. 1–24, 2019.

[72]

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2016, pp. 1715–1725.

[73]

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5297–5307.

[74]

X. Wang, L. Zhu, and Y. Yang, “T2VLAD: Global-local sequence alignment for text-video retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5079–5088.

[75]

H. Luo et al., “UniVL: A unified video and language pre-training model for multimodal understanding and generation,” 2020, pp. 1–15, arXiv:2002.06353.

[76]

S. Chen, Y. Zhao, Q. Jin, and Q. Wu, “Fine-grained video-text retrieval with hierarchical graph reasoning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10638–10647.

[77]

A. Miech, I. Laptev, and J. Sivic, “Learning a text-video embedding from incomplete and heterogeneous data,” 2018, pp. 1–16, arXiv:1804.02516.

[78]

M. Patrick et al., “Support-set bottlenecks for video-text representation learning,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–19.

[79]

R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” 2014, pp. 1–13, arXiv:1411.2539.

[80]

E. Amrani, R. Ben-Ari, D. Rotman, and A. M. Bronstein, “Noise estimation using density estimation for self-supervised multimodal learning,” in Proc. 35th AAAI Conf. Artif. Intell., 2021, pp. 6644–6652.

[81]

I. Croitoru et al., “TeachText: Crossmodal generalized distillation for text-video retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11583–11593.

[82]

R. Wightman, “Pytorch image models,” 2019. [Online]. Available: https://github.com/rwightman/pytorch-image-models

[83]

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,” in Proc. Neural Inf. Process. Syst. Workshop Energy Efficient Mach. Learn. Cogn. Comput., 2019, pp. 1–5.

[84]

Z. Gao et al., “CLIP2TV: An empirical study on transformer-based methods for video-text retrieval,” 2021, pp. 1–17, arXiv:2111.05610.

[85]

X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen, “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” 2021, pp. 1–11, arXiv:2109.04290.

[86]

R. Goyal et al., “The “something something” video database for learning and evaluating visual common sense,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5842–5850.

Cited By

Wu GHaider ASpence IWang H(2024)Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature AlignmentProceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval10.1145/3664524.3675369(45-50)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3664524.3675369
Wang ZGao ZHan MYang YShen H(2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3407664
Shu FChen BLiao YWang JLiu S(2024)MAC: Masked Contrastive Pre-Training for Efficient Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340261326(9962-9972)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3402613
Show More Cited By

Index Terms

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval
Pattern Recognition and Computer Vision
Abstract
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, ...
VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP ...
Animating Images to Transfer CLIP for Video-Text Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Recent works show the possibility of transferring the CLIP (Contrastive Language-Image Pretraining) model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 25, Issue

2023

8932 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 07 December 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu GHaider ASpence IWang H(2024)Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature AlignmentProceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval10.1145/3664524.3675369(45-50)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3664524.3675369
Wang ZGao ZHan MYang YShen H(2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3407664
Shu FChen BLiao YWang JLiu S(2024)MAC: Masked Contrastive Pre-Training for Efficient Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340261326(9962-9972)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3402613
Tu JLiu XHao YHong RWang M(2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3381828
Xing YWu QCheng DZhang SLiang GWang PZhang Y(2024)Dual Modality Prompt Tuning for Vision-Language Pre-Trained ModelIEEE Transactions on Multimedia10.1109/TMM.2023.329158826(2056-2068)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3291588
Hasan MChen JWang JRahman MJoshi AVelipasalar SHegde CSharma ASarkar S(2024)Vision-Language Models Can Identify Distracted Driver Behavior From Naturalistic VideosIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.338117525:9(11602-11616)Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1109/TITS.2024.3381175
Fang HYang ZZang XBan CHe ZSun HZhou LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Mask to Reconstruct: Cooperative Semantics Completion for Video-text RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611756(3847-3856)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611756

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents