research-article

BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

Authors:

Jingjing ChenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 3

Article No.: 86, Pages 1 - 21

https://doi.org/10.1145/3627103

Published: 09 December 2023 Publication History

Abstract

The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.

[2]

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297–5307.

[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450 (2016).

[4]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.

[5]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding. In Proceedings of the 38th International Conference on Machine Learning. 813–824.

[6]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[7]

David Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190–200.

Digital Library

[8]

Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM International Conference on Multimedia. 32–41.

Digital Library

[9]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10635–10644.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.

[11]

Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.

[12]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346–9355.

[13]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4065–4080.

[14]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv:2106.11097 (2021).

[15]

Zerun Feng, Zhimin Zeng, Caili Guo, and Zheng Li. 2020. Exploiting visual semantic reasoning for video-text retrieval. In International Joint Conference on Artificial Intelligence. 1005–1011.

[16]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision. 214–229.

Digital Library

[17]

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. Advances in Neural Information Processing Systems 33 (2020), 22605–22618.

[18]

Yutian Guo, Jingjing Chen, Hao Zhang, and Yu-Gang Jiang. 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 9–15.

Digital Library

[19]

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the ACM International Conference on Multimedia. 3826–3834.

Digital Library

[20]

Ning Han, Jingjing Chen, Hao Zhang, Huanwen Wang, and Hao Chen. 2022. Adversarial multi-grained embedding network for cross-modal text-video retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 1–23.

[21]

Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group contextualization for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 928–938.

[22]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (GELUs). arXiv:1606.08415 (2016).

[23]

Harold Hotelling. 1936. Relations between two sets of variates. In Breakthroughs in statistics: methodology and distribution. 321–377.

[24]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv:1705.06950 (2017).

[25]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations. 1–15.

[26]

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations. 1–14.

[27]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.

[28]

Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2020. SEA: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia 23 (2020), 4351–4362.

[29]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. HiT: Hierarchical Transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11915–11925.

[30]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3003–3018.

[31]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. In British Machine Vision Conference. 279.

[32]

Wei Lu, Desheng Li, Liqiang Nie, Peiguang Jing, and Yuting Su. 2021. Learning dual low-rank representation for multi-label micro-video classification. IEEE Transactions on Multimedia 25 (2021), 77–89.

[33]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.

[34]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9876–9886.

[35]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516 (2018).

[36]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.

[37]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19–27.

Digital Library

[38]

Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander G. Hauptmann, Joao F. Henriques, and Andrea Vedaldi. 2021. Support-set bottlenecks for video-text representation learning. In International Conference on Learning Representations. 1–18.

[39]

Florent Perronnin and Christopher Dance. 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–8.

[40]

Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing 30 (2021), 2989–3004.

[41]

Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia. 84–93.

Digital Library

[42]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763.

[43]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).

[44]

Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, et al. 2021. AVLnet: Learning audio-visual language representations from instructional videos. In Annual Conference of the International Speech Communication Association. 1584–1588.

[45]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

[46]

Xue Song, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia 24 (2021), 2914–2923.

[47]

Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video relation detection via multiple hypothesis association. In Proceedings of the 28th ACM International Conference on Multimedia. 3127–3135.

Digital Library

[48]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4278–4284.

[49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances on Neural Information Processing Systems. 6000–6010.

[50]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence — video to text. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4534–4542.

[51]

Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4116–4124.

Digital Library

[52]

Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, and Jiebo Luo. 2023. Semantic and relation modulation for audio-visual event localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023), 7711–7725.

[53]

Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7026–7035.

[54]

Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2022. Many hands make light work: Transferring knowledge from auxiliary tasks for video-text retrieval. IEEE Transactions on Multimedia 25 (2022), 2661–2674.

[55]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In European Conference on Computer Vision. 399–417.

Digital Library

[56]

Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5079–5088.

[57]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 450–459.

[58]

Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9964–9974.

[59]

Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3518–3527.

Digital Library

[60]

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In Proceedings of the European Conference on Computer Vision. 447–464.

Digital Library

[61]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.

[62]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1348.

Digital Library

[63]

Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1–23.

[64]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision. 471–487.

Digital Library

[65]

Yawen Zeng, Ning Han, Keyu Pan, and Qin Jin. 2023. Temporally language grounding with multi-modal multi-prompt tuning. IEEE Transactions on Multimedia (2023), 1–12.

[66]

Yawen Zeng, Qin Jin, Tengfei Bao, and Wenfeng Li. 2023. Multi-modal knowledge hypergraph for diverse image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 3376–3383.

Digital Library

[67]

Yawen Zeng, Keyu Pan, and Ning Han. 2023. RewardTLG: Learning to temporally language grounding from flexible reward. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2344–2348.

Digital Library

[68]

Hao Zhang, Lechao Cheng, Yanbin Hao, and Chong-wah Ngo. 2022. Long-term leap attention, short-term periodic shift for video classification. In Proceedings of the 30th ACM International Conference on Multimedia. 5773–5782.

Digital Library

[69]

Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia. 917–925.

Digital Library

[70]

Yanchao Zhang, Weiqing Min, Liqiang Nie, and Shuqiang Jiang. 2020. Hybrid-attention enhanced two-stream fusion network for video venue prediction. IEEE Transactions on Multimedia 23 (2020), 2917–2929.

[71]

Rui Zhao, Kecheng Zheng, and Zheng-jun Zha. 2020. Stacked convolutional deep encoding network for video-text retrieval. In 2020 IEEE International Conference on Multimedia and Expo. 1–6.

[72]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 7590–7598.

[73]

Xiaofeng Zou, Kenli Li, and Cen Chen. 2020. Multilevel attention based U-shape graph neural network for point clouds learning. IEEE Transactions on Industrial Informatics 18, 1 (2020), 448–456.

Cited By

Han NYang XLim EChen HSun Q(2024)Efficient Cross-Modal Video Retrieval With Meta-Optimized FramesIEEE Transactions on Multimedia10.1109/TMM.2024.341666926(10924-10936)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3416669
He JZhu HZhou X(2024)Quantum image encryption algorithm via optimized quantum circuit and parity bit-plane permutationJournal of Information Security and Applications10.1016/j.jisa.2024.10369881:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.jisa.2024.103698
Zhang QChang CSu MChang HRoy D(2024)HMTV: hierarchical multimodal transformer for video highlight query on baseballMultimedia Systems10.1007/s00530-024-01479-630:5Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/s00530-024-01479-6
Show More Cited By

Index Terms

BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In ...
Heterogeneous spatio-temporal relation learning network for facial action unit detection
Highlights
- The attention-based feature decoupling module is proposed to obtain AU-specific spatio-temporal feature vectors.
Abstract
Properties such as temporal relation and action relation of facial Action Units (AUs) make AU detection different from general multi-label classification tasks. Therefore, how to capture the spatial and temporal co-occurrence of AUs ...
Spatio-Temporal Attention for Text-Video Retrieval
Text-video retrieval, a fundamental task for associating textual descriptions with video content, has become increasingly important in the video domain. Most existing methods focus on the single-modality features only considering the knowledge within ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 3

March 2024

665 pages

EISSN:1551-6865

DOI:10.1145/3613614

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2023

Online AM: 13 October 2023

Accepted: 24 September 2023

Revised: 19 August 2023

Received: 11 September 2022

Published in TOMM Volume 20, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shanghai Pujiang Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)194
Downloads (Last 6 weeks)21

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han NYang XLim EChen HSun Q(2024)Efficient Cross-Modal Video Retrieval With Meta-Optimized FramesIEEE Transactions on Multimedia10.1109/TMM.2024.341666926(10924-10936)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3416669
He JZhu HZhou X(2024)Quantum image encryption algorithm via optimized quantum circuit and parity bit-plane permutationJournal of Information Security and Applications10.1016/j.jisa.2024.10369881:COnline publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.jisa.2024.103698
Zhang QChang CSu MChang HRoy D(2024)HMTV: hierarchical multimodal transformer for video highlight query on baseballMultimedia Systems10.1007/s00530-024-01479-630:5Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/s00530-024-01479-6
Li PXie CZhao LXie HGe JZheng YZhao DZhang Y(2023)Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00379(4077-4087)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00379

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents