research-article

Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval

Authors:

Ping LiAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1146 - 1156

https://doi.org/10.1145/3404835.3462924

Published: 11 July 2021 Publication History

Abstract

Traditionally, the task of cross-modal retrieval is tackled through joint embedding. However, the global matching used in joint embedding methods often fails to effectively describe matchings between local regions of the image and words in the text. Hence they may not be effective in capturing the relevance between the text and the image. In this work, we propose a heterogeneous attention network (HAN) for effective and efficient cross-modal retrieval. The proposed HAN represents an image by a set of bounding box features and a sentence by a set of word features. The relevance between the image and the sentence is determined by the set-to-set matching between the set of word features and the set of bounding box features. To enhance the matching effectiveness, we exploit the proposed heterogeneous attention layer to provide the cross-modal context for word features as well as bounding box features. Meanwhile, to optimize the metric more effectively, we propose a new soft-max triplet loss, which adaptively gives more attention to harder negatives and thus trains the proposed HAN in a more effective manner compared with the original triplet loss. Meanwhile, the proposed HAN is efficient, and its lightweight architecture only needs a single GPU card for training. Extensive experiments conducted on two public benchmarks demonstrate the effectiveness and efficiency of our HAN. This work has been deployed in production Baidu Search Ads and is part of the "PaddleBox'' platform.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, 6077--6086.

[2]

Sebastian Bruch, Shuguang Han, Michael Bendersky, and Marc Najork. 2020. A Stochastic Treatment of Learning to Rank Scoring Functions. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining (WSDM). Houston, TX, 61--69.

Digital Library

[3]

Olivier Chapelle, Quoc Le, and Alex Smola. 2007. Large margin optimization of ranking measures. In NIPS workshop: Machine learning for Web search .

[4]

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 8785--8805.

[5]

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder--Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.

[7]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE+: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK.

[8]

Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). Anchorage, AK, 2509--2517.

Digital Library

[9]

Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.

[10]

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems (NeurIPS). virtual.

[11]

Felix A. Gers, Jü rgen Schmidhuber, and Fred A. Cummins. 2000. Learning to Forget: Continual Prediction with LSTM. Neural Comput., Vol. 12, 10 (2000), 2451--2471.

Digital Library

[12]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis., Vol. 106, 2 (2014), 210--233.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.

[14]

Weixiang Hong, Qingpei Guo, Wei Zhang, Jingdong Chen, and Wei Chu. 2021. Neural Networks for Information Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Online.

[15]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning Semantic Concepts and Order for Image and Sentence Matching. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, 6163--6171.

[16]

Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 4 (2017), 664--676.

Digital Library

[17]

Tom Kenter, Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke, and Bhaskar Mitra. 2017. Neural Networks for Information Retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Shinjuku, Tokyo, 1403--1406.

Digital Library

[18]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., Vol. 123, 1 (2017), 32--73.

Digital Library

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). Lake Tahoe, NV, 1106--1114.

Digital Library

[20]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich, Germany, 212--228.

Digital Library

[21]

Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the Eleventh ACM International Conference on Multimedia (ACMMM). Berkeley, CA, 604--611.

[22]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020 a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.

[23]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[24]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020 b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 121--137.

Digital Library

[25]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Part V. Zurich, Switzerland, 740--755.

[26]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.

[27]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, 10434--10443.

[28]

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. 2020. Univilm: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).

[29]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 2156--2164.

[30]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.

Digital Library

[31]

Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia (ACMMM). Firenze, Italy, 251--260.

Digital Library

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2015), 1137--1149.

Digital Library

[33]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 815--823.

[34]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, Vol. 45, 11 (1997), 2673--2681.

Digital Library

[35]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia, 2556--2565.

[36]

Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, 1979--1988.

[37]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR). Addis Ababa, Ethiopia.

[38]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.

[39]

Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2020. Fast Item Ranking under Neural Network based Measures. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining (WSDM). Houston, TX, 591--599.

Digital Library

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS). Long Beach, CA, 5998--6008.

[41]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 2017 ACM on Multimedia Conference (ACMMM). Mountain View, CA, 154--162.

Digital Library

[42]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 5005--5013.

[43]

Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Kr"a henbü hl. 2017. Sampling Matters in Deep Embedding Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice, Italy, 2859--2867.

[44]

Zhiqiang Xu, Dong Li, Weijie Zhao, Xing Shen, Tianbo Huang, Xiaoyun Li, and Ping Li. 2021. Agile and Accurate CTR Prediction Model Training for Massive-Scale Online Advertising Systems. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD). Virtual Event, Xi'an, Shaanxi, China.

Digital Library

[45]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics, Vol. 2 (2014), 67--78.

[46]

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020 a. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. arXiv preprint arXiv:2006.16934 (2020).

[47]

Puxuan Yu, Hongliang Fei, and Ping Li. 2021 a. Cross-lingual Language Model Pretraining for Retrieval. In Proceedings of the Web Conference (WWW). Ljubljana, Slovenia.

Digital Library

[48]

Tan Yu, Xuemeng Yang, Yan Jiang, Hongfang Zhang, Weijie Zhao, and Ping Li. 2021 b. TIRA in Baidu Image Advertising. In Proceedings of the 37th International Conference on Data Engineering (ICDE) .

[49]

Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020 b. Combo-Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, USA, 2474--2482.

Digital Library

[50]

Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. 2018. Neural Ranking Models with Multiple Document Fields. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM). Marina Del Rey, CA, 700--708.

Digital Library

[51]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys). Austin, TX.

[52]

Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 319--328.

Digital Library

Cited By

Wu RZhu XYi ZZou ZLiu YZhu L(2024)Multi-Grained Similarity Preserving and Updating for Unsupervised Cross-Modal HashingApplied Sciences10.3390/app1402087014:2(870)Online publication date: 19-Jan-2024
https://doi.org/10.3390/app14020870
Sun HQin XLiu X(2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230214
Zhao XZhao KJin ZYang YTao WChen XHan CLi SLiu LSerra ESpezzano F(2024)Scaling Vison-Language Foundation Model to 12 Billion Parameters in Baidu Dynamic Image AdvertisingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680014(5102-5110)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680014
Show More Cited By

Index Terms

Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

A novel deep translated attention hashing for cross-modal retrieval
Abstract
In recent years, driven by the increasing number of cross-modal data such as images and texts, cross-modal retrieval has received intensive attention. Great progress has made in deep cross-modal hash retrieval, which integrates feature leaning and ...
Deep semantic hashing with dual attention for cross-modal retrieval
Abstract
With the explosive growth of multimodal data, cross-modal retrieval has drawn increasing research interests. Hashing-based methods have made great advancements in cross-modal retrieval due to the benefits of low storage cost and fast query speed. ...
Multi-attention based semantic deep hashing for cross-modal retrieval
Abstract
Cross-modal hashing is an efficient method to retrieve cross domain data. Most previous methods focused on measuring the discrepancy between intro-modality and inter-modality. However, recent researches show that semantic information is vital for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2021

2998 pages

ISBN:9781450380379

DOI:10.1145/3404835

General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '21

Sponsor:

SIGIR

SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
988
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)12

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu RZhu XYi ZZou ZLiu YZhu L(2024)Multi-Grained Similarity Preserving and Updating for Unsupervised Cross-Modal HashingApplied Sciences10.3390/app1402087014:2(870)Online publication date: 19-Jan-2024
https://doi.org/10.3390/app14020870
Sun HQin XLiu X(2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230214
Zhao XZhao KJin ZYang YTao WChen XHan CLi SLiu LSerra ESpezzano F(2024)Scaling Vison-Language Foundation Model to 12 Billion Parameters in Baidu Dynamic Image AdvertisingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680014(5102-5110)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680014
Yang YZhao XZhao KJin ZTao WLiu LLi SSerra ESpezzano F(2024)Multi-Stage Refined Visual Captioning for Baidu Ad Creatives GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679969(4198-4202)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679969
Zhao KZhao XJin ZYang YTao WHan CLi SLiu LHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Enhancing Baidu Multimodal Advertisement with Chinese Text-to-Image Generation via Bilingual Alignment and Caption SynthesisProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661350(2855-2859)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661350
Zha QLiu XCheung YXu XWang NCao JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal MatchingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657806(852-861)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657806
Song ZHu ZZhou YZhao YHong RWang M(2024)Embedded Heterogeneous Attention Transformer for Cross-Lingual Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2024.338467826(9008-9020)Online publication date: 3-Apr-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3384678
Chen YHuang JLi XXiong SLu X(2024)Multiscale Salient Alignment Learning for Remote-Sensing Image–Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.334087062(1-13)Online publication date: 2024
https://doi.org/10.1109/TGRS.2023.3340870
Yao DLi ZLi BZhang CMa H(2024)Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121516237:PBOnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121516
Wang JZeng ZChen BWang YLiao DLi GWang YXia S(2024)Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned TransformersInternational Journal of Computer Vision10.1007/s11263-024-02009-7132:8(2765-2797)Online publication date: 18-Feb-2024
https://doi.org/10.1007/s11263-024-02009-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents