Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3404835.3462924acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval

Published: 11 July 2021 Publication History

Abstract

Traditionally, the task of cross-modal retrieval is tackled through joint embedding. However, the global matching used in joint embedding methods often fails to effectively describe matchings between local regions of the image and words in the text. Hence they may not be effective in capturing the relevance between the text and the image. In this work, we propose a heterogeneous attention network (HAN) for effective and efficient cross-modal retrieval. The proposed HAN represents an image by a set of bounding box features and a sentence by a set of word features. The relevance between the image and the sentence is determined by the set-to-set matching between the set of word features and the set of bounding box features. To enhance the matching effectiveness, we exploit the proposed heterogeneous attention layer to provide the cross-modal context for word features as well as bounding box features. Meanwhile, to optimize the metric more effectively, we propose a new soft-max triplet loss, which adaptively gives more attention to harder negatives and thus trains the proposed HAN in a more effective manner compared with the original triplet loss. Meanwhile, the proposed HAN is efficient, and its lightweight architecture only needs a single GPU card for training. Extensive experiments conducted on two public benchmarks demonstrate the effectiveness and efficiency of our HAN. This work has been deployed in production Baidu Search Ads and is part of the "PaddleBox'' platform.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, 6077--6086.
[2]
Sebastian Bruch, Shuguang Han, Michael Bendersky, and Marc Najork. 2020. A Stochastic Treatment of Learning to Rank Scoring Functions. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining (WSDM). Houston, TX, 61--69.
[3]
Olivier Chapelle, Quoc Le, and Alex Smola. 2007. Large margin optimization of ranking measures. In NIPS workshop: Machine learning for Web search .
[4]
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 8785--8805.
[5]
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder--Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.
[7]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE+: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK.
[8]
Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). Anchorage, AK, 2509--2517.
[9]
Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.
[10]
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems (NeurIPS). virtual.
[11]
Felix A. Gers, Jü rgen Schmidhuber, and Fred A. Cummins. 2000. Learning to Forget: Continual Prediction with LSTM. Neural Comput., Vol. 12, 10 (2000), 2451--2471.
[12]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis., Vol. 106, 2 (2014), 210--233.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.
[14]
Weixiang Hong, Qingpei Guo, Wei Zhang, Jingdong Chen, and Wei Chu. 2021. Neural Networks for Information Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Online.
[15]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning Semantic Concepts and Order for Image and Sentence Matching. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, 6163--6171.
[16]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 4 (2017), 664--676.
[17]
Tom Kenter, Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke, and Bhaskar Mitra. 2017. Neural Networks for Information Retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Shinjuku, Tokyo, 1403--1406.
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., Vol. 123, 1 (2017), 32--73.
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). Lake Tahoe, NV, 1106--1114.
[20]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich, Germany, 212--228.
[21]
Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the Eleventh ACM International Conference on Multimedia (ACMMM). Berkeley, CA, 604--611.
[22]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020 a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.
[23]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[24]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020 b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 121--137.
[25]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Part V. Zurich, Switzerland, 740--755.
[26]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.
[27]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, 10434--10443.
[28]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. 2020. Univilm: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
[29]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 2156--2164.
[30]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.
[31]
Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia (ACMMM). Firenze, Italy, 251--260.
[32]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2015), 1137--1149.
[33]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 815--823.
[34]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, Vol. 45, 11 (1997), 2673--2681.
[35]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia, 2556--2565.
[36]
Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, 1979--1988.
[37]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR). Addis Ababa, Ethiopia.
[38]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.
[39]
Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2020. Fast Item Ranking under Neural Network based Measures. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining (WSDM). Houston, TX, 591--599.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS). Long Beach, CA, 5998--6008.
[41]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 2017 ACM on Multimedia Conference (ACMMM). Mountain View, CA, 154--162.
[42]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 5005--5013.
[43]
Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Kr"a henbü hl. 2017. Sampling Matters in Deep Embedding Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice, Italy, 2859--2867.
[44]
Zhiqiang Xu, Dong Li, Weijie Zhao, Xing Shen, Tianbo Huang, Xiaoyun Li, and Ping Li. 2021. Agile and Accurate CTR Prediction Model Training for Massive-Scale Online Advertising Systems. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD). Virtual Event, Xi'an, Shaanxi, China.
[45]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics, Vol. 2 (2014), 67--78.
[46]
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020 a. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. arXiv preprint arXiv:2006.16934 (2020).
[47]
Puxuan Yu, Hongliang Fei, and Ping Li. 2021 a. Cross-lingual Language Model Pretraining for Retrieval. In Proceedings of the Web Conference (WWW). Ljubljana, Slovenia.
[48]
Tan Yu, Xuemeng Yang, Yan Jiang, Hongfang Zhang, Weijie Zhao, and Ping Li. 2021 b. TIRA in Baidu Image Advertising. In Proceedings of the 37th International Conference on Data Engineering (ICDE) .
[49]
Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020 b. Combo-Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, USA, 2474--2482.
[50]
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. 2018. Neural Ranking Models with Multiple Document Fields. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM). Marina Del Rey, CA, 700--708.
[51]
Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys). Austin, TX.
[52]
Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 319--328.

Cited By

View all
  • (2024)Multi-Grained Similarity Preserving and Updating for Unsupervised Cross-Modal HashingApplied Sciences10.3390/app1402087014:2(870)Online publication date: 19-Jan-2024
  • (2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
  • (2024)Scaling Vison-Language Foundation Model to 12 Billion Parameters in Baidu Dynamic Image AdvertisingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680014(5102-5110)Online publication date: 21-Oct-2024
  • Show More Cited By

Index Terms

  1. Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. heterogeneous attention
    3. image retrieval

    Qualifiers

    • Research-article

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)81
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multi-Grained Similarity Preserving and Updating for Unsupervised Cross-Modal HashingApplied Sciences10.3390/app1402087014:2(870)Online publication date: 19-Jan-2024
    • (2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
    • (2024)Scaling Vison-Language Foundation Model to 12 Billion Parameters in Baidu Dynamic Image AdvertisingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680014(5102-5110)Online publication date: 21-Oct-2024
    • (2024)Multi-Stage Refined Visual Captioning for Baidu Ad Creatives GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679969(4198-4202)Online publication date: 21-Oct-2024
    • (2024)Enhancing Baidu Multimodal Advertisement with Chinese Text-to-Image Generation via Bilingual Alignment and Caption SynthesisProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661350(2855-2859)Online publication date: 10-Jul-2024
    • (2024)UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal MatchingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657806(852-861)Online publication date: 10-Jul-2024
    • (2024)Embedded Heterogeneous Attention Transformer for Cross-Lingual Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2024.338467826(9008-9020)Online publication date: 3-Apr-2024
    • (2024)Multiscale Salient Alignment Learning for Remote-Sensing Image–Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.334087062(1-13)Online publication date: 2024
    • (2024)Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121516237:PBOnline publication date: 1-Feb-2024
    • (2024)Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned TransformersInternational Journal of Computer Vision10.1007/s11263-024-02009-7132:8(2765-2797)Online publication date: 18-Feb-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media