Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

Published: 08 March 2024 Publication History

Abstract

In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.

References

[1]
Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1140–1149.
[2]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining CLIP-based features. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21434–21442.
[3]
Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In European Conference on Computer Vision. 663–676.
[4]
Chandramani Chaudhary, Poonam Goyal, Navneet Goyal, and Yi-Ping Phoebe Chen. 2020. Image retrieval for complex queries using knowledge embedding. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), 13:1–13:23.
[5]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image search with text feedback by visiolinguistic attention learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 3001–3011.
[6]
Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2022. Composed image retrieval with text feedback via multi-grained uncertainty regularization. CoRR abs/2211.07394.
[7]
Zhiguo Chen, Xun Jiang, Xing Xu, Zuo Cao, Yijun Mo, and Heng Tao Shen. 2023. Joint searching and grounding: Multi-granularity video content retrieval. In Proceedings of the 31st ACM International Conference on Multimedia.
[8]
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 95:1–95:23.
[9]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724–1734.
[10]
Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. Computing Research Repository abs/2203.08101.
[11]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.
[12]
Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. FashionVLP: Vision language transformer for fashion retrieval with feedback. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14105–14115.
[13]
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. 2023. CompoDiff: Versatile composed image retrieval with latent diffusion. CoRR abs/2303.11916.
[14]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in Neural Information Processing Systems. 676–686.
[15]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.
[16]
Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-focused vision-and-language representation learning. Computing Research Repository abs/2207.08150.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[19]
Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed query image retrieval using locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3596–3605.
[20]
Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. 2022. SAC: Semantic attention composition for text-conditioned image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4021–4030.
[21]
Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual compositional learning in interactive image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1771–1779.
[22]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision. 201–216.
[23]
Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-style modulation for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802–812.
[24]
Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2023. Data roaming and early fusion for composed image retrieval. CoRR abs/2303.09429.
[25]
Shenshen Li, Xing Xu, Xun Jiang, Fumin Shen, Xin Liu, and Heng Tao Shen. 2023. Multi-grained attention network with mutual exclusion for composed query-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1–1.
[26]
Shenshen Li, Xing Xu, Yang Yang, Fumin Shen, Yijun Mo, Yujie Li, and Heng Tao Shen. 2023. DCEL: Deep cross-modal evidential learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia (MM’23, Ottawa, ON, Canada, 29 October 2023- 3 November 2023), 6292–6300.
[27]
Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L. Yuille. 2019. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3662–3671.
[28]
Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114.
[29]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
[30]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[31]
Sungho Park, Jewook Lee, Pilhyeon Lee, Sunhee Hwang, Dohyung Kim, and Hyeran Byun. 2022. Fair contrastive learning for facial attribute classification. Computing Research Repository abs/2203.16209.
[32]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.
[33]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Vol. 139. 8748–8763.
[34]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 91–99.
[35]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[36]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 5813–5823.
[37]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 815–823.
[38]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval—An empirical odyssey. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 6439–6448.
[39]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. 154–162.
[40]
Weilun Wang, Wengang Zhou, Jianmin Bao, Dong Chen, and Houqiang Li. 2021. Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. 14000–14009.
[41]
Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2VLAD: Global-local sequence alignment for text-video retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21). 5079–5088.
[42]
Xiaohan Wang, Linchao Zhu, Zhedong Zheng, Mingliang Xu, and Yi Yang. 2023. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia 25 (2023), 6079–6089.
[43]
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive linguistic-visual composition network for image retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1369–1378.
[44]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11307–11317.
[45]
Xing Xu, Kaiyi Lin, Lianli Gao, Huimin Lu, Heng Tao Shen, and Xuelong Li. 2022. Learning cross-modal common representations by private-shared subspaces separation. IEEE Transactions on Cybernetics 52, 5 (2022), 3261–3275.
[46]
Qu Yang, Mang Ye, Zhaohui Cai, Kehua Su, and Bo Du. 2023. Composed image retrieval via cross relation network with hierarchical aggregation transformer. IEEE Transactions on Image Processing 32 (2023), 4543–4554.
[47]
Song Yang, Qiang Li, Wenhui Li, Xuan-Ya Li, Ran Jin, Bo Lv, Rui Wang, and Anan Liu. 2023. Semantic completion and filtration for image–text retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 4 (2023), 20.
[48]
Feifei Zhang, Mingliang Xu, and Changsheng Xu. 2021. Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Transactions on Image Processing 31 (2021), 1000–1011.
[49]
Feifei Zhang, Mingliang Xu, and Changsheng Xu. 2022. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 59:1–59:23.
[50]
Gangjian Zhang, Shikui Wei, Huaxin Pang, and Yao Zhao. 2021. Heterogeneous feature fusion and cross-modal alignment for composed image retrieval. In ACM International Conference on Multimedia. 5353–5362.
[51]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the ACM International Conference on Multimedia. 209–217.
[52]
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang. 2023. AMC: Adaptive multi-expert collaborative network for text-guided image retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 19 (2023), 1–22.

Cited By

View all
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02538(26866-26875)Online publication date: 16-Jun-2024

Index Terms

  1. Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
      June 2024
      715 pages
      EISSN:1551-6865
      DOI:10.1145/3613638
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 March 2024
      Online AM: 09 January 2024
      Accepted: 17 December 2023
      Revised: 04 November 2023
      Received: 20 May 2023
      Published in TOMM Volume 20, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Composed query-based image retrieval
      2. cross-modal retrieval
      3. cross-level interaction
      4. preserved and modified attentions

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)671
      • Downloads (Last 6 weeks)108
      Reflects downloads up to 22 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
      • (2024)Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02538(26866-26875)Online publication date: 16-Jun-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media