Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475619acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization

Published: 17 October 2021 Publication History

Abstract

Image retrieval with text feedback is an emerging research topic with the objective of integrating inputs from multiple modalities as queries. In this paper, queries contain a reference image plus text feedback that describes modifications between this image and the desired image. The existing work for this task mainly focuses on designing a new fusion network to compose the image and text. Still, little research pays attention to the modality gap caused by the inconsistent distribution of features from different modalities, which dramatically influences the feature fusion and similarity learning between queries and the desired image. We propose a Distribution-Aligned Text-based Image Retrieval (DATIR) model, which consists of attention mutual information maximization and hierarchical mutual information maximization, to bridge this gap by increasing non-linear statistic dependencies between representations of different modalities. More specifically, attention mutual information maximization narrows the modality gap between different input modalities by maximizing mutual information between the text representation and its semantically consistent representation captured from the reference image and the desired image by the difference transformer. For hierarchical mutual information maximization, it aligns distributions of features from the image modality and the fusion modality by estimating mutual information between a single-layer representation in the fusion network and the multi-level representations in the desired image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate that we can bridge the modality gap between different modalities and achieve state-of-the-art retrieval performance.

Supplementary Material

RAR File (mfp2476aux.rar)
In this supplementary, we provide the parameter settings for the compared methods and the qualitative results and analysis of our method on three standard benchmark datasets.

References

[1]
Kenan E Ak, Ashraf A Kassim, Joo Hwee Lim, and Jo Yew Tham. 2018. Learning attribute representations with localization for flexible fashion search. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7708--7717.
[2]
Philip Bachman, R Devon Hjelm, and William Buchwalter. 2019. Learning rep- resentations by maximizing mutual information across views. In Advances in Neural Information Processing Systems. 15535--15545.
[3]
Suzanna Becker. 1992. An information-theoretic unsupervised learning algorithm for neural networks. University of Toronto.
[4]
Suzanna Becker. 1996. Mutual information maximization: models of cortical self-organization. Network: Computation in neural systems 7, 1 (1996), 7--31.
[5]
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. 2018. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062 (2018).
[6]
Tamara L Berg, Alexander C Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In European Conference on Computer Vision. Springer, 663--676.
[7]
Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. 2019. Deep Metric Learning to Rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1861--1870.
[8]
Wenming Cao, Qiubin Lin, Zhihai He, and Zhiquan He. 2019. Hybrid representation learning for cross-modal retrieval. Neurocomputing 345 (2019), 45--57.
[9]
Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. 2019. An attentive survey of attention models. arXiv preprint arXiv:1904.02874 (2019).
[10]
Yanbei Chen and Loris Bazzani. 2020. Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval. ECCV.
[11]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image search with text feedback by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3001--3011.
[12]
Sumit Chopra, Raia Hadsell, Yann LeCun, et al. 2005. Learning a similarity metric discriminatively, with application to face verification. In CVPR (1). 539--546.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[15]
Monroe D Donsker and SR Srinivasa Varadhan. 1983. Asymptotic evaluation of certain Markov process expectations for large time. IV. Communications on Pure and Applied Mathematics 36, 2 (1983), 183--212.
[16]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[17]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6639--6648.
[18]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[19]
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In European conference on computer vision. Springer, 241--257.
[20]
Chunbin Gu, Jiajun Bu, Xixi Zhou, Chengwei Yao, Dongfang Ma, Zhi Yu, and Xifeng Yan. 2021. Cross-modal Image Retrieval with Deep Mutual Information Maximization. arXiv preprint arXiv:2103.06032 (2021).
[21]
Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. 2019. Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1712--1720.
[22]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Schmidt Feris. 2018. Dialog-based interactive image retrieval. arXiv preprint arXiv:1805.00145 (2018).
[23]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogerio Feris. 2019. Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback. arXiv preprint arXiv:1905.12794 (2019).
[24]
Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304.
[25]
Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13, Feb (2012), 307--361.
[26]
Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463--1471.
[27]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[28]
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
[29]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[30]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed- ings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[31]
Zhibin Hu, Yongsheng Luo, Jiong Lin, Yan Yan, and Jian Chen. 2019. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 789--795.
[32]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634--4643.
[33]
Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[34]
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. arXiv preprint arXiv:1606.01455 (2016).
[35]
Justin B Kinney and Gurinder S Atwal. 2014. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences 111, 9 (2014), 3354--3359.
[36]
Adriana Kovashka and Kristen Grauman. 2013. Attribute pivots for guiding relevance feedback in image search. In Proceedings of the IEEE International Conference on Computer Vision. 297--304.
[37]
Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. Whittlesearch: Im- age search with relative attribute feedback. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2973--2980.
[38]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deep- fashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1096--1104.
[39]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016).
[40]
Tushar Nagarajan and Kristen Grauman. 2018. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV). 169--185.
[41]
Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087--6096.
[42]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 30--38.
[43]
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-gan: Training gen- erative neural samplers using variational divergence minimization. In Advances in neural information processing systems. 271--279.
[44]
Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. 2019. Generalising Fine-Grained Sketch-Based Image Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 677--686.
[45]
Devi Parikh and Kristen Grauman. 2011. Relative attributes. In 2011 International Conference on Computer Vision. IEEE, 503--510.
[46]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In International Conference on Machine Learning. PMLR, 4055--4064.
[47]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence.
[48]
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909 (2019).
[49]
Yong Rui, Thomas S Huang, Michael Ortega, and Sharad Mehrotra. 1998. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on circuits and systems for video technology 8, 5 (1998), 644--655.
[50]
Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. 2016. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1--12.
[51]
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in neural information processing systems. 4967--4976.
[52]
Nawid Sayed, Biagio Brattoli, and Björn Ommer. 2018. Cross and learn: Cross- modal self-supervision. In German Conference on Pattern Recognition. Springer, 228--243.
[53]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.
[54]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
[55]
Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017).
[56]
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207--218.
[57]
Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1891--1898.
[58]
Bart Thomee and Michael S Lew. 2012. Interactive search in image retrieval: a survey. International Journal of Multimedia Information Retrieval 1, 2 (2012), 71--86.
[59]
Xing Tian, Xiancheng Zhou, Wing WY Ng, Jiayong Li, and Hui Wang. 2020. Bootstrap dual complementary hashing with semi-supervised re-ranking for image retrieval. Neurocomputing 379 (2020), 103--116.
[60]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019).
[61]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[62]
Raviteja Vemulapalli, Hien Van Nguyen, and S Kevin Zhou. 2017. Deep networks and mutual information maximization for cross-modal medical image synthesis. In Deep Learning for Medical Image Analysis. Elsevier, 381--403.
[63]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439--6448.
[64]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia. 154--162.
[65]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11572--11581.
[66]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure- preserving image-text embeddings. In Proceedings of the IEEE conference on com- puter vision and pattern recognition. 5005--5013.
[67]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[68]
Laurenz Wiskott and Terrence J Sejnowski. 2002. Slow feature analysis: Unsu- pervised learning of invariances. Neural computation 14, 4 (2002), 715--770.
[69]
Aron Yu and Kristen Grauman. 2019. Thinking outside the pool: Active training image creation for relative attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 708--718.
[70]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co- attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281--6290.
[71]
Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the 28th ACM International Conference on Multimedia. 3367--3376.
[72]
Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1520--1528.
[73]
Xiang Sean Zhou and Thomas S Huang. 2003. Relevance feedback in image retrieval: A comprehensive review. Multimedia systems 8, 6 (2003), 536--544

Cited By

View all
  • (2024)Semantic Editing Increment Benefits Zero-Shot Composed Image RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681649(1245-1254)Online publication date: 28-Oct-2024
  • (2024)Multi-Grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366034620:10(1-23)Online publication date: 12-Sep-2024
  • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention
  2. image retrieval
  3. mutual information maximization

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)80
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Semantic Editing Increment Benefits Zero-Shot Composed Image RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681649(1245-1254)Online publication date: 28-Oct-2024
  • (2024)Multi-Grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366034620:10(1-23)Online publication date: 12-Sep-2024
  • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
  • (2024)LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657740(80-90)Online publication date: 10-Jul-2024
  • (2024)Self-Training Boosted Multi-Factor Matching Network for Composed Image RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.334643446:5(3665-3678)Online publication date: May-2024
  • (2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 2024
  • (2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
  • (2024)Geometric-Contextual Mutual Infomax Path Aggregation for Relation Reasoning on Knowledge GraphIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336025836:7(3076-3090)Online publication date: Jul-2024
  • (2024)Multimodal Composition Example Mining for Composed Query Image RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.335906233(1149-1161)Online publication date: 1-Feb-2024
  • (2024)Multi-Level Contrastive Learning For Hybrid Cross-Modal RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447444(6390-6394)Online publication date: 14-Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media