Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3404835.3462881acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback

Published: 11 July 2021 Publication History

Abstract

We study the task of conversational fashion image retrieval via multiturn natural language feedback. Most previous studies are based on single-turn settings. Existing models on multiturn conversational fashion image retrieval have limitations, such as employing traditional models, and leading to ineffective performance. We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts. One characteristic of the framework is that it searches for candidate images based on exploitation of the encoded reference image and feedback text information together with the conversation history. Furthermore, the image fashion attribute information is leveraged via a mutual attention strategy. Since there is no existing fashion dataset suitable for the multiturn setting of our task, we derive a large-scale multiturn fashion dataset via additional manual annotation efforts on an existing single-turn dataset. The experiments show that our proposed model significantly outperforms existing state-of-the-art methods.

Supplementary Material

MP4 File (SIGIR21-fp0377.mp4)
Presentation video.

References

[1]
Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. 2017. Fashion Forward: Forecasting Visual Style in Fashion. In Proceedings of the IEEE international conference on computer vision. 388--397.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual Question Answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[3]
Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Com-positional Learning of Image-Text Query for Image Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1140--1149.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs.CL]
[5]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv Preprint arXiv:1412.3555(2014).
[6]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2054--2063.
[7]
Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image Retrieval:Ideas, Influences, and Trends of the New Age. Comput. Surveys 40, 2 (2008), 1--60.
[8]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913.
[9]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-Based Interactive Image Retrieval. In Advances in neural information processing systems. 678--688.
[10]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogerio Feris. 2019. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. arXiv Preprint arXiv:1905.12794(2019).
[11]
Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, YuanLi, Yang Zhao, and Larry S Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463--1471.
[12]
Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S Davis. 2017. Learning Fashion Compatibility With Bidirectional LSTMS. In Proceedings of the 25th ACM international conference on Multimedia. 1078--1086.
[13]
Ruining He and Julian McAuley. 2015. VBPR: Visual Bayesian Personalized Ranking From Implicit Feedback. arXiv Preprint arXiv:1510.01784(2015).
[14]
Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends With One-Class Collaborative Filtering. In proceedings of the 25th international conference on world wide web. 507--517.
[15]
Min Hou, Le Wu, Enhong Chen, Zhi Li, Vincent W. Zheng, and Qi Liu. 2019. Explainable Fashion Recommendation: A Semantic Attribute Region Guided Approach. arXiv:1905.12862 [cs.IR]
[16]
Wei-Lin Hsiao and Kristen Grauman. 2017. Learning the Latent "Look": Unsupervised Discovery of a Style-Coherent Embedding From Fashion Images. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 4213--4222.
[17]
Wei-Lin Hsiao and Kristen Grauman. 2018. Creating Capsule Wardrobes From Fashion Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7161--7170.
[18]
Yang Hu, Xi Yi, and Larry S Davis. 2015. Collaborative Fashion Recommendation:A Functional Tensor Factorization Approach. In Proceedings of the 23rd ACM international conference on Multimedia. 129--138.
[19]
Adriana Kovashka and Kristen Grauman. 2013. Attribute Pivots for Guiding Relevance Feedback in Image Search. In Proceedings of the IEEE International Conference on Computer Vision. 297--304.
[20]
Adriana Kovashka and Kristen Grauman. 2017. Attributes for Image Retrieval. In Visual Attributes. Springer, 89--117.
[21]
Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2012. Whittle search:Image Search With Relative Attribute Feedback. In2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2973--2980.
[22]
Jianri Li, Jae whan Lee, Woo sang Song, Ki young Shin, and Byunghyun Go. 2019. Designovel's System Description for Fashion-Iq Challenge 2019.arXiv:1910.11119 [cs.CV]
[23]
Yuncheng Li, Liangliang Cao, Jiang Zhu, and Jiebo Luo. 2017. Mining Fashion Outfit Composition Using an End-to-End Deep Learning Approach on Set Data. IEEE Transactions on Multimedia 19, 8 (2017), 1946--1955.
[24]
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tatseng Chua. 2018. Knowledge-Aware Multimodal Dialogue Systems. In Proceedings of the 26th ACM international conference on Multimedia. 801--809.
[25]
Qiang Liu, Shu Wu, and Liang Wang. 2017. DeepStyle: Learning User Preferences for Visual Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 841--844.
[26]
Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and Shuicheng Yan. 2012. Hi, Magic Closet, Tell Me What to Wear!. In Proceedings of the 20th ACM international conference on Multimedia. 619--628.
[27]
Yihui Ma, Jia Jia, Suping Zhou, Jingtian Fu, Yejun Liu, and Zijian Tong. 2017. Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach. In Thirty-First AAAI Conference on Artificial Intelligence.
[28]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 43--52.
[29]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 30--38.
[30]
Devi Parikh and Kristen Grauman. 2011. Relative Attributes. In 2011 International Conference on Computer Vision. IEEE, 503--510.
[31]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:Global Vectors for Word Representation. In Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP). 1532--1543.
[32]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[33]
Negar Rostamzadeh, Seyedarian Hosseini, Thomas Boquet, Wojciech Stokowiec, Ying Zhang, Christian Jauvin, and Chris Pal. 2018. Fashion-Gen: The Generative Fashion Dataset and Challenge. arXiv Preprint arXiv:1806.08317(2018).
[34]
Minchul Shin, Yoonjae Cho, and Seongwuk Hong. 2020. Fashion-Iq 2020 Challenge 2nd Place Team's Solution. arXiv:2007.06404 [cs.CV]
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In Advances in neural information processing systems. 5998--6008.
[36]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[37]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval-an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439--6448.
[38]
Le Wu, Lei Chen, Richang Hong, Yanjie Fu, Xing Xie, and Meng Wang. 2019.A Hierarchical Attention Model for Social Contextual Image Recommendation. IEEE Transactions on Knowledge and Data Engineering(2019).
[39]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2017. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1367--1381.
[40]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting Image Captioning With Attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.
[41]
Yong Rui, T. S. Huang, M. Ortega, and S. Mehrotra. 1998. Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology8, 5 (1998), 644--655.
[42]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning With Semantic Attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651--4659.
[43]
Wenhui Yu, Huidi Zhang, Xiangnan He, Xu Chen, Li Xiong, and Zheng Qin. 2018. Aesthetic-Based Clothing Recommendation. InProceedings of the 2018 World Wide Web Conference. 649--658.
[44]
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. CurlingNet: Compositional Learning Between Images and Text for Fashion IQ Data. arXiv:2003.12299 [cs.CV]
[45]
Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, and Lawrence Carin. 2019.Text-Based Interactive Recommendation With Constraint-Augmented Reinforcement Learning. (2019).
[46]
Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, and Lawrence Carin. 2020. Reward Constrained Interactive Recommendation With Natural Language Feedback. arXiv Preprint arXiv:2005.01618(2020).

Cited By

View all
  • (2024)Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data SettingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681217(2223-2232)Online publication date: 28-Oct-2024
  • (2024)Personalised Multi-modal Interactive Recommendation with Hierarchical State RepresentationsACM Transactions on Recommender Systems10.1145/36511692:3(1-25)Online publication date: 5-Jun-2024
  • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
  • Show More Cited By

Index Terms

  1. Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. multimodal embedding
    3. multiturn interactive image retrieval
    4. natural language feedback

    Qualifiers

    • Research-article

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)84
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data SettingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681217(2223-2232)Online publication date: 28-Oct-2024
    • (2024)Personalised Multi-modal Interactive Recommendation with Hierarchical State RepresentationsACM Transactions on Recommender Systems10.1145/36511692:3(1-25)Online publication date: 5-Jun-2024
    • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
    • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
    • (2024)Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational SearchProceedings of the ACM Web Conference 202410.1145/3589334.3645483(1474-1485)Online publication date: 13-May-2024
    • (2024)Learning Structured Relation Embeddings for Fine-Grained Fashion Attribute RecognitionIEEE Transactions on Multimedia10.1109/TMM.2023.328459326(1652-1664)Online publication date: 1-Jan-2024
    • (2023)A Survey on Fashion Image RetrievalACM Computing Surveys10.1145/363655256:6(1-25)Online publication date: 13-Dec-2023
    • (2023)Multimodal Fashion Knowledge Extraction as CaptioningProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625315(52-62)Online publication date: 26-Nov-2023
    • (2023)Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based FeaturesACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759720:3(1-24)Online publication date: 23-Oct-2023
    • (2023)Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance FeedbackProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608775(362-373)Online publication date: 14-Sep-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media