Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Dual attention composition network for fashion image retrieval with attribute manipulation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Due to practical demands and substantial potential benefits, there is growing interest in fashion image retrieval with attribute manipulation. For example, if a user wants a product similar to a query image and has the attribute “3/4 sleeves” instead of “short sleeves” he can modify the query image by entering text. Unlike general items, fashion items are rich in categories and attributes, and some items with different attributes have only very subtle differences in vision. Moreover, the visual appearance of fashion items changes dramatically under different conditions, such as lighting, viewing angle, and occlusion. These pose challenges to the fashion retrieval task. Therefore, we consider learning an attribute-specific space for each attribute to obtain discriminative features. In this paper, we propose a dual attention composition network for image retrieval with manipulation, which addresses two important issues, where to focus and how to modify. The dual attention module aims to capture fine-grained image-text alignment through corresponding spatial and channel attention and then satisfy multi-modal composition through corresponding affine transformation. The TIRG-based semantic composition module combines the query image’s attention features and the manipulation text’s embedding features to obtain a synthetic representation close to the target image. Meanwhile, we investigate the semantic hierarchy of attributes and propose a hierarchical encoding method, which can preserve the associations between attributes for efficient feature learning. Extensive experiments conducted on three multi-modal fashion-related retrieval datasets demonstrate the superiority of our network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are derived from the following public domain resources. 1. FashionIQ can be downloaded from https://github.com/hongwang600/fashion-iq-metadata. 2. Fashion200k can be downloaded from https://github.com/xthan/fashion-200k. 3. Shoes can be downloaded from https://github.com/XiaoxiaoGuo/fashion-retrieval/tree/master/dataset.

References

  1. Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1096–1104

  2. Gu X, Wong Y, Shou L, Peng P, ChenG Kankanhalli MS (2018) Multi-modal and multi-domain embedding learning for fashion retrieval and analysis. IEEE Trans Multimed 21(6):1524–1537

    Article  Google Scholar 

  3. D’Innocente A, Garg N, Zhang Y, Bazzani L, Donoser M (2021) Localized triplet loss for fine-grained fashion image retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3910–3915

  4. Lang Y He Y Yang F, Dong J, Xue H (2020) Which is plagiarism: fashion image retrieval based on regional representation for design protection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 2595–2604

  5. Mansouri N, Ammar S, Kessentini Y (2021) Re-ranking person re-identification using attributes learning. Neural Comput Appl 33(19):12827–12843

    Article  Google Scholar 

  6. Li S, Yu H, Hu R (2020) Attributes-aided part detection and refinement for person re-identification. Pattern Recogn 97:107016

    Article  Google Scholar 

  7. Li X, Yang J, Ma J (2021) Recent developments of content-based image retrieval (CBIR). Neurocomputing 452(10):675–689

    Article  Google Scholar 

  8. Zhang F, Xu M, Xu C (2022) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011

    Article  Google Scholar 

  9. Han X, Wu Z, Huang PX, Zhang X, Zhu M, Li Y, Zhao Y, Davis LS (2017) Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1463–1471

  10. Kovashka A, Devi P, Kristen G (2012) Whittlesearch: image search with relative attribute feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 2973–2980

  11. Yu A, Kristen G (2019) Thinking outside the pool: active training image creation for relative attributes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 708–718

  12. Jifei S, Yi-Zhe S, Tao X, Timothy H, Xiang R (2016) Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval. In: Proceedings of the British machine vision conference (BMVC). pp 132–113211

  13. Murrugarra-Llerena N, Kovashka A (2021) Image retrieval with mixed initiative and multimodal feedback. Comput Vis Image Underst 207:103204

    Article  Google Scholar 

  14. Mai L, Jin H, Lin Z, Fang C, Brandt J, Liu F (2017) Spatial-semantic image search by visual feature synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4718–4727

  15. Cheng W, Song S, Chen C, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv 54(4):1–41

    Article  Google Scholar 

  16. Huang J, Feris RS, Chen Q, Yan S (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1062–1070

  17. Kuang Z, Gao Y, Li G, Luo P, Chen Y, Lin L, Zhang W (2019) Fashion retrieval via graph reasoning networks on a similarity pyramid. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3066–3075

  18. Barz B, Denzler J (2019) Hierarchy-based image embeddings for semantic image retrieval. In: 2019 IEEE winter conference on applications of computer vision (WACV). pp 638–647

  19. Zhao J, Peng Y, He X (2020) Attribute hierarchy based multi-task learning for fine-grained image classification. Neurocomputing 395:150–159

    Article  Google Scholar 

  20. Narayana P, Pednekar A, Krishnamoorthy A, Sone K, Basu S (2019) Huse: Hierarchical universal semantic embeddings. arXiv:1911.05978

  21. Vo N, Jiang L, Sun C, Murphy K, Li L-J, Fei-Fei L, Hays J (2019) Composing text and image for image retrieval-an empirical odyssey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 6439–6448

  22. Ji X, Wang W, Zhang M, Yang Y (2017) Cross-domain image retrieval with attention modeling. In: Proceedings of the 25th ACM international conference on multimedia (MM). pp 1654–1662

  23. Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp 686–701

  24. Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 2251–2260

  25. Liao L, He X, Zhao B, Ngo C-W, Chua T-S (2018) Interpretable multimodal retrieval for fashion products. In: Proceedings of the 26th ACM international conference on multimedia (MM). pp 1571–1579

  26. Guo X, Wu H, Cheng Y, Rennie S, Tesauro G, Feris R (2018) Dialog-based interactive image retrieval. In: Proceedings of the conference on advances in neural information processing systems (NIPS). pp 678–688

  27. Liu H, Wang R, Shan S, Chen X (2019) What is a tabby? Interpretable model decisions by learning attribute-based classification criteria. IEEE Trans Pattern Anal Mach Intell 43(5):1791–1807

    Article  Google Scholar 

  28. Xu Y, Bin Y, Wang G, Yang Y (2021) Hierarchical composition learning for composed query image retrieval. In: ACM multimedia Asia. pp 1–7

  29. Zhang F, Xu M, Xu C (2021) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011

    Article  Google Scholar 

  30. Chen Y, Gong S, Bazzani L (2020) Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 3001–3011

  31. Lee S, Kim D, Han B(2021) Cosmo: content-style modulation for image retrieval with text feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 802–812

  32. Wen H, Song X, Yang X, Zhan Y, Nie L(2021) Comprehensive linguistic-visual composition network for image retrieval. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 1369–1378

  33. Li X, Rong Y, Zhao M, Fan J (2021) Interactive clothes image retrieval via multi-modal feature fusion of image representation and natural language feedback. In: International conference on neural computing for advanced applications. Springer, pp 578–589

  34. Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recogn Lett 141:68–74

    Article  Google Scholar 

  35. Quintino Ferreira B, Costeira JP, Sousa RG, Gui L-Y, Gomes JP (2019) Pose guided attention for multi-label fashion image classification. In: Proceedings of the IEEE/CVF international conference on computer vision workshops (ICCVW). pp 3125–3128

  36. Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2020) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Trans Pattern Anal Mach Intell 44(1):318–329

    Article  Google Scholar 

  37. Wu J, Weng W, Fu J, Liu L, Hu B (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34(7):5397–5416

    Article  Google Scholar 

  38. Su H, Wang P, Liu L, Li H, Li Z, Zhang Y (2020) Where to look and how to describe: fashion image retrieval with an attentional heterogeneous bilinear network. IEEE Trans Circuits Syst Video Technol 31(8):3254–3265

    Article  Google Scholar 

  39. Zhang Z, Chen P, Shi X, Yang L (2019) Text-guided neural network training for image recognition in natural scenes and medicine. IEEE Trans Pattern Anal Mach Intell 43(5):1733–1745

    Article  Google Scholar 

  40. Ma Z, Dong J, Long Z, Zhang Y, He Y, Xue H, Ji S (2020) Fine-grained fashion similarity learning by attribute-specific embedding network. Proc AAAI Conf Artif Intell (AAAI) 34:11741–11748

    Google Scholar 

  41. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778

  42. Kuang Z, Zhang X, Yu J, Li Z, Fan J (2021) Deep embedding of concept ontology for hierarchical fashion recognition. Neurocomputing 425:191–206

    Article  Google Scholar 

  43. Yan C, Ding A, Zhang Y, Wang Z (2021) Learning fashion similarity based on hierarchical attribute embedding. In: Proceedings of 2021 IEEE 8th international conference on data science and advanced analytics (DSAA). pp 1–8

  44. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 5659–5667

  45. Shajini M, Ramanan A (2021) An improved landmark-driven and spatial-channel attentive convolutional neural network for fashion clothes classification. Vis Comput 37(6):1517–1526

    Article  Google Scholar 

  46. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7132–7141

  47. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 815–823

  48. Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: a new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 11307–11317

  49. Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the European conference on computer vision (ECCV). pp 663–676

  50. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongquan Wan.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, Y., Zou, G., Yan, C. et al. Dual attention composition network for fashion image retrieval with attribute manipulation. Neural Comput & Applic 35, 5889–5902 (2023). https://doi.org/10.1007/s00521-022-07994-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07994-9

Keywords

Navigation