short-paper

CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue

Authors:

Yanzhi SongAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 1089 - 1093

https://doi.org/10.1145/3652583.3657608

Published: 07 June 2024 Publication History

Abstract

Real-time matching between customer demands and product information via text-image retrieval remains a fundamental problem in intelligent retailing. However, this process involves challenges covering data quality, multi-modal retrieval strategies and performing efficiency. To alleviate the case, we propose a cross-modality retrieval pipeline leveraging contrastive loss and a novel sampling strategy. We also address text-image retrieval as a two-stage process, involving unsupervised clustering and contrastive feature representation. Additionally, we create an image-caption matching dataset by expanding the Grocery Store Dataset using a fundamental visual-language model. Our experiments demonstrate the effectiveness of our method on both an expanded new dataset and the well-known cross-modality retrieval benchmark, Flicker30k.

References

[1]

Hüseyin Fuat Alsan, Ekrem Yıldız, Ege Burak Safdil, Furkan Arslan, and Taner Arsan. 2021. Multimodal Retrieval with Contrastive Pretraining. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). 1--5. https://doi.org/10.1109/INISTA52262.2021.9548414

[2]

Yuanqiang Cai, Longyin Wen, Libo Zhang, Dawei Du, and Weiqiang Wang. 2021. Rethinking Object Detection in Retail Stores. In The 35th AAAI Conference on Artificial Intelligence (AAAI 2021).

[3]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652--12660. https://doi.org/10.1109/CVPR42600.2020.01267

[4]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652--12660. https://doi.org/10.1109/CVPR42600.2020.01267

[5]

Haiwen Diao, Ying Zhang, Lingyun Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:230523667

[6]

Yali Du, Yin wei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2022. Multiqueue Momentum Contrast for Microvideo-Product Retrieval. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (2022). https://api.semanticscholar.org/CorpusID:254974560

[7]

Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. 2019. Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation. Proceedings of the 27th ACM International Conference on Multimedia (2019). https://api.semanticscholar.org/CorpusID:204837016

Digital Library

[8]

Xintong Han. 2019. Fine-grained visual-categorization dataset. In iMaterialist Challenge on Product Recognition. https://kaggle.com/competitions/imaterialistproduct-2019

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90

[10]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Flickr8k Dataset. Journal of Artificial Intelligence Research 47 (2013), 853--899.

Digital Library

[11]

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12971--12980. https://doi.org/10.1109/CVPR46437.2021.01278

[12]

Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 2787--2797. https://api.semanticscholar.org/CorpusID:257663606

[13]

Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. 2019. A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels. In IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 212--228.

Digital Library

[15]

Yong Li, Yuan-Zheng Wang, and Zhen Cui. 2023. Decoupled Multimodal Distilling for Emotion Recognition. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 6631--6640. https://api.semanticscholar.org/CorpusID:257756905

[16]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision. https://api.semanticscholar.org/CorpusID:14113767

[17]

Chunxiao Liu, Zhendong Mao, Anan Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. Proceedings of the 27th ACM International Conference on Multimedia (2019). https://api.semanticscholar.org/CorpusID:202749938

Digital Library

[18]

Chong Liu, Yuqi Zhang, Hongsong Wang, Weihua Chen, Fan Wang, Yan Huang, Yi-Dong Shen, and Liang Wang. 2023. Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training. IEEE Transactions on Image Processing 32 (2023), 3622--3633. https://doi.org/10.1109/TIP.2023.3286710

Digital Library

[19]

Haotian Liu, Chunyuan Li, QingyangWu, and Yong Jae Lee. 2023. Visual Instruction Tuning. In NeurIPS.

[20]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992--10002. https://doi.org/10.1109/ICCV48922.2021.00986

[21]

Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Jiaxin Wen. 2022. COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 15671--15680. https://api.semanticscholar.org/CorpusID: 248218570

[22]

Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. IEEE Transactions on Circuits and Systems for Video Technology 33, 10 (2023), 5486--5497. https://doi.org/10.1109/TCSVT.2023.3257193

Digital Library

[23]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:231591445

[24]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019). https://api.semanticscholar.org/CorpusID:203626972

[25]

Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. 2019. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In International Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:199466137

[26]

Yale Song and M. Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 1979--1988. https://api.semanticscholar.org/CorpusID: 184488029

[27]

Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2022. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 6534--6545. https://doi.org/10.1109/TPAMI.2021.3088863

Digital Library

[28]

Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10938--10947. https://doi.org/10.1109/CVPR42600.2020.01095

[29]

et.al Xiu-Shen WEI, Quan CUI. 2022. RPC: a large-scale and fine-grained retail product checkout dataset. SCIENCE CHINA Information Sciences 65, 9 (2022). https://doi.org/10.1007/s11432-022-3513-y

[30]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.

[31]

Wei Yu Yuxiang Chen, Linfang Wang. 2020. Products-10k: Large Scale Product Recognition Dataset. In Large Scale Product Recognition Challenge. https://products-10k.github.io/challenge.html#dataset

[32]

Haijun Zhang, Donghai Li, Yuzhu Ji, Haibin Zhou, Weiwei Wu, and Kai Liu. 2020. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines. IEEE Transactions on Industrial Informatics 16, 12 (2020), 7722--7731. https://doi.org/10.1109/TII.2019.2954956

[33]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-Aware Attention Network for Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3533--3542. https://doi.org/10.1109/CVPR42600.2020.00359

Index Terms

CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

With the application of generation models, internet is increasingly inundated with AI-generated content (AIGC), causing both real and AI-generated content indexed in corpus for search. This paper explores the impact of AI-generated images on text-image ...
A Learning to Rank framework applied to text-image retrieval

We present a framework based on a Learning to Rank setting for a text-image retrieval task. In Information Retrieval, the goal is to compute the similarity between a document and an user query. In the context of text-image retrieval where several ...
Text-Image Retrieval With Salient Features

In recent years, deep learning has achieved remarkable results in the text-image retrieval task. However, only global image features are considered, and the vital local information is ignored. This results in a failure to match the text well. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Key Research and Development Program of China
Major Project of Science and Technology of Anhui Province
National Science Foundation of China

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
56
Total Downloads

Downloads (Last 12 months)56
Downloads (Last 6 weeks)8

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents