Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3652583.3657608acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue

Published: 07 June 2024 Publication History

Abstract

Real-time matching between customer demands and product information via text-image retrieval remains a fundamental problem in intelligent retailing. However, this process involves challenges covering data quality, multi-modal retrieval strategies and performing efficiency. To alleviate the case, we propose a cross-modality retrieval pipeline leveraging contrastive loss and a novel sampling strategy. We also address text-image retrieval as a two-stage process, involving unsupervised clustering and contrastive feature representation. Additionally, we create an image-caption matching dataset by expanding the Grocery Store Dataset using a fundamental visual-language model. Our experiments demonstrate the effectiveness of our method on both an expanded new dataset and the well-known cross-modality retrieval benchmark, Flicker30k.

References

[1]
Hüseyin Fuat Alsan, Ekrem Yıldız, Ege Burak Safdil, Furkan Arslan, and Taner Arsan. 2021. Multimodal Retrieval with Contrastive Pretraining. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). 1--5. https://doi.org/10.1109/INISTA52262.2021.9548414
[2]
Yuanqiang Cai, Longyin Wen, Libo Zhang, Dawei Du, and Weiqiang Wang. 2021. Rethinking Object Detection in Retail Stores. In The 35th AAAI Conference on Artificial Intelligence (AAAI 2021).
[3]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652--12660. https://doi.org/10.1109/CVPR42600.2020.01267
[4]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12652--12660. https://doi.org/10.1109/CVPR42600.2020.01267
[5]
Haiwen Diao, Ying Zhang, Lingyun Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:230523667
[6]
Yali Du, Yin wei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2022. Multiqueue Momentum Contrast for Microvideo-Product Retrieval. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (2022). https://api.semanticscholar.org/CorpusID:254974560
[7]
Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. 2019. Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation. Proceedings of the 27th ACM International Conference on Multimedia (2019). https://api.semanticscholar.org/CorpusID:204837016
[8]
Xintong Han. 2019. Fine-grained visual-categorization dataset. In iMaterialist Challenge on Product Recognition. https://kaggle.com/competitions/imaterialistproduct-2019
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90
[10]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Flickr8k Dataset. Journal of Artificial Intelligence Research 47 (2013), 853--899.
[11]
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12971--12980. https://doi.org/10.1109/CVPR46437.2021.01278
[12]
Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 2787--2797. https://api.semanticscholar.org/CorpusID:257663606
[13]
Marcus Klasson, Cheng Zhang, and Hedvig Kjellström. 2019. A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels. In IEEE Winter Conference on Applications of Computer Vision (WACV).
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 212--228.
[15]
Yong Li, Yuan-Zheng Wang, and Zhen Cui. 2023. Decoupled Multimodal Distilling for Emotion Recognition. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 6631--6640. https://api.semanticscholar.org/CorpusID:257756905
[16]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision. https://api.semanticscholar.org/CorpusID:14113767
[17]
Chunxiao Liu, Zhendong Mao, Anan Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. Proceedings of the 27th ACM International Conference on Multimedia (2019). https://api.semanticscholar.org/CorpusID:202749938
[18]
Chong Liu, Yuqi Zhang, Hongsong Wang, Weihua Chen, Fan Wang, Yan Huang, Yi-Dong Shen, and Liang Wang. 2023. Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training. IEEE Transactions on Image Processing 32 (2023), 3622--3633. https://doi.org/10.1109/TIP.2023.3286710
[19]
Haotian Liu, Chunyuan Li, QingyangWu, and Yong Jae Lee. 2023. Visual Instruction Tuning. In NeurIPS.
[20]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992--10002. https://doi.org/10.1109/ICCV48922.2021.00986
[21]
Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Jiaxin Wen. 2022. COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 15671--15680. https://api.semanticscholar.org/CorpusID: 248218570
[22]
Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. IEEE Transactions on Circuits and Systems for Video Technology 33, 10 (2023), 5486--5497. https://doi.org/10.1109/TCSVT.2023.3257193
[23]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:231591445
[24]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019). https://api.semanticscholar.org/CorpusID:203626972
[25]
Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. 2019. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In International Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:199466137
[26]
Yale Song and M. Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 1979--1988. https://api.semanticscholar.org/CorpusID: 184488029
[27]
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2022. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 6534--6545. https://doi.org/10.1109/TPAMI.2021.3088863
[28]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10938--10947. https://doi.org/10.1109/CVPR42600.2020.01095
[29]
et.al Xiu-Shen WEI, Quan CUI. 2022. RPC: a large-scale and fine-grained retail product checkout dataset. SCIENCE CHINA Information Sciences 65, 9 (2022). https://doi.org/10.1007/s11432-022-3513-y
[30]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.
[31]
Wei Yu Yuxiang Chen, Linfang Wang. 2020. Products-10k: Large Scale Product Recognition Dataset. In Large Scale Product Recognition Challenge. https://products-10k.github.io/challenge.html#dataset
[32]
Haijun Zhang, Donghai Li, Yuzhu Ji, Haibin Zhou, Weiwei Wu, and Kai Liu. 2020. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines. IEEE Transactions on Industrial Informatics 16, 12 (2020), 7722--7731. https://doi.org/10.1109/TII.2019.2954956
[33]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-Aware Attention Network for Image-Text Retrieval. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3533--3542. https://doi.org/10.1109/CVPR42600.2020.00359

Index Terms

  1. CLCP: Realtime Text-Image Retrieval for Retailing via Pre-trained Clustering and Priority Queue

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval
    May 2024
    1379 pages
    ISBN:9798400706196
    DOI:10.1145/3652583
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contrastive loss
    2. text-image retrieval
    3. vision-language model

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    ICMR '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 56
      Total Downloads
    • Downloads (Last 12 months)56
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media