Abstract
The dependence on human relevance judgments limits the development of information retrieval test collections that are vital for evaluating these systems. Since their launch, large language models (LLMs) have been applied to automate several human tasks. Recently, LLMs started being used to provide relevance judgments for document search. In this work, our goal is to assess whether LLMs can replace human annotators in a different setting – product search in eCommerce. We conducted experiments on open and proprietary industrial datasets to measure LLM’s ability to predict relevance judgments. Our results found that LLM-generated relevance assessments present a strong agreement (\(\sim \)82%) with human annotations indicating that LLMs have an innate ability to perform relevance judgments in an eCommerce setting. Then, we went further and tested whether LLMs can generate annotation guidelines. Our results found that relevance assessments obtained with LLM-generated guidelines are as accurate as the ones obtained from human instructions.\(^1\)(The source code for this work is available at https://github.com/danimtk/chatGPT-goes-shopping)
B. Soviero and D. Kuhn—Work conducted during an internship at VTEX.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blanco, R., et al.: Repeatable and reliable search system evaluation using crowdsourcing. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 923–932 (2011)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 268–275 (2006)
Chen, Y., Liu, S., Liu, Z., Sun, W., Baltrunas, L., Schroeder, B.: WANDS: dataset for product search relevance assessment. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 128–141. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_9
Cleverdon, C.W.: The ASLIB cranfield research project on the comparative efficiency of indexing systems. In: ASLIB Proceedings, vol. 12, pp. 421–431. MCB UP Ltd. (1960)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Faggioli, G., et al.: Perspectives on large language models for relevance judgment. In: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 39–50 (2023)
Harman, D., Voorhees, E.: Overview of the eighth text retrieval conference (TREC-8). In: Proceedings of the Eight Text Retrieval Conference (TREC-8), pp. 1–19 (1999)
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: ACM SIGIR Forum, vol. 51, pp. 4–11. ACM New York, NY, USA (2017)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2020). https://openreview.net/forum?id=SyxS0T4tvS
Lima de Oliveira, L., Romeu, R.K., Moreira, V.P.: REGIS: a test collection for geoscientific documents in Portuguese. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2363–2368 (2021)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
Sanderson, M., et al.: Test collection based evaluation of information retrieval systems. Found. Trends® Inf. Retrieval 4(4), 247–375 (2010)
Schick, T., Schütze, H.: It’s not just size that matters: small language models are also few-shot learners. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.185, https://aclanthology.org/2021.naacl-main.185
Sondhi, P., Sharma, M., Kolari, P., Zhai, C.: A taxonomy of queries for e-commerce search. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1245–1248 (2018)
Spark-Jones, K., van Rijsbergen, C.J.: Report on the need for and provision of an “ideal” information retrieval test collection. University of Cambridge, Computer Laboratory (1975)
Thomas, P., Spielman, S., Craswell, N., Mitra, B.: Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621 (2023)
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manag. 36(5), 697–716 (2000)
Voorhees, E.M., et al.: Overview of the TREC 2003 robust retrieval track. In: Proceedings of the Text Retrieval Conference, pp. 69–77 (2003)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJ4km2R5t7
Xu, L., et al.: FewCLUE: a Chinese few-shot learning evaluation benchmark (2021)
Acknowledgments
The authors thank Shervin Malmasi for his helpful comments and suggestions. This work has been financed in part by VTEX BRASIL (EMBRAPII PCEE1911.0140), CAPES Finance Code 001, and CNPq/Brazil.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Soviero, B., Kuhn, D., Salle, A., Moreira, V.P. (2024). ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14611. Springer, Cham. https://doi.org/10.1007/978-3-031-56066-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-56066-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56065-1
Online ISBN: 978-3-031-56066-8
eBook Packages: Computer ScienceComputer Science (R0)