Can we trust LLMs as relevance judges?

  • Luciana Bencke Universidade Federal do Rio Grande do Sul (UFRGS)
  • Felipe S. F. Paula Universidade Federal do Rio Grande do Sul (UFRGS)
  • Bruno G. T. dos Santos Universidade Federal do Rio Grande do Sul (UFRGS)
  • Viviane P. Moreira Universidade Federal do Rio Grande do Sul (UFRGS)

Resumo


Evaluation is key for Information Retrieval systems and requires test collections consisting of documents, queries, and relevance judgments. Obtaining relevance judgments is the most costly step in creating test collections because they demand human intervention. A recent tendency in the area is to replace humans with Large Language Models (LLMs) as the source of relevance judgments. In this paper, we investigate the use of LLMs as a source of relevance judgments. Our goal is to find out how reliable LLMs are in this task. We experimented with different LLMs and test collections in Portuguese. Our results show that LLMs can yield promising performance that is competitive with human annotations.
Palavras-chave: information retrieval, LLM evaluation, relevance judgments

Referências

Almeida, T. S., Abonizio, H., and Nogueira, R. (2024). Sabiá-2: A New Generation of Portuguese Large Language Models. arXiv preprint arXiv:2403.09887.

Blanco, R., Halpin, H., Herzig, D. M., Mika, P., Pound, J., Thompson, H. S., and Tran Duc, T. (2011). Repeatable and reliable search system evaluation using crowd-sourcing. In ACM SIGIR conference on Research and development in Information Retrieval, pages 923–932.

Bueno, M., de Oliveira, E. S., Nogueira, R., Lotufo, R. A., and Pereira, J. A. (2024). Quati: A brazilian portuguese information retrieval dataset from native speakers. arXiv preprint arXiv:2404.06976.

Cleverdon, C. W. (1960). The aslib cranfield research project on the comparative efficiency of indexing systems. In Aslib Proceedings, volume 12, pages 421–431.

de Jesus, G. and Nunes, S. (2024). Exploring large language models for relevance judgments in tetun. arXiv preprint arXiv:2406.07299.

Faggioli, G., Dietz, L., Clarke, C. L., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B., et al. (2023). Perspectives on large language models for relevance judgment. In ACM SIGIR International Conference on Theory of Information Retrieval, pages 39–50.

Lima de Oliveira, L., Romeu, R. K., and Moreira, V. P. (2021). REGIS: A Test Collection for Geoscientific Documents in Portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2363–2368.

Piau, M., Lotufo, R., and Nogueira, R. (2024). ptt5-v2: A closer look at continued pre-training of T5 models for the portuguese language. arXiv preprint arXiv:2406.10806.

Rahmani, H. A., Craswell, N., Yilmaz, E., Mitra, B., and Campos, D. (2024). Synthetic test collections for retrieval evaluation. arXiv preprint arXiv:2405.07767.

Resnick, A. and Savage, T. R. (1964). The consistency of human judgments of relevance. American Documentation, 15(2):93–95.

Santos, D. and Rocha, P. (2004). The key to the first CLEF with Portuguese: Topics, questions and answers in CHAVE. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 821–832. Springer.

Soviero, B., Kuhn, D., Salle, A., and Moreira, V. P. (2024). ChatGPT goes shopping: LLMs can predict relevance in ecommerce search. In European Conference on Information Retrieval, pages 3–11.

Spärck Jones, K. and van Rijsbergen, C. J. (1975). Report on the need for and provision of an "ideal" information retrieval test collection. Computer Laboratory, University of Cambridge.

Spärck Jones, K., Walker, S., and Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments. Information processing & management, 36(6):809–840.

Theodosiou, Z., Georgiou, O., and Tsapatsoulis, N. (2011). Evaluating annotators consistency with the aid of an innovative database schema. In International Workshop on Semantic Media Adaptation and Personalization, pages 74–78.

Thomas, P., Spielman, S., Craswell, N., and Mitra, B. (2023). Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621.

Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697–716.

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.

Zhu, E., Sheng, Q., Yang, H., Liu, Y., Cai, T., and Li, J. (2023). A unified framework of medical information annotation and extraction for chinese clinical text. Artificial Intelligence in Medicine, 142:102573.
Publicado
14/10/2024
BENCKE, Luciana; PAULA, Felipe S. F.; DOS SANTOS, Bruno G. T.; P. MOREIRA, Viviane. Can we trust LLMs as relevance judges?. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 600-612. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.243130.