Article

A Merge Sort Based Ranking System for the Evaluation of Large Language Models

Authors:

Song LiuAuthors Info & Claims

Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part IX

Pages 240 - 255

https://doi.org/10.1007/978-3-031-70378-2_15

Published: 08 September 2024 Publication History

Abstract

Efficient and accurate evaluation of Large Language Models (LLMs) is essential for progress in the field of natural language processing. To address this, our paper introduces Transitive Merge Sort (TMS), a novel method that harnesses the advantages of merge sort’s efficiency, stability and parallelizability for model ranking in LLMs evaluation. This approach applies a divide-and-conquer strategy for pairwise comparisons, streamlining the evaluation process. Our experimental findings reveal that TMS not only improves the accuracy of model rankings when compared to methods like Elo rating and SuperCLUE (compared with GPT-3.5) but also significantly reduces the need for annotation resources by up to 70%. Additionally, we present an iterated version of TMS that effectively handles scenarios where initial model rankings are unknown.

References

[1]

Berry KJ and Mielke PW Jr Spearman’s footrule as a measure of agreement Psychol. Rep. 1997 80 3 839-846

[2]

Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

[3]

Chang Y et al. A survey on evaluation of large language models ACM Trans. Intell. Syst. Technol. 2023 15 1-45

Digital Library

[4]

Opencompass Contributors.: Opencompass: a universal evaluation platform for foundation models (2023). https://github.com/open-compass/opencompass

[5]

Elo AE The proposed USCF rating system, its development, theory, and applications Chess Life 1967 22 8 242-247

[6]

Hendrycks, D., et al.: Measuring massive multitask language understanding. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)

[7]

Huang, Y., et al.: C-eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

[8]

Li, R., Patel, T., Du, X.: PRD: peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762 (2023)

[9]

Liang, P., et al.: Holistic evaluation of language models. arXiv:2211.09110 (2023)

[10]

Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

[11]

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

[12]

Big-Bench authors.: Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. (2023)

[13]

Wang, P., et al.: Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023)

[14]

Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)

[15]

Wikipedia: Round-robin tournament (2023). https://en.wikipedia.org/wiki/Round-robin_tournament

[16]

Xu, C., et al.: WizardLM: empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)

[17]

Xu, L., et al.: Superclue: a comprehensive Chinese large language model benchmark. arXiv preprint arXiv:2307.15020 (2023)

[18]

Zhang, Y., et al.: Llmeval: a preliminary study on how to evaluate large language models. arXiv preprint arXiv:2312.07398 (2023)

[19]

Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

[20]

Zhong, W., et al.: AGIEval: a human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364 (2023)

Index Terms

A Merge Sort Based Ranking System for the Evaluation of Large Language Models
1. Applied computing
  1. Arts and humanities

Index terms have been assigned to the content through auto-classification.

Recommendations

SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
Natural Language Processing and Chinese Computing
Abstract
Since the advent of the GPT-3.5 model, numerous large language models(LLMs) have emerged in China. With the increasing number of users, the security of these models has garnered extensive attention from researchers. However, the current evaluation ...
Data-Adapted Parallel Merge Sort
Euro-Par 2019: Parallel Processing Workshops
Abstract
In the aerospace sciences we produce huge amounts of data. This data must be arranged in a meaningful order, so that we can analyze or visualize it. In this paper we focus on data that is distributed among computer processes and then needs to be ...
Non-Partitioning Merge-Sort: Performance Enhancement by Elimination of Division in Divide-and-Conquer Algorithm
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

The importance of a high performance sorting algorithm with low time complexity cannot be over stated. Several benchmark algorithms viz. Bubble Sort, Insertion Sort, Quick Sort, and Merge Sort, etc. have tried to achieve these goals, but with limited ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part IX

Sep 2024

520 pages

ISBN:978-3-031-70377-5

DOI:10.1007/978-3-031-70378-2

Editors:
Albert Bifet
LTCI, Télécom Paris, Palaiseau Cedex, France
,
Tomas Krilavičius
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
,
Ioanna Miliou
Stockholm University, Kista, Sweden
,
Slawomir Nowaczyk
https://ror.org/03h0qfp10School of Information Technology, Halmstad University, Halmstad, Sweden

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents