short-paper

Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform

Authors:

Mingyue Cheng,

Hao Zhang,

Jiqian Yang,

Qi Liu,

Li Li,

Xin Huang,

Liwei Song,

Zhi Li,

Zhenya Huang,

Enhong ChenAuthors Info & Claims

WWW '24: Companion Proceedings of the ACM Web Conference 2024

Pages 1035 - 1038

https://doi.org/10.1145/3589335.3651243

Published: 13 May 2024 Publication History

Get Access

Abstract

Large language model evaluation plays a pivotal role in the enhancement of its capacity. Previously, numerous methods for evaluating large language models have been proposed in this area. Despite their effectiveness, these existing works mainly focus on assessing objective questions, overlooking the capability to evaluate subjective questions which is extremely common for large language models. Additionally, these methods predominantly utilize centralized datasets for evaluation, with question banks concentrated within the evaluation platforms themselves. Moreover, the evaluation processes employed by these platforms often overlook personalized factors, neglecting to consider the individual characteristics of both the evaluators and the models being evaluated. To address these limitations, we propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models that employs a competitive scoring mechanism where users participate in ranking models based on their performance. This platform stands out not only for its support of centralized evaluations to assess the general capabilities of models but also for offering an open evaluation gateway. Through this gateway, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities. Furthermore, our platform introduces personalized evaluation scenarios, leveraging various forms of human-computer interaction to assess large language models in a manner that accounts for individual user preferences and contexts. The demonstration of BingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.

Supplemental Material

MP4 File

Supplemental video

Download
106.77 MB

References

[1]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).

Google Scholar

[2]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

Crossref

Google Scholar

[3]

Jingtao Ding, Fuli Feng, Xiangnan He, Guanghui Yu, Yong Li, and Depeng Jin. 2018. An improved sampler for bayesian personalized ranking by leveraging view data. In Companion Proceedings of the The Web Conference 2018. 13--14.

Digital Library

Google Scholar

[4]

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023).

Google Scholar

[5]

Junzhe Jiang, Shang Qu, Mingyue Cheng, and Qi Liu. 2023. Reformulating Sequential Recommendation: Learning Dynamic User Interest with Content-enriched Language Modeling. arXiv preprint arXiv:2309.10435 (2023).

Google Scholar

[6]

Jiatong Li, Rui Li, and Qi Liu. 2023. Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation. arXiv preprint arXiv:2309.04369 (2023).

Google Scholar

[7]

Yucong Luo, Mingyue Cheng, Hao Zhang, Junyu Lu, and Enhong Chen. 2023. Unlocking the potential of large language models for explainable recommendations. arXiv preprint arXiv:2312.15661 (2023).

Google Scholar

[8]

Radek Pelánek. 2016. Applications of the Elo rating system in adaptive educational systems. Computers & Education, Vol. 98 (2016), 169--179.

Digital Library

Google Scholar

[9]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).

Google Scholar

[10]

Tong Zhao, Julian McAuley, and Irwin King. 2014. Leveraging social connections to improve personalized ranking for collaborative filtering. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 261--270.

Digital Library

Google Scholar

[11]

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, et al. 2023. Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective. arXiv preprint arXiv:2306.10512 (2023).

Google Scholar

Cited By

View all

Index Terms

Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

A Survey on Evaluation of Large Language Models
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes ...
Large Language Models as Evaluators for Recommendation Explanations
RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

The explainability of recommender systems has attracted significant attention in academia and industry. Many efforts have been made for explainable recommendations, yet evaluating the quality of the explanations remains a challenging and unresolved ...
Large Language Models are Diverse Role-Players for Summarization Evaluation
Natural Language Processing and Chinese Computing
Abstract
Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

WWW '24: Companion Proceedings of the ACM Web Conference 2024

May 2024

1928 pages

ISBN:9798400701726

DOI:10.1145/3589335

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
,
Roy Ka-Wei Lee
Singapore University of Technology and Design

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Conference

WWW '24

Sponsor:

SIGWEB

WWW '24: The ACM Web Conference 2024

May 13 - 17, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
84
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)16

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A Survey on Evaluation of Large Language Models

Large Language Models as Evaluators for Recommendation Explanations

Large Language Models are Diverse Role-Players for Summarization Evaluation