%0 Conference Proceedings %T AlignBench: Benchmarking Chinese Alignment of Large Language Models %A Liu, Xiao %A Lei, Xuanyu %A Wang, Shengyuan %A Huang, Yue %A Feng, Andrew %A Wen, Bosi %A Cheng, Jiale %A Ke, Pei %A Xu, Yifan %A Tam, Weng Lam %A Zhang, Xiaohan %A Sun, Lichao %A Gu, Xiaotao %A Wang, Hongning %A Zhang, Jing %A Huang, Minlie %A Dong, Yuxiao %A Tang, Jie %Y Ku, Lun-Wei %Y Martins, Andre %Y Srikumar, Vivek %S Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) %D 2024 %8 August %I Association for Computational Linguistics %C Bangkok, Thailand %F liu-etal-2024-alignbench %X Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at https://github.com/THUDM/AlignBench %R 10.18653/v1/2024.acl-long.624 %U https://aclanthology.org/2024.acl-long.624 %U https://doi.org/10.18653/v1/2024.acl-long.624 %P 11621-11640