As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Assessing the capabilities of large language models (LLMs) is increasingly challenging due to their generality and uneven task performance. Often, we do not know how much of the success or failure on a particular task is due to the ‘loading’ of the language elements in the task, such as narrative understanding, or some other intrinsic (non-linguistic) components, such as domain-specific common sense or reasoning capabilities. Understanding what tasks are most loaded on language and determine the predictability of LLMs on these tasks is crucial for improving benchmarks, designing better LLMs, and ensuring their safe deployment. We present an innovative methodology that uses LLMs to annotate linguistic meta-features, allowing us to predict task difficulty and understand linguistic loadings more accurately than traditional readability scores. Using GPT-4 for automated annotation, we show strong predictability for a variety of tasks and language models (e.g., MMLU with R2 from 0.68 to 0.83), but observe limited predictability for other tasks (e.g., LSAT with R2 of -0.07).
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.