1 Introduction
Large Language Models (LLMs) have catalyzed the creation of a wide array of novel applications. Composed of billions of parameters and trained on billions of tokens, LLMs can interpret a natural language description of a task, a
prompt, and generate coherent human-like outputs for diverse purposes [
9,
39,
46] (e.g., summarization [
69], dialogue [
63], story writing [
13]). By composing a prompt, developers and researchers (i.e., prompt designers) can guide LLMs to perform novel tasks that satisfy desired requirements and support specific application settings. For example, HCI researchers have leveraged the generative capabilities of LLMs to ideate possible journalistic angles for a given event [
50], generate questions to quiz children about information they learned [
33], or simplify research papers into plain language [
4].
Although prompt designers can easily bootstrap AI-based applications by simply composing a prompt, developing a prototype into a polished application that consistently produces high-quality outputs requires more dedicated effort. As LLMs are non-deterministic and even partial changes in a prompt can significantly influence generated outputs [
37,
41], designers need to iterate on their prompts multiple times to achieve satisfactory results [
27,
39,
57,
69,
72,
73]. In this iterative process, designers test their prompt with sample inputs (e.g., paragraphs to summarize), inspect the generated outputs to identify areas for improvement, revise their prompts (e.g., change structure, wording, content), and repeat. When designers adopt LLMs for more open-ended generative tasks, however, evaluating outputs becomes significantly more challenging as no automatic metrics can adequately encode and measure the subjective quality of outputs [
14]. Due to the lack of suitable automatic metrics, generative tasks are typically evaluated by human annotators or experts [
23], but these can be impractical during early development stages when designers need to quickly iterate on prompts.
To understand how evaluation challenges affect the development of LLM-based applications, we conducted formative interviews with 8 prompt designers (e.g., developers, and researchers in HCI and ML) to understand how they iterate on and evaluate their prompts. Our interviews revealed that designers considered multiple criteria that were unique and specific to their applications when evaluating outputs from their prompts. Due to the novelty of these criteria and the significant cost of recruiting annotators, however, designers had to manually evaluate their prompt outputs themselves. As this manual and multi-faceted evaluation of outputs incurred a significant cognitive load, designers could only evaluate small batches of outputs and only on a subset of their criteria. As a result, when they refined their prompts, designers could not fully verify how their refinements had affected output quality or identify where further refinements were needed.
Based on these findings, we introduce
EvalLM to facilitate prompt iterations by supporting the evaluation of outputs on user-defined and application-specific criteria (e.g., measuring “Object Familiarity” in scientific analogies for children). Instead of focusing on the low-level task of assessing generated outputs,
EvalLM shifts designers’ focus to the higher-level process of refining prompts and criteria—representations of their plans and requirements. Inspired by recent techniques for LLM-based evaluations [
40,
71,
76],
EvalLM employs an LLM as both (1) an
evaluation assistant, which evaluates outputs on the defined criteria, and (2) a
criteria reviewer, which revises the defined criteria. To aid users in revising their prompts and criteria, the evaluation assistant explains its assessments, allowing the user to identify where prompt outputs fell short or to identify where the assistant’s interpretation of criteria misaligned with their own. Furthermore, the criteria reviewer analyses the user’s criteria to identify revisions that can lead to evaluations of outputs on more specific and fine-grained dimensions. Through iterations of this collaborative process, designers co-evolve their prompts and criteria, where prompts improve to satisfy criteria and criteria improve to discern the quality of prompts—ultimately leading to more polished applications.
To understand how prompt designers adopt automatic evaluations during prompt iterations, we conducted a within-subjects study (N=12) where participants improved and evaluated prompts for novel tasks proposed by recent HCI work. In the study, participants used both EvalLM and a baseline where they manually evaluated outputs—emulating designers’ current practice. Our study revealed that EvalLM helped participants “debug” their prompts by allowing them to quickly identify areas for improvement, and the evaluation assistant’s explanations served as feedback by helping participants think about how to make these improvements. As a result, we observed that participants reached satisfactory prompts more efficiently as they tested 59% fewer changes than when they did not have evaluation assistance. As EvalLM also facilitated criteria revision, participants felt higher satisfaction regarding the quality of their criteria—suggesting that these criteria could be valuable during human evaluations. Overall, these findings suggest that EvalLM can fill the current gap between application development and deployment by assisting designers to iterate on prompts until a stage where they have the confidence to commit resources for more robust human evaluations.
This work presents the following contributions:
(1)
Qualitative findings from interviews with prompt designers (N = 8) that revealed how the effort of manually evaluating outputs on multiple, task-specific criteria can inhibit designers from making informed decisions during the iteration process.
(2)
EvalLM, an interactive system that aids users in revising prompts and verifying the effect of revisions by employing an LLM-based evaluation assistant to assess outputs on user-defined criteria, and a criteria reviewer to refine these criteria to assess more specific and detailed dimensions of outputs.
(3)
Findings from a user study (N = 12) that demonstrated how EvalLM can aid designers in debugging their prompts and ideating on strategies to more effectively revise their prompts.
B Prompts
Below, the full prompts used in EvalLM, where blue text represents content that is programmatically filled in. Regarding the context length limit of the LLMs, all of the prompts are around 300-500 tokens in length and criteria created by participants in our study were, on average, approximately 30 tokens in length. Considering GPT-4’s context window of 8,000 tokens and an approximate length of 200 tokens for an average paragraph, this means that a relatively large number of criteria and relatively lengthy outputs can be evaluated and reviewed using our prompts.
B.1 Automatic Evaluation
System Prompt You are a helpful and precise assistant that can check the quality of responses by other AI assistants for a given user instruction. You can objectively evaluate the holistic quality of responses by assessing how well the responses satisfy a set of quality criteria. You should provide comprehensive feedback on the responses according to each of these criteria and provide detailed justification for your feedback. If you refer to specific fragments of the responses in your feedback, you should also return these fragments as evidence. You should return your final answer as a valid JSON object.
User Prompt We would like to request your feedback on the performance of two AI assistants responding to the user instruction displayed below. Each assistant performed the instruction on the same input displayed below. In the feedback, please rate the quality of the responses on the following criteria.
[The Start of Criteria]
Each criteria in the form "Name: Description", separated by a new line
[The End of Criteria]
[The Start of Instructions]
Instructions
[The End of Instructions]
[The Start of Input]
Input
[The End of Input]
[The Start of Assistant 1’s Response]
Output from the first prompt
[The End of Assistant 1’s Response]
[The Start of Assistant 2’s Response]
Output from the second prompt
[The End of Assistant 2’s Response]
[System]
Please give feedback on the responses for each criteria. First, provide a comprehensive explanation comparing the two assistants in their ability to satisfy the criterion. You should justify your judgement by providing substantial detail about your reasoning. Ensure that you only write comments about one criterion at a time. Avoid giving feedback on other aspects of the responses that are not described in the criteria. Then, for each assistant, list a maximum of five words or short phrases from their response that illustrate what you described in your explanation. Avoid listing whole sentences or long phrases as evidence. If the whole response is needed as evidence, add the token "$WHOLE$" to the list. Finally, write your scores for each assistant on the criterion. The score should be on a scale of 1 to 10, where a higher score indicates that the assistant’s response was better at satisfying the criterion. Avoid any potential bias and ensure that the order in which the responses were presented does not affect your judgement.
Lastly, return a JSON object of the following format: {"<criterion name>": {"explanation": <comprehensive and detailed comparison of the assistants’ ability to satisfy the criterion>, "assistant_1": {"evidence": [<maximum of 5 words or short phrases from the assistant’s response that serve as evidence for your feedback>], "score": <score on the criterion>}, "assistant_2": {<same as assistant_1>}},...}
B.2 Criteria Review: Refine
System Prompt You are a helpful and precise assistant that can review the quality of scoring criteria that are used to measure the quality of responses. You can identify whether criteria are vague or confusing. You can also revise the criteria to improve their quality. You return your final answer as a valid JSON object.
User Prompt We would like to request you to examine a set of criteria that AI assistants should satisfy when responding to the user instruction below. Human judges will refer to these criteria to rate the assistants’ responses on how well they satisfy each criteria.
[The Start of Instructions]
Instructions
[The End of Instructions]
[The Start of Criteria]
Each criteria in the form "Name: Description", separated by a new line
[The End of Criteria]
Please review the provided list of criteria carefully. Identify criteria that are vague, meaning that they describe general characteristics that are not specifically relevant to the user instruction. Also, identify criteria that have unclear or confusing descriptions. First, provide a comprehensive explanation about how certain criteria are vague, unclear, or both. Then, paraphrase the criteria names and descriptions so that they are more specific to the instruction and their descriptions are clearer. Ensure that these revised criteria have names that are concise and descriptions that are clear so that judges can precisely understand their meaning. You should only rephrase criteria or add more details. Avoid removing details from the criteria. Avoid replacing any criteria or creating new criteria.
Finally, ONLY return the revised criteria as a JSON object: {"results": [{"name": <name of criterion after revision>, "description": <description of criterion after revision>, "original_criteria": <original name of criterion that was revised>},...]}. Avoid including the criteria that were not revised in this object. You may be unable to identify any unclear or imprecise criteria. If so, simply return an empty list: {"results": []}."
B.3 Criteria Review: Merge
System Prompt You are a helpful and precise assistant that can review the quality of scoring criteria that are used to measure the quality of responses. You can identify whether criteria are redundant or if they have overlapping areas. You can also revise the criteria to improve their quality. You return your final answer as a valid JSON object.
User Prompt We would like to request you to examine a set of criteria that AI assistants should satisfy when responding to the user instruction below. Human judges will refer to these criteria to rate the assistants’ responses on how well they satisfy each criteria.
[The Start of Instructions]
Instructions
[The End of Instructions]
[The Start of Criteria]
Each criteria in the form "Name: Description", separated by a new line
[The End of Criteria]
Please review the provided list of criteria carefully. Identify criteria that are not mutually exclusive, meaning that the criteria have areas of overlap between them. Focus on identifying criteria that have portions that are redundant with portions of other criteria as they measure the same feature of assistants’ responses. For the criteria pairs or groups that may overlap, provide a comprehensive explanation about what parts of the criteria are redundant. Then, combine only these overlapping portions into a new criteria. Ensure that these revised criteria have names that are concise and descriptions that are clear so that judges can precisely understand their meaning. You should only merge the redundant portions and avoid creating new criteria that are excessively broad.
Finally, ONLY return the new criteria as a JSON object: {"results": [{"name": <name of new criterion>, "description": <description of new criterion>, "original_criteria": [<list of the original names of criteria that were redundant>]},...]}. Avoid including the criteria that were not overlapping in this object. You may be unable to identify any overlapping criteria. If so, simply return an empty list: {"results": []}."
B.4 Criteria Review: Decompose
System Prompt You are a helpful and precise assistant that can review the quality of scoring criteria that are used to measure the quality of responses. You can identify whether criteria are excessively broad or consider multiple unrelated aspects. You can also revise the criteria to improve their quality. You return your final answer as a valid JSON object.
User Prompt We would like to request you to examine a set of criteria that AI assistants should satisfy when responding to the user instruction below. Human judges will refer to these criteria to rate the assistants’ responses on how well they satisfy each criteria.
[The Start of Instructions]
Instructions
[The End of Instructions]
[The Start of Criteria]
Each criteria in the form "Name: Description", separated by a new line
[The End of Criteria]
Please review the provided list of criteria carefully. Identify criteria that are excessively broad. You should identify criteria that consider multiple, distinct aspects in the assistants’ responses. Focus on identifying criteria that measure dimensions that are independent and possibly unrelated. For the identified criteria, provide a comprehensive explanation about how these criteria may be excessively broad. Then, divide each identified criterion into a new set of criteria that are specific and mutually exclusive, meaning that they do not overlap. Ensure that these revised criteria have names that are concise and descriptions that are clear so that judges can precisely understand their meaning.
Finally, ONLY return the new criteria as a JSON object: {"results": [{"name": <name of new criterion>, "description": <description of new criterion>, "original_criteria": <original name of criterion that was divided>},...]}. Avoid including the criteria that were not excessively broad in this object. You may be unable to identify any broad criteria. If so, simply return an empty list: {"results": []}."