Introduction

Large language models (LLMs) can process large volumes of text and produce cogent outputs. This ability also extends into realms that are highly specialized with complicated subject matter, and has particular promise within medicine1,2. For instance, LLMs can perform a number of tasks on clinical text from Electronic Health Records (EHR), including clinical concept extraction3, the identification of social determinants of health4, and classification of breast cancer pathology reports5. They have also been trained on EHR data to serve as all-purpose prediction engines for outcomes like length of stay, 30-day readmission, comorbidity index, and mortality6. LLMs have also been used to streamline automated machine learning for clinical informatics7 and for prediction of clinical acuity8 or admissions in the emergency room9. Researchers have also shown that LLMs are able to encode medical knowledge assessed through medical question answering3,10. LLMs can summarize11 complex documents and produce an explanation in simpler terms for patients to understand, such as inpatient discharge summaries12, and can even effectively answer patient messages13.

There is also much potential utility for LLMs to go beyond interactive and single-turn queries. One particularly exciting avenue is for facilitating healthcare operations and clinical workflow14. Beyond individual questions, LLMs can be used to probe hospital populations at scale, such as for matching multiple patients to all available clinical trials15. LLMs can aggregate copious amounts of multifaceted health information and provide comprehensive and clear reports. For instance, when there is a shift change for clinical practitioners in busy wards, a knowledge transfer for patient census needs to happen accurately and expediently. Patient records could be summarized more easily using LLMs for more efficient handoffs. Hospital administrators can be provided with hourly reports of hospital resource utilization and bed availability. These queries, while complex, can happen asynchronously and do not demand immediate responses allowing for alternative opportunities for working with the system.

Considering these compelling examples, the question is no longer whether an LLM can support clinical operations and care delivery but how can it do so within the confines of a health system. Despite these promises, numerous constraints prevent LLMs from being integrated into the clinical environment. The primary concerns revolve around demonstrating their safety and efficacy under appropriate regulatory oversight16. Other important considerations are cost, integration, and model capacity. Many of the state-of-the-art models, which have shown to be the best performing, are costly to utilize at scale. As of July 1, 2024, the price for API calls to OpenAI’s GPT-4-32k (i.e., 32,000 context window length) is $60.00 per one million tokens input and $120.00 for one million tokens output. While less complex models are cheaper, recent research has shown that more complex models with longer context lengths are more adept at clinical tasks10. As clinical documentation continues to grow in length over time, possibly a side effect of information duplication17, it is imperative to understand the limits and cost of LLMs in clinical care and operations.

There are several strategies that health systems could utilize to mitigate costs through LLM optimization. Prompting in the context of LLMs refers to the method of providing structured input to guide the model’s generation of relevant and context-specific outputs. Prompt engineering is a discipline of designing inputs that can provide the most optimized and relevant output, which can greatly enhance efficiency. There are several prompting techniques and methods that improved LLM performance18,19,20, such as task decomposition21, chain-of-thought22, and question-analysis23. With properly designed prompts, one straightforward option for reducing cost is to group queries: instead of asking one individual question for one individual patient or note, multiple questions could be asked for one or many patients simultaneously. Concatenating queries like this would be more cost-effective by circumventing the need to continually input notes for separate questions. This framework, while not optimal for interactive and immediate responses, may suffice for population-scale queries. However, there is still much unknown about how this strategy would operate in practice. Firstly, there are technical issues like context window size, which limits the possibilities of how much text can even be loaded. Prior research has found that even LLMs with long context windows may not fully process all data in large inputs, focusing on the information presented at either toward the beginning24 or end24,25 of the prompt. Other research has revealed other challenges LLMs face with extensive texts, like the inability to recall facts distributed across the documents26. Similarly, researchers found that the reasoning accuracy of LLMs degrades with increasing input prompt length27. There is limited knowledge of how increasing demands of an LLM would impact performance on healthcare-related tasks.

While showing promise, LLMs have been found to propagate racial bias28, struggle with medical coding29, and can generate factually inconsistent clinical summaries30. It is unknown how LLMs can withstand stressors (including increased data and tasks) and to what extent concatenating queries impacts their accuracy. What is the optimal balance of prompt complexity, i.e., task volume and complexity, for LLMs for which cost efficiency is maximized while accuracy is preserved?

Our goal was to assess the resilience, scalability, and limits of various LLMs in handling complex medical data under real-world clinical loads, focusing on identifying effective thresholds for a concatenation querying strategy. Using real-world patient data from a large urban health system, we assess ten LLMs’ ability to extract data from EHRs amidst escalating prompt complexities. Specifically, we assessed performance via answers to a series of tailored questions pertaining to clinical notes using multiple LLMs across a variety of sizes and complexities. These questions ranged in type, including numerical, fact-based, and temporal assessments. We then sequentially increased both the number of notes and questions per note, representing a rising burden to the system. Outcomes included the accuracy of the answers and the ability of the LLM to accurately follow instructions, reflecting functional integrity. We also performed analyses assessing linguistic quality metrics, deeper dives into the additive effects of increasing prompt size under a balanced task load, and the economic implications of different strategies. Finally, we perform a validation experiment using a publicly available medical question-answering dataset.

Methods

Study design and rationale

We sought to examine how different LLMs of varying sizes and complexities perform in extracting and understanding specific data from EHRs and their response to increased burden (i.e., number of notes and tasks). We assessed LLMs’ ability to answer three specific types of questions, namely fact, temporal, or numerical, on information in a real-world clinical note. The number of questions of each type, as well as the amount of notes themselves, varied per experimental scenario. Accordingly, the primary goal of the study was to assess the models’ ability to accurately process and extract information as the volume of data and number of questions, increases. Specifically, we conducted stress tests on various LLMs by gradually increasing the number of questions they were asked to answer, testing them with certain amounts of clinical notes. The primary outcome of interest was accuracy, overall, and by question type, and secondary outcomes were the ability to properly format and separately number, answers according to instructions (Fig. 1).

Fig. 1: Study design and overview.
figure 1

Figure 1 shows the overall study. We used real-world notes from the electronic health records. We then used GPT-4–8k to create bespoke question-answer pairs of multiple types, namely fact-based, temporal, and numerical. We then tested 10 other LLMs of various sizes against different burdens of questions and notes and assessed the accuracy and proper formatting performance.

Data source and preparation

We utilized clinical notes within EHRs from the Mount Sinai Health System, which comprises seven hospitals within New York City. We selected two hundred random clinical notes from 2023 across note types and providers. The selection criterion was for notes within the range of 400–500 tokens (median of 450 tokens), ensuring a consistent data size for analysis. The study was approved by the Mount Sinai Institutional Review Board and informed consent was waived due to the retrospective nature of the study and minimal risk of harm.

Question–answer database creation and validation

We utilized GPT-4–8k to generate question-answer pairs from each note. Two independent clinicians evaluated the question-answer pairs to ensure quality. The complete set of instructions provided, including the exact prompt structure, is provided in the Supplementary Materials. These question-answer pairs served to assess the LLMs comprehension of clinical scenarios in terms of accuracy for subsequent experiments. A protocol was followed where only one pair was generated at a time to prevent data degradation.

To interrogate the LLMs understanding of the clinical cases, we designed questions of various types that reflect different aspects of information that are pertinent to clinical presentation. Accordingly, GPT-4–8k was instructed to craft questions falling into three distinct categories (Table 1). This categorization aimed to cover a broad spectrum of typical inquiries found in clinical settings, thereby ensuring the generated question-answer pairs accurately mimic real-world data extraction scenarios. In each prompt, previously generated questions for that note were provided, and GPT-4–8k was instructed to avoid repetition. This resulted in 25 unique question-answer pairs per note. This process yielded a total of 5000 question-answer extractions across all notes. GPT-4–8k was only utilized for question–answer generation and was not evaluated in the subsequent experiments.

Table 1 Categories and examples developed by GPT-4–8k for question–answer experimentation

To ensure the quality and clinical relevance of the generated question-answer pairs, a random sample of 250 pairs was selected for evaluation. Two emergency department attending physicians independently reviewed this sample, assessing the factual accuracy and completeness of both the questions and the answers. This step was done to validate the utility of the generated database in simulating realistic data extraction inquiries and responses. The evaluators graded questions as “T” (factually accurate), “TI” (factually accurate, yet incomplete), “F” (false). We retained all question-answer pairs, including those rated as false, as they constituted a minor proportion of the overall dataset and because they ensure equal questions per note.

Experimentation with large language models (LLMs)

For this study, we assessed 10 LLMs of various types and complexities (Table 2). We conducted a series of experiments engaging the LLMs to read a varying number of clinical notes and answer a varying number of associated questions. These experiments were organized in terms of Small, Medium, and Large task sizes revolving around prompt complexity and burden. For instance, the models in the Small task had two notes (~400–500 tokens each) provided and were asked a number of questions ranging from 2 to 50. Each experimental run was replicated 50 times to ensure statistical reliability. To enhance reliability and fair comparisons across experiments, we enacted a sampling approach that ensured each note in our dataset was represented across iteration experiments. Specifically, the dataset first underwent random shuffling, then notes were selected sequentially for each iteration without replacement. For larger experiments, once each note had been utilized, we randomly selected with replacement. Due to context window size limitations, smaller models were not assessed at higher loads because it would be impossible for them to process that amount of tokens. Table 3 showcases the various tasks separated by complexity, the models included in each cohort, and the experiments performed.

Table 2 Details of Large Language Models utilized in this work
Table 3 Details of experiments conducted in the study according to different task sizes

LLM prompt generation and formatting for answering questions

For the LLM experiments, we concatenated the clinical notes with their respective questions, instructing the models to output answers in a structured JSON format. No pre-processing of the text was conducted, just the removal of extra spaces. This format included question numbers and their corresponding answers for easy analysis. The general prompt structure was as follows:

{note_1}

{N questions for note_1}

{note_n}

{N questions for note_n}

Instruction to return the answer in JSON format as:

[

{“question”:<question number X>,

“answer”:<answer to question X>},

…]

The Supplementary Materials section provides the exact prompt utilized for instructions to answer the questions generated.

Primary outcomes and sub-analyses

The primary outcome pertaining to the question-answer pairs was accuracy, and the secondary outcome was the ability to format answers correctly. This secondary outcome was added not only as a prerequisite for assessing accuracy but also as an understanding of when the LLMs start to fail in following instructions. To identify whether a model returned its answers in a proper JSON format, we used json.loads command on the returned string after extraction of the JSON item (extract “[…]”). Successful JSON loads were recorded and merged with the original question numbers for analysis. Failed JSON loads were excluded from further analysis.

After retrieving the correctly loaded JSON data, we employed three methods to evaluate the accuracy of model responses by comparing to the original GPT-4-8k answers: first, a direct comparison assesses correctness by determining if the generated answers exactly match the original GPT-4–8k one. For responses with minor textual deviations, fuzzy logic—using the FuzzyWuzzy algorithm—quantifies and assesses text similarity. Lastly, a GPT-4–8k based check compares the clinical semantic similarity of answers, ensuring appropriate validity in situations where the answers may be textually disparate but have the same clinical context. Accuracy in these methods was delineated through a triage system and deemed satisfactory if it passed any of the methods.

We also assessed the location of errors in the prompts for tasks in which all models were evaluated, namely Small and Medium task sizes. First, we localized each question-answer pair by which quartile the question appeared in the prompt. Next, for each LLMs’ quartile across all question-answer pairs, we tabulated the percentage for which there was either an Omission failure (an error we define as the model failing to return an answer) or an incorrect response.

Separately, we also examined the Flesch readability scores of the original notes. The Flesch score measures how easy a piece of text is to read. We utilized this measure to assess how text complexity correlates with model accuracy. All ‘easy’ (‘Fairly Easy’, ‘Easy’, ‘Very Easy’) categories were consolidated under ‘Easy’ for analysis due to the small number of ‘Easy’ and ‘Very Easy’ questions.

Impact of LLM prompt size on performance

We conducted a final experiment to assess the impact of prompt size on the performance of an LLM. We employed varying configurations, utilizing 2, 5, 10, and 25 clinical notes paired with 25, 10, 5, and 2 questions each, respectively, to maintain a consistent total of fifty questions across each setting (e.g., 2 notes with 25 questions each). A larger number of notes reflects increased prompt size and complexity, with more information that needs to be processed. Models with context windows of at least 16k were included, namely GPT-4-turbo-128k, GPT-3.5-turbo-16k, Mixtral-8x22B, Mixtral-8x7B. To ensure the robustness of the results, each configuration was iteratively tested 50 times, and mean accuracies were subsequently calculated.

External validation experiment

We performed a validation experiment on an open-source, external dataset to assess the robustness of the trends observed in our primary analyses. Specifically, we utilized Medical Multiple-Choice Question Answering (MedMCQA)31, which is a publicly available, large-scale MCQA dataset covering a variety of healthcare topics across 21 clinical domains. From the database, we retrieved 120,765 single-answer medical MCQs (options A-D) with their correct answers. We tested the same ten LLMs used in the primary, private data evaluations. We prompted each LLM with increasing randomly selected question sets—5, 25, 50, 75 questions—to each model in 50 repeated trials. Model performance was measured by mean accuracy as well as JSON and Omission error rates. A random seed was fixed for all experiments.

Simulation analysis of the impact of LLM query strategy on cost

We conducted a simulation analysis to demonstrate the economic impact of LLM querying strategies that were being tested in this study. The default strategy for querying LLMs on EHR notes would be to provide the note itself and a single question about it, along with instructions on how to answer. The concatenation strategy assessed in this study is to ask several questions to one or more notes at once, which would require putting each note and instruction to answer only once instead of every time in the query. Therefore, in this simulation, we compared the cost for querying GPT-4-turbo-128k and GPT-3.5-turbo-16k for the small task, amounting to 2 EHR notes of interest with increasing question load in total, namely: 2, 4, 10, 20, 30, 40, and 50. Accordingly, we used 450 tokens to reflect each note size (the median used in this study) and 10 tokens for each question. The cost per OpenAI API call was obtained from the OpenAI website in July 2024, which amounted to $10.00 and $1.50 per one million input tokens for GPT-4-turbo-128k and GPT-3.5-turbo-16k, respectively. As the cost for output would be identical across both scenarios, it was not incorporated into this experiment. Finally, we incorporated additional overhead for the concatenation framework due to the possibilities of JSON and Omission failures that we observed. Specifically, we added cost equal to the observed drop rate (JSON + Omission) that was observed at each time point. For instance, GPT-4-turbo-128k had a total failure rate of 1.10% at 10 questions, so an additional 1.10% was added to the cost for that time point.

Statistical analyses

We evaluated continuous values using their means, standard deviations (SD), and 95% confidence intervals (CI). A p-value threshold of 0.05 was set to determine statistical significance. We calculated the Spearman’s correlation coefficient to assess the relationship between model accuracy and the number of questions for each experiment, across the various prompt complexities imposed. The analysis intended to identify any significant linear trends between the increasing number of questions and the LLMs’ ability to maintain accuracy.

Computational infrastructure

All research was conducted securely within Mount Sinai Health System’s (MSHS) firewalls. GPT experiments were conducted through the Mount Sinai Hospital’s Azure tenant API. Experiments using other models were performed on a local cluster equipped with 4xH100 80gb GPUs. We did not employ advanced inference techniques such as dynamic batching, paged attention, or quantization in our experiments. All models were executed with their default hyperparameters.

We utilized Python (3.9.18) for all data analyses. We used several Python libraries to facilitate data processing, model interaction, and analysis: NumPy (1.26.4) for numerical computations, Pandas (2.1.4) for data manipulation, Scikit-Learn (1.3.0) for statistical analysis, Hugging Face’s Transformers (4.37.2) for accessing pre-trained NLP models, FuzzyWuzzy (0.18.0) for text comparison, and Matplotlib (3.7.1) alongside Seaborn (0.12.2) for data visualization, and the json module (2.0.9) was employed for handling JSON data formats. LLMs used: GPT-4-turbo-128k v2024-04-0932, GPT-3.5-turbo-16k v061333, Llama-3-70B34, Llama-3-8B34, OpenBioLLM-70B35, OpenBioLLM-8B35, Mixtral-8x22B36, Mixtral-8x7B36, BioMistral-7B37, Gemma-7B38, GPT-4-8k v061332 (for question generation).

Results

The dataset utilized in this study was comprised of all patient encounters that occurred within the Mount Sinai Health System in 2023. There were 1,942,216 unique patients who had at least one encounter in 2023, averaging approximately 21 notes per patient, which had, on average, 711 tokens. We selected a random 200 clinical notes that were between 400-500 tokens from the Mount Sinai Health System EHRs across note types and authors.

Clinical note breakdown and token load

The most represented author types were Physician (28.5%) followed by Registered Nurse/Nurse Practitioner (25.0%) and the most represented note types were Progress Note (57.5%) and Care Note (17.0%) (Supplementary Table 1).

Our analysis of LLMs for data extraction from EHRs under increasing prompt complexities began with the evaluation of three distinct tasks of increasing complexity (Table 4), ranging from Small to Large. Models contained in each cohort were subjected to the same stress test to assess and compare their performance in accurately interpreting and responding to a varied prompt complexity in the form of a number of notes and a number of data extraction questions. Due to context window limitations, only models with a context window of at least 16k were assessed in the Large task. Supplementary Table 2 presents the average prompt length for each experiment, illustrating the scope of information models were expected to process and respond to. For instance, the hardest level assessed in the Small task was 1629.3 ± 79.2 (mean ± SD) tokens, compared to 3532.2 ± 128.4 (mean ± SD) in the Large. This metric is important for putting the results of this work in context and understanding the prompt complexity placed on each mode by category.

Table 4 Overall accuracies of different models across different question types, including 50 iterations per experiment

Physician evaluation of question–answer pairs

In assessing the quality of the question–answer pairs generated from EHR notes by GPT4, two physicians independently reviewed a random sample of 250 pairs. Their evaluation focused on categorizing each pair as “True” (T) for complete and accurate, “True with Incomplete” (TI) for correct but incomplete, or “False” (F) for inaccurate. The first evaluator rated 234 pairs as “T”, 8 as “TI”, and 8 as “F” (96.8% acceptable). The second evaluator rated 238 pairs as “T”, 4 as “TI”, and 8 as “F” (96.8% acceptable). The agreement rate between the evaluators was 92.4%.

Model accuracy for question–answers of clinical notes

The primary outcome of this study was the ability of various LLMs to correctly respond to questions related to specific clinical notes separated into Small, Medium, and Large-sized tasks. We performed two sets of analyses relating to question-answer performance, one designating Omissions (i.e., an empty response, described in detail below) as errors and another excluding them. Due to the high JSON failure rates observed for OpenBioLLM-8B and BioMistral-7B, as well as a high Omission failure rate for Gemma-7B, we were unable to assess their question–answering performance.

Answer accuracy assessment for models, including Omission failures

The accuracy results for all models, including Omission failures as errors, are displayed in Fig. 2a, c, e, and the full results are provided in Supplementary Table 3. For the Small task (2 notes; Fig. 2a), Llama-3-70B had the highest accuracy rates, and GPT-4-turbo-128k similarly showed relatively robust, consistently high accuracy levels across increasing question complexities. GPT-3.5-turbo-16k and OpenBioLLM-70B started with reasonable accuracies but were soon overwhelmed. Llama-3-8B and the Mixtral models showed more variability across question amounts, while Mixtral-8x22B and Mixtral-8x7B struggled with the highest question complexities. For the Medium task (4 notes; Fig. 2c), Llama-3-70B, once again, performed the best, maintaining robust performance and only gently tapering. The same patterns were also apparent for the other models albeit with generally poorer performance across the board, as to be expected with increased prompt complexity. OpenBioLLM-70B and Mixtral-8x22B exhibited steep declines. GPT-3.5-turbo-16k and Mixtral-8x7B generally had poor performance across question loads. Llama-3-8B similarly struggled but had more variability across the experiments. Lastly, for the Large task (10 notes; Fig. 2e), GPT-4-turbo-128k performed the best, beginning with high accuracy but showing a marked decline. Mixtral-8x22B and Mixtral-8x7B had a relatively weak performance for all questions, and GPT-3.5-turbo-16k performed the worst.

Fig. 2: Performance of formatting JSON output responses.
figure 2

The overall accuracy of question answers for the clinical notes is presented across question burdens for each task size. Results are presented when including Omission Failures for Small (a), Medium (c), and Large (e) tasks. Results are also presented when excluding Omission Failures for Small (b), Medium (d), and Large (f). The shaded area reflects 95% Confidence Intervals. Models that were unable to properly format responses were not included.

Answer accuracy assessment for models excluding Omission failures

We also analyzed LLM performance of correctly answering questions as before, except dropping Omission failures instead of penalizing the models by counting them as errors. The rationale for this separate comparison is due to the fact that these errors, like JSON errors, are easily identifiable and can be caught and should not count against the LLMs’ reasoning ability. The accuracy results for all models, excluding Omission failures as errors, are displayed in Fig. 2b, d, f, and the full results are provided in Supplementary Table 4.

There are stark differences that can be seen in this evaluation set compared to when Omission errors were included. For the Small task (Fig. 2b), there was much more uniformly high performance across the models. This pattern of high performance continued for the Medium task (Fig. 2d). The Large task (Fig. 2e) also had much more consistently high performance when Omission failures were not considered. While GPT-4-turbo-128k had the best performance, the Mixtral models were close behind, and GPT-3.5-turbo-16k performed the worst.

Breakdown of accuracies by model and question type

Table 4 presents the accuracy for each model at each experiment, separated by task size and grouped by the question type, namely FB (fact), NUM (numerical), and TMP (temporal) assessments. Interestingly, there was a lot of variability of best-performing question types across all model types and task sizes. Some models were better able to handle certain question types, but these patterns were also affected by task size. For instance, in the Small task, Llama-3-8B had relatively consistent performance across question types but then performed much better on NUM questions (77%) than FB ones (68%) in the Medium task. As seen in the prior section, GPT-4-turbo-128k and Llama-3-70B had the overall highest performance across category types for all task sizes, but the latter stood out more in the Medium task. Supplementary Fig. 1 shows a graph of how accuracy by question type changes across question burden for the best-performing models in all task sizes, namely Llama-3-70B for Small and Medium tasks (panels A and B), and GPT-4-turbo-128k for the large task (panel C). For each of these models per task, the best-performing category tends to be maintained across question burden, indicating a slight tendency. For instance, in the Medium task, Llama-3-70B almost always performs best at NUM tasks, followed by TMP, then FB. However, there are some instances of minor fluctuations such as at 15 questions in the Medium task and at 2 and 10 questions in the Small task. Overall, it is reassuring that accuracy for one type of question does not strongly stand out and that divergence does not occur between question types at larger question burdens.

Assessment of response formatting and output

The secondary outcome of the study was assessing the effect of prompt complexity on LLMs’ ability to properly format its responses. In this assessment, two subtypes of this error were identified. The first was when the output was not in proper JSON format, which is referred to as JSON error. The second type was instances for which the model skipped answering a question and/or incorrectly referenced the question. This error, referred to as Omission Failures, was a pattern for which the model failed to provide any answer or an incorrect question reference in a series of questions for a given note.

JSON failures

Failures in JSON loading were indicative of a model’s limitations in structuring data retrieval, which reflects an overall degradation in processing ability. Fig. 3a–c showcases the failure rate of this type for all models in the Small (a), Medium (b), and Large (c) tasks and Supplementary Table 5 has the results in table format.

Fig. 3: Accuracy of LLMs for question–answer pairs across task sizes with or without Omission Failures.
figure 3

Assessment of LLMs’ ability to properly format outputs aggregated across configurations by task size. Models that had high JSON failure rates were not assessed for Omission failures. The JSON loading error failure rates are plotted with 95% confidence intervals across all experiments (a) Small, b Medium, and c Large task sizes. Omission errors are plotted similarly for (d) Small, e Medium, and f Large task sizes.

For the Small task, GPT-4-turbo-128k and Llama-3-70B demonstrated the lowest failure rates across a number of questions. On the other hand, OpenBioLLM-8B and BioMistral-7B had high failure rates indicating that they may not be suited for this query strategy. GPT-3.5-turbo-16k, Gemma-7B, and Llama-3-8B exhibited notable failure rate spikes, particularly under higher complexity scenarios. Interestingly, Mixtral-8x22B maintained relatively low failure rates but was not as effective as GPT-4-turbo-128k and Llama-3-70B. Results for the Medium task showed similar trends with GPT-4-turbo-128k and Llama-3-70B continuing to be robust to JSON errors at this higher prompt length and complexity. As expected, OpenBioLLM-8B and BioMistral-7B continued to show near-complete failure across all question loads. While Llama-3-8B and Mixtral-8x7B again started with low failure rates, they quickly increased with added questions. This pattern, albeit with slightly worse performance, was also seen for Gemma-7B and OpenBioLLM-70B. As before, Mixtral-8x22B had consistently low, albeit not the lowest, failure rates across the question load. For the Large task, only four models were assessed based on their context length capacity. GPT-4-turbo-128k and GPT-3.5-turbo-16k started with low failure rates at 5 questions but soon increased. Mixtral-8x22B and Mixtral-8x7B were more consistent across question loads.

Omission failures

In addition to JSON failures, we evaluated the rates of occurrences when the LLM omitted responses. As the JSON failure rates were near 100% for OpenBioLLM-8B and BioMistral-7B, it was not possible to evaluate them for Omission errors. Fig. 3d–f showcases the failure rate of this type for all models in the Small (d), Medium (e), and Large (f) tasks, and Supplementary Table 6 has the results in table format.

For the Small task, Llama-3-70B performed the best, but GPT-4-turbo-128k also maintained relatively low Omission rates. Mixtral-8x22B started at low Omission rates for low question burdens but rose at 20 questions. GPT-3.5-turbo-16k, Llama-3-8B, Mixtral-8x7B, and OpenBioLLM-70B had relatively high Omission rates throughout all question amounts. There were similar trends that were seen in the Medium task. Again, Llama-3-70B was the clear top performer and even more pronounced than in the Small task. GPT-4-turbo-128k again started with almost no Omission errors but ballooned at 20 questions. While Mixtral-8x22B and OpenBioLLM-70B also start at relatively low Omission rates, they sharply rose at 10 questions. Like before, GPT-3.5-turbo-16k, Llama-3-8B, and Mixtral-8x7B have relatively high Omission rates across all question burdens. Gemma-7B had the highest Omission rate for both tasks. Only four LLMs were assessed in the Large task due to their context window capacities. GPT-3.5-turbo-16k, Mixtral-8x22B, and Mixtral-8x7B are unable to handle this increased prompt complexity. While GPT-4-turbo-128k performed the best at this task, an acceptable Omission rate was only found at 5 questions (i.e., 50 total questions).

Location of question–answer errors within the prompt structure

We assessed where in the prompt either Omission errors or incorrect responses were given to determine any potential bias within each model, i.e., if a model preferentially answers questions correctly in a certain region of the prompt. Each question-answer pair was localized in terms of which quartile it appeared in the prompt, e.g., for a task with 20 questions, questions 16-20 would be in the fourth quartile. Then, for each quartile, for each model, we tabulated how many times errors occurred (Supplementary Table 7). For many models, we identified only slight discrepancies in the location of errors. For instance, Mixtral-8x7B ranged from 22.75% (20.16–25.35%) to 31.52% (28.13–34.92%) in quartiles 1 and 4, respectively. Interestingly, Llama-3-70B was consistently strong across all quartiles ranging from 5.12% (4.24–5.99%) to 7.61% (6.45–8.78%) in quartiles 2 and 4, respectively. There were other models, like GPT-4-turbo-128k and Mixtral-8x22B, that had much fewer errors at the beginning of the prompt. Specifically, GPT-4-turbo-128k had only 3.82% (2.94–4.70%) of failures in quartile 1 but 13.66% (11.28–16.03%) in quartile 4, indicating perhaps a bias towards questions presented earlier in the group. Figure 4 provides a visualization for the location of errors for a specific model and task.

Fig. 4: Graphical representation of output accuracy for a given experiment highlighting different types of errors.
figure 4

Graphical depiction of accuracy and errors for GPT-4-Turbo-128k for the Medium task (4 notes) and 15 question burden for each. Omission and JSON failures are demonstrated in addition to correct and incorrect answers.

Linguistic quality metrics sub-analyses

Sub-analyses of the LLMs were conducted to evaluate performance across different aspects relating to linguistic quality. Flesch reading ease categories of the original notes were compared across the number of questions asked by cohort and model (Supplementary Fig. 2). The Flesch test of the original EHR notes, which evaluates the readability of the notes, did not show a consistent association with models’ accuracies across cohorts and models (Supplementary Fig. 2a–c).

Comparison of different-sized prompts on LLM performance

In the final experiment, we assessed the effect of prompt size on performance, holding the total number of tasks constant at 50. Different configurations of notes and questions, e.g., 2 notes and 25 questions, were provided to all models that had a large enough context window. All relevant outcomes were assessed and are depicted in Fig. 5, including JSON failure rate (panel A), Omission failure rate (panel B), accuracy including omissions as errors (panel C), and excluding omissions as errors (panel D). The full set of results is also listed in Supplementary Table 8, and the corresponding prompt sizes per configuration can be found in Supplementary Table 1.

Fig. 5: Impact of prompt size on LLM performance for 50 total tasks.
figure 5

The impact of prompt size when a number of tasks is held constant at 50 for GPT-3.5-turbo-16k, GPT-4-turbo-128k, Mixtral-8x22B, and Mixtral-8x7B, with 50 iterations for each experiment. a JSON failure rate; b Omission failure rate; c overall accuracy including Omission failures as errors; and d overall accuracy excluding Omission failures as errors aggregated across configurations. As notes increased, the number of tokens increased, but the number of questions decreased (i.e., 2 notes with 25 questions each, 5 notes with 10 questions each, etc.). Shaded areas in plots (c) and d reflect 95% Confidence Intervals.

Overall, the outcomes of this task continue to support the eligibility of GPT-4-turbo-128k for grouping queries, and that 50 tasks are a suitable amount for this strategy. JSON and Omission failures remain low for all question and note configurations. Most importantly, accuracy remains consistently high, though it is better when Omission errors are excluded compared to when all outputs are considered

The other models were less resilient and robust to prompt size and complexity and there was a clear trend that emerged: as the prompt size increased—while the total number of tasks remained fixed—the models’ performance declined. For the complete set at the smallest prompt level (i.e., 2 notes) these models had somewhat similar accuracies. GPT-3.5-turbo-16k had the steepest decline in performance and was unable to handle the largest prompt complexity. As with prior tasks, accuracy greatly improves when Omission errors are excluded.

External validation results of the MedMCQA dataset

We performed an external validation of the primary and secondary analyses using a public dataset, MedMCQA, which contains medical multiple-choice questions. Accuracy performance, excluding Omission failures, is presented in Supplementary Table 9 for all ten LLMs. Supplementary Table 10 details JSON failure rates, while Supplementary Table 11 outlines JSON and Omission failures across models. These validation results align with trends observed in the primary experiments. Higher capacity models, such as GPT-4-turbo-128 and Llama-3-70B, exhibit minimal JSON and Omission errors. For GPT-4-turbo-128, accuracies remain stable from 5 to 50 questions, with a minimal decrease at 75 questions. For Llama-3-70B, accuracy holds from 5 to 25 questions, then slightly decreases at 50 and another decrease at 75 questions.

Economic modeling simulation by LLM query strategy

We evaluated the economic impact of different LLM querying strategies, comparing the standard approach of querying notes and questions individually to our concatenation method. For this simulation experiment, we compared both GPT-4-turbo-128k and GPT-3.5-turbo-16k using the framework of the Small task, specifically 2 notes and increasing the number of questions in select intervals from 1 to 25 (Supplementary Fig. 3). To account for potential failures, we added overhead cost to the concatenation version equal to the observed drop rate (JSON + Omission) that was observed at each time point. For instance, GPT-4-turbo-128k had a total failure rate of 1.10% at 10 questions, so an additional 1.10% was added to the cost for that time point.

For this experiment, the cost difference between the two strategies is, as expected, minimal at low numbers of questions, such as $0.02 for sequential strategy vs. $0.01 for concatenation strategy for 4 total questions for GPT-4-turbo-128k. The price difference, however, becomes more pronounced at higher loads of questions, such as 50 questions, where it would cost $0.25 for the sequential strategy versus only $0.02 for the concatenation framework. In health system-scale scenarios, where the notes could be in the hundreds of millions, the economic impact of query strategy decisions is extremely relevant. As expected, the impact of repeating notes for questions results in the biggest cost difference.

Discussion

Population-scale questions in a hospital system are being answered without LLMs. Many institutions have developed health services, and computational teams that process and integrate unstructured and structured data streamed from points of care delivery to produce detailed reports. However, the frequency, type, detail, and amount of such reports that can be produced are constrained by the required expertise and skills. With its intuitive interface, LLMs can empower a broader range of individuals to engage with the data, thereby facilitating an increase in the volume of impactful research conducted.

While LLMs are equipped to process healthcare data, there remain limited experience and guidelines for integration into clinical workflows. Holding regulatory and safety concerns aside, the cost of running state-of-the-art LLM APIs would have economic implications, especially those with millions of patients and hundreds of millions of records. This level of need is realistic given the amounts of patients seen in large health systems and the volume of data that stem from these cases.

While there certainly is utility in interactive queries to LLMs, such as chatbots, there are several questions that can be asked on the population scale that do not require immediate output. For instance, daily reports can be generated about hospital resource utilization and allocation, or clinical summaries can be aggregated for patients in a ward during shift changes. One could imagine how querying LLMs for generating information on patients at this scale could work in practice. The default process, i.e., individually and sequentially inputting specific notes from cases to ask a specific question, would incur the highest cost. There are alternative strategies, however, that could be utilized, which would be more cost-efficient. For instance, multiple notes could be loaded at once, and multiple questions could be asked at once. This “concatenation” strategy is beneficial because the same note does not have to be uploaded multiple times per question reducing the amount of tokens sent in each query. However, while economically efficient, this strategy would add complexity or stress to the LLM. It was unknown how increasing levels of burden affect performance, namely accuracy, which is paramount in the healthcare space. The primary goal of this study was to systematically evaluate burden loads for LLMs of multiple types and complexities to emulate how these systems could be most efficiently implemented in practice to minimize cost while maintaining sufficient accuracy. We evaluated the capabilities and limitations of ten different LLMs of various complexities when tasked with answering questions about real-world clinical notes of various types, including those that are fact-based, temporal, and numerical. We finally assessed the ability of LLMs to follow instructions in their answers, reflecting model integrity.

In addition to incorrect responses, we identified and analyzed two other kinds of errors LLMs could make in answering questions. The first was a JSON failure in which the models failed to return a properly formatted answer. The other, which we refer to as an Omission failure, occurred when models simply failed to answer a question, often doing so for all questions affiliated with the same note. There were significant differences in the performance of various LLMs across experiments as the prompt complexity increased. Smaller models, especially OpenBioLLM-8B and BioMistral-7B, had trouble correctly loading and formatting responses in JSON, even at the least complex prompt experiment with 2 notes. As expected, the amount of JSON and Omission failures increased across task sizes across all models, but they were at a tolerable level for the more complex ones. It is encouraging that LLMs tend to deliver improperly formatted or no responses instead of making inaccuracies.

In terms of overall accuracy, there was another clear trend in which performance suffered across the board as more notes were added in every task size. This performance drop-off was also more pronounced in the tasks with larger prompt complexities, but this was dependent on model complexity. While there was performance degradation in models of all types, the higher capacity models, namely GPT-4-turbo-128k and Llama-3–70B, were the most superior in almost every metric. Based on these findings, GPT-4-turbo-128k and Llama-3–70B are clearly able to handle up to 50 tasks in a series in any configuration. Interestingly, the other high-capacity model, OpenBioLLM-70B did not operate at the same level of accuracies as the other two. Out of the high-performing models, Llama-3–70B was consistent in where its errors were made, but GPT-4-turbo-128k had fewer errors for questions located in the first quartile of the prompt. While Omission failure reflects model fidelity, they are addressable in the implementation construct we propose (i.e., these questions can simply be re-queried in another attempt). Therefore, we additionally calculated accuracy without counting Omission failures as inaccuracies. Using this approach, we found that performance was recovered throughout all experiments, but GPT-4-turbo-128k and Llama-3–70B remained top performers. At extremely large question burdens, this strategy is less desirable in practice, however, due to the extremely high failure rates that occur. The results of the external validation analysis of the publicly available MedMCQA dataset reproduced the trends observed. These results further support the utility of GPT-4-turbo-128k and Llama-3–70B in concatenated queries of up to 50 tasks.

There was also variability in how LLMs handle different types of questions under various prompt complexities. Generally speaking, questions pertaining to a numeric answer were the easiest for all models, while temporally related questions were the most challenging. However, the overall differences in performances between categories were slight, and the best-performing categories often switched as the number of questions rose. These trends signal that there was not a profound difference in LLMs’ abilities to answer these types of questions. Fortunately, at the 50-task level, GPT-4-turbo-128k had a strong performance (90+%) in all question categories. The relationship between token size in prompts and accuracy is further revealed when number of tasks were held constant. Four models with sufficiently large context windows were provided with 50 tasks across various amounts of note and question configurations, with more notes reflecting larger prompt complexities. Like before, there was a clear decline as the number of notes increased for most models. Encouragingly, however, GPT-4-turbo-128k tended to maintain performance above 90% for most note-question configurations, and above 95% when excluding Omission errors. This finding further supports the use of high-capacity models for grouping up to 50 tasks.

To put these findings in terms of real-world implications, an economics simulation experiment was performed where total API cost was compared between query strategies. Specifically, we compared the total cost of running up to 50 tasks on 2 notes (simulating the Small experiment) either sequentially or as a group for GPT-4-turbo-128k and GPT-3.5-turbo-16k. After adjusting for possible failures, the concatenation strategy achieved roughly 17-fold savings in cost for 50 tasks with GPT-4-turbo-128k. The price difference of $0.24 per 50 tasks may seem limited but would amount to significant savings at the health system scale.

There are several limitations to consider in light of this work. First, the generalizability of the results may be by the number and selection of clinical notes and question types which do not fully represent the variety seen in broader clinical practices. However, we provided a variety of questions relevant to clinical workflows. Second, the study did not account for all possible clinical scenarios and was conducted within one health system. Third, the selection of specific LLMs and types considered do not reflect the comprehensive scope of all those available, however, we included commonly used LLMs over a variety of sizes. Fourth, we did not explore fine-tuning of models or retrieval augmented generation (RAG) techniques, as we were interested in “out-of-the-box” performance. Fifth, we used GPT-4-0613 8k solely for question generation and GPT-4-turbo-2024-04-09 128k for evaluation. While different models, GPT-4 may have had a potential advantage due to both models belonging to the same family. Finally, only 5% of the question-answer pairs were evaluated by human experts. They revealed that a small percentage (<5%) of those pairs that were reviewed were not correct. However, since all LLMs were subjected to the same dataset errors, the comparative analyses remain valid.

In conclusion, our study demonstrates the value of this concatenation strategy for LLMs in real-world clinical settings and provides evidence for effective usage under different model complexities and burdens. Future research should further explore different combinations of stressors and question types. As LLMs are continually refined, expanded, and newly developed, more model types should be included in this assessment. Additionally, alternative strategies should be explored to accomplish cost-efficient, enterprise-level LLM integration, such as pre-processing techniques, attention mechanisms like paged attention, decoding strategies such as speculative decoding, and optimization techniques like quantization. Perhaps most importantly, with proper regulatory oversight, this experiment should be assessed in real time, in real clinical environments, and under real burdens of caseloads. Operationalizing this framework will also enable the evaluation of its effects on computational time. Understanding the most cost-efficient and robust ways to implement LLMs in healthcare scenarios will make their integration more impactful and effective.