Abstract
Artificial Intelligence (AI) is well-suited to help support complex decision-making tasks within clinical medicine, including clinical imaging applications like radiographic differential diagnosis of central nervous system (CNS) tumors. So far, there have been numerous examples of theoretical AI solutions for this space, for example, large-scale corporate efforts like IBM’s Watson AI. However, clinical implementation remains limited due to factors related to the alignment of this technology in the clinical setting. User-Centered Design (UCD) is a design philosophy that focuses on developing tailored solutions for specific users or user groups. In this study, we applied UCD to develop an explainable AI tool to support clinicians in our use case. Through four design iterations, starting from basic functionality and visualizations, we progressed to functional prototypes in a realistic testing environment. We discuss our motivation and approach for each iteration, along with key insights gained. This UCD process has advanced our conceptual idea from feasibility testing to interactive functional AI interfaces designed for specific clinical and cognitive tasks. It has also provided us with directions to develop further an AI system for the non-invasive diagnosis of CNS tumors.
1. Introduction
AI has the power to support clinical decision-making, but the lack of transparency of many advanced AI models makes clinical translation difficult [6, 15]. Explainable AI (XAI) methods aim to clarify these models to mitigate the black-box problem and foster user trust [1, 9]. A crucial aspect of an XAI approach is the communication method between the human user and the AI system [1]. Visual analytics approaches can facilitate communication by representing data, models, and outputs in a manner that is readily understandable for human users [2]. These approaches leverage visual representations and interactive interfaces to analyze large amounts of data rapidly, uncover discernible patterns, and derive meaningful insights [2, 11, 22].
Visual analytics approaches tailored for the clinical setting are only just emerging. Ooge et al. reviewed 71 visual analytics platforms for healthcare [22]. They reported that most such systems are built for classical and mainstream statistical methods (e.g., clustering and regression analysis). Few platforms were designed for reporting predictive modeling (a type of data analysis that uses data mining, machine learning, and statistical techniques to identify patterns and relationships in data and then uses these patterns to forecast or predict future outcomes or events). Similarly, Alicioglu and Sun surveyed 55 visual analytics methods for XAI, noting a lack of examples of visual analytics tools built for predictive modeling [2]. These works demonstrate an opportunity to develop strategies for explaining complex AI models for non-expert users, such as clinicians and healthcare professionals [22]. While many recognize the necessity to incorporate explainability features in AI models, addressing user needs for understanding AI remains an open question. As the requirements of interpretability vary depending on the context, it is clear that XAI must take a user-centered approach [9].
UCD is a valuable methodology for developing customized tools for specific end users, such as clinicians [10]. In clinical decision-making, various visualization options are available to represent data from an electronic health record system, including images, graphs, plots, and text. Implementing reliable AI tools in clinical practice driven by UCD principles is an ongoing challenge [1]. In clinical practice, it is essential to investigate the factors influencing clinicians’ acceptance and adoption of clinical AI systems and how they interact with them [6].
We present our initial work towards designing a realistic user study for assessing the effectiveness of a visual analytics tool for a specific clinical use case: radiographic differential diagnosis of central nervous system (CNS) tumors from preoperative imaging. In this clinical context, when a patient presents with symptoms or signs, radiologists or clinicians typically review the imaging findings and compare them to known patterns associated with different diseases or conditions. Based on the radiographic appearance, location, and other image features, they create a list of potential differential diagnoses, representing the conditions that could account for the observed findings. The radiographic differential diagnosis is crucial as it guides subsequent diagnostic and therapeutic decisions. It provides valuable information for pathologists, surgeons, and other specialists involved in the patient’s care. By considering the potential diagnoses, they can determine appropriate next steps, such as additional testing (e.g., laboratory tests, biopsies), treatment planning (e.g., surgery, medical intervention), or referrals to other specialists [16].
This paper focuses on the first three levels of Munzner’s nested model for visualization design and validation [19]: domain problem characterization, data/operation abstraction design, and encoding/interaction technique design. While we have previously worked on algorithm design [25, 27], our current research goal is to improve clinicians’ decision-making by designing and evaluating interfaces to contextualize and explain AI models. With that in mind, we follow a user-centered design approach, collaborating with clinical experts to co-create an explainable AI tool through iterative designs. We specifically discuss the initial phase, involving domain characterization, task specification definition, and prototype development. For each iteration, we report our rationale for the specific iteration, its design, and the insights gained from a qualitative evaluation with clinical experts. In future work, we will utilize our developed prototype implementation to conduct a formal user study to assess the effectiveness of our explainable AI approach in a clinical setting with a more prominent subject pool of clinical experts.
2. Related Work
The Role of UCD for Clinical AI Software Development and Evaluation
UCD is crucial for developing AI systems in clinical settings [8]. There are limited clinical specialists for a given subspecialty within a single institution, and they work demanding schedules with little free time. Therefore, the development of an AI tool to support a clinician’s decision-making is hindered by access to the target user group. Fortunately, studies have demonstrated that testing with just five users can identify most interface design problems, improving usability, user satisfaction, and adoption rates [20].
Various examples exist of a UCD approach applied to AI in clinical medicine. Systematic definitions and guidelines for what comprises an AI explanation have been developed in the computer science community [17, 32]. He and colleagues developed the XAI User Needs Library for the medical field, considering explanation content, not technical details, algorithms, or design methods [14]. This development was a joint effort of engineers, AI professionals, and clinical end users, featuring various factors sorted by type of questions on model provenance, performance, comparisons, counterfactual reasoning, ethics, and law. For our work, we utilized the XAI User Needs Library as a starting point for the quantitative interface design.
The DoReMi method combines domain analysis, requirements elicitation, and multi-modal interaction design and evaluation [31]. Following this approach, the authors examined the literature on diagnosing ADHD, identified 20 information elements for XAI, and factored in social contexts in which the XAI would be applied. They created explanation design templates and prototypes for user studies by routinely integrating clinical user feedback. We apply the same principles in our work and emphasize the perspective that different application environments and social contexts will impact how clinical decision support software is used.
Calisto and colleagues developed an AI software called BreastScreening-AI to diagnose breast cancer using medical imaging [7]. The impact of the software on clinicians’ decision-making was assessed by comparing the clinicians’ breast cancer diagnoses with and without the integration of AI technology. The software uses deep learning to analyze the images and improve the accuracy of diagnoses. The researchers examined how clinicians would interact with the AI and whether it could help reduce diagnosis errors. The use of AI assistants in medical workflows can lead to improved accuracy and reduced errors while not affecting the performance or productivity of clinicians. These benefits are accomplished by increasing clinicians’ acceptance of AI systems and giving them control over the diagnostic process. The outcome is a higher rate of correct diagnoses and a lower incidence of false positives and false negatives.
Calisto and colleagues also explored how AI assistance affected clinicians’ decision-making. In a follow-up study, the BreastScreening-AI researchers compared two methods for conveying information to different user groups, utilizing either a suggestive or assertive tone to communicate their predictive diagnostic explanation [5]. They suggested that the style of communication used when presenting predictive models can affect response. Additionally, varying the assertiveness of text may communicate uncertainty in generative text results. We adopted a similar approach but focused on designing software for diagnosing CNS tumors. Additionally, our research explores using visual aids like charts and graphs instead of traditional text, tables, and images to enhance the presentation of findings, aiming to improve the effectiveness of systems in presenting AI results to medical professionals.
AI for Supporting the Radiographic Differential Diagnosis of Central Nervous System Tumors
Assisting in generating a radiographic differential diagnosis is the most common AI application in neuro-oncology [12, 16]. While not currently available, a potential outcome of AI-assisted imaging diagnosis would be to no longer depend upon invasive tissue biopsy to obtain the biological data needed to guide therapy. Using pediatric Adamantinomatous Craniopharyngioma (ACP) as an example, currently, diagnosis includes neurosurgical biopsy more than 91% of the time [28]. However, radiotherapy alone could be an effective treatment for ACP without any neurosurgical intervention [28, 37]. Neuroradiology experts can identify ACP from preoperative MRI with an average accuracy of 86%. Existing deep learning algorithms perform equally as well at this task [21, 27]. While AI performance can assist neuroradiologists in developing an efficient and thorough differential diagnosis, it does not provide enough evidence to eliminate the need for neurosurgical biopsy in diagnosis mitigation.
A practical example of XAI in neuro-oncology is AutoRAPNO, a completely automated system for segmenting pediatric medulloblastomas, high-grade gliomas, and tumors that have seeded the leptomeninges, based on the Response Assessment in Pediatric Neuro-Oncology (RAPNO) guidelines [23]. Integrating AI support with guidelines like RAPNO has yielded a more reproducible and standardized assessment of treatment response across clinicians, specifically in patients with low-grade glioma [34]. Towards this overall goal, we aim to use a scientifically reproducible UCD approach to develop an AI system that enhances the clinician’s ability to diagnose CNS tumors non-invasively.
3. Methods
In our journey toward designing an explainable AI system for CNS tumor diagnosis, we began with publicly available tools for explaining AI predictions. We worked with three board-certified clinicians representing neuroradiology, neuro-oncology, and neurosurgery. The following subsections describe the iterative design cycle we performed as well as software implementation details.
Iterative Design Cycle
Working alongside our clinical collaborators, we conducted this study in a design cycle with four iterations. Each iteration includes three steps: (1) curating examples of visualization designs based on task specification – e.g., plots, static renderings, functional implementations, (2) conducting structured interviews with target clinical users regarding examples and tasks, and (3) incorporating feedback to refine task specification and visualization design.
In the first iteration, the visualization examples curated included areas under the curve plots reflecting AI model performance and saliency maps generated using SHAP [18] to indicate imaging regions contributing to a diagnosis. In the second iteration, we used an off-the-shelf tool for explainable machine learning (The What-If-Tool; WIT [36]) to communicate the predictions to clinicians. This iteration was different from the other iterations because we conducted a small user study using survey methods to determine how the WIT impacted diagnostic accuracy and clinicians’ decision-making confidence and perceived difficulty [26]. In the third iteration, we targeted our task specifications by engaging in a clinical immersion experience at Children’s Hospital Colorado (CHCO). Following this experience, we rendered a pair of static prototypes using Adobe Illustrator. In the fourth iteration, we replicated and created a radiology reading terminal to the specification of the neuroradiology department at CHCO and implemented basic functional AI interfaces as web applications.
AI Functionality
The study’s first and second iterations use the AI model and dataset we described previously [25, 27]; the third iteration did not include AI functionality. In the fourth iteration, we generated an AI model to support our target patient comparison functionality. We utilized our ATPC50 dataset comprising 50 unique patients with CT and MRI data totaling 46,879 individual 2-D DICOM instances. Each instance was resized to 224×224×3 pixels using an anti-aliasing algorithm in scikit-image and preprocessed for a ResNet V2 using the built-in TensorFlow function. Next, a ResNet V2 50 model trained on ImageNet extracted vector embeddings of length 2,048 (using average pooling). We then created a small vanilla autoencoder with a latent dimensionality of 256. From this latent space, we connected four classifier which, for each instance, predicted which patient it belonged to, which modality, which study, and which acquisition protocol. The resulting multi-task loss function was a summation of the classifier loss and the reconstruction loss of the autoencoder. This model was trained for 200 epochs and saved. Because our dataset comprises ACP patients, we needed to simulate multiple diagnoses within this dataset to achieve our target demo functionality. We clustered the 256-dimensional latent space of the autoencoder using a Gaussian mixture model for five components. We then labeled these components arbitrarily as glioma, craniopharyngioma, meningioma, and germ cell tumor. While these diagnoses are untrue, the comparison between patient images is functional, and therefore, these groups reflect hypothetical subgroups of ACP patients based on imaging phenotypes.
Software Implementation
Aside from renderings made in Adobe Illustrator, all aspects of this project were conducted within the Explainable Analytical Systems Lab (EASL) framework [26] deployed as a web application hosted on our AI server (described below). Our implementation of EASL includes a Python environment for constructing, training and evaluating AI models. It also consists of the SQL, Python, PHP, JavaScript, and HTML/CSS stack used to implement prototype interfaces.
Computational Hardware
AI model development and application hosting were performed on a CentOS 7 server comprised of a 64-core Intel Gold 6226R CPU, 512 GB of DDR3 RAM, 2 NVIDIA Tesla T4 GPUs, and 4 NVIDIA Tesla K80 GPUs. Our replicated workstation for the study in the fourth design iteration was matched in specification to that in the CHCO radiology reading room, which consists of a Barco Coronis Fusion 6MP DICOM monitor (MDCC-6530) as the primary reading monitor and two Dell 24” P-series monitors stacked vertically to the left. The top-left monitor displays the simulated electronic health record view, and the bottom-left monitor shows AI model information interactively and visually. These monitors are connected to a Dell Precision 5820 machine with AMD NVIDIA RTX A4000 and MXRT-6700 GPU to support the display monitors.
4. Results
Domain Characterization and Task Specification
We engaged in an immersive experience alongside the neurosurgery, neuroradiology, and neuro-oncology groups at CHCO. We focused on observing the workflows related to radiographic imaging in these settings, emphasizing how data and technology are currently being utilized. Through this process, we dissected the nuances regarding how each specialist uses the neuroimaging data. The neuroradiologist specializing in brain tumors follows a specific gaze pattern when examining MRI and CT scans. They begin by assessing the overall image quality and anatomy. Then, they systematically evaluate various brain regions, including the midline structures, ventricular system, supratentorial region, posterior fossa, skull base, and paranasal sinuses. Throughout this process, they pay attention to the specific characteristics of brain tumors, such as size, location, enhancement pattern, and any associated findings. Neurosurgeons primarily focus on the surgical treatment of brain tumors and other neurological conditions. Their interpretation of brain imaging is often geared toward surgical planning and intraoperative guidance. Some areas of emphasis for a neurosurgeon may include lesion localization, infiltration and involvement, vascular supply, mass effect, and midline shift. Neuro-oncologists specialize in the medical management of brain tumors, including chemotherapy and radiation therapy. Their interpretation of brain imaging focuses on evaluating the tumor’s characteristics for treatment planning and response assessment. Some areas of emphasis for a neuro-oncologist may include tumor characteristics, peritumoral edema, and treatment response assessment.
We identified the following tasks which are common across these three sub-specialties:
T1 Identify the top predicted diagnosis for a given patient
T2 Identify the most similar patient to the target patient and their diagnosis
T3 Compare and contrast between the target patient and similar patient
First Design Iteration – Visual Representations for AI Explanations
Rationale.
Before this project, we demonstrated the feasibility of an AI model predicting a CNS tumor diagnosis from preoperative imaging and human experts [27]. We also presented a method for estimating the epistemic uncertainty of these predictions, which increased our AI model performance [25]. Subsequently, we were motivated to address how these methods can be translated into a clinical environment. There exist numerous ways which have been designed to explain AI model functionality. These include post hoc methods, which use surrogate models to explain the relationship between the input and output of a trained model. In addition, there are ante hoc methods that explicitly define the model architecture based on prior knowledge but at the cost of model generalizability. Finally, there is a balance to be struck between information that describes model performance and provenance and the predictive information output by the model to support the clinical decision-making task directly.
Design.
We utilized our previously published model and dataset [27]. In a Python environment, we generated visualizations for model performance, including the area under the receiver operating characteristic and bar charts relating our model to XGBoost and Random Forrest implementations. In addition, we included saliency maps generated by the SHAP algorithm to visualize how specific regions of a given image contribute to a predicted diagnosis (Figure 2). We then conducted individual structured interviews with our clinical collaborators to assess their perceived utility of each visual.
Insights.
Although this design iteration satisfies T1, through our interviews we observed that performance charts and saliency maps are not particularly informative in the context of making clinical decisions. While visualizations of AI model performance are critical for engineers to develop and evaluate models, this task does not equate to a clinician evaluating why a predicted diagnosis has been provided. We learned that clinicians want to know that the model is performing at a clinically reasonable level; if so, it becomes a minor factor in their downstream analysis, if not it will not be adopted. We also realized that our initial domain characterization and task specifications were insufficient and required revision.
Second Design Iteration – Representativeness and Clinician Confidence
Rationale.
We revisited our foundational concepts based on our feedback from the first iteration. Our initial prototypes reflected typical visuals found in the AI and explainable AI communities. Instead, in this iteration, we focused on clinicians’ decision-making process and how the AI tool might affect that process. It is important to consider clinicians’ typical workflows when making a patient care decision. AI presents a paradigm shift in healthcare by mimicking human mental processes [3]. Interestingly, it is known that clinicians utilize a cognitive heuristic known as representativeness to aid their decision-making when faced with uncertainty [29]. They consider how a new patient might be similar to a patient they have previous experience with (T3). When it comes to radiographic images, AI can be used to emulate this heuristic by applying machine learning algorithms to analyze and interpret the images. AI algorithms can be trained on a large dataset of radiographic images with labeled diagnoses. By learning from these examples, the AI system can develop the ability to recognize patterns and similarities between new radiographic images and the known prototypes or stereotypes. Therefore we adapted our specifications to include T2 and T3.
Design.
In this iteration, we performed a small user study to evaluate how an interactive AI software—Google’s WIT, which facilitates the comparison of data points, in our case patients—impacts a clinician’s ability to diagnose a CNS tumor from preoperative imaging. Described in more detail as a use case for the EASL framework [26], this user study was conducted using three conditions and two clinical subjects. The conditions included a standard imaging interface, the standard interface with static predictive values from the model, and the standard interface with the WIT. The reader is directed to [26] for a view of the WIT in our context.
For each condition, the clinical subject was provided 30 pediatric CNS tumor cases to review and provide a diagnosis. In addition, we adapted parts of the ICE-T survey instrument [35] to have study subjects respond to a 5-point Likert scale regarding how difficult it was for them to make the diagnosis and how confident they were in their diagnosis. Finally, we utilized the NASA Task Load Index instrument [13] to assess the usability of the overall interface. The study used interactive PDF documents that hyperlinked to the various interfaces, and clinical users performed the operation.
Insights.
We observed that when clinicians used the WIT, they reported higher confidence and less difficulty relative to the other conditions. No change in accuracy was noted. Notably, and unsurprisingly, users reported that the WIT was challenging to understand at first due to the unfamiliar visual representations and mathematical terminology displayed in the interface. Further discussion revealed that once the WIT functionality was understood, the clinicians perceived representativeness as a meaningful conceptual target for explaining predictions. An additional crucial insight from this study was that the physical testing environment was perceived to be highly restrictive for the clinicians to perform their typical workflow and will therefore, likely not lead to clinically translatable results. Overall, we took away from this iteration that we need to adopt the WIT functionality into a novel interface tailored for clinicians. We also require specifications for a physical testing environment that does not restrict the clinical user’s ability to do their task normally.
Third Design Iteration – Qualitative and Quantitative Visualizations for Predictions and Patient Comparison
Rationale.
After iteration 2, we were encouraged to target representativeness as a primary AI functionality for explainability. We expanded our interface prototypes to present this functionality and assess which types of visuals clinicians found aesthetically pleasing and perceived to have high utility. Choosing effective visualization for clinically translatable AI tools is a major challenge in supporting clinical decision-making. Making this decision requires consideration of factors such as the type of data being presented, the target user group, and the specific purpose of the visualization in supporting clinical decision-making. For instance, people possess different levels of numeracy, which refers to their ability to comprehend and utilize numerical information [24]. Consequently, when presented with numbers, individuals may feel frustrated or disregard the information due to its complexity. Moreover, lower numeracy skills are linked to an increased likelihood of making mistakes in clinical decision-making [30]. Therefore, using a visual representation accessible to clinicians is crucial to effectively communicate information and facilitate informed decision-making in a clinical setting. Failure to do so will result in software that is unusable by the target user group and will be unable to support clinical decision-making effectively.
Design.
Using Adobe Illustrator, we created static renderings for prototype designs presenting our comparative AI functionality (T2 and T3) in two formats. The first format (Figure 3, top) was designed to be more numerical/quantitative in nature and is comprised of multiple views, each representing a category in the XAI User Needs Library. The second format (Figure 3, middle) was designed to have a more qualitative theme and split the views into assenting and dissenting evidence with respect to a predicted diagnosis.
Insights.
Interviews with clinical collaborators regarding these prototypes provided feedback about presenting familiar information to the target user group. In our case, it was suggested that the second format contained more familiar visuals like table and radiographic image views. We incorporated that feedback and created a revised design (Figure 3, bottom), which placed more emphasis on these visuals.
Fourth Design Iteration – Testing Functional Prototypes in a Realistic Environment
Rationale.
Based on feedback from every previous phase, we determined that the physical testing environment for the explainable AI software is critical to derive clinically usable solutions. We also learned that clinicians felt hindered in providing feedback to systems they could not utilize or interact with, indicating a need for functional prototypes. This observation aligns with previous work: to ensure the validity of AI systems in healthcare settings, a reliable and reproducible testing environment must be established [4]. Suppose clinical AI experiments are conducted in an unrealistic setting. In that case, the results may not accurately reflect the performance of existing methods of diagnosis and treatment, which could lead to incorrect conclusions about the efficacy of the AI system [8]. Additionally, it could not be easy to replicate the results in other settings, hindering the development of the AI system.
Design.
In response to our clinical user feedback, we developed a physical testing environment to specification the regular setup used in the Department of Radiology at CHCO (Figure 4). In addition, we addressed clinician comments regarding interactive prototypes by developing functional implementations of quantitative and qualitative interfaces using EASL (Figure 1). We conducted one-on-one structured interviews with clinical collaborators and inquired feedback about the perceived utility of each interface, contextualized with their existing workflows.
Insights.
Our clinical collaborators have verified that this design iteration is perceived as beneficial in functionality and appears to satisfy T1 and T2. Furthermore, they have attested to the accuracy and adequacy of the testing environment we created, which effectively emulates a realistic workflow. However, we also learned that our task specification required further revision. Specifically, we need to refine T3 to consider how patients should be compared based on the top-down logical workflow clinicians utilize in their daily practice. Two key examples are related to the level of diagnosis provided, the workflow, of course, and nuanced analyses and differentiating between coarse and precise display instruments. In addition, we learned about the importance of screen real estate. Finally, we learned about the temporal dependency of what information is available and the importance of clinically indicated variables.
5. Discussion, Challenges, and Limitations
Although our testing environment is aligned with a typical clinical workflow, replicating the hardware setup clinicians typically use also raised new challenges regarding the interface design. These challenges are fundamentally related to the increased screen real estate provided by the large monitors in the reading terminal.
Following our pilot study of the WIT, we created static renderings for a pair of prototype designs that would present our comparative AI functionality in two formats. The first format (Figure 3, top) was designed to be more numerical/quantitative in nature and is comprised of multiple views, each view representing a category in the XAI User Needs Library. The second format (Figure 3, middle) was designed to have a more qualitative theme and split the views into assenting and dissenting evidence with respect to a predicted diagnosis. Interviews with clinical collaborators regarding these prototypes provided feedback about presenting familiar information to the target user group. In our case, it was suggested that the second format contained more familiar visuals like table and radiographic image views. We incorporated that feedback and created a revised prototype which placed more emphasis on these visuals (Figure 3, bottom).
However, transitioning to the reading terminal caused us to rethink this trajectory towards image and table-based views. During prototype implementation, we observed that providing radiographic image views on any screen other than the specialized Barco reading monitor seemed contradictory. However, removing the image views from our prototype design also leaves us with a surplus of unused screen real estate. To address this issue, we discussed this concern with our clinical collaborators. They suggested that it is reasonable for a clinician to want a radiographic view on any of the screens. More specifically, we understood additional details about the workflow: it can be separated into coarse and fine analyses. As an example, ependymoma is a CNS tumor that presents proximal to the brain stem but has also been observed to have distant metastases in the white matter of the brain. Clinicians may not immediately notice these distant growths as they may be focused on the central mass at the brain stem. Therefore, it would be helpful to have a system that can automatically flag the presence of metastases which can prompt the clinician to review a region in more detail using the precision view of the Barco monitor. This observation also serves as a reminder of the importance of the iterative design process and co-creation philosophy of UCD. When the prototype designs and implementations become more refined, the clinical experts can provide feedback on aspects not considered in initial design and domain characterization discussions.
Limitations and Challenges.
First, we have no information regarding how the underlying AI performance of this tool would impact the clinicians’s perception. Villain et al. developed a visualization technique highlighting the mean difference between class activation maps [33]. Remarkably, these activation maps can effectively guide a human’s attention to the region of interest within the image, even with a relatively low model accuracy of 65%. These findings suggest that AI does not necessarily have to be highly accurate to be helpful – a weak classifier can be used to direct a person’s attention, which can lead to performance improvement for that person. However, mitigating human bias using AI may be more important than accuracy alone. There are an increasing number of examples of papers that discuss the social contexts in which these human-AI interactions can exist and how clinical experts may respond to information that agrees or disagrees with their interpretation. This issue is intriguing and complex and will require a multidisciplinary team of clinicians, computation experts, and human factors research teams to define this interaction formally.
Second, the comparison between interfaces is not a true comparison. Instead, we are using the two interfaces as a sandbox to present a wide variety of possible options with the goal of narrowing down on the specific types of representations. Additionally, clinicians commented that to provide additional in-depth feedback on a prototype interface they would ultimately need time to interact with it and use it to perform real tasks. A complete prototype implementation is a costly proposition, which is why we are prototyping so much before implementation. This limitation could confound our future prototypes.
Finally, UCD has advantages and disadvantages in our specific use case. On one hand, UCD is demonstrably helpful for creating generalizable and valuable tools with a few users providing feedback. Conversely, clinicians have demanding work schedules and limited free time, making accessibility to that target user group a significant challenge. In addition, the number of eligible clinical users is limited at a single institution, which ultimately means recruiting almost 100% of a rare group of users is required. To maximize the effectiveness of user feedback and respect clinicians’ time, it is essential to ensure that each meeting is designed to test specific hypotheses regarding design and implementation choices. In practice, this can be quite challenging to execute, particularly in the initial stages of a project. We have found that immersion into the everyday work environment of the clinicians, as well as semi-structured interviews, is precious for making design and implementation choices at an early stage.
Future Directions and Conclusion
Our next steps are to continue the design cycle for one more iteration to simplify and enhance the capacity of our explainable AI tool to provide clinicians with a way to discern between supporting/consistent evidence and inconsistent evidence quickly. Following this design iteration and evaluation, we will use our realistic testing environment to conduct a case-based user study to assess quantitatively the effectiveness of our explainable AI system to improve the clinician’s ability to diagnose a brain tumor patient. For this study, we will enroll a larger subject pool, including clinicians from multiple subdisciplines and varied experiences.
Precision medicine for diagnosing CNS tumors will likely include more data from various modalities. Specifically, a combination of imaging features and circulating biomarkers was obtained via blood or cerebrospinal fluid. With the increase in research being conducted in the domain of radiogenomics and the previous discussion point regarding the efficient use of screen real estate, it will be essential to begin considering how radiogenomics data may be presented alongside classical imaging views. UCD is a robust methodology for approaching the clinical translation of AI tools and should be utilized when undertaking this task.
AI methods have become more advanced in recent years and can now be used to improve clinical decision-making tasks. However, relying solely on AI model predictions is unsuitable for high-risk clinical scenarios like diagnosing CNS tumors using radiographic images. Instead, advanced clinical applications require tailored solutions supporting clinicians’ specific tasks, enhancing their abilities. In this study, we applied UCD to develop an explainable AI tool that helps clinicians in our particular use case. We started with basic functionality and visualizations, going through four design iterations to create functional prototypes in a realistic testing environment. With these iterations, UCD has transformed our initial idea of the prediction model’s feasibility into interactive, practical, and explainable AI interfaces designed for specific clinical and cognitive tasks. Additionally, this approach has provided various directions for the future, moving us closer to our long-term goal of developing an AI system for the non-invasive diagnosis of CNS tumors.
Contributor Information
Eric W. Prince, University of Colorado.
Todd C. Hankinson, Children’s Hospital Colorado.
Carsten Görg, Colorado School of Public Health.
REFERENCES
- [1].Albahri A, Duhaim AM, Fadhel MA, Alnoor A, Baqer NS, Alzubaidi L, Albahri O, Alamoodi A, Bai J, Salhi A, et al. A systematic review of trustworthy and explainable artificial intelligence in healthcare: Assessment of quality, bias risk, and data fusion. Information Fusion, 2023. [Google Scholar]
- [2].Alicioglu G and Sun B. A survey of visual analytics for explainable artificial intelligence methods. Computers & Graphics, 102:502–520, 2022. [Google Scholar]
- [3].Bharati S, Mondal MRH, and Podder P. A review on explainable artificial intelligence for healthcare: Why, how, and when? IEEE Transactions on Artificial Intelligence, 2023. [Google Scholar]
- [4].Bi WL, Hosny A, Schabath MB, Giger ML, Birkbak NJ, Mehrtash A, Allison T, Arnaout O, Abbosh C, Dunn IF, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA: a cancer journal for clinicians, 69(2):127–157, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Calisto FM, Fernandes J, Morais M, Santiago C, Abrantes JM, Nunes N, and Nascimento JC. Assertiveness-based agent communication for a personalized medicine on medical imaging diagnosis. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–20, 2023. [Google Scholar]
- [6].Calisto FM, Nunes N, and Nascimento JC. Modeling adoption of intelligent agents in medical imaging. International Journal of Human-Computer Studies, 168:102922, 2022. [Google Scholar]
- [7].Calisto FM, Santiago C, Nunes N, and Nascimento JC. Breastscreening-ai: Evaluating medical intelligent agents for human-ai interactions. Artificial Intelligence in Medicine, 127:102285, 2022. [DOI] [PubMed] [Google Scholar]
- [8].Chen H, Gomez C, Huang C-M, and Unberath M. Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review. npj Digital Medicine, 5(1):1–15, Oct. 2022. Number: 1 Publisher: Nature Publishing Group. doi: 10.1038/s41746-022-00699-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Combi C, Amico B, Bellazzi R, Holzinger A, Moore JH, Zitnik M, and Holmes JH. A manifesto on explainability for artificial intelligence in medicine. Artificial Intelligence in Medicine, 133:102423, 2022. [DOI] [PubMed] [Google Scholar]
- [10].Dorton SL, Maryeski LR, Costello RP, and Abrecht BR. A case for user-centered design in satellite command and control. Aerospace, 8(10):303, 2021. [Google Scholar]
- [11].Faiola A, Srinivas P, and Duke J. Supporting clinical cognition: a human-centered approach to a novel icu information visualization dashboard. In AMIA Annual Symposium Proceedings, vol. 2015, p. 560. American Medical Informatics Association, 2015. [PMC free article] [PubMed] [Google Scholar]
- [12].Fischer C, Petriccione M, Donzelli M, and Pottenger E. Improving care in pediatric neuro-oncology patients: an overview of the unique needs of children with brain tumors. Journal of child neurology, 31(4):488–505, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Hart SG and Staveland LE. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In Advances in psychology, vol. 52, pp. 139–183. Elsevier, 1988. [Google Scholar]
- [14].He X, Hong Y, Zheng X, and Zhang Y. What are the users’ needs? design of a user-centered explainable artificial intelligence diagnostic system. International Journal of Human–Computer Interaction, pp. 1–24, 2022. [Google Scholar]
- [15].Javed AR, Saadia A, Mughal H, Gadekallu TR, Rizwan M, Maddikunta PKR, Mahmud M, Liyanage M, and Hussain A. Artificial intelligence for cognitive health assessment: State-of-the-art, open challenges and future directions. Cognitive Computation, pp. 1–46, 2023. [Google Scholar]
- [16].Kann BH, Hosny A, and Aerts HJ. Artificial intelligence for clinical oncology. Cancer Cell, 39(7):916–927, July 2021. doi: 10.1016/j.ccell.2021.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Leslie D. Understanding artificial intelligence ethics and safety. arXiv preprint arXiv:1906.05684, 2019. [Google Scholar]
- [18].Lundberg SM and Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. [Google Scholar]
- [19].Munzner T. A nested model for visualization design and validation. IEEE transactions on visualization and computer graphics, 15(6):921–928, 2009. [DOI] [PubMed] [Google Scholar]
- [20].Nielsen J and Landauer TK. A mathematical model of the finding of usability problems. In Proceedings of the INTERACT ‘93 and CHI ‘93 Conference on Human Factors in Computing Systems, CHI ‘93, p. 206–213. Association for Computing Machinery, New York, NY, USA, 1993. doi: 10.1145/169059.169166 [DOI] [Google Scholar]
- [21].Norris GA, Garcia J, Hankinson TC, Handler M, Foreman N, Mirsky D, Stence N, Dorris K, and Green AL. Diagnostic accuracy of neuroimaging in pediatric optic chiasm/sellar/suprasellar tumors. Pediatric Blood & Cancer, 66(6):e27680, 2019. [DOI] [PubMed] [Google Scholar]
- [22].Ooge J, Stiglic G, and Verbert K. Explaining artificial intelligence with visual analytics in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(1):e1427, 2022. [Google Scholar]
- [23].Peng J, Kim DD, Patel JB, Zeng X, Huang J, Chang K, Xun X, Zhang C, Sollee J, Wu J, et al. Deep learning-based automatic tumor burden assessment of pediatric high-grade gliomas, medulloblastomas, and other leptomeningeal seeding tumors. Neuro-oncology, 24(2):289–299, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Peters E, Västfjäll D, Slovic P, Mertz C, Mazzocco K, and Dickert S. Numeracy and decision making. Psychological science, 17(5):407–413, 2006. [DOI] [PubMed] [Google Scholar]
- [25].Prince EW, Ghosh D, Gxörg C, and Hankinson TC. Uncertainty-aware deep learning classification of adamantinomatous craniopharyngioma from preoperative mri. Diagnostics, 13(6):1132, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Prince EW, Hankinson TC, and Görg C. Explainable analytical systems lab (easl): A framework for designing, implementing, and evaluating ml solutions in clinical healthcare settings. In Machine Learning for Healthcare Conference, vol. 219, 2023. [PMC free article] [PubMed] [Google Scholar]
- [27].Prince EW, Whelan R, Mirsky DM, Stence N, Staulcup S, Klimo P, Anderson RC, Niazi TN, Grant G, Souweidane M, et al. Robust deep learning classification of adamantinomatous craniopharyngioma from limited preoperative radiographic images. Scientific reports, 10(1):16885, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Recinos MA, Momin A, Soni P, Cioffi G, Patil N, Recinos PF, Kruchko C, Barnholtz-Sloan JS, and Kshettry VR. Descriptive epidemiology of craniopharyngiomas in the united states. Journal of Neurological Surgery Part B: Skull Base, 82:S65–S270, 2021. [DOI] [PubMed] [Google Scholar]
- [29].Richie M and Josephson SA. Quantifying heuristic bias: anchoring, availability, and representativeness. Teaching and learning in Medicine, 30(1):67–75, 2018. [DOI] [PubMed] [Google Scholar]
- [30].Rolison JJ, Morsanyi K, and Peters E. Understanding health risk comprehension: The role of math anxiety, subjective numeracy, and objective numeracy. Medical Decision Making, 40(2):222–234, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Schoonderwoerd TA, Jorritsma W, Neerincx MA, and Van Den Bosch K. Human-centered xai: Developing design patterns for explanations of clinical decision support systems. International Journal of Human-Computer Studies, 154:102684, 2021. [Google Scholar]
- [32].Sokol K and Flach P. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 56–67, 2020. [Google Scholar]
- [33].Villain E, Mattia GM, Nemmi F, Péran P, Franceries X, and Le Lann MV. Visual interpretation of cnn decision-making process using simulated brain mri. In 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), pp. 515–520. IEEE, 2021. [Google Scholar]
- [34].Vollmuth P, Foltyn M, Huang RY, Galldiks N, Petersen J, Isensee F, van den Bent MJ, Barkhof F, Park JE, Park YW, et al. Ai-based decision support improves reproducibility of tumor response assessment in neuro-oncology: an international multi-reader study. Neuro-Oncology, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Wall E, Agnihotri M, Matzen L, Divis K, Haass M, Endert A, and Stasko J. A heuristic approach to value-driven evaluation of visualizations. IEEE transactions on visualization and computer graphics, 25(1):491–500, 2018. [DOI] [PubMed] [Google Scholar]
- [36].Wexler J, Pushkarna M, Bolukbasi T, Wattenberg M, Viégas F, and Wilson J. The what-if tool: Interactive probing of machine learning models. IEEE transactions on visualization and computer graphics, 26(1):56–65, 2019. [DOI] [PubMed] [Google Scholar]
- [37].Young M, Delaney A, Jurbergs N, Pan H, Wang F, Boop FA, and Merchant TE. Radiotherapy alone for pediatric patients with craniopharyngioma. Journal of Neuro-Oncology, pp. 1–10, 2022. [DOI] [PubMed] [Google Scholar]