As a final evaluation of our framework, we conducted an exploratory think-aloud study with 10 professional data scientists tasked with using AIFinnity to choose between two image captioning models. This study aimed at understanding how people use a complete sensemaking system, including how the different stages interact and how data scientists approach the process. We believe that these initial empirical insights can highlight the primary benefits and key features of AI analysis systems grounded in the sensemaking framework.
To recruit participants, we sent an email to 200 data scientists at Microsoft. We continued to invite participants in order of their responses until the qualitative themes in our iterative analysis converged at 10 participants (8 male, 2 female, mean age 32). The participants had an average of 6.8 years of data science experience and worked with various domains and models, including recommendation systems, search, captioning, and cybersecurity. The study lasted between 40 and 60 minutes, for which we compensated the participants with a $25 Amazon gift card.
6.1 Study Procedure and Analysis
We started the study with a few background questions about the data scientist’s experience with AI and behavioral analysis. The researcher then spent 10–20 minutes walking participants through
AIFinnity, specifically for a task comparing two optical character recognition models used to read street signs, the same as the example in Section
5. The researcher explained the primary features and components of
AIFinnity, and had the participant create at least one schema and hypothesis. We used a different domain and task for the introduction to not bias the behaviors that the participants looked for in the last part of the study.
In the final and main part of the study, which lasted 30–40 minutes, participants were tasked with using
AIFinnity to choose between two image captioning models on a dataset of outdoor activities. This task was motivated by a common use case for image captioning, making photos accessible to people who are visually impaired or blind, for example, on social networks [
53]. The task focused on model comparison to give participants a concrete goal, but since comparison requires participants to understand each model’s behavior, our discoveries encompass understanding the behavior of one model. The first model, model A, was Microsoft’s Cognitive Services image captioning system,
1 and the second model, model B, was a pre-trained, off-the-shelf captioning model.
2 Participants analyzed the behavior of the models on the UIUC Sports Event dataset [
51], a collection of images from various indoor and outdoor sports. We chose this dataset as it has a wide variety of conditions, scenarios, and actions, while being a limited enough domain to explore in 30–40 minutes. To not limit or cherry pick the types of behaviors participants searched for, we gave them the general task of understanding the two models well enough to describe to a client, with supporting evidence, which model they should use for the given sports dataset.
As we conducted the studies, we transcribed the recordings and did iterative open coding of the results [
70]. We also summarized the schemas and hypotheses of the participants as additional data on how the participants analyzed the two AI systems. With 10 participants, we found that the themes of how data scientists use a complete sensemaking system converged with significantly overlapping interaction patterns and hypotheses. After completing all the interviews, we conducted selective coding of the transcripts focused on the main themes identified in the open coding. We separate the findings into broader insights that are likely to generalize to other sensemaking systems and findings specific to the
AIFinnity system.
6.2 Results
Making Sense of Model Behavior. The challenges and goals described by the participants for AI analysis matched those identified in the empirical studies reviewed in Section
4. When describing their AI analysis workflow, all 10 participants talked about taking steps to better understand their AI systems beyond aggregate metrics. One participant (P8), a manager of an AI team, actually described their primary role as “metric development”: conducting behavioral analyzes on a deployed AI system and converting those insights into metrics to track and improve the system. Another participant (P5) described behavioral analysis as necessary because metrics like “precision and recall can lie,” but found that this deeper analysis is “a very challenging problem.”
Many of the strategies that participants use for AI analysis also reflect those described in the sensemaking framework. Five participants use human judges to label or gather instances, while two participants mostly rely on ad hoc spot checking like dogfooding to check if the AI is behaving as expected. Some data scientists have developed their systems for unit testing and validating model behaviors, with four participants using a form of “regression sets” that track specific model behaviors, or hypotheses. They use these sets to ensure that updates to their AI do not cause it to regress on important behaviors or subgroups of instances. Even the participants with bespoke tooling found behavioral analysis to be an open challenge, as one participant (P1) stated, “we don’t really have a way of checking for patterns to see if a problem is a one-off or something more systematic.” Like data scientists in the empirical studies, our participants tended to perform behavioral analysis in an ad hoc and post hoc manner, reacting to discovered failures.
Process and Strategy. When the participants used the AIFinnity system, we noticed differences in how participants approached the sensemaking process. The first pattern we found was that participants started the AI analysis process from different stages. Since AIFinnity does not provide preexisting hypotheses, most of the participants (8) began their analysis by looking at the initial schema of instances with the largest output differences. The other two participants, who train image models in their work, started the analysis with their own preexisting hypotheses. They created these hypotheses from their experience and knowledge of how image models are most likely to fail. For example, a participant (P2) specifically created hypotheses for “high contrast lighting” and “low light” before looking at any of the instances. Despite starting at different stages, all participants eventually took an iterative process, going back to the image explorer to find new instances and using the affinity diagram to create schemas and hypotheses.
Another significant difference in participants’ processes was whether they took a breadth-first or depth-first approach. About half of the participants (4) took a breadth-first strategy by exploring multiple instances in the original schema before creating more specific schemas and hypotheses. The other six participants used a depth-first approach, immediately creating schemas and hypotheses for the first interesting instance they found. These different techniques led to a tradeoff between the number of hypotheses and the amount of evidence participants found; participants using the breadth-first technique tended to find more hypotheses with less supporting evidence, while depth-first participants found fewer hypotheses with more evidence.
Complementary Tools. One of the most salient benefits of having an integrated sensemaking system was the complementarity of tools across stages. As participants progressed through the sensemaking process, they had tools available to help them at each stage. For example, when participants wanted to validate an initial idea of a behavior from a schema, they could create a hypothesis and find evidence using AIFinnity’s similar image search feature. Participants found the progressions between tools and stages to be natural as they created schemas and validated hypotheses.
An unexpected benefit of AIFinnity was the complementarity of the features within each sensemaking stage. This complementarity was most apparent in the schema stage with similar search and filtering tools. The benefit of having both tools was highlighted by one participant (P9), who in validating the hypothesis that models could not describe large groups of people found that “using the tool together is useful, because otherwise, I was trying to look at [images with] groups of people but [similar image search] didn’t give me that, but the object detection model is more specific.” Similar search is a less structured but quicker schema tool, while filtering can create more specific and structured schemas. Participants generally started with the similar search tool to get an initial group of instances for a schema but were concerned about missing evidence with the “black box” search and so moved on to use the filtering approach. Having a quick heuristic tool combined with a more deliberate schema method was an essential feature of AIFinnity.
Dealing with Confirmation Bias. Confirmation bias is a significant challenge when creating and validating any hypothesis; How does a data scientist know that they have enough diverse instances to support their hypothesis? We found that having a combined sensemaking system helped data scientists combat confirmation bias. This was especially true when participants went from the hypothesis stage back to the schema stage to find more evidence, as they had various techniques at their disposal to discover or create more evidence. Six of the 10 participants found that at least one of their hypotheses did not hold after finding additional evidence. For example, a participant (P8) thought model A typically confused racquets for video game controllers, but quickly disproved their hypothesis by using the similar image search to find more images of people with racquets that were correctly described. Three participants also actively reflected on their potential confirmation bias and took steps to counteract it by proactively looking for disconfirming evidence.
Actionable, Evidenced Hypotheses. Overall, the participants found various hypotheses with significant supporting evidence. Participants created 4.1 hypotheses on average, which ranged from specific failures to high-level patterns. The most specific hypotheses included “model cannot describe images with cliff backgrounds,” and “model fails to describe large groups of people on boats.” Some of the most general hypotheses included “model doesn’t describe the central activity,” “the model is often too vague,” and “bad lighting leads to inaccurate captions.” There was significant overlap in the hypotheses and behaviors the participants discovered despite the wide range of described behaviors. Five of the 10 participants found that Model B confused climbing images with snow, skiing or snowboarding. Four participants found that both models described most of the racquet sports as tennis and did not have badminton in their language. Lastly, the most common groupings eight participants created were for a specific activity, for example climbing, boats, or tennis.
At the end of the study, most participants had developed nuanced conclusions about which model they would choose for a given task. The most common conclusion, which seven of the participants came to, was that model A is more conservative, less detailed, but often correct, while model B provides more detailed captions, but is often wrong. Given these findings, they decided to make different recommendations for which model should be used depending on the risk profile and domain of the client.
Beyond describing the differences between the two models, some participants also asked questions about the underlying model and data and came up with potential fixes for the issues they saw. Three participants attributed the patterns they found to biases in the training data or labels. Two of these participants hypothesized that there might be an “alpine” or “snow” bias in the data, causing model B to describe people climbing as snowboarding or skiing, and they wanted to look at the training data to verify their hypotheses. Two other participants hypothesized that the models themselves may be causing the problem by not having certain words in their vocabulary, specifically “badminton” and “croquet”, which were often described as “tennis” and “baseball.”
Using the AIFinnitySystem. We also found insights specific to the
AIFinnity system and analysis of image and text models. Participants generally found the affinity diagram to be intuitive and usable, with five participants specifically stating that it was their favorite part of the interface and one participant (P3) stating that it “makes total sense, especially for images.” One participant (P8) especially liked the split between the top and bottom areas of
AIFinnity, seeing them as two different representations of the data, or schemas: “Switching between text and visual representations is very interesting—I can have a hypothesis and go back and forth.” Affinity diagramming is a prolific sensemaking tool in other domains [
32], which lends another piece of support to taking a sensemaking lens to AI analysis.
A feature that received mixed feedback in AIFinnity was the thumbs up or down quality judgment. Two participants (P1, P9) used it as the primary way of tracking which model was performing better, and a third participant (P2) liked that “having them [images] colored gives you a quantitative feel for how strong your hypothesis is or not.” While more than half of the participants (6) liked to have a quantitative view of their findings, four participants found that the judgment was too coarse to be very useful. Both captions were often wrong, but one was slightly better, or a caption being “good” would depend on the situation. The participants would have liked more detailed descriptions to capture these nuances, such as scale or text descriptions.
Participants thought the counterfactual feature was useful but found that AIFinnity’s implementation of drawing black boxes was too simple. The participants wanted more image manipulation tools, such as adding new objects or changing image properties. Counterfactuals are a powerful tool for generating more evidence, and participants wanted these improved interactions to test more nuanced and complex behaviors.