1 Introduction
AI-driven decision aids have been increasingly deployed to support human decision making in many activities ranging from making investment choices, to detecting harmful online content, to annotating biomedical images. To help people evaluate the trustworthiness of these decision aids and determine the best strategies to rely on their recommendations, it is critical to provide people with some insights into why the AI model underlying the decision aid makes a particular decision recommendation on a decision making task. To this end, many explainable AI (XAI) methods have been designed to explain the reasoning processes underneath the black-box algorithmic decisions. For example, post-hoc techniques such as LIME [
70] and SHAP [
58] have been developed to illustrate the importance of different features to an AI model’s final prediction.
In the real life, however, the AI model underlying the decision aid is not always static—it may get
updated over time. The update of a model can come as a result of different reasons, such as the availability of additional or higher-quality training data,
the incorporation of user feedback, the development of more advanced learning algorithms, and the needs to ensure fairness in the model.
An increasing number of recent research has started to explore how end-users of an AI model perceive and react to the model as it changes over time. For example, it was found that a good first impression of the AI model is crucial for people to develop trust in the model [64, 84], while those with sufficient domain expertise are capable of dynamically adjusting their trust based on their observations of model performance over time [64]. On the other hand, novice users who have limited knowledge about AI or machine learning may expect the AI model to correct its errors and improve on its own, which reflects their misconceptions of AI models [79]. It was also shown that when the updated AI model has an error boundary that is “incompatiable” with the old AI model (i.e., the updated model makes mistakes on cases where the old model used to be correct), users who make decisions with the help of this AI model can suffer from a significant decrease in decision making performance [6].Beyond the changes in the model’s decision recommendations and performance, updates in the AI model can also result in changes in the model’s explanations for why it makes certain recommendations. For instance, recent studies have reported that when different learning algorithms are used to train a model, their explanations for the model’s prediction can be quite different [
49,
51]. This means that after an update, it is possible for the AI model’s explanations to have a very low level of similarity with the explanations that would have been provided by the old model.
While many empirical studies have been carried out to understand the effects of AI explanations on end-users’ interactions with a static AI model in AI-assisted decision making [7, 19, 52, 56, 63, 87, 91, 95], a natural but currently under-explored question to ask is, how will
changes in the AI explanations caused by a model update impact end-users’ perceptions and usage of the AI model?
Obtaining a solid understanding to this question can not only advance our empirical knowledge of people’s interactions with an evolving AI model, but also inform the appropriate designs of AI explanations during model updates to ensure a smooth transition of people’s mental models of AI and minimize the negative unintended consequences, if any. Therefore, in this paper, we conduct an experimental study to empirically examine that in AI-assisted decision making, how end-users of the AI-driven decision aid react to changes in AI explanations as the AI model gets updated. Specifically, we ask the following research questions:
•
RQ1: Can end-users perceive changes in model explanations after a model update?
•
RQ2: Will the level of similarity between the updated model’s explanations and the old model’s explanations change end-users’ trust in and satisfaction with the AI model?
•
RQ3: What are the potential mechanisms through which changes in model explanations affect end-users’ trust in and satisfaction with an AI model?
Conjecturing the answer to any of these questions turns out to be quite challenging, and an important moderating factor here can be the level of prior knowledge the users have in the decision making domain. For example, one may rightfully conjecture that users will be able to perceive the AI explanation changes if the explanations of the updated model is sufficiently dissimilar from the explanations of the old model. However, when users have limited domain knowledge in the decision making tasks, they may have difficulty in making sense of the AI explanations [87] and thus can be less responsive to changes in them. Even if users successfully detect changes in model explanations, how they will impact users’ trust in and satisfaction with the AI model is still unclear—competing hypotheses exist and many factors may play a role in mediating this impact. One plausible hypothesis is that human users may not desire a new model explanation that is significantly different from the old one, since that implies a substantial violation to their established mental model of the AI model, potentially leading to a degree of cognitive dissonance [30]. Following this line of thought, one may expect users to decrease their trust in and satisfaction with the AI model as its updated explanations become more dissimilar from the old ones. In contrast, if users generally expect an AI model to improve its performance after the update [79], it is also possible for them to use the similarity between the AI explanations before and after the update as a heuristic to gauge the magnitude of the improvement. In this case, it is reasonable to hypothesize that users may consider an updated AI model with more dissimilar explanations as a “better” model with more improvement, and therefore perceive it as more trustworthy and satisfactory. To complicate things further, when users have some prior knowledge in the decision making domain, their perceived differences between the AI explanations before and after the model update may not only concern the similarity between the two explanations (i.e., the “size” of the change), but also whether the updated explanations become more or less consistent with their domain knowledge compared to the old ones (i.e., the “direction” of the change). Previous research has shown that for a static AI model, the more its explanations align with the human rationale, the more accurate users perceive the model to be [63]. However, whether similar observations can be made when the AI model gets updated is unknown. For instance, when the AI model’s explanations become less aligned with users’ domain knowledge after the update, users may consider the updated model as less “reasonable” and indeed decrease their trust in and satisfaction with the updated AI model. Yet, users may also justify this misalignment simply as that due to the update, the already-trustworthy AI model (because its explanations largely align with human rationale before the update) further uncovers new hidden patterns in the data that they are not previously aware of [76], which may even lead to an increase in their trust in and satisfaction with the updated model. To answer these questions, we designed and conducted a set of human-subject experiments where participants were recruited from Amazon Mechanical Turk (MTurk) and asked to complete a same sequence of decision making tasks with the help of an AI model. The tasks were divided into two phases and the model was updated between the two phases. All participants used the same AI model and saw the same AI explanations in Phase 1. However, in Phase 2, the AI model was updated in different ways for participants of different treatments, which led to varying levels of similarity between the updated model’s explanations and the old model’s explanations. To isolate the impacts of AI explanation changes before and after the model update on participants’ trust in and satisfaction with the AI model, for participants across all treatments, the decision recommendations they received from the AI model were kept the same for both Phase 1 and Phase 2.
Furthermore, to account for various decision making domains where users may have different levels of prior knowledge in, we conducted two experiments on two different decision making contexts. Our Experiment 1 focuses on a decision making context where laypeople have little domain knowledge in, that is, determining if a mushroom is poisonous. In our Experiment 2, we look into a different decision making context in which laypeople have more domain knowledge—predicting the default risk of loans.
In addition, to cover both the cases where the model update results in the explanations to be more or less consistent with users’ prior knowledge in the domain1, we conducted two sub-experiments
in Experiment 2—in the first sub-experiment (Experiment 2.1), explanations of the old model presented in Phase 1 were largely
inconsistent with users’ prior knowledge, thus in Phase 2, explanations for updated models with lower similarity to those of the old model were
more consistent with users’ prior knowledge. In contrast, the second sub-experiment (Experiment 2.2) was the opposite—explanations of the old model presented in Phase 1 were largely
consistent with users’ prior knowledge, while in Phase 2, explanations for updated models with lower similarity to those of the old model were
less consistent with users’ prior knowledge.
Our experimental results show that in both experiments, participants can perceive the changes in model explanations after the AI model gets updated. This means that in general, users have some capability to detect explanation changes during the model update regardless of their level of prior knowledge in the decision making domain. In addition, in both experiments, we find no reliable evidence suggesting that the changes in AI explanations during the model update can affect users’ objective trust in the AI model in terms of how frequently users are willing to adopt the AI model’s decision recommendations. However, we find that when users have a degree of prior knowledge in the decision making domain, as the AI model gets updated, their subjective trust in and satisfaction with the AI model will change with the increased or decreased level of consistency between the new AI explanations and their prior knowledge. This highlights the importance of taking the “compatibility” of human rationale and AI explanations into account when updating AI models to make the model update more “understandable” to end-users, or to help them understand why a “counter-intuitive” model update occurs. Finally, through path analyses, we confirm that the impacts of AI explanation changes on users’ trust in and satisfaction with the AI model during the model update are partially mediated by users’ perceived changes in the AI model’s accuracy, and their perceived changes in the consistency between the AI model’s explanations and their domain knowledge.
Taken together, our findings provide important implications on constructing and communicating AI explanations to human users after upgrading the AI model. Techniques for integrating humans’ domain knowledge into the explanation generation and updating processes, and for supporting people to make sense of the changes in explanations after a model update are both promising directions recommended to explore. We conclude with the discussions of our study implications and limitations (e.g., simplified AI explanations and simplified AI model updates resulting in explicit explanation changes). Despite these limitations, we hope this study can inspire more future work in empirically understanding the impacts of AI explanation updates, and in developing explainable AI methods that better support human-AI joint decision making in a fast-evolving AI development and deployment lifecycle.
3 Experiment 1: Poisonous Mushroom Prediction
The goal of our study is to empirically understand whether, how, and why changes in AI model explanations due to an update affect end-users’ perceptions and usage of the AI model in AI-assisted decision making. We begin our study with a first randomized human-subject experiment on a decision making domain in which people may have limited domain knowledge.
3.1 Experimental Task
In this experiment, we asked participants to complete a sequence of decision making tasks to predict whether a mushroom is poisonous or not, with the help of a decision aid powered by an AI model. Specifically, in each task, participants were asked to review the profile of a mushroom, which consisted of 5 categorical features that describe the mushroom’s physical characteristics—the surface texture of the cap of the mushroom, the spacing between the mushroom gills, the shape of the mushroom stalk, the habitat that this mushroom species usually grows on, and the growth habit of a population of this mushroom species. In addition to the mushroom’s profile, participants were also presented with a binary prediction given by our AI model in terms of whether the mushroom was predicted to be poisonous, along with the model’s
explanations for its prediction (in the form of the top two features in the mushroom’s profile that contribute the most to the AI model’s prediction; see more details in Section
3.2). After reviewing all this information, participants were asked to make a decision on whether they believed this mushroom was poisonous or not. The mushroom profiles that we presented to participants were selected from the UCI mushroom dataset [
28], which includes 8,124 North American mushroom species described in terms of physical characteristics, with each species identified as either edible or poisonous. In the original dataset, each mushroom species contains 22 categorical features. To simplify the decision making task, we reduced the number of categorical features presented to participants in a profile to five. Figure
1 shows an example of the task interface.
We chose the poisonous mushroom prediction tasks in our Experiment 1 because we speculated that most participants may not have much domain knowledge in this task. As a result, when the AI model as well as its explanations gets updated, participants may only be able to tell whether the updated model explanations are consistent with the old ones (i.e., how similar the model explanations are before and after the update), without having strong feelings about whether the updated explanations become more or less aligned with their
prior knowledge, or making further judgements on whether the explanation updates are
sensible or not. Conducting our experiment on this task, thus, allows us to isolate the effects of model explanation updates on people in AI-assisted decision making that are caused directly by the similarity levels between the explanations before and after the model update.
3.2 Experimental Design
3.2.1 Overview of Experimental Treatments.
We created three experimental treatments for Experiment 1. Specifically, all participants of Experiment 1 went through a sequence of 30 decision making tasks in the experiment. These 30 tasks were divided into two phases, each containing 15 tasks. In the first 15 tasks (i.e., Phase 1), participants in all treatments saw the same set of 15 mushroom profiles, and they were aided by the same AI model
M0. Since all subjects were given the predictions produced by the same model
M0 in Phase 1, the model explanations they saw in Phase 1
(i.e., the top two most “important” features for the AI prediction in each task) were also the same.
Details on how we developed M0 and its explanations in Phase 1 are described in Section 3.2.2.After Phase 1, we explicitly told the participants that the AI model was updated. In the next 15 tasks (i.e., Phase 2), participants in all treatments still saw the same set of mushroom profiles, but participants in different treatments used a different version of the updated AI model (i.e., M1, M2, or M3). The tasks in Phase 2 were carefully selected such that different updated AI models still made the same binary predictions on each task. However, the explanations of the updated models were different on Phase 2 tasks across the three treatments, and they exhibited varying levels of similarity when compared to the model explanations that would have been provided by the AI model before the update (i.e., M0). In particular, we had the following three experimental treatments:
•
High similarity (HS): Participants in this treatment received an updated model M1 in Phase 2, whose explanations on Phase 2 tasks had a high similarity with the explanations that would have been provided by M0 (i.e., the AI model before the update).
•
Medium similarity (MS): Participants in this treatment received an updated model M2 in Phase 2, whose explanations on Phase 2 tasks had a medium similarity with the explanations that would have been provided by M0 (i.e., the AI model before the update).
•
Low similarity (LS): Participants in this treatment received an updated model M3 in Phase 2, whose explanations on Phase 2 tasks had a low similarity with the explanations that would have been provided by M0 (i.e., the AI model before the update).
Details on how we operationalized these three treatments in Phase 2 are described in Section 3.2.3. 3.2.2 Operationalization of Phase 1.
We randomly selected 50% of data samples in the original UCI mushroom dataset as the held-out test dataset, and the rest 50% as the training dataset. Using a random subset of the training dataset, we first trained a logistic regression model, which was used as the AI model M0 in Phase 1. We further adopted the SHAP algorithm [58], which is a model-agnostic explanation method that can be applied to any supervised learning model, to compute the contribution that each of the five features in the mushroom’s profile made to the AI model’s prediction on that task. We then explained the model’s prediction to participants by highlighting on the mushroom’s profile the feature-value pairs for the top two features which had the highest contribution scores in the same direction as the AI model’s prediction2. Moreover, the goal of Phase 1 was to help participants establish a mental model of how the AI model makes prediction. Since we used the top two most important features identified by the SHAP algorithm as the model’s explanation on each task, it is natural to expect that participants’ mental model of the AI model’s logic comes as patterns described by if-then rules, e.g., “if X1 = a and X2 = b, then the model will predict Y = y.” Thus, the 15 task instances in Phase 1 were selected so that participants repeatedly observe three explanation patterns as follows:
•
Pattern 1.a: When “cap surface=fibrous” and “gill spacing=crowded”, the AI model M0 predicts “edible.”
•
Pattern 1.b: When “cap surface=smooth” and “gill spacing=close”, the AI model M0 predicts “poisonous.”
•
Pattern 1.c: When “stalk shape=enlarging” and “gill spacing=close”, the AI model M0 predicts “poisonous.”
In other words, we hope that after participants completed the 15 tasks in Phase 1, they could form their mental models of the AI model by memorizing these three explanation patterns. We note that a complete description of the AI model M0’s global behavior on all kinds of task instances requires much more explanation patterns. Here, we selected the task instances to restrict participants’ attention to the above three patterns only and enable them to develop some mental models of the AI model’s local—instead of global—behavior.
3.2.3 Operationalization of Phase 2.
The goal of Phase 2 was to have participants in the medium or low similarity treatments realize that their mental models were “broken down.” This means that given a task instance in Phase 2, participants in medium or low similarity treatments might find it to directly relate to their mental model. Thus, they retrieved an if-then rule from their memory and expected that the AI model would predict Y = y on this instance because X1 = a and X2 = b, but only to find out that while the updated AI model still predicted Y = y, the top two feature-value pairs it highlighted as its explanations (which was again computed by the SHAP algorithm) were changed.
To obtain different updated AI models M1, M2 and M3, whose explanations on Phase 2 task instances would show different levels of similarity with those of M0, we re-sampled the training dataset and re-trained the logistic regression model. For instance, to train M2—whose explanations on Phase 2 tasks have a medium similarity with that of M0—we re-sampled the training dataset mostly within the set of data samples with the feature-value pair “cap surface = smooth” and then re-trained the logistic regression model. By doing so, the updated model M2 would seldom highlight the feature “cap surface” in its explanations (because most data samples in its training dataset had the same value on this feature, making it not informative for the prediction). Thus, given a task instance for which the old model M0 would use one of the either Pattern 1.a or Pattern 1.b to explain its predictions, the explanation of the updated model M2 would likely differ on at least one highlighted feature-value pair. Note that this kind of model updates can be realistic in the real world, as the training dataset may constantly get updated [
39], yet the additional training data obtained may be biased (e.g., due to sampling biases).
With the updated models prepared, we then move on to select task instances for Phase 2. Given a task instance, we can compute the similarity between two AI models’ explanations on this instance using the feature agreement metric introduced in [
49] (i.e., the size of the intersection of the two sets of top-
k features divided by
k;
k = 2 in our study). We carefully selected the 15 tasks in Phase 2 such that on each task:
(1)
all the four AI models’ (i.e., the original model M0, and the three updated models M1, M2, M3) binary prediction was the same;
(2)
the explanation that would have been provided by the model M0 is one of the three patterns as shown above;
(3)
compared to the two most important feature-value pairs highlighted by M0 as its explanations, the explanation given by M1 in the high similarity treatment was the same (the average feature agreement score between M0 and M1’s explanations across the 15 tasks in Phase 2 was 1.0), the explanation given by M2 in the medium similarity treatment usually had one feature-value pair in common (the average feature agreement score between M0 and M2’s explanations across the 15 tasks in Phase 2 was 0.6), while the explanation given by M3 in the low similarity treatment usually had no feature-value pair in common (the average feature agreement score between M0 and M3’s explanations across the 15 tasks in Phase 2 was 0.1)3. 3.3 Experimental Procedure
We posted our experiment as a human intelligence task (HIT) on Amazon Mechanical Turk (MTurk). Upon arrival, participants were randomly assigned to one of the 3 experimental treatments as described in Section
3.2. They first completed a
questionnaire on their background, including their demographics, technical literacy, and expertise in AI and machine learning. Then, we presented participants with an interactive tutorial to
explain the task to them and walk them through the interface. Since participants might have little prior knowledge on how to determine if a mushroom is poisonous, we added a training component in the tutorial to help participants get familiar with the mushroom prediction task. In particular, we provided participants with a list of assistive information extracted from the UCI mushroom dataset about how values on the five features of a mushroom’s profile may relate to the mushroom’s poisonous status (e.g., “in a large database, 10% of mushrooms whose gill spacing is crowded are poisonous”). This assistive information was also made available to participants during the actual 30 decision making tasks.
Upon completion of the tutorial, participants were asked to answer a few qualification questions to show they understood all the information presented in the tutorial, and they could not proceed to the next part of the experiment unless they answered all the qualification questions correctly.After passing the qualification, participants started to work on the same set of mushroom prediction tasks that were divided into two phases with 15 tasks each (the order of tasks was randomized within each phase). As discussed earlier, in Phase 1, participants in all three treatments saw exactly the same model prediction and explanations for each task. In contrast, in Phase 2, participants still saw the same model prediction for each task, but the model explanations were associated with different levels of similarity compared with the explanations provided by the old model used in Phase 1. In each task, participants followed a three-step procedure to complete the task. They were first asked to review the profile of the mushroom to make their own prediction. Then, we would present to them the AI model’s prediction along with its explanations. Lastly, the participants needed to make a final prediction. The AI models made correct predictions on 10 tasks in Phase 1 and on 12 tasks in Phase 2, although the participants were not given any accuracy feedback on either their prediction or the model’s prediction throughout the experiment.
Note that between Phase 1 and Phase 2, we explicitly told participants that the AI model was updating and asked them to complete a mid-point questionnaire while waiting for the model update to be completed. To see if the participant successfully formed a mental model of the AI model in Phase 1, we included in the questionnaire three multiple-choice understanding questions, each corresponding to one of the three explanation patterns appeared in Phase 1 (e.g., “If a mushroom’s cap surface is smooth and its gill spacing is close, what is our machine learning model’s prediction?”). In addition, the participant was also asked to self-report their subjective trust in and satisfaction with the AI model in Phase 1 on a 7-point Likert scale (1 is the lowest and 7 is the highest), and they also indicated their agreement with the following statement from 1 (“strongly disagree”) to 7 (“strongly agree”):
•
Perceived explanation consistency with prior knowledge: “The machine learning model’s explanations in Phase 1 agrees with my own knowledge about how to predict poisonous mushroom.”
To make participants feel the update of the AI model was real, after participants completed the mid-point questionnaire, we had them wait for 10 more seconds before telling them that the model update was completed and allowing them to proceed to Phase 2.
Finally, after the participant completed Phase 2, they needed to complete an exit questionnaire to again self-report their subjective trust in and satisfaction with the AI model in Phase 2, as well as their perceived consistency of the AI model’s explanations in Phase 2 with their own prior knowledge on a 7-point Likert scale. They were also asked to express their agreement with two statements regarding their perceived changes of the AI model after the update, using a scale of 1 (“strongly disagree”) to 7 (“strongly agree”):
•
Perceived explanation change: “After the model update, the updated model in the last 15 tasks utilizes very different features to make predictions compared to the old model shown in the first 15 tasks.”
•
Perceived accuracy change: “The updated machine learning model in the last 15 tasks seems to be more accurate than the old machine learning model in the first 15 tasks.”
We included three attention check questions at different places throughout the HIT (one each in Phase 2 prediction tasks, the mid-point questionnaire, and the exit questionnaire). In these questions, participants were instructed to select a pre-specified option as their prediction in the task or as their response to a 7-point Likert question in the questionnaire. These attention check questions later helped us to filter out the data from inattentive participants. Our experiment was open to U.S. workers only, and each worker was allowed to participate only once. The base payment of the experiment was $1.80. To incentivize participants to carefully read about the model’s explanation in each task and adjust their trust accordingly, we further provided them with additional performance-contingent bonuses—if the overall accuracy of a participant’s final predictions on the 30 tasks was at least 55%, they could earn a bonus of $0.04 for each of their correct final predictions. Thus, the maximum amount of bonus a participant could earn in this experiment was $1.20.
3.4 Analysis Methods
3.4.1 Independent Variables.
The main independent variable we used in our analysis is the experimental treatment that a participant was assigned to, i.e., the level of similarity between the explanations of the updated AI model that the participant received in Phase 2 and the explanations of the AI model M0 used in Phase 1.
3.4.2 Dependent Variables.
To quantify participants’ perceived changes in the model explanations due to the model update, we use their self-reported scores in the exit questionnaire as our dependent variable; the higher the score, the more the participant finds the updated model explanations in Phase 2 to be different from what would have been provided by the old model in Phase 1.
Moreover, to measure the changes in participants’ trust in the model due to the model update, we compute their trust gain from Phase 1 to Phase 2, for both objective trust and subjective trust. Participants’ objective trust in the model in a phase is computed as the fraction of tasks of that phase in which the participant’s final prediction was the same as the model’s prediction. Meanwhile, participants’ subjective trust in the model in a phase is obtained from their self-reports at the end of that phase. Given a participant’s objective trust or subjective trust scores in both phases, their trust gain is then computed as the Phase 2 trust score minus the Phase 1 trust score; the larger the difference, the more the participant increased their trust in the model after the model update.
Finally, to measure the changes in participants’ satisfaction of the model due to the model update, we compute their satisfaction gain from Phase 1 to Phase 2 as their self-reported satisfaction with the model in Phase 2 in the exit questionnaire minus that reported for Phase 1 in the mid-point questionnaire. Again, the higher the value, the more the participant increased their satisfaction with the model after the model update.
3.4.3 Statistical Methods.
We start by examining that after a model update, whether
participants can perceive the changes in model explanations (
RQ1) and whether the perceived model explanation similarity before and after the update changes
participants’ trust in and satisfaction with the AI model (
RQ2). To avoid multiple comparison problems and control false discovery, we conduct our analyses using the interval estimate method [
26]. That is, we first visualize our data by plotting the mean values of the dependent variables of interest for each treatment along with the 95% bootstrap confidence intervals (
R = 5000). Then, we construct OLS regression models to predict the dependent variables’ values while controlling for covariates (e.g., participants’ demographics), both for the entire set of participants and for the subset of participants who had different levels of understandings of how the AI model worked after Phase 1 (e.g., the subsets of participants who answered different numbers of understanding questions correctly in the mid-point
questionnaire). Results of these models are interpreted via the estimated coefficient values for the independent variables as well as their 95% bootstrap confidence intervals
4.
Next, to explore
RQ3 (i.e., the mechanism underlying the effects of model explanation updates on end-users’ trust in and satisfaction with an AI model), we posit three hypotheses and illustrate our hypothesized model in Figure
2:
•
[H1.1] The similarity level of model explanations before and after the model update (i.e., between Phase 1 and 2) has a direct effect on participants’ perceived change in the model explanations.
•
[H1.2] Participants’ perceived change in the model explanations has a direct effect on their perceived change in the AI model’s accuracy after the model update.
•
[H1.3] After the model update, participants’ perceived change in the AI model’s accuracy directly affects their objective and subjective trust in the AI model, and their satisfaction with the AI model.
In other words, we hypothesize that the effects of model explanation updates on end-users’ trust in and satisfaction with an AI model are mediated by their perceived similarity between the explanations of the updated model and the old model, and their perceived change in the model’s accuracy. Since participants are not likely to have much domain knowledge in the mushroom prediction task, in this experiment, we do not expect the model explanation updates will affect participants’ trust in and satisfaction of the AI model through influencing their perceived change in the consistency between the model explanations and their domain knowledge.
We perform path analysis [
22], a type of structural equation modeling (SEM) [
41,
60,
72] without latent variables, to test these hypotheses and explore the potential causal mechanisms underlying the effects of model explanation updates
5. We use five indicators to evaluate the goodness of fit of the model: (1) the
χ2 test indicating absolute/predictive fit; (2) the Comparative Fit Index (
CFI), (3) the Tucker–Lewis Index (
TLI) indicating comparative fit, (4) the Root Mean Square Error of Approximation (
RMSEA), and (5) the Standardized Root Mean Square Residual (
SRMR). A model fits the data well when the
p-value associated with the
χ2 test is non-significant, the CFI and TLI values are over 0.90, and the RMSEA and SRMR values are below 0.08 [
9,
11,
21].
Since this set of path analysis is mostly meaningful for those people who actually had formed an accurate mental model of how the AI model worked, for RQ3, we restrict our analysis only on the data obtained from those participants who correctly answered all three understanding questions in the mid-point questionnaire.
3.5 Experimental Results
In total, 475 participants completed our experiment HIT. The median time participants spent on the experiment was 12.5 minutes, leading to a median hourly wage of $11.00. After filtering the data from participants who did not pass the attention check, we were left with valid data from 361 participants for Experiment 1 (49.9% male, the average age is 38). We analyze these valid data to answer our research questions. As a sanity check, we first construct an OLS regression model to examine whether there are any differences across the three treatments regarding participants’ perceived changes in how consistent the model explanations are with their own prior knowledge, utilizing their self-reports at the end of Phase 1 and Phase 2. We do not find any reliable differences, which is consistent with our expectation.
3.5.1 RQ1: Effects on perceived explanation change.
We start by examining participants’ perceived change of the model explanations between Phase 1 and Phase 2. Figure
3(a) compares across the three treatments participants’ perceptions of the model explanation’s change (see the “overall” group). To explore whether participants’ understanding of the AI model (or their capability to form an accurate mental model of AI) has any moderating effect, we also present the same comparison separately for participants with relatively low levels of understanding (i.e., who answered no more than 2 understanding questions correctly in the mid-point
questionnaire; the “understanding ≤ 2” group), and those with high levels of understanding (i.e., who answered all 3 understanding questions correctly; the “understanding=3” group). We find that participants’ perceived change of explanations increased as the the explanations of the updated model in Phase 2 became more dissimilar from those of the old model used in Phase 1. In other words, participants in our experiment could perceive the change in model explanations brought up by a model update. Moreover, it appears that the better the participants could form an mental model of the AI model, the more they could perceive the change in model explanations.
We then construct OLS regression models to predict a participant’s perceived change in the model explanations between the two phases while controlling the participant’s demographic background (e.g., age, gender, education). as covariates. Our regression results are consistent with what we have observed in Figure
3(a) . In particular, participants in both the medium and low similarity treatments reported higher levels of changes in the model explanations due to the model update (
MS: estimated coefficient β = 0.232, 95% CI=[0.017, 0.461]; LS: β = 0.202, 95% CI=[-0.025, 0.432]). We further construct two separate regression models for participants who answered no more than 2 understanding questions correctly in the mid-point
questionnaire and those who answered all 3 understanding questions correctly, respectively. For the former group of participants (the “understanding ≤ 2” group), we do not obtain coefficients that are reliably different from zero, while for the latter group (the “understanding=3” group), we find that they reported a slightly higher level of perceived model explanation change if they were in the low similarity treatment (
β = 0.429, 95% CI=[-0.022,0.869]).
3.5.2 RQ2: Effects on trust and satisfaction change.
We next analyze our data to examine whether people’s trust in and satisfaction with the AI model is influenced by the model explanation updates. Figures
3(b) and
3(c) show participants’ objective trust gain and subjective trust gain in the AI model from Phase 1 to Phase 2, both across all participants and within subgroups of participants with different levels of understanding of the AI model. However, we find that neither participants’ objective trust nor their subjective trust seems to be affected by the similarity level of model explanations between Phase 1 and Phase 2. Figure
3(d) further shows participants’ subjective satisfaction gain from Phase 1 to Phase 2, conditioned on their understanding score. Still, participants did not seem to significantly change their satisfaction with the AI model as the similarity of model explanations before and after the update varied. Our regression models also don’t show any reliable treatment effects either for all participants or for any subsets of participants.
3.5.3 RQ3: Mechanisms underlying the effects of model explanation updates.
As discussed earlier, we restrict our attention to the 98 participants who correctly answered all three understanding questions in the mid-point
questionnaire, and we test the hypothesized path model on the data obtained from them. We start by adding all covariates (i.e., the participant’s age, gender, education, task familiarity, technical literacy and expertise in AI and machine learning) to the regression models for all paths, and we refine the regression models by pruning covariates with insignificant contributions to achieve better model fit. As a result, the fit statistics for the final model we get are
p(
χ2) = 0.240,
CFI = 0.971,
TLI = 0.932,
RMSEA = 0.047,
SRMR = 0.051, which indicate a good fit. Estimates of the path coefficients and the results of significance testing of the path model are presented in Figure
4.
Our path analysis results validate all of our hypotheses
H1.1–
H1.3. It’s shown that the first mediation step of the treatment effects is whether people can perceive the change in model explanations after a model update, and in Experiment 1, we detect that those participants for whom the updated model explanations in Phase 2 had a low similarity with the old model in Phase 1 perceived a significantly larger change in the model explanations. Interestingly, the more people perceive the model explanations have changed, the
more likely they feel the updated model’s accuracy is increased. Finally, the change in people’s trust in and satisfaction with the AI model after the update are all positively affected by their perceived increase in the updated AI model’s accuracy. Notably, while all mediation paths in our path analysis are significant, we do not observe a total effect of the treatment on participants’ trust gain or satisfaction gain in Section
3.5.2, which seems to be contradictory. We conjecture that one possible explanation for this observation is that there may exist other competing effects that suppress the path that we have tested in our path analysis, such that multiple direct and indirect effects of opposing direction can result in a near-zero total effect [
3,
37,
75]. Identifying the additional mediation paths for the effects of model explanation updates on changes in people’s trust in and satisfaction with the AI model will be an interesting future work.
5 Discussions
In this section, we provide further discussions of our results as well as their implications, and discuss the limitations and future work.
5.1 The role of domain knowledge
Comparing the results we have obtained from the two experiments, we indeed find that the effects of model explanation updates on
end-users’ trust in and satisfaction with the AI model are moderated by the level of domain knowledge people have in the decision making domain. While how much users are willing to accept the AI model’s recommendations (i.e., people’s objective trust in the AI model) is not significantly affected by the AI explanation updates regardless of their prior knowledge level, their subjective feelings of the AI model (e.g., subjective trust and satisfaction) are affected by the AI explanation updates when they have some prior knowledge in the task domain. In fact, as shown in Figure
9, people’s perceived change in the explanation’s consistency with their domain knowledge largely dominates their perceived change in the model accuracy in influencing their trust in and satisfaction with the AI model after the update. Similarly, if we compare the standardized path coefficients estimated for the effects of people’s perceived change in the model accuracy on the changes in their trust and satisfaction between Figure
4 and Figure
9, we can also see those in Figure
9 are consistently smaller, indicating decreased impacts for people’s perceived change in the model accuracy when people have domain knowledge in the tasks. All of these highlight the key role that users’ prior knowledge in a domain plays when they observe explanation updates in an AI model.
One possible explanation for the different results that we see in the two experiments is that without additional information, people may only be able to
make sense of the feature contribution explanations if they have some domain knowledge about the task. For example, for participants working on the poisonous mushroom prediction task, while they might notice the change in model explanations after a model update, they might not be able to judge whether the new patterns utilized by the model were more or less meaningful; so, they simply reacted to different AI explanations similarly. On the contrary, participants performing the loan default prediction task might find it rather straightforward to apply their prior knowledge (i.e., a proxy/heuristic of what is meaningful) and focus more on analyzing how consistent the AI explanation was with their knowledge, when evaluating the quality of the updated explanations [
63]. This highlights the importance of helping end-users to make sense of explanations when they have limited prior knowledge in the task domain. To this end, one promising direction is to supplement explanations of AI models with explanations of the underlying
data [
4], which in effect may help people establish some “knowledge” or data-driven insights about the domain. Moreover, our findings on how users adjust their subjective trust in and satisfaction with the updated AI model when they have some knowledge in the task domain is largely consistent with what we would expect from users’ reactions to explanations of a static AI model. This implies that without additional information, users are unlikely to interpret “human-meaningless” explanations as revealing novel insights even in the context of AI models getting updated.
5.2 On people’s perceived change in model accuracy after a model update
An interesting finding we consistently see from both experiments is that the dissimilarity level between the model explanation before and after the update positively affect people’s perceived accuracy increase of the model. As discussed earlier, we conjecture that this may be resulted from a combination of two factors. First, people may use the similarity level between the model explanations as a heuristic to gauge how different the two models’ accuracy is, and they associate less similar model explanations with larger differences in model accuracy. Second, people may have a biased belief/misconception that a model update will always result in a “better” model, due to their day-to-day experience (e.g., the newer generation of a product is always advertised as having improved performance). Thus, people may consider updated models with less similar explanations as having a larger accuracy improvement.
Another interesting observation is that in those cases where people have some domain knowledge (i.e., Experiment 2), while we hypothesize that the similarity level between the model explanations before and after the update will indirectly affect people’s perceptions of the model accuracy through their perceptions of the explanation’s consistency with their prior knowledge, our results show that this is not always the case—we only observe this indirect effect in Experiment 2.2 when the update results in a decrease of consistency between the model’s explanations and people’s prior knowledge. In fact, in Experiment 2.1, the correlation between people’s perceived change in model accuracy and their perceived change in the model explanation’s consistency with their domain knowledge is quite weak (Pearson’s r = 0.135). We speculate that this asymmetric effect is observed because the model explanation update in Experiment 2.1 naturally aligned with people’s expectations, while the explanation update in Experiment 2.2 did not. In other words, most people might believe that the updated model should utilize more information that they (i.e., humans) consider as “predictive” to make decisions. Therefore, participants might get “shocked” by the insensible updates that they saw in Experiment 2.2, so that such violation of expectation became a key driver of the decrease in their perceived model accuracy. On the other hand, participants in Experiment 2.1 might perceive the updated model explanation as simply meeting their expectation without giving extra credit to the updated model’s performance.
5.3 Implications for designing AI explanations during updates
Our findings imply a few important implications for designing effective AI explanations during the model update. First, as we find that people’s subjective trust in and satisfaction with the AI model during the model update can largely be influenced by the consistency of the AI explanations with their domain knowledge, novel methods should be developed for incorporating human expertise into the model development/updating process or the explanation generation process.
This is closely connected to the line of research on human-in-the-loop machine learning [29], in which feedback is solicited from humans to improve and update the AI model. Indeed, as shown by many previous studies [
18,
32,
33,
66], integrating expert knowledge into AI models may not only enhance the robustness and trustworthiness of the models, but also satisfy the expectations of users for expert-informed and user-centric explanations.
However, it is also possible that people may inappropriately decrease their trust in and satisfaction with an AI model because the updated AI explanations contain some novel and truly meaningful patterns which people are not aware of themselves. Indeed, one of the greatest promises of AI technologies is their strong capabilities in processing huge amounts of data to automatically identify hidden patterns and to generate data-driven insights. To avoid these undesirable scenarios, after a model update, instead of simply presenting the updated model explanations, it may be helpful to put more emphasis on the components of the explanations that have been changed, and provide more insights into why these changes occur. Compared to plainly explaining the updated model’s prediction, highlighting the changes in the explanation may attract user’s attention to the updated part of the explanation. Additional information on why explanation changes occur may enable people to go beyond their potentially limited domain knowledge in evaluating the “utility” of the changes, supporting them to better calibrate their perceptions of the updated model’s trustworthiness.
5.4 Limitations and future work
Our study have a few limitations. First, we adopted a relatively simplified setting in our experiment to study how changes in AI explanations during the model update affect users’ perceptions and usage of the AI model—the explanation used is simple (i.e., the top-2 important features), the task instances are selected to have participants repeatedly observe the AI model’s behavior in the same local area, and the experimental treatments are designed with rather salient changes in model explanations after the AI model gets updated. We acknowledge that in the real world, the explanations of an AI model can be much more complex—especially when trying to explain an AI model’s global behavior—and the model updates may have low chance of resulting in fundamentally different explanation patterns. However, we believe the study we conducted on the simplified setting had two important advantages and provided a starting point for more future research along this line. First, by using simple explanations and restricting participants’ attention to the AI model’s local behavior, we maximized the possibility for participants to successfully form a mental model of the AI model before the update. This is critical because it allowed us to rule out the possibility that any null result of our study is simply caused by participants’ inability to understand how the AI model works before the update. Second, by having participants in some treatment (e.g., the low similarity treatment) observe very distinct explanations after the model update, we pushed our experimental manipulations to the extreme to maximize their possible effects, if any. In this sense, one can argue that the empirical effects of model explanation changes that we found in this study are likely the upper-bound estimates . These upper-bound estimate results can still be quite informative. For example, even in our setting where the changes of model explanations are very salient, we did not find users’ objective trust in the AI model is reliably affected by explanation changes during the model update. This may imply that in a more practical setting where the explanation changes during the model update are much more subtle, users’ objective trust in the AI model is also unlikely to be influenced.
Another limitation of our study is the choice of some measurements. For example, we used “agreement fraction” (i.e., the chance for participants’ final prediction in a task to agree with AI) to quantify participants’ objective trust in the AI model. Although widely used in the literature [7, 15, 23, 52, 55, 57, 68, 95], we acknowledge that this metric may reflect the natural agreement between people’s independent decisions and the AI recommendations to some extent. In practice, agreement fraction is often the only metric that can be adopted to objectively quantify people’s trusting behavior when no information about people’s own independent decision is available. In our study, however, we collected participants’ initial prediction in each task, which allowed us to quantify participants’ objective trust in the AI model using “switch fraction” (i.e., the fraction of tasks for which the participant’s final prediction agreed with the model’s prediction, among all tasks where the participant’s initial prediction disagreed with the model’s prediction), another metric commonly used in previous studies [36, 93, 95]11. We found that when using agreement fraction or switch fraction as the objective trust metric, the corresponding values for participants’ objective trust gain are highly correlated (e.g., Pearson correlations are 0.69, 0.63, and 0.66 for Experiment 1, 2.1, 2.2, respectively), suggesting agreement fraction still reflects participants’ true willingness to adopt the AI recommendation to a large extent. As another limitation, the dependent variables we measured in this study are not comprehensive. Future studies should be carried out to better understand how changes in AI explanations during the model update may affect other aspects of user experience and performance (e.g., influence user’s trust calibration and understanding). In general, we caution the readers to not over-generalize our results to other settings.
Our study was conducted on two selected types of decision making domains, and how model explanation updates affect people’s perceptions and usage of the AI model in other domains may be impacted by nuances in those domains. For example, explanation formats in domains like image classification [2] and text classification [54] can be very complicated, task domains such as autonomous driving can be highly situation-dependent as to the need of explanations [89], and it’s hard to even provide scalable explanations for unsupervised learning models [61] used in human-AI co-writing, chatbot, or AI art generator. To simplify the experimental design, in our study, we also only investigate into the effects of model explanation updates when the AI model’s prediction does not change, while in reality changes in AI predictions and explanations often go hand in hand. Our study results may not hold for settings where decision makers have significant domain expertise in the decision making domain or where the decision stakes are especially high (e.g., doctors making life-or-death decisions), and the effects of AI explanation updates may also be moderated or mediated by other factors such as the accuracy level of the AI model. Overall, future studies should be conducted to explore the effects of AI explanation updates in more realistic settings and diverse domains, for different types of end-users, and explore in more details how these effects may be moderated by various factors.