The Role of Behavioral Anthropomorphism in Human-Automation Trust Calibration

Theodore Jensen¹⁰,
Mohammad Maifi Hasan Khan¹⁰ &
Yusuf Albayram¹¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12217))

Included in the following conference series:

International Conference on Human-Computer Interaction

4241 Accesses
9 Citations

Abstract

Trust has been identified as a critical factor in the success and safety of interaction with automated systems. Researchers have referred to “trust calibration” as an apt design goal– user trust should be at an appropriate level given a system’s reliability. One factor in user trust is the degree to which a system is perceived as humanlike, or anthropomorphic. However, relevant prior work does not explicitly characterize trust appropriateness, and generally considers visual rather than behavioral anthropomorphism. To investigate the role of humanlike system behavior in trust calibration, we conducted a 2 (communication style: machinelike, humanlike) $\times $ 2 (reliability: low, high) between-subject study online where participants collaborated alongside an Automated Target Detection (ATD) system to classify a set of images in 5 rounds of gameplay. Participants chose how many images to allocate to the automation before each round, where appropriate trust was defined by a number of images that optimized performance. We found that communication style and reliability influenced perceptions of anthropomorphism and trustworthiness. Low and high reliability participants demonstrated overtrust and undertrust, respectively. The implications of our findings for the design and research of automated and autonomous systems are discussed in the paper.

You have full access to this open access chapter, Download conference paper PDF

I’m Only Human: The Effects of Trust Dampening by Anthropomorphic Agents

The uncertain advisor: trust, accuracy, and self-correction in an automated decision support system

Article 07 November 2022

An Examination of Dispositional Trust in Human and Autonomous System Interactions

Keywords

1 Introduction

A user’s trust has been identified as a critical factor in both the safety and efficacy of interactions with automation and computers [5, 10]. Parasuraman and Riley [25] note that human-automation systems can suffer from both disuse of reliable and effective automation and misuse of unreliable automation, which may result from undertrust and overtrust, respectively. Thus, it is not necessarily desirable to increase trust, but to promote good ‘trust calibration.’ A user’s trust should be appropriate with respect to that system’s capabilities [10].

Anthropomorphism, or the degree to which an entity is perceived as humanlike, is one among numerous approaches to trust-building design in HCI. In general, the notion that greater anthropomorphism makes users more comfortable has motivated the development of humanlike agents (e.g, Apple’s Siri, Amazon’s Alexa, Google Assistant) as the interface to computing systems. Automated systems researchers have recognized the role of anthropomorphic agents in user perceptions and trust in collaborative tasks (e.g., [36], [35]). As artificial intelligence (AI) continues to drive the development of complex autonomous technologies in various applications, the significance of anthropomorphism or humanness in the interface will only grow. We argue that the process by which a user interacts with an automated system is less straightforward than “more humanlike = good.” Rather, anthropomorphic features or social cues in the interface color a user’s expectations of the system’s future behavior. While greater anthropomorphism may increase trust, this does not necessarily equate to better outcomes. The question of how appropriateness of user trust, with respect to system reliability, is influenced by humanlike features remains unanswered.

To investigate this, we designed a 2 (communication style) $\times $ 2 (reliability) between-subject study where participants collaborated with a low or high reliability Automated Target Detection (ATD) system in 5 rounds of gameplay. The task involved classifying a set of 20 images as “Dangerous” or “Not Dangerous.” A score incentivized the speed and accuracy with which all of the images were collectively classified. Participants chose how many images to allocate to the automation before each round. Trust appropriateness was the difference between a user’s allocation in a round and an ideal level of allocation in each reliability condition that maximized the score (i.e., 5 images in low reliability, 15 in high). We attempted to elicit low and high anthropomorphism perceptions via machinelike and humanlike communication styles, respectively. Machinelike messages were minimal and informational, while humanlike messages were friendly, apologetic, and framed from the perspective of the automation.

We found that both communication style and reliability influenced measures of perceived anthropomorphism, confirming that behavioral cues can affect humanness perceptions without a visually humanlike representation. Humanlike communication was associated with greater perceptions of the automation’s benevolence. We also demonstrate the utility of a trust appropriateness measure. Despite the lack of a communication style effect, we found that high and low reliability participants were prone to undertrust and overtrust, respectively. Characterizing trust with respect to good performance in experimental setups can help to inform human-centered design that assists with appropriate trust calibration.

2 Related Work

Much of the prior work on anthropomorphism and social cues in HCI, as well as human-machine trust, is inspired by the Computers as Social Actors (CASA) paradigm [27]. CASA has been applied not only to justify studying the human construct of trust in HCI contexts generally, but to rationalize that a more socially-oriented system will lead to better interactions. We challenge the assertion that more humanlike systems will bring about the best outcomes by investigating the role of perceived humanness in the process of trust calibration. We discuss relevant prior work below.

2.1 Trust in Automation and HCI

A substantial body of work has investigated the notion of “trust” in automation (see [10] and [5] for reviews). Trust has been defined in the organizational psychology literature as “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that party” [12]. This definition readily applies to interactions with computers and automated systems. We rely on systems to assist us with a task despite an inability to observe the exact process by which they execute the task. We make ourselves vulnerable to the possibility that the trusted computer fails to assist us.

Trust in automation researchers have suggested “trust calibration” to be a critical consideration when designing systems. That is, rather than necessarily aiming to increase trust, it is most desirable that a user is able to maintain an appropriate level of trust relative to the system’s reliability [10]. Trust that is too high can be counterproductive or dangerous. We consider trust appropriateness as ultimately more important than trust. The best outcomes will be achieved when a user has an accurate understanding of a technology’s strengths and shortcomings.

Despite the breadth of research on the importance of human-computer and human-automation trust, researchers generally investigate whether system features increase trust. Even in the work that acknowledges trust calibration in human-automation interaction, trust is often viewed as an end, without reference to what is an “ideal” level of trust. Actual measures of trust appropriateness or performance are relatively rare (for one measure, see [34]).

McDermott and ten Brink [13] defined “Calibration Points” as moments when automation reliability changes and a user must adjust their level of trust. In practice, changing environmental conditions or inputs that change reliability can affect automation performance and, therefore, necessitate a change in user trust. Appropriate trust occurs when a user adjusts their perceptions and behavior to accommodate changes in reliability at a Calibration Point. Our study involves a static level of reliability where the best automation performance occurred at a predetermined “ideal” level of reliance on the automation. In this way, we quantified trust appropriateness with respect to the ideal level and observed how system features influenced users’ calibration to this level throughout extended interaction with the automation.

De Visser, Pak, and Shaw [37] effectively outline how systems may be designed to support appropriate trust calibration via dynamic trust repair mechanisms. Noting the shift from automated to autonomous systems as having a significant impact on the nature of “human-machine relationships,” they refer not only to the importance of trust repair acts in response to detrimental behaviors by a system, but that of trust dampening acts in response to beneficial behaviors [37]. Trust that is too high can be equally as problematic as trust that is too low, and systems can be designed to prevent both.

More recently, de Visser et al. [38] referred to “relationship equity” as a key variable in trust calibration on human-robot teams. Over the course of an interaction, the machine’s behaviors and its responses to those behaviors (i.e., dampening and repair acts) influence the accumulated relationship equity between the human and the technological team member. In turn, relationship equity affects trust in the technology in the future. Perceptions of an automated or autonomous teammate’s humanness may play a significant role in the interpretation of trust repair and dampening messages and, thus, how relationship equity among teammates develops. We sought to investigate how humanlike features influence users’ perceptions of and behavior toward an automated teammate over the course of a collaborative task.

2.2 Computers as Social Actors and Anthropomorphism

As mentioned, a body of research using the Computers as Social Actors paradigm has informed much of the research on trustworthy design. CASA studies have found that social rules learned for human-human interaction often still apply to computers. This has also been referred to as “the media equation” (i.e., “media = real life”) [27]. For instance, one study utilizing CASA found that participants are more likely to disclose information to a computer that first discloses to them, as is the case when interacting with another person [15]. Another study found that participants were polite toward computers, in that they were more positive when giving direct feedback to a computer, rather than feedback on a separate machine [20]. This phenomenon has been replicated in various social contexts (see [27]), suggesting that humans have a general tendency to treat computers like people.

Given that participants in early CASA studies were often from technical backgrounds and denied that they would treat computers like people, Nass and Moon [18] suggested that social responses occur due to mindlessness–when computers possess a sufficient amount of social cues, we cannot help but to automatically engage the scripts we use for interacting with other humans. Nass, Steuer, Henriksen, and Dryer [19] opt for the Greek term “Ethopoeia” to describe this automatic process, rather than “anthropomorphism,” which they define as a conscious and sincere belief that a non-human is human. Because conscious anthropomorphism is inconsistent with CASA, researchers have more recently argued that anthropomorphism may indeed be a mindless process [8]. While the media equation posits that computers elicit the same social responses as people, there is ample evidence that the degree of humanness influences the strength of the social response. Morkes, Kernal, and Nass [16] suggest that “soft” social responses to communication technologies (SRCT) may be a more appropriate model than the “hard” SRCT implied by the media equation. Soft SRCT implies a continuum of socialness on which entities are perceived, rather than a binary judgment of human or not human.

We use anthropomorphism to refer to the assignment of humanlike qualities to an entity by an observer, and the process by which mindless social responses to computers occur. Anthropomorphic is used to describe a target that is perceived as humanlike, and more or less anthropomorphic represents the degree to which a target is perceived as humanlike. Degrees of anthropomorphism are synonymous with degrees of socialness, as anthropomorphizing has been referred to as the act of evaluating another entity’s “social potential” and informing expectations of that entity’s future behavior [21]. A sensitivity to humanness is understandable from an evolutionary perspective, since human intelligence could pose a unique threat to survival [21]. As we face an increasing variety of intelligent and evolutionarily unfamiliar technologies, we may be anthropomorphizing even if we are not actively and consciously considering our technological interaction partner as a human. An understanding of the dynamics of anthropomorphism can critically inform interface design and ensure that the appropriate amount of humanness is employed in a system.

Research on visual anthropomorphism, or the extent to which an entity has a humanlike appearance, tends to show that agents with a greater degree of anthropomorphism elicit more positive perceptions than those that appear less humanlike [4, 22]. The role of a humanlike appearance in fostering positive perceptions has motivated research and design of systems represented by anthropomorphic software agents as well as humanlike robots.

Currently there is less understanding of behavioral anthropomorphism, or the extent to which an entity acts in a humanlike manner. These perceptions may arise more subtly given that an explicit representation of a human is not given. Nass and Moon [18] pointed to various features of computers that may elicit a social response, such as words for output or the filling of roles usually held by people. These may be thought of as behaviorally anthropomorphic features. Parasuraman and Miller [24] investigated the role of “etiquette” in human-automation interaction, manipulating communication style to compare an interruptive system to a patient one. They found that the latter “good etiquette” system improved performance significantly, even compensating for low automation reliability. While etiquette may contribute to perceived humanness, the study did not measure this. The current study explicitly measures how perceptions of anthropomorphism are elicited by the communication style of messages displayed by an automated system.

One group of researchers has observed the effects of anthropomorphism of an automated agent that suggested answers in a number pattern guessing game, using a visual representation and background stories to demonstrate the humanness of the agent [34,35,36]. In the first study, the appropriateness of participants’ compliance with the agent’s suggestions is reported, and it appeared that performance with a human agent was better than a computer agent [34]. All three studies demonstrated that, in response to reliability that steadily degraded, there were less drastic trust decrements when the automation was represented by a more anthropomorphic agent [34,35,36]. Pak, Fink, Price, Bass and Sturre [23] similarly found that a visually anthropomorphic agent led to better performance and greater compliance with advice (i.e., behavioral trust) than a less anthropomorphic agent. Kulms and Kopp [9] found that subjective trust was higher for an agent represented by a human compared to a computer, although there was no agent effect on behavioral trust.

While it is often assumed that anthropomorphism positively influences trust, researchers in various domains have suggested that anthropomorphism may not always have a positive effect on perceptions and interactions. Duffy [3] referred to the careful interplay between humanness perceptions and expectations for robot behavior. Culley and Madhaven [2] specifically called researchers and designers to consider the consequences of inappropriate trust calibration that may result from overly anthropomorphic agents. Moreover, the metaphor of the “uncanny valley” [31] has been used to study the discomfort and unpleasantness that arise when an agent or robot is “too humanlike” [32].

Our study builds on these to investigate how anthropomorphism perceptions elicited by communication style influence an explicit measure of trust appropriateness.

3 Methodology

We conducted a 2 (communication style: machinelike, humanlike) $\times $ 2 (reliability: low, high) between-subject study to observe how users calibrated trust in an error-prone automated system. Reliability determined the percentage of images that the system correctly identified and overall performance when it was relied upon. The communication style of the system’s messages was intended to elicit different anthropomorphism perceptions.

Given that prior work has shown positive effects of anthropomorphism on trust, such as less drastic trust declines in the face of degrading performance [34,35,36] and more positive subjective trust [9], we predicted that participants would have more appropriate trust when the system’s communication style “matched” its reliability. In other words, positive associations with greater humanness would lead participants to expect better reliability. Thus, we expected that trust would be less appropriate in the cases where 1) the system acted machinelike but reliability was high, and 2) the system acted humanlike but reliability was low. We formed the following hypotheses:

H1: Participants in the low reliability-machinelike group will have more appropriate trust than participants in the low reliability-humanlike group.

H2: Participants in the high reliability-humanlike group will have more appropriate trust than participants in the high reliability-machinelike group.

To test these hypotheses, we designed an online game where participants collaborated with and were able to adjust their level of reliance on an automated system. Participants reported on their perceptions of the system in a post-gameplay survey. The details of our methods are presented below.

3.1 The Target Identification Task

In each of 5 rounds of gameplay, participants classified a subset of 20 images on a map. To manually classify an image, participants clicked on a marker on the map, after which an image of a vehicle was shown in the Vehicle Identification Panel to the right of the map. “Non-dangerous” vehicles had only text on top of them, while “dangerous” vehicles had numbers in addition to text. Participants used “Zoom In,” “Zoom Out,” and “Rotate” buttons to determine whether the vehicle was dangerous. The manual task was intentionally made simple to reduce variability in individual performance.

While participants manually identified their portion of the images, the ATD system “worked” in parallel on the rest of the images. In reality, the system took a fixed amount of time per image and had accuracy determined by the reliability condition. The low reliability automation correctly identified 60% of its allocated images and the high reliability automation correctly identified 90%, both rounded down to the nearest integer so that the automation always misidentified at least one image. This level of accuracy was chosen based on prior work finding that a decision aid with less than 70% accuracy is considered worse than no aid [41], although our task involved reliance on automation’s simultaneous performance rather than decisions to comply with an aid’s suggestions.

The round ended when all 20 images had been classified. The gameplay interface is shown in Fig. 1.

3.2 Scoring

Participants received a score representing their performance in each round, which was motivated as follows. Manual classification involved clicking multiple times and was more time consuming than automated classification. However, the automation was not perfect and could lead to low accuracy based on the reliability condition. We told participants that their “Round Score” credited speed (overall time spent to classify the 20 images) and accuracy (overall percentage of correctly identified images). Participants had to decide how willing they were to rely on the automation to quickly assist in classifying images despite the risk to overall accuracy.

In reality, the Round Score was determined solely by the number of images allocated to the automation in a given round. The best Round Score occurred for an “ideal” number of images in each reliability group: 5 images for low reliability, 15 for high. Round Scores decreased linearly with the distance away from this value. Moreover, while we told participants that their Round Score was up to 100 points, we limited the maximum score to 90 points so that it was not obvious when ideal calibration had been achieved. We added a small, randomly generated “noise” amount to each Round Score so that it appeared that scoring varied based on performance.

Trust appropriateness was characterized with respect to the ideal value. For instance, allocating 18 images to the high reliability automation is indicative of overtrust, 10 of undertrust, and 15 of appropriate calibration. This allowed us to observe how the communication style and reliability manipulations influenced the degree of miscalibration.

This fixed scoring mechanism was in line with the ostensible speed and accuracy incentives. Low reliability participants should have allocated fewer images to the automation because of its reduced accuracy, although they still needed to rely on it to achieve reasonable speed. High reliability participants should have allocated more images to the automation because it did not greatly compromise accuracy, although they still should have helped to identify some images to achieve reasonable speed.

To motivate good performance, we told participants that they would be rewarded a bonus based on their cumulative Round Scores in addition to the compensation they would receive for completion of the study. In reality, all participants first received $2 immediately following participation and a $2 bonus after we were no longer recruiting participants. The timing of the bonus ensured that workers were not able to reveal to other workers that an additional $2 was rewarded regardless of performance.

3.3 Feedback Page and Message Design

Following each round of the game, a feedback page (Fig. 2) was shown containing three elements: 1) the Round Score for the previous round, 2) a message noting the number of errors made by the automation, and 3) the allocation decision for the next round.

The Round Score was shown with a colored gauge indicating where the score fell out of 100 possible points, as well as the compensation amount ostensibly associated with that score (100 points = $0.40).

The design of the feedback messages was based on research investigating the effects of trust repair messages following system errors. For instance, Tzeng [33] found that apologies by computers led to more positive impressions. Jensen et al. [7] found that a self-blaming automated system (“I was unable...”) was perceived as more trustworthy than one blaming its developers for errors (“The developers were not able...”) using the same collaborative game used in the current study. Moreover, Sebo, Krishnamurthi, and Scassellati [30] and Quinn, Pak, and de Visser [26] have noted how apologies and denials by robots and automation influence future trust.

In line with this prior work, the humanlike text was designed using first-person (the automation referred to itself as “I”), an apology when reporting errors and, in general, social niceties intended to elicit the perception of humanness. On the other hand, the machinelike text was designed to be minimal, informational, and impersonal. The manipulation is similar to that in Parasuraman and Miller’s [24] etiquette study, although they used interruptive or patient timing of messages and did not measure perceived humanness of the automated system. After gameplay, we tested whether the humanlike communication style led to greater perceived anthropomorphism with two different scales. The text we displayed on the feedback page in each condition is shown in the second column of Table 1.

Table 1. Introduction and feedback messages for each communication style condition. Introduction messages were shown after the instruction page and prior to Round 1 of gameplay. Feedback messages were shown after each round of gameplay. The portion of each message regarding the allocation decision was shown in a separate panel from the rest of the message. “X” represents the number of images that the ATD system could not identify in the previous round and “Y” represents the subsequent round number.

Full size table

3.4 Study Procedure

Participants were recruited on Amazon Mechanical Turk (MTurk) and restricted to workers 18 years or older, living in the United States, and having completed at least 1000 Human Intelligence Tasks (HIT’s) with an approval rate of at least 95%. Upon accepting the HIT, participants were shown an information sheet describing the general study procedure and were asked whether they consented to participate. Those who gave consent were forwarded to our online game.

First, an instruction page was displayed describing the Target Identification Task. After reading about the task and game controls, participants were required to correctly answer a series of multiple choice questions about the game. This included a question confirming their understanding of the importance of speed and accuracy for performance. These motivational details were critical so that there was risk and reward in relying on the ATD system, both components of a trusting relationship. One multiple choice question also confirmed that participants were using audio so that they would hear the sound effects accompanying correct and incorrect image identifications.

After completing the instruction page, an introduction message was displayed, followed by the allocation decision for Round 1. The introduction text was the first manifestation of the communication style, and is shown in the first column of Table 1 for each group. Participants played the first round of the game after confirming their allocation decision.

After the feedback page that followed the fifth round of the Target Identification Task, participants were forwarded to a post-gameplay survey. Only those who completed all 5 rounds of gameplay and the survey were compensated.

3.5 Survey Measures

The post-gameplay survey was hosted on our university’s Qualtrics server. Participants first responded to a series of demographic questions regarding gender, age, race, education level, and video gaming frequency before reporting on their experience in the game.

Game-Related Questions. Two items were used to assess the perceived reliability of both automated and manual identification. For instance, the automation item asked, “How reliable was the automation? Specifically if the automation analyzed 100 images, how many would it correctly identify?” Two additional items assessed the perceived speed of automated and manual identification. The automation item in this case asked, “How fast was the automation? Specifically, if the automation analyzed images for one minute, how many would it identify?”

Then, in one multiple answer item, participants were asked to select which factors contributed to their score, with the following options in order: “None of the below,” “All of the below,” “The automation’s accuracy,” “Your accuracy,” “The automation’s speed,” and “Your speed.” This was used to confirm that participants were not aware of the fixed scoring mechanism.

Perceived Behavioral Anthropomorphism. To check the effect of the communication style manipulation, we included 4 of the items from the anthropomorphism index of the Godspeed questionnaire [1] rated on 5-point semantic differential scales with reference to the ATD system (Fake/Natural, Machinelike/Humanlike, Unconscious/Conscious, Artificial/Lifelike). The Godspeed scale was developed for robotic systems, and so the “Moving rigidly/Moving elegantly” item was not relevant given our automated system’s lack of physical embodiment.

Three additional perceived behavioral anthropomorphism (PBA) items were developed for the current study referring specifically to the humanness of the system’s behavior and messages. Participants rated agreement with the statements, “The system communicated with me like a human would,” “My interaction with the system felt like one with another person,” and “The system acted in a humanlike manner” on a 7-point Likert scale.

Individual Differences in Anthropomorphism. The Individual Differences in Anthropomorphism Questionnaire (IDAQ) was developed by Waytz, Epley, and Cacioppo [39] and has been found to predict trust in technology [40]. Items were rated on an 11-point scale from “Not at all” to “Very much.” Sample items include, “To what extent do cows have intentions?” and “To what extent does the average computer have a mind of its own?”

Perceived Trustworthiness Characteristics. We measured subjective trust in the ATD system by adapting perceived ability, integrity, and benevolence items from McKnight, Choudhury, and Kacmar [14], originally from Mayer and Davis [11]. Prior work on automated systems has found that these trusting perceptions are influenced by information about system performance and process [6] as well as system accuracy and attribution of blame for errors in a similar image classification task [7]. These “trusting beliefs” can help to paint a thorough picture of how humanness relates to automation perception and behavior beyond unidimensional subjective trust measures.

We included 3 attention check questions throughout the survey asking for a specific multiple choice answer (e.g., “Please select ‘Disagree’ for this statement”).

The study was approved by the University of Connecticut Institutional Review Board (IRB).

4 Evaluation

A total of 158 participants completed the study. We first removed the data of 27 participants who incorrectly answered at least one of the attention check questions. Next, in the multiple answer item, only one of the remaining participants answered “None of the below” when asked which factors contributed to their score, suggesting that this participant did not think their score was actually based on performance. All subsequent analyses were conducted on the remaining 130 participants, with the group distribution shown in Table 2.

Table 2. Group distribution. Number of participants in each experimental condition.

Full size table

The sample consisted of 71 (54.6%) males and 59 (45.4%) females, and there were 106 (81.5%) white, 11 (8.5%) African American, 5 (3.8%) Hispanic, 5 (3.8%) Asian, and 3 (2.3%) Native American participants. The average age was 37.2 years (SD = 11.7). Regarding education level, 57 (43.8%) participants reported having at least a 4-year college degree. When asked how often they play games on a computer or mobile device, 54 (41.5%) said they play daily, 41 (31.5%) a few times a week, and 35 (26.9%) a few times a month or less.

A Chi-square test revealed that the experimental groups did not significantly differ in terms of gender ($\chi ^{2}(3)$ = 0.29, p = 0.96). A Fisher’s Exact Test likewise found that groups were similar in terms of race (p = 0.49). Lastly, Kruskall-Wallis Tests found that there were no significant differences in age ($\chi ^{2}(3)$ = 3.20, p = 0.36), education level ($\chi ^{2}(3)$ = 0.97, p = 0.81), or gaming frequency ($\chi ^{2}(3)$ = 3.11, p = 0.38). Thus, group differences can be attributed to our manipulations.

We expected that participants would rate manual identification as more accurate than automated identification, and automated identification as faster than manual identification. Given the non-normal distributions of these responses, two Wilcoxon Signed-Ranks Tests (the non-parametric equivalent of a paired t-test) confirmed this prediction. Participants reported that, out of 100 images, they would correctly identify significantly more (M = 86.9, SD = 17.0) than the automation (M = 72.6, SD = 17.8)($p < 0.001$) and that, in one minute, the automation would identify significantly more images (M = 57.1, SD = 31.5) than they would (M = 34.5, SD = 28.0)($p < 0.001$). Consistent with Mayer, Davis, and Schoorman’s definition of trust [12], participants had to decide how willing they were to be vulnerable to the ATD system’s lower accuracy, with the expectation that it would help improve their speed.

4.1 Manipulation Checks

Participants accurately reported the number of images they would expect the automation to correctly identify out of 100 in both the low (M = 60.5, SD = 15.0) and high (M = 84.0, SD = 11.5) reliability groups. A Mann-Whitney U-test confirmed that this difference in perceived reliability between groups was significant (U = 328.50, $p < 0.001$).

Next, we tested the effect of the automation’s communication style on perceived anthropomorphism measured using the 4 established Godspeed items ($\alpha $ = 0.94) and the 3 PBA items developed for this study ($\alpha $ = 0.91). The IDAQ scale ($\alpha $ = 0.91) was entered as a covariate for each test to control for individual anthropomorphic tendencies and because it reduced the error term in both cases^{Footnote 1}.

A 2 (reliability) $\times $ 2 (communication style), Analysis of Covariance (ANCOVA) on the Godspeed measure yielded a significant main effect of reliability (F(1, 125) = 9.00, p = 0.003, $\eta _{p}^{2}$ = 0.067) when controlling for the IDAQ (F(1, 125) = 25.51, $p < 0.001$, $\eta _{p}^{2}$ = 0.160). High reliability participants ($M_{adj}$ = 2.96, SE = 0.13) reported greater Godspeed perceived anthropomorphism than low reliability participants ($M_{adj}$ = 2.41, SE = 0.13). However, the main effect of communication style was not significant (F(1, 125) = 2.07, p = 0.153, $\eta _{p}^{2}$ = 0.016).

A separate ANCOVA on the PBA measure yielded significant main effects of both reliability (F(1, 125) = 4.29, p = 0.040, $\eta _{p}^{2}$ = 0.033) and communication style (F(1, 125) = 10.56, p = 0.001, $\eta _{p}^{2}$ = 0.078) when controlling for the IDAQ (F(1, 125) = 18.39, $p < 0.001$, $\eta _{p}^{2}$ = 0.128). High reliability participants ($M_{adj}$ = 4.14, SE = 0.17) reported greater perceived anthropomorphism than low reliability participants ($M_{adj}$ = 3.63, SE = 0.18). Also, participants in the humanlike condition ($M_{adj}$ = 4.28, SE = 0.17) reported greater perceived anthropomorphism than those in the machinelike condition ($M_{adj}$ = 3.49, SE = 0.17).

There are three important conclusions from these results. First, individual tendencies appear to play a significant role in perceptions of anthropomorphism of automation. Second, the PBA items may measure something slightly conceptually distinct from the Godspeed items, given that the latter were not significantly influenced by communication style. The Likert-style PBA items were more explicit in their reference to system behaviors being humanlike. Third, the high reliability system was consistently perceived as more humanlike than the low reliability system. When the system was more accurate and could handle more of the task load, it was considered more humanlike. It is worth noting that participants responded to the perceived anthropomorphism items after the entire interaction. Perceptions of humanness may have been influenced to a greater extent by communication style initially, especially when they only saw the introduction message and had not watched the system in action. Over time, observation of the automation’s performance grew to inform humanness perceptions to a greater extent.

4.2 Allocation to the Automation and Trust Appropriateness

To test hypotheses H1 and H2, we operationalized a participant’s trust appropriateness for each round of gameplay as the difference between their allocation to the ATD system for that round and the ideal allocation amount for their reliability condition (5 images for low reliability, 15 for high). Negative values represent undertrust of the automation and positive values represent overtrust. Mean trust appropriateness for each group across the 5 rounds of gameplay is shown in Fig. 3.

Trust appropriateness was submitted to a repeated measures analysis of variance (rm-ANOVA) with round of the game as the within-subject factor and reliability and communication style as between-subject factors. There was a significant main effect of reliability on trust appropriateness (F(1, 126) = 192.64, $p < 0.001$, $\eta _{p}^{2}$ = 0.605)^{Footnote 2}. Regardless of round, high reliability participants tended to undertrust the automation ($M_{adj}$ = −2.63, SE = 0.32) whereas low reliability participants tended to overtrust ($M_{adj}$ = 3.64, SE = 0.33).

Moreover, there was a significant interaction between round and reliability (F(4, 504) = 13.32, $p < 0.001$, $\eta _{p}^{2}$ = 0.096). Post-hoc comparisons were conducted with a Bonferroni adjustment to $\alpha $ = 0.05/10 = 0.005. Low reliability participants’ trust appropriateness in Round 1 (M = 5.19, SD = 3.40) was significantly different than in Round 3 (M = 3.21, SD = 4.32)(p = 0.001), Round 4 (M = 2.25, SD = 4.20)($p < 0.001$), and Round 5 (M = 2.37, SD = 3.96)($p < 0.001$). Round 2 trust appropriateness (M = 5.19, SD = 3.94) was also significantly different from Round 3 (p = 0.002), Round 4 ($p < 0.001$), and Round 5 ($p < 0.001$). High reliability participant’s trust appropriateness in Round 1 (M = −3.62, SD = 4.14) was significantly different than in Round 4 (M = −1.96, SD = 3.25)(p = 0.004). Lack of differences between the later rounds suggest that participants’ found a relatively stable level of trust after observing the automation perform throughout the game. However, appropriate trust was never achieved by either reliability group. One sample t-tests using a Bonferroni-adjusted significance level confirmed that group mean trust appropriateness in each round was significantly different from 0.

The interaction between communication style and reliability was not significant, and thus H1 and H2 were not supported.

4.3 Perceived Trustworthiness

To observe whether our manipulations influenced subjective trust, we conducted a multivariate analysis of variance (MANOVA) on the perceived ability, integrity, and benevolence of the ATD system. We found significant main effects of both reliability (F(3, 124) = 23.64, $p < 0.001$, Wilks’ $\lambda $ = 0.636, $\eta _{p}^{2}$ = 0.364) and communication style (F(3, 124) = 3.07, p = 0.030, Wilks’ $\lambda $ = 0.931, $\eta _{p}^{2}$ = 0.069) and conducted a series of follow-up univariate ANOVA’s. The significance level for follow-ups was Bonferroni-adjusted to $\alpha $ = 0.05/3 = 0.0167.

Ability. There was a main effect of reliability on perceived ability (F(1, 126) = 68.04, $p < 0.001$, $\eta _{p}^{2}$ = 0.357). High reliability participants (M = 5.81, SD = 0.87) rated the system’s ability higher than low reliability participants (M = 4.23, SD = 1.29).

Integrity. The overall model for integrity was not significant, suggesting that reliability and communication style manipulations did not influence participants’ rating of system integrity items.

Benevolence. There was a marginally significant effect of reliability on perceived benevolence (F(1, 126) = 4.68, p = 0.032, $\eta _{p}^{2}$ = 0.036) where high reliability participants (M = 4.62, SD = 1.36) rated the system as slightly more benevolent than low reliability participants (M = 4.15, SD = 1.26). Additionally there was a main effect of communication style (F(1, 126) = 6.34, p = 0.013, $\eta _{p}^{2}$ = 0.048). Participants in humanlike groups (M = 4.67, SD = 1.15) rated the system as more benevolent than those in machinelike groups (M = 4.11, SD = 1.46).

5 Discussion

We believe our method and findings offer valuable insights into the synthesis of research on CASA, anthropomorphism, and trust in automation as follows.

5.1 System Behaviors and Anthropomorphism

First, we wanted to observe whether perceptions of anthropomorphism could be elicited by subtle system behaviors and features, and not just visual cues to humanness that are often studied. Nass and Moon [18] suggested that one cue that may lead to social responses to computers is the filling of roles normally held by people. In our study, the automation may have been perceived as humanlike because it was performing the same task as each participant. Likewise, the mere presence of words for communication may have been enough of a cue to perception of the system as a social entity.

On top of these features common across groups, we sought to demonstrate that the communication style of system messages would elicit differences in perceptions of humanness. We found that the system’s communication style influenced our PBA measure, but not the Godpseed measure. This may point to differences in what our explicitly-worded Likert items measured compared to the more general phrasing of the Godspeed items. Among the 3 PBA items, one referred specifically to the system’s communication, while the 5 Godspeed items noted general characteristics of the automation.

Additionally, reliability affected both the PBA and Godpseed measures. Participants perceived the more accurate system as more humanlike. Prior findings on the relationship between accuracy and humanness are mixed. One study similarly found that a more smoothly operating robot was perceived as more humanlike [29]. Another found that greater inconsistency between a robot’s spoken instructions and gestures actually led to greater perceived anthropomorphism [28].

In light of communication style’s lack of effect on the Godspeed measure, it seems that reliability carried a greater weight in perceptions of the automation on our experiment. Had perceived anthropomorphism been measured early in the interaction, perhaps before participants observed the automation’s performance, communication style may have shown a stronger effect. Nonetheless, this suggests that anthropomorphism may be a relatively dynamic perception with respect to an entity. Further research is needed to understand the extent to which systems represented by a humanlike agent are consistently perceived as humanlike, or whether perceptions of humanness change based on familiarity or the observation of certain behaviors.

In general, this finding lends to the idea that perceptions of a system’s anthropomorphism are drawn from how it acts, and not just what it looks like. We encourage further work identifying specific features or system actions that lend to perceptions of behavioral anthropomorphism. For instance, the messages in the current study were consistent across rounds. Messages that change over time may act as a cue to more dynamic and humanlike behavior, and therefore have a more substantial influence on trust in a system.

The more accurate system may also have been perceived as more humanlike due to a greater sense of similarity, given that participants felt that they were more accurate than the automation. Prior work has found that computers with personalities that are similar to the user elicit more positive impressions [17].

We found that humanlike communication style participants perceived the system as more benevolent than machinelike communication style participants. In fact, the effect of communication style on perceived benevolence ($\eta _{p}^{2}$ = 0.048) was slightly larger than that of system reliability ($\eta _{p}^{2}$ = 0.036). The humanlike messages appeared to express that the system had the capacity to care for the user. This supports previously found positive perceptual benefits of apologies by computers [33]. Prior work has also found that self-blame by an automated system following errors leads to perceptions of greater benevolence compared to blame of system developers [7]. As of yet, however, the relationship between benevolence perceptions and behavioral outcomes such as reliance has not been demonstrated in the human-automation context. Future work in this area may help to better understand the relationship between anthropomorphism, trustworthiness perceptions, and performance.

5.2 Trust Calibration and Appropriateness

A second goal of the current study was to demonstrate a relationship between humanlike system design features and trust appropriateness. As mentioned, the communication style manipulation did not strongly influence behavioral trust. Within our experimental setup, the influence of the scoring mechanism may have been too dominant to observe such an effect. Speed and accuracy incentives as well as the clear representation of performance given by the Round Score may have reduced the salience of the communication style. A task that was more social in nature than the Target Identification task or a stronger manipulation of anthropomorphism could have amplified the effect we hoped to observe.

Nonetheless, we believe the quantifiable measure of trust appropriateness employed in this study sheds light on the importance of trust calibration as a design goal. Our participants gradually calibrated their trust to a more appropriate level over the course of the game. Participants in the low reliability group demonstrated overtrust. The degree of overtrust was significantly greater in the first two rounds compared to the later three. On the other hand, participants in the high reliability group generally demonstrated undertrust. A lack of significant differences between rounds (except for between Rounds 1 and 4) suggests that these participants may have been better calibrated initially. Without considering what represents an “appropriate” level of trust, prior work tends to miss the fact that increasing trust is not always desirable.

This measure was not without its limitations. For one, the predetermined ideal levels of trust for both groups generally led to longer times for each round. The shortest round durations would be more likely to occur at more equal levels of allocation, especially when the speed of manual classification increased with experience. For instance, 10 manual and 10 automated images would likely lead to less overall time than the ideal 5 manual and 15 automated in the high reliability condition. Although we motivated participants to perform both quickly and accurately, while also fixing the score to motivate certain levels of allocation, MTurk users’ desire to complete HIT’s quickly may have incentivized speed to a greater extent. This was likely especially true for low reliability participants, since the ideal level of allocation for their score was associated with a great deal of effort in manually identifying 15 images.

Our study also defined trust appropriateness in a specific way. Reliability was represented as a fixed percentage of images that the automation would classify correctly. Thus, calibration of trust involved building an understanding of this percentage and finding what level of allocation would optimize performance. In practice, other types of trust calibration may be more prevalent, as noted in McDermott and ten Brink’s [13] idea for Calibration Points. Understanding the situations in which system reliability changes and adjusting behavior accordingly is a critical aspect of the calibration process. Further research is needed to clarify how anthropomorphism and agents that represent systems help or hinder the goal of appropriate trust in contexts where reliability is dynamic.

6 Conclusion

In this study, we sought to observe whether perceptions of behavioral anthropomorphism influenced the extent to which individuals were able to calibrate their trust in an automated system to an appropriate level. We found that, while both communication style and reliability influenced measures of perceived anthropomorphism, only reliability significantly influenced the appropriateness of participants’ trust in the automation throughout the game. In particular, low reliability participants tended to overtrust the automation while high reliability participants tended to undertrust. The humanlike system was perceived as more benevolent than the machinelike system. Further research is needed to identify other system features and behaviors that elicit behavioral anthropomorphism perceptions. Additionally, we hope that the measure of trust appropriateness that we employed inspires other researchers to focus on trust calibration, rather than simply increasing levels of trust.

Notes

1.
Estimated marginal means are reported at the mean IDAQ score of 4.17.
2.
Levene’s test was violated only for first round trust appropriateness.

References

Bartneck, C., Kulić, D., Croft, E., Zoghbi, S.: Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc. Robot. 1(1), 71–81 (2009). https://doi.org/10.1007/s12369-008-0001-3
Article Google Scholar
Culley, K.E., Madhavan, P.: A note of caution regarding anthropomorphism in HCI agents. Comput. Hum. Behav. 29(3), 577–579 (2013)
Article Google Scholar
Duffy, B.R.: Anthropomorphism and the social robot. Robot. Auton. Syst. 42(3–4), 177–190 (2003)
Article Google Scholar
Gong, L.: How social is social responses to computers? The function of the degree of anthropomorphism in computer representations. Comput. Hum. Behav. 24(4), 1494–1509 (2008)
Article Google Scholar
Hoff, K.A., Bashir, M.: Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors 57(3), 407–434 (2015)
Article Google Scholar
Jensen, T., Albayram, Y., Khan, M.M.H., Buck, R., Coman, E., Fahim, M.A.A.: Initial trustworthiness perceptions of a drone system based on performance and process information. In: Proceedings of the 6th International Conference on Human-Agent Interaction, pp. 229–237. ACM (2018)
Google Scholar
Jensen, T., Albayram, Y., Khan, M.M.H., Fahim, M.A.A., Buck, R., Coman, E.: The apple does fall far from the tree: user separation of a system from its developers in human-automation trust repair. In: Proceedings of the 2019 on Designing Interactive Systems Conference, pp. 1071–1082. ACM (2019)
Google Scholar
Kim, Y., Sundar, S.S.: Anthropomorphism of computers: is it mindful or mindless? Comput. Hum. Behav. 28(1), 241–250 (2012)
Article Google Scholar
Kulms, P., Kopp, S.: More human-likeness, more trust? The effect of anthropomorphism on self-reported and behavioral trust in continued and interdependent human-agent cooperation. Proc. Mensch und Comput. 2019, 31–42 (2019)
Article Google Scholar
Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Hum. Factors 46(1), 50–80 (2004)
Article Google Scholar
Mayer, R.C., Davis, J.H.: The effect of the performance appraisal system on trust for management: a field quasi-experiment. J. Appl. Psychol. 84(1), 123 (1999)
Article Google Scholar
Mayer, R.C., Davis, J.H., Schoorman, F.D.: An integrative model of organizational trust. Acad. Manag. Rev. 20(3), 709–734 (1995)
Article Google Scholar
McDermott, P.L., Brink, R.N.T.: Practical guidance for evaluating calibrated trust. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 63, pp. 362–366. SAGE Publications, Los Angeles (2019)
Google Scholar
McKnight, D.H., Choudhury, V., Kacmar, C.: Developing and validating trust measures for e-commerce: an integrative typology. Inf. Syst. Res. 13(3), 334–359 (2002)
Article Google Scholar
Moon, Y.: Intimate exchanges: using computers to elicit self-disclosure from consumers. J. Consum. Res. 26(4), 323–339 (2000)
Article Google Scholar
Morkes, J., Kernal, H.K., Nass, C.: Effects of humor in task-oriented human-computer interaction and computer-mediated communication: a direct test of SRCT theory. Hum.-Comput. Interact. 14(4), 395–435 (1999)
Article Google Scholar
Nass, C., Lee, K.M.: Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 329–336. ACM (2000)
Google Scholar
Nass, C., Moon, Y.: Machines and mindlessness: social responses to computers. J. Soc. Issues 56(1), 81–103 (2000)
Article Google Scholar
Nass, C., Steuer, J., Henriksen, L., Dryer, D.C.: Machines, social attributions, and ethopoeia: performance assessments of computers subsequent to” self-” or” other-” evaluations. Int. J. Hum.-Comput. Stud. 40(3), 543–559 (1994)
Article Google Scholar
Nass, C., Steuer, J., Tauber, E.R.: Computers are social actors. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 72–78. ACM (1994)
Google Scholar
Nowak, K.L.: Examining perception and identification in avatar-mediated interaction. In: Sundar, S.S. (ed.) Handbooks in Communication and Media. The Handbook of the Psychology of Communication Technology, pp. 89–114. Wiley-Blackwell (2015)
Google Scholar
Nowak, K.L., Biocca, F.: The effect of the agency and anthropomorphism on users’ sense of telepresence, copresence, and social presence in virtual environments. Presence Teleoperators Virtual Environ. 12(5), 481–494 (2003)
Article Google Scholar
Pak, R., Fink, N., Price, M., Bass, B., Sturre, L.: Decision support aids with anthropomorphic characteristics influence trust and performance in younger and older adults. Ergonomics 55(9), 1059–1072 (2012)
Article Google Scholar
Parasuraman, R., Miller, C.A.: Trust and etiquette in high-criticality automated systems. Commun. ACM 47(4), 51–55 (2004)
Article Google Scholar
Parasuraman, R., Riley, V.: Humans and automation: use, misuse, disuse, abuse. Hum. Factors 39(2), 230–253 (1997)
Article Google Scholar
Quinn, D.B., Pak, R., de Visser, E.J.: Testing the efficacy of human-human trust repair strategies with machines. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 61, pp. 1794–1798. SAGE Publications, Los Angeles (2017)
Google Scholar
Reeves, B., Nass, C.I.: The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press, New York (1996)
Google Scholar
Salem, M., Eyssel, F., Rohlfing, K., Kopp, S., Joublin, F.: To err is human (-like): effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Robot. 5(3), 313–323 (2013)
Article Google Scholar
Salem, M., Lakatos, G., Amirabdollahian, F., Dautenhahn, K.: Would you trust a (faulty) robot? Effects of error, task type and personality on human-robot cooperation and trust. In: 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 1–8. IEEE (2015)
Google Scholar
Sebo, S.S., Krishnamurthi, P., Scassellati, B.: “I don’t believe you”: investigating the effects of robot trust violation and repair. In: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 57–65. IEEE (2019)
Google Scholar
Seyama, J., Nagayama, R.S.: The uncanny valley: effect of realism on the impression of artificial human faces. Presence Teleoperators Virtual Environ. 16(4), 337–351 (2007)
Article Google Scholar
Strait, M., Vujovic, L., Floerke, V., Scheutz, M., Urry, H.: Too much humanness for human-robot interaction: exposure to highly humanlike robots elicits aversive responding in observers. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems, pp. 3593–3602. ACM (2015)
Google Scholar
Tzeng, J.Y.: Toward a more civilized design: studying the effects of computers that apologize. Int. J. Hum.-Comput. Stud. 61(3), 319–345 (2004)
Article Google Scholar
de Visser, E.J., et al.: The world is not enough: trust in cognitive agents. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 56, pp. 263–267. SAGE Publications, Los Angeles (2012)
Google Scholar
de Visser, E.J., et al.: A little anthropomorphism goes a long way: effects of oxytocin on trust, compliance, and team performance with automated agents. Hum. factors 59(1), 116–133 (2017)
Article Google Scholar
de Visser, E., et al.: Almost human: anthropomorphism increases trust resilience in cognitive agents. J. Exp. Psychol. Appl. 22(3), 331 (2016)
Article Google Scholar
de Visser, E.J., Pak, R., Shaw, T.H.: From ‘automation’ to ‘autonomy’: the importance of trust repair in human-machine interaction. Ergonomics 61(10), 1409–1427 (2018)
Article Google Scholar
de Visser, E.J., et al.: Towards a theory of longitudinal trust calibration in human-robot teams. Int. J. Soc. Robot. 12, 459–478 (2019). https://doi.org/10.1007/s12369-019-00596-x
Article Google Scholar
Waytz, A., Cacioppo, J., Epley, N.: Who sees human? The stability and importance of individual differences in anthropomorphism. Perspect. Psychol. Sci. 5(3), 219–232 (2010)
Article Google Scholar
Waytz, A., Heafner, J., Epley, N.: The mind in the machine: anthropomorphism increases trust in an autonomous vehicle. J. Exp. Soc. Psychol. 52, 113–117 (2014)
Article Google Scholar
Wickens, C.D., Dixon, S.R.: The benefits of imperfect diagnostic automation: a synthesis of the literature. Theor. Issues Ergon. Sci. 8(3), 201–212 (2007)
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank Md Abdullah Al Fahim and Kristine Nowak for their insights while preparing this experiment and manuscript.

Author information

Authors and Affiliations

University of Connecticut, Storrs, Mansfield, CT, 06269, USA
Theodore Jensen & Mohammad Maifi Hasan Khan
Central Connecticut State University, New Britain, CT, 06053, USA
Yusuf Albayram

Authors

Theodore Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Maifi Hasan Khan
View author publications
You can also search for this author in PubMed Google Scholar
Yusuf Albayram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Theodore Jensen .

Editor information

Editors and Affiliations

Siemens, Princeton, USA
Helmut Degen
University of Central Florida, Orlando, FL, USA
Lauren Reinerman-Jones

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jensen, T., Khan, M.M.H., Albayram, Y. (2020). The Role of Behavioral Anthropomorphism in Human-Automation Trust Calibration. In: Degen, H., Reinerman-Jones, L. (eds) Artificial Intelligence in HCI. HCII 2020. Lecture Notes in Computer Science(), vol 12217. Springer, Cham. https://doi.org/10.1007/978-3-030-50334-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-50334-5_3
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50333-8
Online ISBN: 978-3-030-50334-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics