This section discusses the results of the study through usability surveys, quantitative analysis and qualitative analysis. We describe our findings to validate our considerations and design decisions. We will use the convention (\(\overline{x}=\_\_~,\sigma =\_\_\) ) for reporting the mean and the standard deviation for any quantity.
6.1 Utility of ConverSense as a Data-driven Communication Feedback Tool
Data-driven communication feedback helps providers evaluate patient interactions objectively — All five participants liked the idea of using a data-driven visualization to uncover their communication patterns with patients. Participants reported how current methods employed in clinical practices (if any), such patient-reports [P2, P3], physician coaches[P2], don’t provide contextualized feedback for each visit and hence are not comprehensive enough to fully understand patterns. P3 also expressed how he sometimes evaluates trainees’ patient interactions and provides them with feedback, and that having more data-driven feedback would help this process. P5 explained that his peers use methods such as self-reflection without data, which can be subjective and prone to their own biases. Patient reports of provider communication behaviors are also typically anonymized and shared in summaries monthly or quarterly, which becomes hard to understand really "who was the patient or which interaction it was" [P2], making the actionability of these surveys poor.
The proposed approach has potential to generalize to the real-world — Our model was trained on samples from the EF dataset [
62]. To validate the model in an "out of sample" setting, we test it on unseen data. We manually coded each interaction from the Lab Study
5 (n=10), using a protocol similar to Section
3.2. From our post-hoc analysis we see that our pipeline had a mean normalized accuracy per visit of 0.94, 0.77, 0.89, 0.95 for dominance, interactiveness, engagement, warmth. This shows that the models are capable to perform reasonably even on unseen data from a different distribution.
Computational approaches could help reduce observer bias — Compared to their existing methods of communication feedback, which are largely subjective in nature, participants found merits in the ConverSense computational approach for self-reflecting on one’s own implicit bias. Human observers, such as providers or physician coaches, can offer impactful personalized feedback due to their expertise and experience, but they will always carry some bias shared by their experiences. One of our participants remarked that the ConverSense computational approach "is independent of overt or underlying bias and can [thus] provide more accurate insights in complex, individualized patient interactions" [P3]
Use in near-real-time and cross-device is valued — Participants concurred that they would prefer a feedback tool that visualizes their communications post-hoc for review after visits. They argued that real-time notifications during patient visits can be intimidating and distracting. Three out of five participants [P1, P4, P5] also said that they would ideally review each visit while their memory of the visit is still fresh, potentially at the end of each work day. In contrast, P3, advocated for real-time feedback during visits and sporadic interactions with a communication feedback dashboard. All participants agreed that future tools should be accessible on their computer screens in clinics as well as mobile phones when they are outside of the clinic. These findings may imply that future tools should also work in near real-time, to facilitate this intended use. P2 and P3 also suggested that communication feedback is critical for providers in training, opening up the opportunity to use ConverSense during medical education, with a user group that might amplify the impact of our tool.
Interacting with social signals in a data-driven tool is hard, but can be learned — On average, participants rated the SUS usability of ConverSense to be on average just
OK, or
marginally acceptable (
\(\overline{x}=60.55, \sigma =22.17\) out of 100). Participants also reported that they would need to learn a lot before using the system (
\(\overline{x}=3, \sigma =1.58\) out of 5), but also that they would imagine being able to learn quickly (
\(\overline{x}=3.6, \sigma =1.5\)). We also learned about their perceptions of cognitive load while using ConverSense from NASA-TLX: participants overall reported low cognitive load when using the system; all participants except P3 reported their overall cognitive load to be below 30 out of 100, with all dimensions except performance to be below 25 out of 100. Performance (level of success in completing tasks) was however rated higher (
\(\overline{x}=51, \sigma =31.50\) out of 100), meaning that the load of completing tasks was consistent with the widely-used threshold of 50 [
74].
Line Charts can become confusing during overlaps — ConverSense depicts social signal affect scores over time using line charts. Participants however found it confusing when lines overlapped on one-another, as exemplified by P1: "Why did my warmth signal go missing in this time period?". The use of the interactive legend and configurable plot were not very intuitive to the users. For example, during the think aloud exploration, P1 did not discover that the legend was interactive, whereas P2, P4, P5 spent a lot of time selecting and unselecting signals to get a desired configuration and became frustrated with the number of clicks required.
6.2 Social Signals as an Approach to Communication Feedback
Social Signals alone may not show the complete picture — All participants agreed that social signals can represent more "nuanced" [P1, P3] communication patterns than non-verbal behavioral cues alone. However these same nuances can make the patterns difficult to fully understand as they were found incomplete descriptors of communication feedback. For example, one participant stated that each person has their own "definition of [these] social signals" [P3] and that makes them "open to various interpretations" [P1]. Not having knowledge of the system’s interpretation led participants to question the validity of the tool. Their repeated attempts to decipher the definition of each social signal was also visible in the number of flips participants performed from one page to another. Participants clicked to the "About" page approximately once per each view (\(\overline{x}=0.99, \sigma =0.39\)) when they were confused. Further, most did not recall the definition later, leading to revisiting the About page multiple times.
Providers need more contextualized and actionable feedback — All the participants mentioned the tool’s inability to provide actionable strategies to improve communication behavior when particular patterns emerge. In particular, P2 was concerned about the inability to act on ConverSense’s feedback and he suggested that the tool could benefit from showing strengths and weakness in terms of social signals: "I guess that my warmth must have been low in a certain slice, but the tool never told me how to be warmer" [P2]. P3 also discussed how the current feedback mechanism lacks actionable insights, which a human observer or an evaluator can provide.
Participants also experienced a mismatch of expectations when considering their existing methods of communication feedback by an evaluator (e.g., coach), which can provides more actionable feedback. Reflecting on his own experience as an evaluator, P3 questioned if and how improving on scores for a particular social signal might actually improve patients’ experience;
"Feedback through visualizing social signals deviates from conventional paradigms of assessment feedback [such as visualizing non-verbal gestures]. It does not use discrete behaviors as perceived by the patients that we [come to know] from a patient’s perspective. We don’t have a third eye that is independently gathering information and describing it in domains which are not very discrete, and instead quite ambiguous and uses unfamiliar terms. What exactly is interactiveness? And how do I look at that in a way that might provide commentary on what I do well, or don’t?" [P3]
Finally, when thinking about how to better understand the specific social signals feedback, participants expressed how they would "lose memory of a specific interaction over time and having context about that interaction, could help remember the interaction better" [P5]. Participants also suggested using low affect scores as a mechanism for identifying "clips" from the interaction where communication could be improved.
Comparing provider and patient affect scores helps in understanding actions and reactions — Most participants found it useful to compare their affect scores with the scores of patients to gain a relative understanding of the dynamics of the interaction during a visit. They compared the patient scores to their own scores and even tried to correlate it with their memory of the interactions. P5 found the juxtaposition helpful, as it helped them map their own behavior against a global baseline. They said "I felt pretty warm to her and I thought she was actually much more cold" [P5] when they saw the patient warmth as similar to theirs. P2 also shared a similar experience and added that "if an interaction did not go well, I would wonder if it was mine or the patient’s fault" and seeing patient scores helps them understand why they acted or reacted in a certain manner. However participants also voiced that this approach might result in being more defensive in the interpretation of communication breakdowns, as evidenced by P2: "they weren’t very warm so I wasn’t too warm to them".
Absolute affect scores are difficult to interpret — Participants described the absolute affect scores or the numerical measures of affect throughout the dashboard as difficult to interpret. While the score of "3" for neutral behavior on which RIAS is based, participants were not able to determine what makes for a "good score". Participants expected a more local benchmark either compared to themselves or with their peers. P1 shared: "my overall affect is 2.86, is that good or bad?... Having a point of reference, either through self-comparison over time or by comparing to others, would provide clearer context for understanding a score like 2.8".
P4 also interpreted the scores as percentages to judge how far he was from "100%"; he found that the choice of using the RIAS denominator (six) was problematic for calculating percentages. Three participants (P1, P2, P4) demonstrated that they fundamentally aspired to achieve the optimal scores, but also acknowledged the perils of chasing numbers in the context of their interactions. P1 described this belief:
"Attaining a score of 6/6 might actually make the providers [to have] excessive empathy, potentially compromising professionalism. A balanced approach, combining rationality and empathy, is crucial for maintaining professional behavior and earning patient trust."
A lack of knowledge of the underlying models might reduce the trust in the system — There were instances where participants did not agree with the tool’s evaluation and expected their affect scores to be higher than those of the patient. This reduced their trust in the affect scores [P1, P3, P4], with P1 even saying "2.8 was too low given the number of concessions [she] made". Perceptions of mistrust were so strong that participants rated their overall trust in the model only moderate (\(\overline{x}=5.20~,\sigma =1.49\) ) on a scale of 1 to 10. Each participant questioned the interpretability of the SSP models and explained that while the models were based on non-verbal behaviors, they would only be able to trust them if the knew the specifics. For example, one participant shared:
"I’m curious about the methods and like, how it came up with these things. Do I think it did a good job? Yeah, I mean, I don’t know was I less warm than the patient was? " [P4]
Mistrust expressed by participants reflected three dimensions: (1) Validity (do the signals measure true social behavior?) [P1, P3], (2) Performance (how well do the audio features map to the perceived definition of the signal?) [P1, P2, P5], and (3) Interpretability (why did I score so low?) [P2, P4].