To understand how people would like to augment their speech with on-the-fly visuals, we conducted two brainstorming sessions. These results informed a design space for visually augmenting conversations.
3.2 Design Space for Augmenting Verbal Communication with Visuals
Two researchers organized participants’ responses with the affinity diagram approach. Informed by the set of low-level and high-level themes derived, we developed a design space for systems that augment verbal communication with visuals. We followed the design space analysis methods [
6] and held iterative discussion sessions. We identified eight key dimensions as detailed in Figure
2.
D1. Temporal. To augment verbal communication with visuals, systems can be either
synchronous or
asynchronous. The majority of prior systems provides augmentations asynchronously. Users either have to set up corresponding visuals before (
e.g., pre-configure visuals for an upcoming presentation, and trigger visuals by gestures and keywords [
42]), or select and edit visuals after the text is composed [
25,
51]. Our system falls in the paradigm of synchronous augmentation, where users select appropriate visuals on-the-fly while engaging in conversations.
D2. Subject. Visual augmentations of spoken language can either be used by the speaker to express their ideas (visualize their own speech) or by the listener to understand others (visualize others’ speech). The majority of prior art in this domain falls under the former paradigm, where speakers select and design corresponding visuals to support their speech. In our system, we aim to support both subjects and allow all parties to visually supplement their own speech and ideas.
D3. Visual. Participants in our formative study wish to augment speech using a variety of visuals. We identified three main aspects to consider when providing visual augmentations:
(1) Visual Content — what information to be visualized? A segment of speech contains different information that can be visualized. For example, consider the statement “I went to Disneyland with my family last weekend”. One could visualize the generic term Disneyland, a picture of I, or more specific, contextual information such as me and my family at Disneyland. The system should be able to disambiguate the most critical and relevant information to visualize in the current context.
(2) Visual Type — how should the visual be presented? There are often multiple ways to present a visual, ranging from abstract to concrete. For example, the term Disneyland could be visualized as: an icon of Disneyland, a photo of Disneyland, an interactive 3D map of Disneyland, or a video of people riding a roller-coaster. While visuals may have similar meaning, they can evoke different levels of attention and provide different levels of detail.
(3) Visual Source — where the visual should be retrieved from? Different sources can be utilized for the visual augmentations, including both personal and public assets. One might want to retrieve personal photos from one’s own phone, or public images from the internet. While personal photos provide more contextual and specific information, images from the internet can provide more generic information with less privacy concerns.
Our system leverages a large language model optimized to consider the context of conversations, and identify the most appropriate visual content, type and source to suggest.
D4. Scale & D5. Space. Visual augmentations could be used in various communication scenarios, including one-on-one meetings, one-to-many lectures and many-to-many discussions. The number of participants and their location (e.g. in-person v.s. remote) can greatly affect best practices for such visual augmentations. We developed Visual Captions for existing video conferencing software to augment meetings at different scales, supporting one-on-one, one-to-many, many-to-many scenarios.
D6. Privacy. Visual augmentations should take privacy into consideration right at the beginning and allow users to select among multiple privacy options: 1) Privately shown visuals are only presented to the speaker, invisible to any audience. 2) Publicly shown visuals are presented to everyone in the conversation. 3) In-between, the visuals can be selectively presented to a subset of audiences. We provide speakers with options 1) and 2) and to privately preview the visualizations before displaying them to the audiences. Listeners could also use our system to privately see the visuals based on speech they hear.
D7. Initiation. 6 participants in our formative study wanted the least efforts to generate visuals during the conversation, and therefore prefer having the system proactively providing visual augmentations without user interaction. However, other participants would like to have more control over the visuals, including when to trigger them and what to show. To meet these different preferences, we designed three levels of AI proactivity: on-demand-suggest, auto-suggest, and auto-display.
D8. Interaction Participants mentioned six domains of potential interactions: speech (e.g., “let’s show an image here”), gesture (e.g., pinch), body pose (e.g., waving hands), facial expression (e.g., to trigger emojis), gaze (e.g., for selecting visuals from suggestions), and custom input devices (e.g., a controller). We support traditional input devices (e.g., keyboard, mouse, and touch screens) given their universal use in video conferencing. In Visual Captions, we trigger visual generation by understanding the language via speech-to-text engines or by user pressing the space bar. In the future, we can expand capabilities to enable interaction with body pose, facial expressions, and other devices.
During the session, participants discussed various potential use cases for Visual Captions and how they believe it would be helpful (Figure 3). Many participants expressed a desire to use Visual Captions for educational and casual scenarios, and said that they would appreciate the addition of visuals to make conversations more informative and clear.