1 Introduction
Mutual awareness of visual attention–the ability to identify collaborators’ visual attention–is crucial for successful collaboration [
22,
26,
77,
99,
111]. As such, prior studies have shown that introducing bi-directional visual attention cues in collaborative VR can improve mutual awareness [
53]. Although they offer improvements over having no visual attention cues in virtual collaborative environments, correctly representing users’ attention remains an open challenge. Field-of-view-based visualisations only provide an estimate of visual attention [
16], while pointer-based using natural pointing modalities, such as the eye gaze [
44] and head [
115], does not afford the dynamic visual representation of different types of attention (
e.g. focused and distributed [
103]). Moreover, there are inherent limitations to using natural pointing modalities to represent visual attention. For example, attention cues based on gaze input can be distracting for an observer due to natural looking behaviour [
119], or ’confusing’ when there is a misalignment between a collaborator’s verbal references and the depicted eye-gaze location due to eye-tracker calibration issues [
26].
In this paper, we explore how combining an existing field-of-view-based visual attention cue (‘Cone of Vision’ [
16]) with verbal communication can improve gaze inference and mutual awareness for exploratory data analysis in VR. The Cone of Vision (CoV) visual attention cue is a novel technique developed by Bovo et al. [
16] that leverages head behaviour to allow a more accurate representation of users’ attention within their field of view (FoV). Existing FoV-based techniques display the entire area within a user’s vision. Though the technique narrows the FoV (based on gaze probability within head coordinates), the visualisation can still contain high densities of information within the ‘cone’. By using speech to direct the CoV region towards the visual elements mentioned in verbal communication, we can create an adaptive multi-modal approach that continuously refines the focus of visual attention towards such elements. Figure
1 shows how the combination of CoV+Speech can narrow down the CoV visual attention cue based on keywords uttered by a collaborator during exploratory data analysis.
Our proposed approach of combining CoV and verbal communication mirrors how collaborators communicate in face-to-face settings. Research has shown that collaborators often leverage cues from multiple modalities to gauge the visual attention of collaborators, including verbal cues [
117]. In particular, they first understand the general orientation of their collaborators (i.e., by evaluating the general direction of their head gaze) and then confirm or refine the location of the visual context using verbal communication [
26,
89]. Collaborative verbal communication is also used as a fallback method when visual cues are not accurate enough or when there are calibration errors [
117]. Further, our approach is well-suited to cross-virtuality analytics (XVA) context because XR headsets typically have access to head and verbal behaviour, while they do not always have eye-tracking capabilities. This makes our approach widely applicable within the XR device ecosystem. Due to the potential benefits of our proposal technique, we aim to answer the following research question: How does speech in conjunction with head behaviour impact joint attention during collaboration and gaze inference? To address this question, we designed and conducted a within-group study that compares three conditions:
CoV,
CoV+Speech, and
Eye-Gaze Cursor. In the study, ten pairs of participants performed collaborative exploratory data analysis tasks using three different dataset visualisations, testing each of the three conditions. In the
CoV condition, we used a model to model a cone of vision using the statistical model of gaze probability, which was projected onto VR screens [
16]. In the
CoV+Speech condition, we processed the collaborative verbal communication using speech recognition as input to narrow the CoV around the enunciated elements of the visualisation. Lastly, we added the
Eye-Gaze Cursor condition as it is a widely used method to represent visual attention, in which we mapped the raw eye-gaze position to a live cursor.
Our results showed that speech recognition did not lead to better joint attention compared to CoV, due to lag and limited speech recognition accuracy. To further investigate the potential of verbal communication to negotiate shared attention, we performed a follow-up analysis using a highly accurate speech-to-text model to transcribe the verbal communication data collected during our study. The transcription allowed us to analyse the types of verbal references used by participants. This analysis validated our assumption that the most common form of communication relies on explicit keyword utterances rather than implicit verbal references or pointing-based communication. In addition to allowing us to perform an offline approximation of eye gaze using speech as an input, the transcription allowed us to analyse the types of verbal references used by participants. Our analysis demonstrated that our proposed method improves eye gaze approximation accuracy by 50 pixels when the CoV regions do not constrain the eye gaze. This suggests that speech has the potential to improve the shared context of visual attention. Therefore, we also present the collected data and accurate transcriptions as a dataset, which is the first dataset of collaborative head, eyes and transcribed speech behaviour to the best of our knowledge. Our findings and dataset contribute to a deeper understanding of verbal communication and gaze during collaboration.
Furthermore, we were able to estimate the impact that CoV and CoV+speech have on individual visual attention by testing the statistical model of gaze on which the CoV cues are based. In the
Eye-Gaze Cursor condition, the gaze distribution followed the earlier model ( 70% of gaze samples within the non-displayed CoV). In contrast, the CoV conditions showed that eye-gaze distribution was considerably narrower: more than 85% of gaze samples fell within the displayed CoV. When head-based visual attention cues are visible (i.e. CoV), the head gaze becomes a better predictor of eye gaze than when they are not used. Our study also enabled us to compare bidirectional head-gaze visual attention cues and eye-tracking cues, finding that CoV cues foster joint attention equally or better than eye-tracking cues. We measured joint attention as the fraction of concurrent gaze on the same area of interest (AOI),
that is a method used in research to analyze attention to individual objects [84], we do so at two resolutions: the chart level and the screen level. We discuss the implications of this finding in the discussion and conclusion section. The contribution of this paper is threefold:
(1)
A novel visual FoV-based cue for collaboration that dynamically changes size based on verbal communication to balance broad and narrow information.
(2)
The results of a study compared three visual cues for collaboration during an exploratory data analysis task in VR. Results showed that our proposed approach better approximates the head gaze if compared to the approximation offered by the head gaze alone. Moreover, head-based visual attention cues foster joint attention equally or better than eye-tracking visual attention cues.
(3)
A dataset
1 containing the verbal, head, and eye behaviour of ten pairs of participants collaborating in VR.
3 Study
We designed a within-group study that compares three conditions: Cone of Vision (CoV), Cone of Vision+Speech (CoV+Speech) and Eye-Gaze Cursor (Figure
3). Participants were embodied in an avatar ReadyPlayerMe
2 and could also use a hand pointer to reference the observed dataset.
We used a Latin Square approach with 3 conditions. However, due to the number of participants, one order had one more pair of participants than the other3.
3.1 Conditions
3.1.1 Cone of Vision (CoV).
Inspired by Bovo et al. [
16], we use a different graphic element from the classic FoV frustum, called the Cone of Vision (CoV) (Figure
4). Geometrically, this visualisation is obtained by intersecting the cone that has the vertex in the centre of the head and the direction parallel to the head direction with the observed 2D surface. This depiction is designed to work with data displayed on 2D surfaces (such as panels or VR screens) but immersed in a 3D scenario (Figure
3). The main difference between the FoV frustum and the CoV is their spatial dimensionality, that is, 3D for the first and 2D for the second. However, both convey probabilistic information about the gaze location since they are aligned with the head. The CoV is displayed using the contour surrounding the area with the 70% probability of containing the users’ fixations, achieved using the dataset of Agtzidis et al. [
1].
3.1.2 Cone of Vision+Speech (CoV+Speech).
The second visual cue is a combination of CoV and the effects of the user’s verbal interaction with the system. Although we use the same CoV calculation as in the CoV condition, we designed a novel algorithm that modifies the CoV after processing the user’s speech. We describe in section
3.2.1 the dataset contained in the HTML pages rendered by the virtual screens in the 3D office. To extract the semantics of the speech, we first capture the audio of the user talking to his collaborator. Therefore, we stream this audio to an online speech service that transcribes the speech and returns a string to parse and process with NLP algorithms. Such information is then searched in the HTML context for those elements that contain keywords isolated by NLP. Then, we return their bounding box coordinates within the browser page and convert the local coordinates into world coordinates and add them to a list. The next phase is the modification of the current CoV. To reduce the CoV size, we determine the principal component of the coordinates by doing a linear regression. Then we calculate the standard deviation of the points along the principal component and along its orthogonal direction. Subsequently, we draw the ellipse using the coordinates of the centre of mass with the two standard deviations are the two radii of the ellipse. Ultimately, we interpolate between the CoV points and the ellipse points by a factor of 0.5 (Figure
5 (d)). The visual cue is displayed as shown in Figure
5 (e). Such a condition includes two different input channels: head-based position and orientation coming from the hardware of the HMD, and an analysis that converts the speech to a morphing function of the CoV. While the CoV calculation relies on internal code, a part of speech processing relies on an external service.
3.1.3 Eye-Gaze Cursor.
The Eye-gaze cursor condition displays a graphical cursor at the gaze location on the VR screen. The position is calculated by calculating the intersection between the gaze direction and the VR screen. The cursor is visualised as a ring (Figure
3 (b)) and has a radius of
\(\tfrac{1}{3}\) of the radius of the fovea region determined by pilot testing to ensure that the cursor is noticeable. The eye-gaze cursor is subjected to noise from the eye-tracker.
3.2 Apparatus
For our study, we provide each participant with a PicoNeo 2 HMD with a resolution of 4K (3840 × 2160) at a refresh rate of 75Hz. The HMD has an embedded Tobii eye-tracker that works at 90Hz and a declared accuracy of 0.5 degrees. For verbal communication, we set up a Microsoft Teams connection using Bluetooth headphones and a microphone for the participants’ communication, while we used PicoNeo 2 microphone to capture the audio streaming for the transcription. We designed and implemented our collaborative VR application made with Unity2020.3.34.f1, where two users shared the same digital space, but not physical, with three different ways of exchanging visual cues during the exploratory data analysis task. The visual cues of each user are displayed in two different colours: red for the local visual cue and green for the remote visual cue. All cues are refreshed at each Unity loop with a fixed rate of 50Hz.
3.2.1 VR Environment.
We designed the 3D scene with the participants positioned in two locations close to each other in front of four panels positioned as in Figure
2. We developed a convex egocentric layout in line with the findings of [
60,
97] described in section
2.1; however, since we are not limited in space by physical constraints as [
60], and we have more participants than [
97] we set our environment to have a radius of 3m compared to the 2m setup in [
97] (Figure
2). The participants were positioned in the initial locations for all the sessions without the possibility of translating their avatars to avoid obfuscating the other’s participant view. The avatar was created and imported from ReadyPlayerMe. We used the torso version of a custom avatar and implemented lip synchronisation and eye synchronisation. The four VR screens contain charts rendered by an internal browser embedded in such panels, decoding information from HTML/javascript files where the datasets are contained. The information related to the keywords’ position in such HTML is extracted to be used during the CoV+speech condition.
3.2.2 Data Visualisations.
The three datasets used during the experiment are the "Hollywood movie gender bias" based on The Bechdel Test [
12], the success of Hollywood movies with information taken from IMDB [
50], and the insurance risk for cars taken from the UCI machine learning repository [
98]. These datasets are also used in collaborative analysis tasks by Bovo et al. [
16]. Each test includes 38 views on seven screens, one of which contains instructions. The visualisations contain scatter plots, stacked bar plots, histograms, and box and whisker plots. The dataset is stored in the GitHub repository
4, and the charts at the following link
5.
Speech to Text. Real-time captioning services provide transcriptions for spoken information from audio streams. Google Speech-To-Text[
39], Microsoft Cognitive Services[
75], Dialogflow [
38], IBM Watson[
49], Amazon Transcribe [
4] are the most used services that allow integrating a real-time API transcription in a system. We choose Google Speech-To-Text as the one compatible with our requirements of the platform (Android) and framework (Unity). The Google Speech-to-Text service can be configured with different parameters such as language, sample rate, automatic punctuation, context adaptation, etc. We ran several pilots to optimise the accuracy and reduce the latency of the Google speech service. We used audio with a sample rate of 16000 Hz, determined the language as British English and consequentially hired participants that were native British speakers and filtered out punctuation.
3.3 Participants
We recruited 20 participants in two weeks(13 women, 7 men,
MAge = 29.4,
SDAge = 9.1) through an online platform managed by
University College London 6. We applied several inclusion criteria when performing a screening: being a native English speaker, having normal vision, and having a minimum education degree in high school. In particular, the latter criterion was to ensure that participants had sufficient knowledge to interpret the graphs of the visualization. In addition, we required participants to be confident in interpreting the charts we included in the study. Such charts consisted of bar or candlestick plots, histograms, and scatter plots. Each participant was self-assessed with a questionnaire, and we summarised the characteristics of such plots during the task presentation. One participant declared to be an expert VR user, five with average experience, six as occasional users, eleven with low experience and four with no experience. Participants received compensation of £15 each for a 90 min study. We incentivised participants to perform at their best by introducing of an additional reward of £15 each if they reported the highest number of valid insights among all the pairs. We recommended participants to collaborate instead of splitting their attention into different visualisation areas.
3.4 Procedure
Upon arrival, participants were asked to read the information sheet and sign the consent form. We carried out the experiment in the lab using two separate offices, one for each participant. Next, we explained the duration of the task and the three experimental conditions, allowing participants to test each for 1 min. We then asked participants to perform an exploratory data analysis task, extracting insights from the displayed visualisations. We provided participants with examples of valid insights. In our context, we describe a valid insight as a recorded speech where is conveyed a precise and deep understanding of two or more measures displayed on a graph or a series of graphs [
80]. Next, we explained how to record insights and use the hand pointer. Once instructions were clear, participants were asked to wear the Pico Neo eye 2, perform an eye-tracking calibration process, connect to the virtual environment, and start the collaborative task. After all VR trials (i.e. experimental conditions), we asked participants to complete the questionnaire (section
4.3).
At the end of the experiment, we conducted semi-structured interviews with each participant individually. Participants were asked to report cases in which each experimental condition helped with the assigned task and cases that did not. The study lasted between 75–90 minutes (M = 80, S.D. = 12.7), and the duration of the trial lasted approximately 10 min (M = 13 m, SD = 3m). The stop condition was reached when the time was up (13 minutes).
3.5 Offline Analysis Methods
To understand the role of verbal communication concerning negotiating shared visual attention, we transcribed the recorded audio to achieve high-accuracy transcriptions (section
3.2.2). We analysed the transcribed data to quantify how much participants utter displayed keywords to reference the data and how much they use alternative methods to reference (section
3.5.2). Furthermore, we evaluated whether participants’ utterances can be used in conjunction with the head direction to refine eye-gaze inference (section
3.5.3).
3.5.1 Speech to Text.
We used an offline speech recognition system to analyse the audio recordings with higher accuracy than the real-time system used in the study. This framework, released in the second part of September 2022, is the open-source project Whisper [
93], developed by OpenAI. Such a system is trained with many hours of multilingual spoken language. Its end-to-end architecture is based on an encoder-decoder transformer [
110] and produces very accurate text captions. We used Whisper with Python 3.8.3 and PyTorch 1.12.3 [
86]. The manual analysis described in the following section
3.5.2 ensured the transcription quality.
3.5.2 Classification of verbal references.
We quantify how much participants utter displayed keywords to reference the data and how much they use alternative methods such as pointing gestures or implicitly referring to visual cues.
We start by merging verbal reference taxonomies from D’Angelo and Begel [26] (i.e., remote pair programming; Table 1 (1,3,8)) and Pettersson et al. [89] (i.e., collaboration over tabletop maps visualizations; Table 1 (3, 5)) to include both text and visual element classes in the same context. We expand the resulting taxonomy by considering novel verbal references such as sequential statements (Table 1 (2)) that rely on implicit directional bias left-to-right (LTR) [33]. Furthermore, we add Context/User relative references (Table 1 (4, 6)), and temporal references (Table 1 (7)). The transcripts were analysed alongside video and audio recordings to gather the context of non-verbal communication (i.e., pairs being mutually aligned or orientated in opposite directions, performing pointing gestures with a laser pointer, etc.). Three coders performed the analysis: each transcribed trial was analysed by one coder and then reviewed by a second one; the third coder resolved any disparity between the first and second coders. The roles between coders rotate for each trial. For each transcribed sentence, we identify if it contains a verbal element aimed at identifying or changing the focus of the collaborative exploratory data analysis task concerning the visualized data. If the sentence contains a visual context negotiation, it is classified (i.e. using the aforementioned classes), and we identify which areas of interest the verbal communication was aiming for (i.e., data, chart, page). After classifying all transcriptions, we counted the number of occurrences each pair of participants performed in each experimental condition. The difference between the “Keyword” class and all other classes was immediately apparent, as the Keyword class was more prevalent than all other classes combined.
3.5.3 Head+Speech Gaze inference.
We evaluated whether the utterances of keywords by the participants can be used in conjunction with the head direction to refine eye-gaze inference. We consider the data segments in which verbal communication is used to perform a fairer analysis. As shown in 9b, we describe the steps we used to calculate the accuracy of Head/Gaze and Head/Gaze+Speech methods in our analysis with respect to the ground truth, the gaze information. Firstly, our model focuses on bi-grams, the last two spoken words by the user at any point in time. Such a number of words is optimal among the other n-grams. Secondly, we ran the well-established text similarity metric Recall-Oriented Understudy for Gisting Evaluation Lin [
68] for longest common subsequences (ROUGE-L) through all the possible keywords located inside the CoV. Such similarity metrics range between 0 to 1, and we kept only positive scores. We determined the bounding boxes of the accepted keywords and evaluated which box is closest by calculating its Euclidean distance with the Head/Gaze. Finally, we calculate the RMSE for Head-Gaze and Head/Gaze+Speech with the eye gaze.
5 Qualitative Analysis
Post-experiment semi-structured interviews were audio-recorded, fully transcribed and analysed through thematic analysis [
19]. Our research questions focused on verbal communication as input for visual attention cues and, more broadly, the role of verbal communication in collaborative exploratory data analysis. The codes for the analysis were initially based on our research questions. Therefore, we focused on capturing aspects relative to the perception of the cues, comments about verbal communication, and the impact of the CoV. However, we also included codes from the interviews, such as lag and accuracy issues with speech-to-text technology, workarounds when the visual attention cues lacked precision, or CoV helping individuals to focus. The resulting 30 codes were grouped into three themes reported in the following subsections.
5.1 Comparing the different visual cues
Although most participants reported the eye-tracking condition as their favourite due to its precision, some noted that it did not allow them to focus on the charts and, for this reason, preferred the larger contoured region of the cones. For example, we heard:
“Eye-tracking was helpful If I was trying to say something specific. But then if either of us were talking about something broader then it would not be helpful because you missed the bigger picture. [P12]", or: “...so sometimes during a discussion you are not talking about the specific data point but more about the broader and the specific cursor led me to focus on one thing at the detriment of other facts. [P16]"
Some participants reported that gaze movements were hectic and distracting. For example, P13 mentioned: "...it was like really distracting as I have ADHD, so it’s hard to focus its hard to concentrate on the task, so I could not focus on my collaborator’s visual cue because it was very confusing". Similar P3 said: "...the eye-tracking visual cue it felt like he’s pulling me away from the where I need to focus...".
Most participants reported the CoV to be most useful during the initial phase of mutual alignment. For example: “it was very helpful to get aligned initially and so just for the moment what he needs to align and maybe the moment when the other one goes away [P4]" or: “it was helpful when the person I was collaborating with was talking about something, but I didn’t know where she was looking for pictures so I could see the different colour area and turns towards it [P7]". P15 said that the CoV was useful to confirm the two participants were looking at the same thing: “I found it helpful because I knew where she was looking, so we were able to basically be on the same page".
Several participants reported that the CoV contours allowed them to focus on the encircled data: "...the good part of the cone was that it helped me focus on where I was looking at, so I won’t look in under directions. [P9]". P17 explicitly stated that they ‘liked’ the circle produced by the cones: "We were both focussing on the same things. Like having a line around, you know what you are focussing your attention on, that kind of helps the kind of block out everything else". Similarly, P2 stated: "The cone was helpful as it just helps me concentrate. I don’t really feel like it was getting on the way"
Participants commented that the bidirectional visual attention cues made them feel more coupled and accountable and not wandering around but staying on the same page. For example, P13 reported: “The cone was helpful in that it kind of kept me in the room so I need to look at the same thing that she was describing, so she would see that I was looking at it was very helpful".
Participants complained that in the CoV + speech condition, the cue lagged considerably due to lag in recognition: “I feel the speech was picking it up like in 10 seconds I did not find it to be reliable as when it was shirking down it would do so unreliable not in the specific area. [P13]". Similarly, P12 stated: “it felt like it was slow".
5.2 Verbal communication as a fallback.
Most participants reported that the CoV worked well for keeping them on the same page, however, it was not very precise. So they reported using verbal communication as the default method to refine the accuracy of the CoV. For example: “I think it was helpful seeing in general where the other person is looking at and then also aligning myself with that we did without yeah but it wasn’t like with the specifics obviously it wasn’t as helpful so I think we used more like verbal things to see like which chart each person’s actual we read [P4]". P6 explicitly compared the CoV to the eye-tracking cues: “with the fixed cone compared to the eye-tracker there is a lot more to verbalise so you had to find out like oh yeah, I have a look over there and get more details to say to the other collaborator".
Some participants reported feeling that the eye-tracking was sometimes not perfectly calibrated. Although initially they tried to compensate for the error by moving their eyes, they ended up using verbal communication to specify the location of the data they were discussing: “the eye-tracking did get in the way a little bit because I feel like it was sometimes it wasn’t calibrated that well, so I’m trying to fix my attention on the specific part of the chart and then the cursor was slightly off in another place so at first I was trying to compensate with my eyes but that wasn’t working so I just had to ignore it and communicate the region of interest verbally [P2]".
5.3 CoV + Speech condition.
The CoV narrowing down on the region of interest was reported as a welcome confirmation of the shared visual attention: “[it] was nice to have a confirmation of the cone shrinking as it increases the confidence that we were both looking at something [P5]". Participants commented that it rarely focused on the wrong area: “I felt it rarely narrows down on the wrong area, but there was delay [P19]".
Sometimes participants went beyond explicitly looking for labels to refer to, and they attempted to direct the CoV narrowing with spatial voice commands (e.g., top left, bottom right, etc.). The positions were expected to be understood concerning the virtual screen at which the participant was looking. For example, P13 reported feeling disappointed that such a strategy did not work: “I found also that was limited in the functionality as it would not recognise top left bottom right corners". P2 mentioned that the other participant instead quickly reacted to such spatial references: “There were few charts in which some of the information on the y and x axis were the same; however, with them, I was mentioning top left or top right, and she would very quickly look there". Therefore, the future system which uses speech as inputs for visual attention could integrate recognizing this type of verbal, spatial references to inform the visual cue contours without the semantic knowledge of the context.
7 Future Work and Limitations
We recognise several limitations of our work. First, subjective responses highlighted voice detection issues during the CoV+Speech condition. The post hoc analysis addressed this issue with a different speech recognition engine, but it is possible that speech behaviour during the study was affected. Second, our qualitative analysis (section
5.3) showed that participants often made spatial references (that is, "on my left" or "top right corner"). These references were not used by our technique. Further work could explore spatial references as explicit control of visual cues. In addition, other verbal references could be exploited to infer areas with specific colours, shapes, images, or synonyms of visible keywords (section
3.5.2). Third, our current speech-based system is limited to HTML-based VR screens that must contain tags useful for verbal referencing. This aspect could be expanded to be viable in other environments, for example, by leveraging meta-information of 3D environments or the real-time segmentation of videos to provide a layer of meta-information to be queried for collaborative communication [
94].
We also envision several avenues for future work. First, our analysis highlighted how individual visual attention is affected by the CoV; in future work, we could explore how CoV size affects this phenomenon. Second, the qualitative analysis showed different qualities of head-based and eye-tracking visual cues. The CoV is easier to find because it is wider and more stable, and the participants found it to be the best for mutual alignment. However, it lacks precision once mutual alignment is performed (section
5.1). Meanwhile, the eye-tracker is precise but moves erratically, distracting users, and the cursor can be difficult to find. Future work could investigate a hybrid version that combines CoV and eye-tracking cues to gather their advantages. Finally, our dataset of human behaviour can be used for multiple purposes, such as evaluating leadership [
3], competence skill [
23,
31]), the success of collaboration [
112,
113], and other behavioural analyses. However, the dataset is limited to 2-dimensional data, and future work can explore 3D data. Future challenges for 3D data include occlusions, illumination, and different approaches to generating visual cues.
8 Conclusions
In this paper, we investigate how using verbal communication with the Cone of Vision (CoV) can improve gaze inference and mutual awareness for exploratory data analysis in VR. We proposed a novel method named Speech-Augmented Cone-of-Vision which aims to dynamically balance the broadness of the cone of vision with the pinpoint abilities of verbal communication. We conducted a within-group study where ten pairs of participants performed collaborative data analysis tasks under three conditions. We used quantitative and qualitative methods, including participants’ head and eye gaze behaviour, post-task questionnaires, and semi-structured interviews. Our findings suggest that visual attention cues based on head gaze (i.e. CoV and CoV + speech) are equally, if not more effective, in fostering joint attention than those based on eye-tracking. This leads to an increase of about 20% in concurrent gaze on the same VR screen. The questionnaire results and the analysis of the interviews suggest that the CoV+Speech condition was affected by the lag and limited accuracy of the real-time speech recognition implementation we used. To overcome this limitation, we used recorded audio to transcribe verbal communication using an offline high-accuracy speech-to-text model. Accurate transcription allowed us to classify the type of verbal references and validate our assumption that participants used keywords to negotiate shared visual attention. This approach allowed us to perform a non-real-time approximation of eye gaze using speech as input. The results of this analysis show that our proposed method improved the accuracy of gaze by 50px when it was not constrained by CoV regions. Therefore, we demonstrate that speech has the potential to be used as input to dynamically alter CoV cues by narrowing the focus of visual attention. To support further research in this area, we release the data collected in our study as a public research dataset. To the best of our knowledge, this is the first dataset on collaborative head, eye, and transcribed speech behaviour made publicly available.