We designed a user study aiming to evaluate the usability and performance of the SGM concept in two abstract inspection-like tasks. We compare to a no-marker condition and involve task variations to assess how task distractions can further affect user performance.
5.1 Study Design
The user study is split into two tasks, both designed as within-subject. The first task,
Repair, has two independent variables, namely
Technique and
Distraction, counterbalanced using a Latin Square [
3] in two steps, first on Technique and then on Distraction. We compare having SGM versus a baseline of No Marker and having the user complete the task with No distraction versus being Distracted by a secondary task before finishing (Section
4.2.1). This results in four conditions for the
Repair task: (1) No Marker+No Distraction (NM+ND), (2) SGM+No Distraction (SGM+ND), (3) No Marker+Distraction (NM+D), (4) SGM+Distraction (SGM+D). The second task of the study,
Inspection, has one independent variable, namely
Technique (NM vs. SGM), as in the
Repair task, counterbalanced using a Latin Square [
3]. In sum, 2
Techniques × 2
Distractions × 8
trials + 2
Techniques × 8
trials = 48 successful trials per participant. We had 20 participants complete our study, totalling 960 successful trials.
5.3 Procedure
Participants received a study briefing, completed consent and demographics forms, and watched a video of the Repair task with SGM, ensuring that participants understood the study elements before starting. Participants then wore the HoloLens 2 and underwent eye-tracking calibration.
Starting with
Repair, participants had three training trials following the counterbalancing and were instructed to be as fast and as accurate as possible. Subsequently, participants completed 8 successful study trials for the current
Repair condition, followed by a post-condition questionnaire consisting of a NASA-TLX [
20] with one additional Eye Demand question. After each condition, participants waited until the screen was cleared of stuck-on shapes before proceeding to the next condition according to the counterbalancing. Additional training sessions were completed when new elements of the study design were reached.
Ending with Inspection, participants had three training trials, followed by 8 study trials according to the counterbalancing, and completed the same post-condition questionnaire, just as for Repair. Then, after completing both Inspection conditions, the participants completed a final post-study questionnaire to gauge how the participants felt about the system and the setup. The study lasted on average around 60 minutes.
5.4 Evaluation Metrics
For dependent variables, we include the following measures.
Relocalisation Time measures how much time it takes for the participant to find the correct Post-it after returning to the work area, counting from the first time gaze hits the work area after returning from the tools area until the correct tool is placed under the target Post-it. Relocalisation Time is thereby not directly affected by the additional distraction task in
Repair, but only indirectly affected by how the distraction task impacts the participants’ memory.
Rechecks count how many Post-its the participant had to check before finding the correct Post-it, manually noted by the study conductor. An
Error Rate was calculated as the fraction of error trials out of total trials, with trials marked as an error if the participant checked more than 5 Post-its or picked an incorrect printed-out shape. To get insight into the participants’ task load we used the
NASA-TLX in a 7-point Likert scale variant
2 [
8,
16,
34] conducted as Raw TLX [
4,
19] answered immediately after each condition, and
User Feedback on SGM was gathered in a post-study questionnaire after all six conditions were completed.
5.6 Results
After outlier removal (\(\approx \!3.8\%\)), the data contained non-normally distributed data (as reported by Shapiro-Wilk tests). Therefore, we ran a series of Friedman tests with posthoc Bonferroni corrected Wilcoxon signed-rank tests on both logged objective data and subjective survey data. Statistical significance is shown in graphs as * for p < .05, ** for p < .01, and *** for p < .001. Note that significance tests were not carried out between Repair and Inspection.
This section focuses on Relocalisation Time, Rechecks, Error Rate, NASA-TLX, and User Feedback. Full analysis results on logged data can be found in supplementary materials, including results on additional measures that did not add to the discussion, such as Task Completion Time (i.e. the time from the start of the trial until the participant completes the trial). However, as reference points for Relocalisation Time, the average Task Completion Time for each condition was, in the Repair task, 22.84s for NM+ND, 16.63s for SGM+ND, 50.29s for NM+D, and 38.07s for SGM+D. In the Inspection task, NM took 19.59s and SGM took 18.30s.
5.6.1 Relocalisation Time (Figure4a).
We found significant differences in Relocalisation Time in the Repair task (χ2(3) = 53.1, p < 0.001), both SGM+ND (M = 3.01, SD = 0.425) and SGM+D (M = 3.39, SD = 0.683) conditions were faster than both NM+ND (M = 5.14, SD = 1.508) and NM+D (M = 7.13, SD = 2.33) conditions (all p < 0.001), NM+ND was faster than NM+D (p < 0.001), and SGM+ND was faster than SGM+D (p = 0.042). In the Inspection task, no significance was indicated (χ2(1) = 3.2, p = 0.074).
5.6.2 Rechecks (Figure4b).
In terms of Rechecks, we found significant differences in the Repair task (χ2(3) = 31.194, p < 0.001); bothSGM+ND (M = 0.1, SD = 0.447) and SGM+D (M = 0.65, SD = 1.226) conditions exhibited significantly fewer rechecks than the equivalent NM+ND (M = 2.15, SD = 2.323) and NM+D (M = 2.9, SD = 2.47) conditions within the same level of Distraction (all p ≤ 0.018) and SGM+ND exhibited fewer rechecks than NM+D (p < 0.001). We also found significant difference in the Inspection task (χ2(1) = 5.333, p = 0.021) with SGM (M = 0.4, SD = 1.188) exhibiting significantly fewer rechecks than NM (M = 1.9, SD = 3.007).
5.6.3 Error rate (Figure 4c).
No significant differences were indicated in Error Rate in the Repair task (χ2(3) = 4.244, p = 0.236) or in the Inspectiontask (χ2(1) = 0.667, p = 0.414).
5.6.4 NASA-TLX (Figure 5).
As for the NASA-TLX scores, we found significant differences in Mental Demand in the Repair task (χ2(3) = 52.175, p < 0.001), SGM+ND (M = 2.15, SD = 0.67) was rated significantly less mentally demanding than all three other conditions (all p < 0.001), SGM+D (M = 4.75, SD = 1.29) was less mentally demanding than NM+D (M = 6.1, SD = 1.02) (p < 0.001), and NM+ND (M = 4.3, SD = 1.26) was less mentally demanding than NM+D (p < 0.001). Significant difference was also found in Mental Demand in the Inspection task (χ2(3) = 6.250, p < 0.012) with SGM (M = 2.45, SD = 1.05) being less mentally demanding than NM (M = 3.3, SD = 1.22). Regarding Performance in the Repair task (χ2(3) = 12.677, p = 0.03), SGM+ND (M = 2.0, SD = 1.56) was rated better (lower rating) than SGM+D (M = 3.1, SD = 1.33) (p = 0.018). In terms of Effort in the Repair task (χ2(3) = 43.235, p < 0.001), SGM+ND (M = 2.35, SD = 0.67) was rated significantly less effortful than all three other conditions (SGM+D (M = 4.3, SD = 1.3), NM+ND (M = 4.4, SD = 1.54), NM+D (M = 5.6, SD = 1.27)) (all p < 0.001), NM+ND required less effort than NM+D (p = 0.012), and SGM+D required less effort than NM+D (p = 0.03). Finally, we found significant differences in Frustration in the Repair task (χ2(3) = 28.938, p < 0.001), SGM+ND (M = 1.55, SD = 0.83) was rated less frustrating than all three other conditions (SGM+D (M = 3.0, SD = 1.72), NM+ND (M = 2.6, SD = 1.35), and NM+D (M = 3.6, SD = 1.7)) (all p ≤ 0.018).
5.6.5 User Feedback.
13 participants mentioned that they appreciate that SGM reduced task load, making tasks easier to manage. 12 participants mentioned that SGM helped them remember target locations. 3 participants mentioned feeling that SGM improved task efficiency, especially in Distraction conditions. However, our participants also noted several challenges. 5 participants mentioned that SGM would occasionally provide incorrect guidance. However, these incorrect placements were infrequent and most often caused an error trial and are thereby captured by the Error Rate. 2 participants mentioned SGM disappeared too quickly, causing them to lose track of it. 4 participants mentioned that SGM was less useful in the Inspection task. Finally, 1 participant mentioned that the limited FOV of the HoloLens 2 required them to move their head more.