5.3 Apparatus
Figure
4 shows the overall simulated conversation setting of the experiment. A virtual conversation partner (a muted talking head video following [
36]) and two virtual rooms (a living room and a kitchen) were displayed on three 27” LCD monitors (refresh rate = 60 Hz, resolution = 1920 x 1080 px) at eye level. The former was modeled after an average female (head height = 24 cm [
61], FoV = 9.15° vertical at 1.5 m) and was displayed on the central monitor 1.5 meters away from the participants following the social conversation distance defined by Hall et al.[
29,
36]; while the latter were displayed on side monitors to provide an immersive feeling of a home at the same distance. A Python program controlled the virtual conversation partner and other stimuli on desktops. Note that the virtual conversation partner was used with a trade-off consideration between external validity and internal validity [
53]. While using realistic conversation partners can enhance external validity, it can significantly reduce internal validity by introducing potential confounding factors, such as inconsistent replies in terms of content and duration, which can affect the users’ manipulation behaviors. Thus, we selected a virtual conversation partner to make a fair comparison in this study.
There were a total of eight IoT devices, four in the living room (two lights, an air-conditioner, and a smart speaker) and four in the kitchen (a light, a dishwasher, and two drink machines), following common smart home settings [
23,
35,
37]. To manipulate the IoT devices, as shown in Figure
4, participants used either an OHMD (Nreal Light
3, 1920x1080, 60 Hz, FoV ≈ 45° horizontal × 25° vertical) with a ring mouse (Sanwa 400-MAW151BK with 4 buttons and 1 touchpad), a smartphone (Google Pixel 4, 5.7”), or a smart speaker (Google Nest Hub 2) depending on the conditions. A mobile eye tracker (Pupil Core/Pupil Core Addon) was used either attached to OHMD or directly worn on the head. Four April Tags were attached to the central monitor for the eye tracker to register the location of the virtual conversation partner.
For
ParaGlassMenu and
Linear Interface, participants wore the Nreal Light along with the ring mouse on their dominant hand. Menus were implemented using Unity
4 and displayed at the same depth as the virtual partner using the mixed reality mode of Nreal. OpenCV plus Unity asset
5 was used to track the target face with the Nreal’s camera, which positioned the menu around the face. The size of the menu icons, 5 cm in diameter, was designed based on a pilot study (N=5) where participants could recognize the menu while looking at the virtual face from a 1.5 m distance (Figure
1c and Figure
3b).
For the
Phone Interface, participants used a Google Pixel 4 phone installed with the Google Home app and YouTube Music app
6. YouTube Music app was used to select and stream songs to the smart speaker, as Google Home app doesn’t allow playing songs directly in the app. The locked phone was placed on the table within hand reach. Moreover, for the
Voice Interface, participants used integrated Google Voice Assistant in the Google Nest Hub 2. In addition, Google Home Playground
7 was used to generate virtual IoT devices and rooms for Google Home App (
Phone Interface) and Google Nest Hub 2 (
Voice Interface).
5.4 IoT Manipulation Tasks Design
IoT manipulation tasks can be divided into two types: 1)
information task in which users get information about a device and 2)
command task in which users execute a command on a device [
72]. Moreover, our analysis of the IoT tasks in smart home scenarios based on the Google Home Device traits [
28] revealed six major sub-tasks related to IoT manipulations: 1) Activation: turning on the manipulation interface; 2) Navigation: going to the corresponding room/device; 3) Selection: selecting the room/device/item; 4) Checking: examining the state of the device; 5) Discrete manipulation: changing the discrete state of the device; and 6) Continuous manipulation: changing the continuous state of the device.
By aligning the two IoT manipulation task types with six sub-tasks, we found activation, navigation, and selection were common to both types. Besides, an information task involves checking; while a command task, depending on the capabilities of the device, includes discrete or continuous manipulation. Moreover, the task complexity, i.e., the number of steps or the duration required to complete a task, depends on the number of states supported by the device.
Thus, to evaluate the interfaces across different tasks with different complexities, four IoT manipulation tasks (i.e.,
IoT Tasks), including one
information task and three
command tasks, were selected to cover the full spectrum of sub-tasks. In that regard,
Checking Info (i.e., check the device’s current state) was selected as the information task; while
Discrete Manipulation (i.e., change the active state of the device),
Continuous Manipulation (i.e., change the continuous state of the device), and
Selecting From List (i.e., change the discrete state of the device which has more than two states) were selected as command tasks. Appendix
A.1 presents sample IoT tasks and Table
2 summarizes the interaction methods used for the selected
IoT Task on the selected
Interface.
5.5 Study Design
A repeated-measures within-subject design was used in which the independent variables were IoT Interface (ParaGlassMenu, Linear, Phone, Voice) and IoT Task (Checking Info, Discrete Manipulation, Continuous Manipulation, Selecting From List), resulting in 16 sessions per participant. Furthermore, the IoT Interface was counterbalanced using Latin Square across participants, and the IoT Tasks were presented in a fixed order with increasing complexity, i.e., Checking Info followed by Discrete Manipulation, followed by Continuous Manipulation followed by Selecting From List because comparing conversation and IoT manipulation quality across different task types was not in the scope of this research.
To avoid the potential biases due to the menu layouts, three trials for each IoT Task were designed, and each trial involved different devices with the same complexity. In summary, the final design involved 960 IoT trials in total, including: 20 participants × 4 Interfaces × 4 IoT Tasks × 3 trials per task.
5.6 Task and Procedure
After getting consent, participants were first given brief guidance and training sessions to familiarize themselves with each Interface; then completed the 16 sessions in the formal experiment.
For each session, the eye-tracker was first calibrated, then three trials were conducted. For each trial, the manipulation commands were first displayed in text form consisting of action, device name, and location (e.g., “Raise the Temperature of the AC Above 27 in the Living Room”, see Appendix
A.1) on the central monitor for seven seconds to ensure participants can read the commands at least twice [
15]. Next, the text “Start” was shown on the monitor for one second to tell participants they could start manipulating the device when the virtual conversation partner showed up; then, the virtual conversation partner was displayed on the central monitor and continuously speaking (moving mouth) until the participant
successfully completed each trial (see the details of stimuli in Appendix
A.2). We asked the participants to act as if they are listening to their conversation partner when manipulating.
To ensure consistent experience among all participants, the state of the IoT devices and the status of the
Interface were reset to the default after each trial. After finishing all three trials for each session, participants filled out questionnaires, detailed in sec
5.7, about their experience with the corresponding
Interface and
IoT Task pair.
Moreover, participants were given a 10-minute break upon completing all four sessions for each Interface. After completing all sixteen sessions, they filled out a questionnaire with their overall rankings and attended 8-12 minutes of semi-structured post-interview. The entire experiment took approximately 120 minutes per participant.
5.8 Results
During the study, a total of 320 data points were collected. Figure
6 and Figure
7 indicate the participants’ performance (see Appendix
A.3 for details).
5.8.1 Quality of (simulated) conversation.
Overall, there was a significant (
p < 0.05) main effect of the type of
Interfaces for all measures, and the
ParaGlassMenu allowed the highest quality of conversation when compared to other interfaces.
Face Focus: A repeated-measures ANOVA after ART indicated significant main effects of Interface (F3, 285 = 85.155, p < 0.001), IoT Task (F3, 285 = 9.394, p < 0.001), and interaction effect (F9, 285 = 2.583, p = 0.007). Besides, there were simple effects (p < 0.05) for individual levels of Interface and IoT Task except for Phone Interface. Moreover, post-hoc analysis revealed Voice and ParaGlassMenu were significantly higher than Linear and Phone (pbonf < 0.05), with Linear significantly higher than Phone (pbonf < 0.05). There was no significant difference between ParaGlassMenu and Voice.
Overall, Voice enabled the highest Face Focus (M = 0.253, SD = 0.192) on the virtual conversation partner as it did not provide any visual feedback that deviated their visual focus from the conversation partner’s face; however, six participants who disagreed with the above mentioned that they could focus better with ParaGlassMenu (M = 0.235, SD = 0.119) over Voice as they tended to look at the smart speaker before speaking; while the circular layout of ParaGlassMenu helped them concentrate on the face. In contrast, Phone had the lowest Face Focus (M = 0.044, SD = 0.043) as IoT manipulation using Phone required users to switch between the phone and the face.
Politeness: There was only a significant main effect of Interface (F3, 285 = 50.731, p < 0.001) and the post-hoc analysis revealed that ParaGlassMenu and Linear were significantly higher (pbonf < 0.001) than Phone and Voice, with no significant difference between other Interface pairs.
Overall, OHMD interfaces, particularly ParaGlassMenu showed the highest Politeness (M = 5.51, SD = 1.12) as it enabled participants to keep focus on the face. In contrast, participants felt it was “rude” and “impolite” to use the Phone (M = 3.73, SD = 1.84) to manipulate devices during a conversation as it required attention switching between the face and the phone and violated social norms. Similarly, participants felt using Voice (M = 3.84, SD = 1.75) was impolite and “awkward” as it could interrupt and pause the conversation; however, two participants mentioned that using Voice was acceptable to play songs when the conversation topics were related to songs as it could increase shared interactions.
Naturalness: There was only a significant main effect of Interface (F3, 285 = 12.800, p < 0.001) and the post-hoc analysis revealed ParaGlassMenu and Linear Interfaces were significantly higher (pbonf < 0.02) than Phone and Voice, with no significant difference between other Interface pairs.
Overall, ParaGlassMenu showed the highest Naturalness (M = 5.23, SD = 1.04) indicating that it allowed the manipulation of IoT devices with lesser interruption, according to post-interviews.
RTLX: There were only significant main effects of Interface (F3, 285 = 4.234, p = 0.006) and IoT Task (F3, 285 = 4.040, p = 0.008). Moreover, the post-hoc analysis revealed ParaGlassMenu was significantly lower (pbonf = 0.004) than Voice, with no significant difference between other Interface pairs.
Overall, the ParaGlassMenu had the lowest RTLX (M = 22.23, SD = 14.34) as it enabled easier IoT devices multi-tasking while focusing on the face. Additionally, ParaGlassMenu, Linear, and Phone provided visual cues, which reduced the burden of remembering the commands or making mistakes compared to Voice. In contrast, Voice caused the highest RTLX (M = 27.67, SD = 19.99) due to command recognition errors which made participants repeat voice commands. Moreover, as expected, it made users “wait” for confirmation feedback which took longer time-demand than other Interfaces.
5.8.2 Quality of IoT manipulation.
Overall, there was a significant (
p < 0.05) main effect of
Interface for all measures, and the
ParaGlassMenu increased the quality of IoT manipulation over others.
Task Duration: There were significant main effects of Interface (F3, 285 = 321.711, p < 0.001), IoT Task (F3, 285 = 58.370, p < 0.001), and interaction effect (F9, 285 = 15.496, p < 0.001). Besides, there were simple effects (p < 0.05) for all individual levels of Interface and IoT Task. The post-hoc analysis revealed significant differences (pbonf < 0.001) between all Interface pairs with the ParaGlassMenu having the lowest duration and Voice having the highest.
Overall, ParaGlassMenu had the lowest Task Duration (M = 5.75, SD = 2.28) as it enabled to locate and navigate individual devices easily while maintaining focus on the face, provided “more intuitive” manipulation compared to Linear, and reduced attention switching between the face and the menu compared to Phone. On the contrary, as expected, Voice had the highest Task Duration (M = 14.18, SD = 5.60) due to the longer time to provide voice commands and get feedback, and multiple attempts due to voice recognition errors.
Task Accuracy: There were significant main effects of Interface (F3, 285 = 64.194, p < 0.001), IoT Task (F3, 285 = 100.873, p < 0.001), and interaction effect (F9, 285 = 20.279, p < 0.001). Besides, there were simple effects (p < 0.05) for Voice and IoT Tasks except for Discrete Manipulation. The post-hoc analysis revealed ParaGlassMenu, Linear, and Phone were significantly higher (pbonf < 0.001) than Voice, with no significant difference between other Interface pairs.
Overall, Voice had the lowest accuracy (M = 0.844, SD = 0.183) due to the speech recognition inaccuracy, which led to repeated commands. On the contrary, ParaGlassMenu has the highest accuracy (M = 0.997, SD = 0.028) due to its intuitive spatial mapping, and Phone has the second highest accuracy (M = 0.994, SD = 0.039) due to its familiar UI designs with touch interaction.
Relaxation: There was only a significant main effect of Interface (F3, 285 = 12.523, p < 0.001) and the post-hoc analysis revealed Voice was significantly lower (pbonf < 0.05) than other Interfaces, and Linear was significantly lower (pbonf < 0.05) than ParaGlassMenu. There was no significant difference between other Interface pairs.
As expected, Phone had the highest Relaxation (M = 5.69, SD = 0.91) due to device familiarity. ParaGlassMenu has the second highest Relaxation (M = 5.68, SD = 1.13) due to its quick and intuitive manipulation. While Voice was felt the least relaxed (M = 4.80, SD = 1.69) as incorrect recognition of voice commands caused repeated attempts and delays in feedback.
SUS: A one-way repeated-measures ANOVA with Greenhouse-Geisser correction (ϵ = 0.71) revealed a significant effect of Interface (F2.155, 40.952 = 5.288, p = 0.008, η2 = 0.218; Note: SUS was calculated only for each Interface). The post-hoc analysis revealed that Voice was significantly lower (pbonf < 0.05) than ParaGlassMenu, Linear, and Phone, with no significant difference between the pairs of the three Interfaces.
Overall,
ParaGlassMenu was perceived as the most usable system (
M = 83.00,
SD = 9.82) to manipulate IoT devices in a conversation setting as it was “intuitive”, “easy to use”, “polite”, “faster than others”, and “help[ed] to concentrate on people’s face”. In contrast,
Voice had the lowest
SUS score (
M = 70.88,
SD = 18.84), which was below the threshold (i.e., 80 [
11]) for good usability.
5.8.3 Preference rankings.
Figure
8 indicates the overall preference ranking of
Interfaces.
The majority of participants (12) ranked ParaGlassMenu as their most preferred Interface, while Voice is the least preferred one (11). They reported that ParaGlassMenu was intuitive, easy to use, polite, and less distracting to the conversation than the other Interfaces, while Voice could interrupt conversations as voice commands could pause the conversation and speech recognition errors cause repeated attempts.
The participants (5) who selected the Phone as their first preference mentioned that familiarity helped them to control the IoT devices easily and conveniently, and it was acceptable as “most people have gotten used to people occasionally checking their phones”. At the same time, two participants who chose Voice as their first preference mentioned that it took them less effort, did not affect their focus on the partner, felt more natural, and was easier to use when compared to ring mouse or phone. Lastly, the remaining participant who chose Linear as the first preference mentioned that the 1D nature of Linear was simpler and easier to locate than the 2D nature of ParaGlassMenu.
5.9 Discussion
Overall, the ParaGlassMenu achieved the highest conversation quality in terms of more focus on the conversation partner (\(M = 23.5\%,\: SD = 11.9\%\)), the highest politeness (M = 5.51, SD = 1.12 / 7) and naturalness (M = 5.23, SD = 1.04 / 7), and the lowest cognitive load (M = 22.23, SD = 14.34 / 100). ParaGlassMenu also enables the most effective IoT manipulation measured with the lowest IoT manipulation time (M = 5.75, SD = 2.28 s), the highest accuracy (\(M = 99.7\%,\: SD = 2.8\%\)) and best usability score (M = 83.00, SD = 9.82 / 100) in a relaxed manner (M = 5.68, SD = 1.13 / 7). Thus, manipulation of IoT devices with ParaGlassMenu demonstrated the lowest interference to the conversation. Furthermore, it was also the most preferred Interface. Linear Interface is recommended as the second choice due to its familiar linear layout, but interacting with it requires much higher attention, which causes a noticeably lower focus on the conversation partner.
On the other hand, as expected, Phone and Voice Interfaces have limitations in a conversation setting, given the Phone failed to support high conversation quality and Voice failed to support both high conversation quality and high usability in social interactions. But this does not mean Phone and Voice Interfaces should be excluded. Phone Interfaces are the most accessible interface today, and it is the most familiar, making them the easiest and default choice for most users. Voice Interface has the ability to maintain visual attention on a target and be accessed ubiquitously, which can be particularly useful in other non-social settings, such as single-user scenarios and driving scenarios.