research-article

Open access

Perception and Action Augmentation for Teleoperation Assistance in Freeform Telemanipulation

Authors:

Tsung-Chi Lin,

Achyuthan Unni Krishnan,

Zhi LiAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 13, Issue 1

Article No.: 11, Pages 1 - 40

https://doi.org/10.1145/3643804

Published: 11 March 2024 Publication History

PDF eReader

Abstract

Teleoperation enables controlling complex robot systems remotely, providing the ability to impart human expertise from a distance. However, these interfaces can be complicated to use as it is difficult to contextualize information about robot motion in the workspace from the limited camera feedback. Thus, it is required to study the best manner in which assistance can be provided to the operator that reduces interface complexity and effort required for teleoperation. Some techniques that provide assistance to the operator while freeform teleoperating include: (1) perception augmentation, like augmented reality visual cues and additional camera angles, increasing the information available to the operator; (2) action augmentation, like assistive autonomy and control augmentation, optimized to reduce the effort required by the operator while teleoperating. In this article, we investigate: (1) which aspects of dexterous telemanipulation require assistance; (2) the impact of perception and action augmentation in improving teleoperation performance; and (3) what factors impact the usage of assistance and how to tailor these interfaces based on the operators’ needs and characteristics. The findings from this user study and resulting post-study surveys will help identify task-based and user-preferred perception and augmentation features for teleoperation assistance.

1 Introduction

1.1 Remote Perception and Teleaction Problems in Freeform Telemanipulation

Problem Statement. Contemporary motion-tracking interfaces (e.g., HTC Vive virtual reality system [93]) enable manipulator robots to track the natural human arm and hand motions to perform more dexterous, freeform manipulation. While human operators can efficiently and intuitively control gross manipulation (e.g., reaching to or moving an object), they may experience significant cognitive and physical workload when trying to control precise manipulation, such as carefully adjusting the robot end-effector near an object for grasping or placing. This is usually because human operators can not acquire the necessary sensory information (e.g., visual or haptic) to perceive and control the remote tasks [32, 85]. For example, the operators may need the camera viewpoint from a different perspective to perceive the depth information not available in the primary camera viewpoint. They may also need proprioceptive and haptic feedback to precisely control the end-effector’s motions or postures. Besides the remote perception problems, the cognitive and physical workload may also come from the difficulties in remote robot motion control. Freeform manipulation tasks typically involve both gross and precise manipulation, which can be difficult to perform efficiently through interfaces designed around motion tracking. Changing the interface mapping and scaling (from human controlled inputs to robot motion outputs) based on task state or user input will be required to control the robot with efficiency and precision.

Limitations of Related Work. Related work has proposed to various approaches assist human teleoperators’ remote perception and motion control. The existing solutions regarding the design of the telemanipulation interfaces include the methods to:

—

Improve the capabilities of interface to display additional sensory information (e.g., multi-camera viewpoints, haptic interfaces [12, 57]);

—

Resort to alternative sensory feedback to present the missing information, such as using augmented reality (AR) visual cues to represent the remote contact or force feedback [64];

—

Delegate the part of the telemanipulation task difficult for humans to capable and reliable autonomy [46] so that the interface only needs to present feedback on the autonomy’s performance instead of the detailed sensory information.

However, related work in the literature mostly compares to the same types of approaches to validate the effectiveness of their proposed methods. There is still no work to compare differenttypes of approaches, to inform how to choose among or integrate them when multiple types of approaches are available. For example, if the telemanipulation interfaces are capable of displaying AR visual cues, adjusting interface mapping dynamically, or providing autonomy for action assistance: (1) Which method will be more effective to assist (which part of) the telemanipulation? (2) Which method will be preferred by human operators? (3) Which operator specific factors affect the effectiveness and preferences of the teleoperation assistance features?

1.2 Division of Human and Robot Efforts in Assisted Teleoperation

Problem Statement. Another problem we are concerned with is: how to combine human operator and robot autonomy to optimally control telemanipulation? Our insight from the related work in the literature and our prior work is that: shared autonomy to assist telemanipulation can be more effective if it is designed to enable an appropriate division of task and effortbetween the human and robot. Such task division should allow humans to have sufficient freeform control to perform the unstructured parts of the task and allow robot autonomy to handle the structured parts of the task with desired performance (e.g., speed, accuracy, reliability).

Limitations of Related Work. In recent related work, the shared autonomy to assist remote manipulation is mostly designed to assist as much and as early as possible, based on the prediction of human intents (e.g., target object [61], expected motion trajectory [91]). These shared autonomy designs are created to minimize control inputs and efforts, and may not always be necessary and effective to assist the operators who would prefer to have more freedom than assistance to control gross manipulation. While humans can easily perform freeform reaching motions to clearly indicate the object they intend to grasp, the most effective way to reduce human workload is to provide autonomy only to the part of the task that causes humans high cognitive or physical workload. Our prior work has developed a shared autonomy that provides autonomous actions (for grasping) to reduce human physical workload [60]. In this article, we will extend the assistance design to investigate the need for robots to estimate human cognitive and physical workload to detect whether the human needs assistance, and determine if the robot can offer perception or action autonomy useful to the human.

1.3 Overview of Research Efforts

Motivating Example. Consider a comprehensive telemanipulation task such as workspace organization, which may involve control of reaching, grasping, moving, and placing of various kinds of objects. This task is mostly unstructured and requires freeform control, because it does not follow (and require) any procedure; What, when, and how to handle each object will be decided by the user on-the-fly as the task goes. Sufficient freeform control will also allow human to improvise based on their knowledge and experience (leveraging environmental constraints, physical properties of handled objects, etc.) to facilitate or enable some manipulation. To effectively assist telemanipulation without compromising the human’s control authority and freedom, the robot autonomy can provide an additional camera viewpoint from a different perspective or AR visual cues, to augment their remote perception and enable them to control the manipulation themselves. The robot autonomy can also be different interface mappings or autonomous actions that can effectively perform the task. As the human has moved the robot end-effector close enough to the target object or container to place, it is easier for the robot to infer the human’s intent and assist in the structured component (placing object) of the unstructured task using simple but effective autonomy.

Proposed Method. To this end, we propose systematic approaches for action and perception augmentation. The Action Augmentation allows humans to control the robot motions using hand pose tracking and trackpad available on the hand-held controller, for freeform or constrained motion control. It is also implemented by dynamically adjusting the scaling of the operator to robot interface mapping to support both gross and precise manipulation. For Perception Augmentation, we provide AR visual cues to convey the visual information information difficult to perceive in 2D camera (e.g., the task and robot status, interface control mode, and autonomous action affordance). We also provide a complementary camera viewpoint from a significantly different perspective, in which missing visual information, like loss of depth perception, can be more easily perceived.

Implementation. We have implemented the proposed perception and action augmentation on a representative interface. Shown in Figure 1, we used the HTC Vive hand-held controller to control robot motion and the desktop monitor to display the remote camera viewpoints and AR visual cues. The robot could provide autonomous actions (e.g., grasping and placing actions) or switch to constrained motion control using a trackpad when humans operate the robot end-effector near the target object or location to place. The implementation can be generalized to the teleoperation system using various contemporary telemanipulation control devices (e.g., hand-held, touch-based, wearable), and displays (e.g., screen-based and head-mounted visual display). Note that we are concentrating on autonomy and motion constraints specifically for pick-and-place tasks. This approach stems from the complex coordination and interaction requirements involved, laying the groundwork for potential future extensions.

Fig. 1.

Experiments and Results. We conducted three pilot studies to determine the design parameters of perception and action assistance. In Pilot Study I, we provided the complementary viewpoint from different perspectives according to the recommendation in the literature [26] to determine the complementary viewpoint angle and distance useful to our task and setup. In Pilot Study II, we evaluated what combinations of perception and action augmentation humans may prefer to inform the formal integration of the interface. In Pilot Study III, we validated our proposed methods for online estimation of muscle efforts using human motion trackers based on surface electromyography (sEMG) measurement. Our formal user study to validate and compare the proposed perception and action augmentation included 23 participants of diverse genders, professions, and experiences with technology. Since nursing tasks require nurses to handle patient medicines and articles in a non-specific order, requiring the decision-making skills possessed by human healthcare workers, and given that nurses and healthcare workers are the intended end users of our interfaces within healthcare applications, this study recruited participants with a background in nursing tasks. We use objective metrics to evaluate the task performance (e.g., completion time, types and occurrence of errors), cognitive workload (from the eye tracking data), and physical workload (from motion-tracking data), to compare the effectiveness of the proposed perception and action augmentation and their combinations. We also analyzed the survey feedback from participants to understand their preference for different perception and action augmentation. Our comprehensive user study shows that the effectiveness of and preference for the perception and action augmentation depend on the task performance objective, the user’s need for assistance, and the types of users.

Main Contributions. The specific contributions of our work include (1) a novel shared autonomy to leverage human capabilities of freeform control and an assist-as-needed robot autonomy for effective, intuitive human-robot collaboration to perform telemanipulation tasks (Section 3.2); (2) a generalizable design of AR visual cues to provide the information critical to the precision and performance of the remote manipulation (Section 3.2); (3) the integration and comparison of different types of perception and action augmentation to discover new knowledge on optimal human-robot collaboration for freeform telemanipulation (Section 3.5); (4) a novel approach for objective physical and cognitive workload estimation based on human motion and eye tracking devices (Section 3.6); and (5) a demonstration of the effectiveness and usability of the proposed interfaces and evaluation methods via a human experiment (Section 4).

2 Related Work

The challenges encountered during robot remote manipulation via freeform control have been organized in terms of control effectiveness and effort in the following section. Our cognitive and physical workload estimation method enables the evaluation of non-negligible effort while performing the freeform telemanipulation. We further implement the action and perception augmentation drawn from prior work (i.e., control mapping and scaling [93] described in Section 3.4 as well as augmented reality visual cues [59] described in Section 3.2) to reduce the physical and cognitive workload.

2.1 Cognitive and Physical Workload in Freeform Telemanipulation

Maintaining human freedom in motion control is essential to the freeform teleoperation of unstructured, unpredictable manipulation tasks (e.g., including tele-robotic surgery [29], nursing assistance [57], manufacturing [38], hazardous material handling [92], explosive ordnance disposal [83]). Such telemanipulation tasks are usually not feasible or error-prone for high-level robot autonomy (refer to the review on the level of autonomy [9]), and heavily depend on human knowledge, expertise, and robot control dexterity. To assist freeform telemanipulation, it is preferred to enable humans to efficiently and intuitively control the remote robot and cameras while having some low-level robot autonomy for perception or action support to reduce human operator’s cognitive and physical workload. Consider the various motion mapping interfaces (e.g., soft/hard exoskeletons [41], camera/IMU-based motion capture systems [58]) that can map the human body, arm, and hand motions to efficiently and intuitively control the freeform motions of manipulator robots. These interfaces tend to cause non-trivial cognitive and physical workload [58], because precise control of manipulation motion or posture could be difficult without the necessary haptic or proprioceptive feedback [5, 72]. The remote visual perception problems, including the limited field of view, loss of depth information, and unnatural camera viewpoint control (typically for eye-in-hand cameras), also contribute to the cognitive and physical workload. This cognitive and physical workload not only fatigues the teleoperator if they use the interfaces for hours but may also lead to work-related musculoskeletal injury for the operator who uses the teleoperation interfaces on a daily basis. For the telemanipulation tasks that are designed for manipulation dexterity rather than handling heavy payload, the “assist-as-need” action augmentation helps humans to efficiently and reliably complete the manipulation actions clearly indicated by humans (e.g., by moving the end-effector close enough to the target location or object). This will be more effective than the autonomy predicting human intents (e.g., References [7, 30, 46]) to assist as early and as much as possible. To effectively reduce the cognitive workload, perception augmentation can present additional visual information using the camera viewpoint from a different perspective, or present AR visual cues to communicate the high-level task and robot states so that human operators do not need to perceive and comprehend the low-level feedback from various sensors [28]

2.2 Action Augmentation via Interface Mapping and Scaling Design

Control Interface. Table 1 categorizes the conventional and contemporary control interfaces for assisted telemanipulation that represent the state-of-the-art. Compared to conventional interfaces, contemporary interfaces tend to: (1) improve the control dexterity of high degrees-of-freedom motion coordination (e.g., multi-finger coordination, hand-arm coordination), and simultaneous position and orientation control; (2) improve the intuitiveness of manipulation control, either by mapping natural human motions to robots, or replicating/representing the controlled robots or the manipulated objects (e.g., using 3D-printed prototypes or virtual reality); (3) improve control accuracy by providing (shared) autonomy with/out haptic feedback. Considering the telemanipulation assistance in recent related work, we also found that: while the action support that (partially) automates task-specific manipulation actions can improve the control accuracy and is used more for structured tasks [19, 54], control augmentation, such as the design of interface mapping and scaling, can better enhance the dexterity and intuitiveness, can be generalized across various interfaces, and are more used for freeform manipulation [67].

Table 1.

Type	Representative Interfaces
Conventional
Desktop	Keyboard + mouse + point-and-click graphical user interface [16, 50, 102, 103]
Hand-held	Xbox gamepad [55, 70, 87, 98], (Haptic) Joystick [4, 12, 23, 30, 39, 68, 69, 103], Customized teleoperation console (e.g., Da Vinci Surgeon Console [1, 22])
Contemporary
Wearable	Arm/hand exoskeleton [10], Data glove [31, 56], Soft haptic glove [63, 71]
Hand-held	Hand-held controllers of virtual reality systems (e.g., HTC Vive, Oculus) with trackpad and buttons, Robopuppet [27], Chopstick [49], Haptic tweezer [84], Tangible interface [14]
Motion/Gesture	Touchless: Arm/hand motion tracking (e.g., vision-based [24, 25, 90, 103], marker-based [20, 58], and IMU-based [100]), Mid-air gesture control [17, 43]; Touch-based: Touch-screen gesture control [11, 62, 101]

Table 1. Conventional and Contemporary Control Interfaces for Assisted Telemanipulation

Mapping Design. While being intuitive, motion-tracking interfaces are generally limited in their control efficiency. This is a result of the limited accuracy of human motions, and interference of intended and unintended motions due to simultaneous control of many Degrees of Freedom (DOFs). The efficiency of the controlled motions can be improved by introducing constraints in terms of virtual fixtures [76] or autonomy for teleoperation assistance (e.g. collision avoidance [73], motion guidance toward intended goal [60]). From a more general perspective, the interfaces that map gestures or point-and-click actions to autonomous robot motions or movement primitives [103] can all be considered as some kind of constraints that limit the extent to which human operator can control the robot motion freely. In addition, motion constraints can also be introduced by the separation of DOFs in the design of interface mapping. For instance, people may use separate controllers to manipulate a 3D object’s position and orientation, to avoid the interference of intended and unintended motion control [103]. Some interface hardware, such as the trackpad of hand-held controllers and the joystick of gamepads, are naturally suitable for the separation of DOFs as they can clearly distinguish the control inputs for different motion directions based on the controlled axes. For screen- or projection-based interfaces, interactive avatars such as the ring-and-arrow markers [75], the virtual handlebar [37] enable the independent control of individual DOF(s) of the manipulated (virtual) object or robot end-effector.

Scaling Design. The motion scaling ratio of the interface affects both the control efficiency and accuracy of the telemanipulation tasks. The major concern in the design of interface scaling ratio is how to achieve the trade-off of control efficiency, adaptability and predictability. Scaling up the control motion can increase robot motion speed and range but may compromise the accuracy of motion control. However, scaling down the control motion will increase the motion control accuracy in the concerned small-scale workspace, but may also improve the efficiency by reducing operational errors. The scaling ratios can be fixed (commonly used by tele-robotic laparoscopic or eye surgery interfaces [40]), or vary with the user’s operating speed (e.g., PRISM method [33]) or regions of operation [67]. Both fixed and varying scaling have pros and cons. Interfaces with varying scaling ratios can better adapt to the control of fast and slow motions in a large or small workspace. In contrast, interfaces with fixed scaling ratios tend to be more stable and predictable to the teleoperator. Related work in the literature proposed several solutions to achieve such trade-off, which allow the user to: (a) manually switch among several pre-defined scaling ratios (suggested by task experts) depending on the types of operation or size of the workspace [40]; (b) manually adjust the scaling ratio as a continuous control parameter (e.g., by changing the distance between the controlling hands [37, 88]); (c) autonomously adjust the scaling ratio (e.g., according to the location of operation in the workspace [25]). While these representative designs balance the performance objectives to some extent, it was also revealed that: (1) being able to adjust the scaling ratio (manually or autonomously) is useful overall but may lead to a complicated interface design hard for users to learn; (2) the manual switching of scaling ratio modes leads to more predictable interface behavior, but may increase the control efforts and mental workload; (3) autonomously adjusting the scaling ratio reduces the control efforts and cognitive workload, but has to be carefully designed to the nature of the task and the preference of users.

2.3 Visual Assistance for Perception Augmentation

Complementary Viewpoint. Human operators need to have visual information regarding the global, spatial relationship in the workspace, and the detailed, local visual information of region-of-interest critical to the precise operation. Realistic tele-manipulator robots also need to have multiple cameras dedicated to providing global and local viewpoints, or a single camera with sufficient mobility and motion of range to serve both purposes. Related literature and our prior work proposed to present both (or switch between) the global viewpoint from an onboard or standalone workspace camera, and the local viewpoint from a high-mobility, eye-in-hand camera to focus on the region of interest (e.g., using detail-in-context display [48, 86, 94]). However, for gross manipulation, humans may need to reach beyond the workspace covered by the workspace cameras. Providing a much-too-large camera field of view (FOV) is not efficient for limited communication bandwidth and may compromise the resolution in visual display for the workspace of frequent manipulation operations. For precise manipulation, an additional camera viewpoint from a different perspective may be necessary to confirm if the manipulation motion satisfies the task constraints in multiple degrees-of-freedom. Another problem we have to address is the dilemma to retain human’s authority and freedom to control the camera while reducing the human’s effort for camera control. The autonomy for dynamic camera viewpoint control and optimization can reduce human’s camera control efforts, yet it may move the camera in an unexpected and unpredictable way and disrupt the operator’s manipulation.

Augmented Reality Visual Support. AR visual cues can communicate very rich, detailed information using a variety of colors, shapes, and displayed text. AR visual cues are preferred to assist the estimation of spatial relationships (e.g., gap estimation for driving assistance [82]), to direct and enhance visual attention (of drivers [74] and video game players [21]). More recently, the design of AR visual cues emphasizes how to intuitively communicate robot motion intent (e.g., goal and trajectory [15]) to assist the human control or supervision of more autonomous robots, and emphasizes how to enhance the depth perception [6], contextual understanding of 3D spatial relationship [95], and real-time status of robot, task and environment [97], to assist robot teleoperation. The AR display can be augmented to provide information about the robot, interface, and environment [96]. The AR visual displays can also be integrated with virtual reality display of robot models [80], or presented with the haptic cues to communicate interaction force, motion constraints, or desired trajectories [36, 53, 66, 73].

3 Interface and Evaluation Design

This section will present our proposed approaches for perception and action augmentation, and their implementation on a representative telemanipulation system and for a general-purpose pick-and-place task. We further propose methods for the estimation of eye-tracking-based cognitive workload, and motion-tracking-based physical workload to enable the evaluation of integrated interfaces in formal user studies.

3.1 System Overview

Figure 1 shows the telemanipulation system developed in our prior work [59], which enables the development of the perception and action augmentation proposed for this project. The robot platform is a 7-DoF Kinova Gen 3 manipulator with a two fingered Robotiq gripper that can detect contact with the grasped object. Two RealSense Cameras (D435) were standing alone in the workspace for primary and complementary remote perception.

For robot motion control, we use an HTC Vive hand-held controller (referred to as “controller” in the rest of this article) that allows human operators to control the freeform robot motions using their natural hand motions and constrained motion using controller’s trackpad. By default (i.e., Mode 1 of Figure 2), the linear velocity of the controller will be mapped to the linear velocity of the robot. The input-to-output motion mapping ratio is 1:5 along the x-axis and 1:3 along the y- and z-axis. This mode will be referred to as baseline mode in the rest of this article. We locked the robot’s rotational motions, because this work focuses on developing and comparing different modalities of teleoperation assistance instead of the capabilities of robot control. To perform a telemanipulation task, the operator will: (1) press the menu button on the controller to send the robot to the home configuration, (2) press the menu button again to get the robot ready, and press the grip (side) button to initiate the control.

Fig. 2.

For visual feedback, we used 1,440\(\times\)1,080 pixel resolution Unity 3D window on a 27-inch desktop monitor to display the remote camera video stream (at 30 Hz frame rate) and to display augmented reality visual cues (see Section 3.2 for details). By default, the graphical user interface (GUI) will only display the robot’s operation state. Specifically, the GUI will display: (1) “WAITING” when the tele-robotic system is ready for operation and waiting for a control command; (2) “SENDING HOME” when the operator presses a controller button to set the robot to the default pose; (3) “READY” when the robot is posed at the start position for the current task; (4) “TELEOPERATING” when the robot is being teleoperated; (5) “PAUSED” when the robot is paused by the teleoperator. Figure 3 shows the control architecture and data communication pipeline of the telemanipulation system. The RGB video from the remote cameras are streamed at 30 Hz frame rate. A screen-based eye tracker (Tobii Pro Nano) was attached below the monitor to track the human operator’s gaze and eye movements (e.g., pupil diameter) at 60 Hz. The autonomy for perception and action can detect the ArUco tags attached to the objects, container, and counter workspace [35, 79], to estimate the information for the AR visual cues and control the robot autonomous actions for precise manipulation (e.g., object grasping and placing).

Fig. 3.

3.2 Design and Implementation of AR Visual Cues and Assistive Autonomy

To assist robot remote manipulation, we implemented systematic AR visual cues and user-triggered autonomous actions as the baseline representing the common solution for remote perception and action problems. We then develop, integrate, and compare different types of perception and action assistance upon the baseline AR visual support and assistive autonomy, to discover new knowledge on optimal human-robot collaboration for freeform telemanipulation.

AR Visual Cues. Our prior work [59] has proposed four types of AR visual cues for freeform teleoperation assistance, including: (1) the Target Locator to indicate the robot’s movement direction and distance to the targeted object or goal pose; (2) the Action Affordance to indicate if the robot is ready to afford the action to be performed (e.g., grasping or stacking an object, with a good chance of success); (3) the Action Confirmation to indicate that the robot has successfully performed an appropriate action; and (4) the Collision Alert to alert the teleoperator if the end-effector is about to violate any environment constraints (e.g., hitting the table). Figure 2 (Mode 2) shows the implementation of these AR visual cues to assist a pick-and-place task for this work. This mode will be referred to as the AR mode in the rest of this article.

—

The Height indicator shows the robot’s distance to the table surface. Besides the display of numerical distance, the height bar display also turns from green to red if the robot is too close (within 0.1 m) to the table.

—

The Alignment indicator (displayed as a dot-in-circle) shows if the robot is aligned with the object to grasp or the container to place the object, in x- and y-direction. Once the blue dot moving with the robot is aligned with the pick circle displayed on the object or containers, the pink circle will change its color to light blue to indicate the operator can reliably close or open the gripper to reliably grasp or drop the object into the container.

—

The Grasping/Placing Hint includes two square-shape that turn on and off to show whether the robot is aligned with the object or the container so that the operator can confidently close or open the gripper. It is designed to confirm the critical information conveyed in the Alignment and Height cues.

—

The Arrow with Distance indicator shows the distance (in cm) and direction (using green and pink arrows, respectively) to show the target object to grasp or container to place.

The proposed implementation of AR visual cues is refined based on our prior design and evaluation results [59]. Specifically, we have adjusted the Height indicator to be vertical instead of horizontal for a more intuitive visual display. We grouped the Grasping/Placing Hint into a white box and highlighted the boundary of the container to make them easier to spot at a glance. We also extended the AR support to picking-and-placing.

Assistive Autonomy. Shown in Figure 2 (autonomy mode), we also provide autonomous actions to assist the operators to perform precise manipulation (e.g., picking and placing an object). The robot autonomy can detect the human’s goal and action intents based on robot states, including the distance to the object or container, and whether the gripper is open or closed. When the gripper is open, we predict human intent to grasp the object if the robot is within the predefined distances to the center of the object (0.05 m, 0.08 m, and 0.13 m in the x-, y-, and z-direction). This mode will be referred to as autonomy mode in the rest of the article. A hint of “AUTONOMY” will be displayed to show the robot has detected the human’s goal and action intent by filling the box when the robot can reliably perform the action. Humans therefore can press the controller’s trigger to confirm the execution of the autonomous action, after which the robot will autonomously reach to grasp the object and lift it to 0.2 m above the table surface. To place an object, the operator needs to move the robot to be within a predefined distance (0.08 m, 0.08 m, and 0.15 m in the x-, y-, and z-direction) to the center of the top of the container. Once confirmed by the human, the robot will autonomously move to the top of the container and drop the object into it reliably.

Remarks. Our proposed visual and action augmentation depends on the robot autonomy to predict human intents, determine action affordance and success, and detect and avoid collision. Here, we implemented a simple design of autonomy that predicts the human’s intent to grasp or place an object based on the robot state. The object detection, action affordance, and collision are also simplified given that we know the location and geometry of the object and the environmental constraints. Note that more advanced methods to predict human intents, from human control inputs [24, 46, 81], gaze [3, 78], or their fusion [77], can be integrated with our proposed visual and action augmentation for more complex manipulation tasks. Advanced methods to detect objects and their action affordance (e.g., using Sim2real approach [47], for unknown objects [18]) can also be incorporated to enable more complicated precise manipulation and the delicate control of interaction forces. Collision in dynamic and cluttered environments can be detected using advanced methods such as generalized velocity obstacles [99].

3.3 Complementary Viewpoint for Perception Augmentation

We propose to leverage an additional workspace camera to provide a complementary viewpoint in which the operator can better perceive the information missed in the primary workspace camera viewpoint. Shown in Figure 4, the GUI presents a picture-in-picture (PIP) display to embed the complementary viewpoint into the primary viewpoint. The perception augmentation in the form of the complementary viewpoint can be presented always (i.e., the fixed viewpoint) or dynamically given the robot and task states (i.e., dynamic viewpoint). It can also be augmented with different interface control modes. Here, we present the pilot user study (Pilot Study I) for complementary viewpoint iterative design and evaluation. We conducted a pilot study with one expert participant (female, age = 33, without visual or motor disability, 100+ h experience with robot) to answer Q1–Q3.

Fig. 4.

Q1—Do we need multiple viewpoints? . We follow the experiment setup in the literature [26], and set up five workspace cameras (4 RealSense cameras and 1 Webcam) to observe the workspace from different perspectives (the front, back, left, right, top). Including the viewpoint from the eye-in-hand camera of the robot, we presented six viewpoints to the user and tracked her gaze fixation on each viewpoint during the pick-and-place task. As shown in Figure 5, the operator was asked to grasp four blocks of different colors placed around a red cup and place each into the cup. During the task, the participant used an HTC Vive controller to control the robot’s motions. For the pilot study, the head-mounted display of the HTC Vive Pro Eye system is used to display graphical user interfaces and track human gaze.

Fig. 5.

In Figure 5, the camera viewpoints that the human looked at are compared between different manipulation actions, and compared between the manipulation of different objects. The operator’s gaze fixation mostly switched between the back view that looks at the workspace from the operator’s standing point and the viewpoint in which she could observe the object to pick up with minimal occlusion. We also found that the participant spent more time looking at the back view than any other viewpoint, which implies that we need to distinguish the primary and complementary viewpoints based on the duration of their fixation.

Q2—Which camera is preferred for the complementary viewpoint?. We conducted another round of the pilot study with the same participant to determine the preferred camera view for a complementary viewpoint. Based on the result from Q1, we implemented a picture-in-picture multi-viewpoint display. By default, we displayed the back view camera to be the primary viewpoint and the front view camera to be the complementary viewpoint. The complimentary viewpoint system was selected, since the back view was utilized the most with different viewpoints used only when additional information was required. This implied that only one additional viewpoint to the primary viewpoint would be required. Shown in Figure 6, the operator could also press the controller’s button to switch the complementary viewpoint to be from other workspace cameras. We recorded the robot and task states and tracked the human gaze. We noticed that the operator preferred to only use the left view camera for the complementary viewpoint, because (1) it shows the additional objects not visible in the back view, and (2) it is less occluded by the robot arm. The participant also mentioned that manually switching the complementary viewpoint increased her cognitive workload and control efforts during a post-study interview. She also mentioned that the complementary view could be improved with a zoom functionality to provide detailed information on the task and workspace.

Fig. 6.

Q3—Do we need to adjust the field of view for the complementary viewpoint?. Based on the feedback from Q2, we enabled the operator to use the controller’s trackpad to control the complementary viewpoint to shift the center of the FOV, and to zoom in and out (Figure 7). We found that during the same pick-and-place task, the operator still chose the complementary viewpoint cameras in a similar way, but preferred to zoom in and shift the FOV to make the target object or container more centered and visible.

Fig. 7.

3.4 Dynamic Interface Mapping for Action Augmentation

Dynamic interface mapping, controlled by humans or autonomy, enables humans to use different interface mappings or scaling ratios to effectively control precise and gross manipulation. While manually adjusting the interface mapping could be annoying and tedious, existing autonomy to adjust the interface mapping [34, 88] tends to confuse humans, because they do not intuitively inform humans about this change, due to which humans may find the interface inconsistent and unpredictable. Our recent work [93] shows that humans can more efficiently control precise manipulation if the interface mapping: (1) allows humans to constrain the motions to be only for the position or orientation control, and (2) autonomously reduces the human-to-robot motion mapping ratio (which will reduce the robot motion speed) close to the objects and environment constraints. For this work, we propose to improve efficiency to control precise, directional motions by: (1) allowing humans to use constrained motion control input channels (e.g., the controller’s Trackpad) to control the motions of an individual DOF, and (2) reducing the scaling ratio only in the precise motion control direction. These interface designs build upon the principles suggested in Reference [93].

Figure 8 shows the implementation of our two proposed action augmentation approaches. In the “Trackpad” control mode, the human can still control the robot in the x- and y-axis (horizontal plane motion) using natural hand motions but will control the motion in the z-axis (vertical motion) using the controller’s trackpad (to avoid collision with the table and object to be grasped). When the human controls the robot to move close to the object or containers, the corresponding AR visual cue will turn from “MOCAP” (in green) to “TRACKPAD” (in red) to inform humans that the interface mapping mode has changed. The trackpad control region for the object (container) is a 0.14 m \(\times\) 0.2 m \(\times\) 0.6 m (0.2 m \(\times\) 0.26 m \(\times\) 0.4 m) bounding box w.r.t. to the center of the object (top of the container). In the “Scaling” mode, the interface will reduce the mapping ratio in the x- and y-axis to allow humans to precisely adjust the robot to align with the object or container while maintaining the scaling ratio to be “Regular” in the z-axis. We define the reduced scaling region to be the same as the trackpad control region. The corresponding AR visual cue will turn from “SPEED: REGULAR” in green to “SPEED: SLOW” in red to inform the change in scaling ratio.

Fig. 8.

3.5 Integration of Perception and Action Augmentation

We conducted a pilot user study (Pilot Study II) to determine the effective integration of perception and action augmentation based on human preference. In total, we have 15 different experiment conditions, considering the three interface control modes with different perception and action augmentation. Our pilot study involved eight participants (four male and four female, five novices, and three participants who have used the same teleoperation system before). The participants performed a single-object pick-and-place task once under every experimental condition (the order of interfaces was randomized) and reported their preferred combinations of control modes and perception/action augmentation after the experiments.

Table 2 highlighted the augmentation combination preferred by the majority of the participants for each mode. In the baseline control mode, five of eight participants preferred to have the pop-up picture-in-picture (PIP) display of the complementary viewpoint (interface-1c) and to use trackpad control (interface-1d). Some participants commented: “...would like to have the pop-up PIP display to provide more workspace information when needed and use a trackpad to control the robot in a single direction for precise motion.” In AR mode when the interface can display AR visual cues, six of eight participants preferred to use the single camera view display (interface-2a) without any perception augmentation and trackpad control (interface-2d). Participants commented: “...the PIP display overwhelms the user interface while the AR visual cues are available.” In autonomy mode, seven of eight participants preferred to use the fixed PIP display of the complementary viewpoint (interface-3b) and motion scaling (interface-3e). As the participants commented: “...the fixed PIP display increases the awareness of the region where autonomy is triggered” and “...motion scaling prevents large movement that moves the robot out of the autonomy zone.”The preferred combination of interface control modes with perception and action augmentation will be further evaluated in our formal user study.

Table 2.

	Baseline	AR Visual Cues	Assistive Autonomy
Default	1a: Single View	2a: Single View	3a: Single View
PA = Perception	1b: Fixed PIP (PA1)	2b: Fixed PIP (PA1)	3b: Fixed PIP (PA1)
Augmentation	1c: Pop-up PIP (PA2)	2c: Pop-up PIP (PA2)	3c: Pop-up PIP (PA2)
AA = Action	1d: Trackpad (AA1)	2d: Trackpad (AA1)	3d: Trackpad (AA1)
Augmentation	1e: Motion Scaling (AA2)	2e: Motion Scaling (AA2)	3e: Motion Scaling (AA2)

Table 2. Pilot Study II: Testing Conditions with Highlighted Combinations Preferred by the Participants

We further refined the interface display based on the freeform comments from the participants. Shown in Figure 9, we use a sidebar in pink and green to prominently indicate the activation of action augmentation. In autonomy mode, we also highlight the region to activate the autonomous actions in both the primary and complementary viewpoint. The corresponding AR visual cue (i.e., the square around the object) will be turned from white to light blue color.

Fig. 9.

3.6 Estimation of Cognitive and Physical Workload

We estimate the cognitive workload using the operator’s gaze and eye movement tracked by a Tobbi Pro Nano eye tracker. We also propose a novel method to estimate the physical workload online from human motion tracking.

Estimation of Cognitive Workload (Offline). Following the methods in the literature [44, 52, 89], we will estimate cognitive workload caused by stress \(C_{str}\), interface complexity \(C_{int}\) and task workload \(C_{tsk}\) from the operator’s pupil diameter, gaze fixation and movements, and task duration. We will track the difference between the operator’s pupil diameter, and estimate the cognitive workload caused by stress as the difference between average pupil diameter (\(D_{tsk}\) during a task and the operator’s calibrated pupil diameter \(D_{cal}\) before the task start, and will be normalized with respect to the maximum cognitive workload among all the participants, i.e., \(C_{str} = \frac{\overline{D_{tsk}} - D_{cal}}{\max \nolimits _{p=p_1,\ldots ,p_n}(\overline{D_{tsk}} - D_{cal})}\). Prior literature suggests that [44, 52, 89] pupil dilates with the increased workload, thus increasing the difference between the average pupil diameter during a task (\(D_{tsk}\)) and the operator’s calibrated pupil diameter (\(D_{cal}\)) prior to the start of the task.

The cognitive workload caused by interface complexity \(C_{int}\) will be computed as the ratio between the average distance in pixels of the operator’s gaze fixation and the center of visual display and the maximum distance in pixels (from edge to center of visual display, i.e., \(S_{tsk}\) and \(S_{max}\)). Thus, the interface complexity can be calculated as, \(C_{int} = \overline{S_{tsk}}/S_{max}\)). To compute the cognitive workload for each sub-task (e.g., picking-and-place one object), we will also estimate the cognitive workload caused by task complexity as the ratio between the time to complete a sub-task and total task completion time (namely, \(C_tsk = T_{sub}/T_{total}\)). Thus, the cognitive workload for a sub-task can be computed as the average of \(C_{str}\), \(C_{int}\) and \(C_{tsk}\). We also contribute the overall workload \(C_{task}\) of the entire task caused the stress and interface complexity as the average of \(C_{str}\) and \(C_{int}\), assuming they have equal contributions.

Estimation of Physical Workload (Online). Surface Electromyography (sEMG) signals can provide more accurate measurements of the muscle efforts and physical workload than using subjective feedback (e.g., Rapid Upper Limb Assessment, namely, RULA [2, 42, 65]). Our recent work has used sEMG for the objective but offline estimation of physical workload in robot teleoperation via whole-body motion mapping [58, 60]. Here, we propose to learn predictive models for the online, accurate muscle effort prediction from human motion-tracking data. Our prior work [58] shows that: the muscle efforts of the anterior, lateral deltoid, and bicep muscle groups, caused by shoulder flexion, abduction, and elbow flexion, contribute most to the physical workload when humans control telemanipulation using their arm and hand motions.

Shown in Figure 10, we thus attached 6 body trackers (Vive Tracker 3.0) to the upper arms, forearms, chest, and waist of the human operator, to estimate the shoulder and elbow joint angles. Specifically, the shoulder flexion (\(\theta _{SF}\)) is estimated on the sagittal plane as

\begin{equation} \theta _{SF} = \arccos \left(\frac{\vec{T}_{ua} \cdot \vec{g}}{||\vec{T}_{ua}||\ ||\vec{g}||}\right)\!, \end{equation}

(1)

which has \(\vec{T}_{ua}\) to be the upper arm vector estimated from shoulder and elbow trackers, and the \(\vec{g}\) to be the gravity vector, both of which are projected on the sagittal plane (i.e., the x-y plane).

Fig. 10.

The shoulder abduction \(\theta _{SA}\) is estimated on the frontal plane as

\begin{equation} \theta _{SA} = \arccos \left(\frac{\vec{T}_{vertical} \cdot \vec{T}_{ua}}{||\vec{T}_{vertical}||\ ||\vec{T}_{ua}||}\right)\!, \end{equation}

(2)

which has \(\vec{T}_{vertical}\) to be the vector perpendicular to the vector connecting two shoulder trackers, and \(\vec{T}_{ua}\) to be the vector of the upper arm formed by shoulder and elbow trackers, both of which are projected on the frontal plane (i.e., the x-z plane).

The elbow flexion \(\theta _{EF}\) is estimated as

\begin{equation} \theta _{EF} = \arccos \left(\frac{\vec{T}_{ua} \cdot \vec{T}_{la}}{||\vec{T}_{ua}||\ ||\vec{T}_{la}||}\right)\!, \end{equation}

(3)

which has the \(\vec{T}_{ua}\) to be the upper arm vector, and \(\vec{T}_{la}\) is the forearm vector estimated from the elbow tracker and hand-held controller positions. Both these vectors are projected on the sagittal plane (i.e., the y-z plane). Note that: \(0^{\circ }\lt \theta _{SF}\lt 150^{\circ }\), \(0^{\circ }\lt \theta _{SA}\lt 120^{\circ }\) and \(0^{\circ }\lt \theta _{EF}\lt 150^{\circ }\).

We conducted a pilot study (Pilot Study III) to validate the physical workload estimation with sEMG data. Shown in Figure 11 (Left), before telemanipulation, we asked the human operator to perform a compound arm exercise that involves the active coordination of the anterior and lateral deltoid and the bicep muscle groups. The participants held one HTC Vive controller in each hand and moved their shoulder and elbow from fully extended to fully flexed for 20 s at the speeds and angular velocities for typical robot control motions. We computed the joint angles of shoulder flexion, abduction, and elbow flexion from the body and arm motions tracked by the HTC Vive trackers, and used the corresponding sEMG data to estimate the offline muscle efforts [58]. For the offline workload estimation, we used a bandpass filter to extract the 40–700 Hz EMG signals from the wireless sEMG sensors (Delsys Trigno Avanti Sensors) attached to the anterior, lateral deltoid and bicep muscle groups. We pre-processed the data using a high pass filter (cutoff frequency 10 Hz) to remove the soft tissue artifact and offset the frequency baseline and used a full-wave rectification then a sixth-order elliptical low pass filter (cutoff frequency 50 Hz) to remove noise and transients and develop a linear envelope of the EMG signals, following the method in the literature [45] but choose tunable parameters for our own task and data. The shoulder muscle efforts were computed using the weighted sum of the anterior and lateral deltoid (at the ratio of 3:4 based on their capabilities of force generation [51]), while the elbow efforts were calculated from the bicep flexion. The muscle efforts were computed by normalizing the processed EMG data with respect to the person’s maximum voluntary contraction following the standard procedure in the literature [13]. We averaged the shoulder and elbow muscle efforts for each arm, and estimated the operator’s overall physical workload as the weighted sum of muscle efforts from the dominant and non-dominant arm (at the ratio of 9:1), for the tasks that operators extensively move their dominant arms for robot motion control:

\begin{equation} P_{overall} = 0.9 \times \left(\frac{P_{DS} + P_{DE}}{2}\right) + 0.1 \times \left(\frac{P_{NDS} + P_{NDE}}{2}\right)\!, \end{equation}

(4)

where \(P_{DS}\) and \(P_{NDS}\) are the shoulder muscle efforts of the dominant and non-dominant arms, while the \(P_{DE}\) and \(P_{NDE}\) are the elbow muscle efforts. A set of injunctive mapping functions was learned to predict the muscle efforts based on the arm joint angles with good accuracy.

Fig. 11.

Figure 10 shows that our predictive model can estimate the sEMG-based physical workload based on the joint angles in isolation exercises, comparable to literature results [2, 42]. For compound exercises, Figure 11 (Right) shows an example of the prediction accuracy of our simple models for one male (32 years old) and one female (33 years old) of functional upper extremities and normal body mass index. The root-mean-square errors between the proposed method and EMG data are 3.68, 4.52, 4.63, and 3.78 for males and 4.97, 3.81, 4.12, and 4.37 for female participants for the left should, left elbow, right shoulder, and right elbow.

4 User Study

Research Questions. We conducted a user study to address the following research questions:

—

RQ1: What aspects of dexterous telemanipulation are improved while using generally preferred improvements upon freeform teleoperation?

—

RQ2: How do the different types of augmentation impact the performance, workload, and preference?

—

RQ3: When should the teleoperators be provided with the visual and action augmentation to increase task performance and decrease operational workload?

—

RQ4: Who should be provided with what type of augmentation for freeform teleoperation?

Experiment Setup. Figure 1 shows the telemanipulation system used for our experiments. The participants were instructed to control the Kinova Gen 3 manipulator robot to perform a single-object pick-and-place task using an HTC Vive hand-held controller. Two RealSense D435 cameras were set up to provide the primary and complementary viewpoints, to provide the back view and left side view of the workspace, while a desktop monitor was used to display the GUI (with camera viewpoints and AR visual cues) to the operator. HTC Vive body trackers and hand-held controllers were used to track their body and arm motions for online physical workload estimation. The Tobii Pro Nano screen-based eye tracker attached to the desktop monitor display was used to track the operator’s gaze and eye movements for cognitive workload estimation. Unlike the pilot studies, for this user study a screen-based tracker was used, because the visual interface was relayed on a computer screen as opposed to a head-mounted display.

Participants. Our experiments include 23 participants (28\(\pm\)10 years old) diverse in gender, technological, and professional experience. These participants were categorized into distinct user groups and these groups may result in participant overlap, with each category still comprising 23 participants. The participants were divided into different groups based on the following factors:

—

Gender: Based on gender, participants comprised 14 male and 9 female participants.

—

Background: The 23 participants comprised 5 nurses and 18 users who do not have a nursing background. Participants were determined to have a nursing background if they are a nursing student or a registered nurse. Participants with nursing backgrounds were recruited to incorporate our intended future users for a teleoperation platform nursing in the development stage.

—

Proficiency: The participants could be divided into 9 experienced and 14 inexperienced users based on their experience in having used the teleoperation system. Users were classified as experienced users if they had more than one hour of experience controlling the robot via teleoperation. They must have also teleoperated the robot within one year to the day of their participation in the user study. The experienced users included participants of the pilot study for this user study, in addition to other experienced participants from prior user studies for different experiments.

—

Gaming: The participants were divided into 16 infrequent video game players and 7 frequent video game players. Participants who spent less than 5 h a week playing video games were classified as infrequent video game players.

—

Spatial: Based on their spatial reasoning skills (via a spatial test from AssessmentDay [8]), the 23 participants were divided into 10 people with low spatial reasoning skills and 13 people with high spatial reasoning skills. The participants’ spatial reasoning skill was evaluated using a spatial reasoning test that was part of the pre-user study survey. Participants who scored less than 60 percentile on the test were evaluated to have low spatial reasoning skills.

—

Mode Order: The participants were also divided based on the order in which the participants used the interface modes. 8 users used the interface modes in the 1 \(\rightarrow\) 2 \(\rightarrow\) 3 order. 7 users used the interface modes in the 2 \(\rightarrow\) 3 \(\rightarrow\) 1 order. 8 users used the interface modes in the 3 \(\rightarrow\) 1 \(\rightarrow\) 2 order.

Task. Our participants performed the same single-object pick-and-place task in all the trials. To focus on the comparison of different augmentation approaches, we simplify the task of picking up and dropping a block object into the container, which does not involve the control of robot/object orientation. The robot, object, and container were set to the same positions at the start of each trial (Figure 12(a)). This task involves general-purpose manipulation actions and requires both gross and precise manipulation. Specifically, each task can be decomposed to four action phases (Figure 12(b) and Figure 12(c), including: (1) Reaching to the object from the robot start position to within 0.3 m from the target object; (2) adjust the robot pose for Grasping the object; (3) Moving the grasped object to be 0.15 m from the container; (4) adjust the robot pose for Placing the object to the container. Note that the telemanipulation tasks could be more diverse and difficult than the single-object pick-and-place, our event-based robot autonomy and augmentation (perception and action) can still adapt to different purpose tasks (e.g., assist the object alignment in the stacking task or interact with multi-target workspace in correct order).

Fig. 12.

Experiment Conditions. In each mode, a participant performed the task twice: (1) without any augmentation (Default), (2) with each perception augmentation (PA1 and PA2), (3) with each action augmentation (AA1 and AA2), and (4) using the preferred combination identified in the pilot study in Section 3.5. The total trials for each participant is 36 = 2 repetitions \(\times\) 3 modes \(\times\) (1 default + 2 PAs + 2 AAs + 1 preferred integration). To avoid the learning effects, participants performed a random selection of one of the three mode orders mentioned above.

Experiment Procedure. The experiment consists of a training section and performing section. In the training section, the experimenter explained and demonstrated how to use the default interface of the selected starting mode, to perform the telemanipulation task without any robot autonomy and interface augmentation for perception and action. The participants could practice the task (for a maximum of 15 min). The participants who felt confident to perform the task after practice would perform the practiced task under the aforementioned experiment conditions. Every participant stated they felt confident in using the teleoperation interface within the allocated practice time. We recorded the task performance (e.g., task completion time, types, and occurrence of errors) during both the training and performing sections. Before the experiments, participants filled in a survey to report their experiences in video games, virtual reality environments, and spatial reasoning (via a spatial test from AssessmentDay [8]). Before the performing section, we asked the participants to look at the monitor for 30 s and recorded their pupil diameters, for the calibration required to estimate their cognitive workload. After completing the trials in each control mode, the participants filled in generic surveys, including NASA-TLX and System Usability Survey (SUS), and reported their rating for each of the six interface conditions. After the experiment, the participants also filled in a customized questionnaire to report their preference for the control mode and interface conditions.

Data Analysis. Our data analysis considered interface conditions and action phases to be the independent variables, and considered the task performance, workload, and user preference to be the dependent variables. We measured the task performance objectively using the completion time, robot trajectory length, and types and occurrence of errors, for the entire task and for each action phase. We consider the physical and cognitive workload estimated using the methods in Section 3.6, and reported in the NASA-TLX survey. We also consider the user’s preference inferred from the gaze fixations and distributions on the interfaces and reported in the SUS and the customized surveys.

5 Results

Driven by our research questions, we outline the step-by-step process we designed to analyze and present our findings. Illustrated in Figure 13, our approach commences by offering cross-mode comparisons for individual dependent variables, concentrating solely on default interfaces devoid of augmentations. Subsequently, we delve into comprehensive cross- and within-mode comparisons for each dependent variable, encompassing diverse augmentations. Moving forward, we unveil within-mode comparisons spanning distinct action phases, encompassing completion time and workload, while accommodating a spectrum of augmentations. Finally, we extend our investigation to within-mode comparisons across diverse user demographics, scrutinizing completion time and augmentation utilization in the context of varied augmentations. For all the comparisons, we analyzed data from all dependent variables using one-way repeated-measures analysis variance (ANOVA), including control modes, augmentations, action phases, and user groups, as a within-participants variable. Upon obtaining a significant result in ANOVA, all subsequent pairwise comparisons (performed through Student’s t-test) employ the Bonferroni correction to control for Type I error in the context of multiple comparisons.

Fig. 13.

5.1 Effects of AR Visual Cues and Assistive Autonomy

From the comparison between different control modes (without any perception or action augmentations), we have the following results regarding task performance, workload, and user preference. As shown in Figure 14, we found that: (1) using autonomous actions can significantly reduce the occurrence of errors; (2) using AR visual cues can significantly reduce cognitive workload; (3) the overall preference of the participants for each mode was autonomy mode \(\gt\) AR mode \(\gt\) baseline. The results in this section address (RQ1) discussing the impact of improvements to freeform teleoperation benefiting different aspects of dexterous telemanipulation.

Fig. 14.

Task Performance. The task completion time for the baseline, AR, and autonomy modes were on average 33.3 \(\pm\) 7.7, 30.6 \(\pm\) 8.7, and 29.2 \(\pm\) 4.3 s for all the participants. The participants completed the task faster (by 8% and 12%) with the assistive AR visual cues or autonomous actions than in baseline mode. The total trajectory lengths of the robot during the task for baseline, AR, and autonomy modes are 1.99 \(\pm\) 0.53, 1.76 \(\pm\) 0.53, and 1.78 \(\pm\) 0.59 m, respectively. The trajectory lengths were shorter in AR and autonomy modes (by 12% and 11%) than in baseline mode. The occurrence of errors during the task were 4.86 \(\pm\) 1.38, 3.93 \(\pm\) 0.91 and 0.15 \(\pm\) 0.14 occurrences, for baseline, AR, and autonomy modes, respectively. The ANOVA analysis showed no significant differences in the task completion time or the total trajectory lengths. However, the ANOVA analysis revealed a significant difference in the occurrence of errors (\(F(2,66)=27.4\), \(p \lt 0.01\)). Post-hoc analyses revealed that the autonomy mode demonstrated significantly fewer errors compared to both the baseline mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and the AR mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)), exhibiting error reductions of 97% and 96%, respectively.

Workload. The physical workloads while using baseline, AR, and autonomy modes were on average 49.3 \(\pm\) 4.7, 45.8 \(\pm\) 1.2, and 49.7 \(\pm\) 2.8 percent of the muscle capabilities. Mode with AR visual cues led to a lower physical workload than baseline and autonomy modes but without significant differences. The cognitive workloads were on average 59.3 \(\pm\) 11, 50 \(\pm\) 18.9, and 54.2 \(\pm\) 10.9, respectively, when using baseline, AR, and autonomy modes. The ANOVA analysis revealed a significant difference in the cognitive workload (\(F(2,66)=22.1\), \(p \lt 0.01\)). Post-hoc analyses revealed that the AR mode demonstrated significantly less cognitive workload compared to both the baseline mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and the autonomy mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)).

Preference. Our post-experiment survey asked the participants “Overall, how much do you prefer to use this interface for controlling the robot on a daily basis for your work?” The participants rated their preference for each mode using the Likert scale from 1 (the least) to 5 (the most). The feedback for baseline, AR, and autonomy modes were on average 2.1 \(\pm\) 1.3, 3.8 \(\pm\) 1 and 4.6 \(\pm\) 0.8. The ANOVA analysis revealed a significant difference in the rate of preference (\(F(2,66)=34.5\), \(p \lt 0.01\)). Post-hoc analyses revealed that the autonomy mode demonstrated significantly more preference compared to both the baseline mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and the AR mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)).

5.2 Effects of Various Perception and Action Augmentations

From the comparison between different augmentations (perception, action, and integration), we have the following results regarding task performance and workload. Table 3 shows the descriptive statistics for all objective measures. The green (red) color indicates the best (worst) case among all the augmentation interfaces for within-mode comparisons. Figure 15 further compares each objective measure for the best and worst cases across three modes. We found that: (1) using Fixed PIP (PA 1) can improve the performance of task completion time; (2) using Trackpad (AA 1) can reduce the occurrence of errors; (3) using the Integrated interface can reduce the cognitive workload. Similarly, Table 6 shows the subjective feedback from NASA-TLX and SUS forms between the interfaces for each mode. We found that: (1) using Trackpad (AA 1) results in higher mental and physical workload; (2) using Scaling (AA 2) results in higher overall workload and lower SUS score. Moreover, the analysis of gaze fixation indicates that: (1) participants tend to check on the PIP more for all perception augmentations in baseline mode; (2) the perception augmentations reduce the usage of the AR visual cues in AR mode. The impact of different types of augmentation on performance, workload, and preference (as mentioned in RQ2) will be highlighted through the results of the following section.

Table 3.

Table 4.

	PA 1	PA 2	Integrated
Baseline mode	29 (16)	17 (9)	21 (13)
AR mode	16 (13)	10 (9)	–
Autonomy mode	17 (16)	10 (11)	12 (15)

Table 4. Duration of Gaze Fixation on the Complementary Viewpoints (Percentage of Task Completion Time)

Table 5.

	AR Visual Cues
	Object AR	Box AR	Height Bar	Hint Boxes	Distance
Default	33 (12)	26 (12)	2 (4)	0.8 (1)	0.3 (1)
PA 1	30 (12)	19 (11)	1 (3)	0.3 (0.9)	0.2 (0.6)
PA 2	32 (13)	22 (13)	0.2 (0.5)	0.2 (0.5)	0.8 (1)
AA 1	34 (12)	25 (13)	2 (3)	0.2 (0.6)	0.2 (0.6)
AA 2	31 (11)	28 (13)	0.7 (1)	0.5 (2)	0.3 (2)
Integrated	36 (12)	25 (13)	0.3 (1)	0.4 (2)	0.6 (2)

Table 5. Duration of Gaze Fixation on the AR Visual Cues (Percentage of Task Completion Time)

Table 6.

Fig. 15.

5.2.1 Task Performance and Workload.

Based on Table 3, we first conducted the ANOVA and then multiple post-hoc comparisons: (1) between all augmentation interfaces for each mode to identify the best and worst case, and (2) of the best and worst case across three modes. Note that the selection of both the best and worst-case scenarios was determined by identifying the most significant difference within each mode. In cases where no significant difference was observed, the criteria shifted to prioritize the minimum mean value with a lower standard deviation.

Task Completion Time. In baseline mode, the ANOVA analysis indicated a significant difference in the completion time between all augmentation interfaces (\(F(5,132)=21.8\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Fixed PIP outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis indicated a significant difference in the completion time between all augmentation interfaces (\(F(5,132)=17.5\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Pop-up PIP outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the completion time between all augmentation interfaces (\(F(5,132)=25.4\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Fixed PIP outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). The ANOVA analysis showed no significant difference for the comparisons of the best and worst cases across three modes.

Total Trajectory Length. The ANOVA analysis showed no significant difference for the comparisons between all augmentation interfaces in each mode as well as of the best and worst cases across three modes.

Occurrence of Errors. In baseline mode, the ANOVA analysis indicated a significant difference in the occurrence of errors between all augmentation interfaces (\(F(5,132)=22.2\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Trackpad outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis indicated a significant difference in the occurrence of errors between all augmentation interfaces (\(F(5,132)=48.2\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Trackpad outperformed the interface with Fixed PIP, which was the comparison with the most significant difference (\(p = 0.0026\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the occurrence of errors between all augmentation interfaces (\(F(5,132)=23.1\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Integration, which was the comparison with the most significant difference (\(p = 0.0018\), corrected \(\alpha = 0.003\)). For the best interfaces in each mode, the ANOVA analysis showed a significant difference (\(F(2,66)=21.9\), \(p \lt 0.01\)) between modes and the post-hoc comparisons indicated the Default in autonomy mode outperforms the Trackpad in both baseline (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and AR modes (\(p \lt 0.01\), corrected \(\alpha = 0.017\)). For the worst interfaces in each mode, the ANOVA analysis showed a significant difference (\(F(2,66)=58.1\), \(p \lt 0.01\)) between modes and the post-hoc comparisons indicated the Integrated in autonomy mode outperforms the Scaling in baseline mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and Fixed PIP in AR mode (\(p = 0.01\), corrected \(\alpha = 0.017\)).

Physical Workload. In AR mode, the ANOVA analysis indicated a significant difference in the physical workload between all augmentation interfaces (\(F(5,132)=17.7\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). The ANOVA analysis showed no significant difference for the comparisons between all augmentation interfaces in both baseline and autonomy modes as well as of the best and worst cases across three modes.

Cognitive Workload. In baseline mode, the ANOVA analysis indicated a significant difference in the cognitive workload between all augmentation interfaces (\(F(5,132)=31.3\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Integration outperformed the interface with Default, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis showed no significant difference for the comparisons between all augmentation interfaces. In autonomy mode, the ANOVA analysis indicated a significant difference in the cognitive workload between all augmentation interfaces (\(F(5,132)=15.6\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Integration outperformed the interface with Default, which was the comparison with the most significant difference (\(p \lt 0.002\), corrected \(\alpha = 0.003\)). For the worst interfaces in each mode, the ANOVA analysis showed a significant difference (\(F(2,66)=22.1\), \(p \lt 0.01\)) between modes and the post-hoc comparisons indicated the Default in AR mode outperforms the Default in baseline mode (\(p \lt 0.01\), corrected \(\alpha = 0.017\)) and Fixed PIP in autonomy mode (\(p = 0.01\), corrected \(\alpha = 0.017\)). The ANOVA analysis showed no significant difference for the comparisons of the best cases across three modes.

5.2.2 Usage of Complementary View and AR Visual Cues.

Table 4 shows the duration of the gaze fixation on the complementary viewpoint (measured by the percentage of task completion time). Note that we only count the gaze fixation longer than 0.1 s on a particular interface feature. We found that with perception augmentations including PA 1 (Fixed PIP), PA 2 (Pop-up PIP), and Integrated, the time that participants spent looking at the complementary viewpoint is longer (yet not significantly) in baseline mode than in the other two modes.

Table 5 compares the gaze fixation duration on the different AR visual cues (measured by the percentage of task completion time). Gaze fixations on the height bar, the hint boxes, and the distance box were low, because the participants only glanced at them to find out the action affordance and confirmation. However, the object and box AR features were much more used. This is because the participants need to look at them to control the continuously performed reaching and moving motions, and to precisely adjust the robot position for grasping or placements. Table 5 also shows that the perception and action augmentations may change the participants’ reliance and usage of the AR visual cues. The ANOVA analysis indicated a significant difference in the use of the “object” AR visual cue between all augmentation interfaces (\(F(5,132)=10.8\), \(p \lt 0.05\)). Post-hoc analyses revealed that the PA 1 significantly reduced the use of the “object” AR visual cue, compared to AA 1 (\(p \lt 0.001\), corrected \(\alpha = 0.003\)) or the integrated (\(p \lt 0.001\), corrected \(\alpha = 0.003\)) interface. However, the perception and action augmentations have no significant impacts on the use of the “box” cue based on the ANOVA analysis.

5.2.3 Subjective Feedback.

Table 6 compares the reported mental and physical workload from the NASA-TLX survey between the augmentation interfaces for each mode, and the reported usability from the SUS. We also compare the overall NASA-TLX score, using the coefficients of: 5 for mental demand, 4 for physical demand, 0 for temporal demand, 2 for performance, 3 for effort, and 1 for frustration. The weighting coefficients were generated by choosing from a series of pairs of rating scale factors that were deemed to be important based on the official instructions. For each mode, we first conducted an ANOVA analysis, followed by a set of post-hoc comparisons for all augmentation interfaces. Here, we focused on identifying the best and worst case in each mode. Note that the selection of both the best and worst-case scenarios was determined by identifying the most significant difference within each mode. In cases where no significant difference was observed, the criteria shifted to prioritize the minimum mean value with a lower standard deviation. Similar to Table 3, the green (red) color indicates the best (worst) case for each mode in Table 6.

Mental Demand. In baseline mode, the ANOVA analysis indicated a significant difference in the mental workload between all augmentation interfaces (\(F(5,132)=4.1\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Pop-up PIP outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis indicated a significant difference in the mental workload between all augmentation interfaces (\(F(5,132)=2.9\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p = 0.0013\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the mental workload between all augmentation interfaces (\(F(5,132)=6.7\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)).

Physical Demand. In baseline mode, the ANOVA analysis indicated no significant difference in the physical workload between all augmentation interfaces. In AR mode, the ANOVA analysis indicated a significant difference in the physical workload between all augmentation interfaces (\(F(5,132)=2.0\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the physical workload between all augmentation interfaces (\(F(5,132)=2.3\), \(p \lt 0.05\)). Post-hoc analyses revealed that the interface with Integrated outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)).

Overall Workload. In baseline mode, the ANOVA analysis indicated a significant difference in the overall workload between all augmentation interfaces (\(F(5,132)=3.9\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Pop-up PIP outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis indicated a significant difference in the overall workload between all augmentation interfaces (\(F(5,132)=3.6\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the overall workload between all augmentation interfaces (\(F(5,132)=5.9\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Integrated outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)).

SUS. In baseline mode, the ANOVA analysis indicated a significant difference in the SUS score between all augmentation interfaces (\(F(5,132)=11.7\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Pop-up PIP outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In AR mode, the ANOVA analysis indicated a significant difference in the SUS score between all augmentation interfaces (\(F(5,132)=5.7\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Scaling, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)). In autonomy mode, the ANOVA analysis indicated a significant difference in the SUS score between all augmentation interfaces (\(F(5,132)=9.4\), \(p \lt 0.01\)). Post-hoc analyses revealed that the interface with Default outperformed the interface with Trackpad, which was the comparison with the most significant difference (\(p \lt 0.001\), corrected \(\alpha = 0.003\)).

5.3 Effects on Different Action Phases

We further analyze the interface modes and perception/action augmentation on the performance and workload for the different action phases defined in Section 4. From the comparison between different action phases in each mode, we have the following results regarding the task performance, cognitive and physical workload. As shown in Figure 16, we averaged the data from all augmentation interfaces (default, PAs, AAs, integrated) for each action phase (i.e., reaching, grasping, moving, and placing) in each mode (i.e., baseline, AR visual cues, and assistive autonomy). The ANOVA analysis and multiple post hoc comparisons showed that: (1) the action phase of grasping takes a significantly longer time than the reaching phase in all modes; (2) the action phase of placing results in a significantly higher physical workload in all modes; (3) the action phase of grasping and placing results in a significantly higher cognitive workload than the reaching and moving phases, respectively, in all modes. The results of the following section will help identify when visual and action augmentation is to be provided to improve performance and workload, which is RQ3.

Fig. 16.

Task completion time. Table 7 shows the task completion time of each action phase for all the interfaces of all three modes. The ANOVA analysis indicated a significant difference in the completion time between action phases for all baseline (\(F(3,88)=12.7\), \(p \lt 0.01\)), AR (\(F(3,88)=9.1\), \(p \lt 0.01\)), and autonomy (\(F(3,88)=7.1\), \(p \lt 0.01\)) modes. Post-hoc analyses revealed that the grasping action takes significantly longer time than the reaching action, for all interfaces and all the modes (all \(p \lt 0.001\), corrected \(\alpha = 0.008\)). The placing action also takes significantly less time than the moving action for all the interfaces of baseline (\(p \lt 0.001\), corrected \(\alpha = 0.008\)) and AR (\(p \lt 0.001\), corrected \(\alpha = 0.008\)) modes, but not for autonomy mode. We noticed that the interface that takes the least time for the grasping action is different for each mode. Specifically, it is the interface with PA 2 (Pop-up PIP) in baseline mode, the Default interface in AR mode, and the interface with PA 1 (Fixed PIP) in autonomy mode. The interface with PA 1 (Fixed PIP) in autonomy mode takes the least time to grasp, across all the modes and interfaces. We also noticed that the interface that takes the least time for the placing action is different for each mode. Specifically, it is the interface with PA 1 (Fixed PIP) in baseline mode, the interface with PA 2 (scaling) for AR and autonomy modes, respectively. The interface with PA 2 in AR mode takes the least time to place, across all the modes and interfaces. Note that the interfaces that take the least time for the grasping and placing action are highlighted in green in Table 7.

Table 7.

Physical Workload. Table 8 shows the physical workload of each action phase for all the interfaces of all three modes. The ANOVA analysis indicated a significant difference in the physical workload between action phases for all baseline (\(F(3,88)=13.2\), \(p \lt 0.01\)), AR (\(F(3,88)=10.9\), \(p \lt 0.01\)), and autonomy (\(F(3,88)=11.1\), \(p \lt 0.01\)) modes. Post-hoc analyses revealed that the grasping action has a significantly higher physical workload than the reaching action, for all interfaces and all the modes (all \(p \lt 0.001\), corrected \(\alpha = 0.008\)). The placing action also has a significantly higher physical workload than the moving action for all the interfaces of all the modes (all \(p \lt 0.001\), corrected \(\alpha = 0.008\)). We noticed that the interface that has the least physical workload for the grasping action is different for each mode. Specifically, it is the interface with PA 2 (Pop-up PIP) in baseline mode, the Default interface in AR mode, and the interface with PA 1 (Fixed PIP) in autonomy mode. The interface with PA 1 (Fixed PIP) in autonomy mode takes the least time to grasp, across all the modes and interfaces. However, the interface that has the least physical workload for the placing action is the Default interface for all three modes. The Default interface in AR mode has the least physical workload to place, across all the modes and interfaces. Note that the interfaces that have the least physical workload for the grasping and placing action are highlighted in green in Table 8.

Table 8.

Cognitive Workload. Table 9 shows the cognitive workload of each action phase for all the interfaces of all three modes. The ANOVA analysis indicated a significant difference in the physical workload between action phases for all baseline (\(F(3,88)=9.6\), \(p \lt 0.01\)), AR (\(F(3,88)=17.7\), \(p \lt 0.01\)), and autonomy (\(F(3,88)=15.4\), \(p \lt 0.01\)) modes. Post-hoc analyses revealed that the grasping action has a significantly higher physical workload than the reaching action, for all interfaces and all the modes (all \(p \lt 0.001\), corrected \(\alpha = 0.008\)). The placing action also has a significantly higher physical workload than the moving action for all the interfaces of all the modes (all \(p \lt 0.001\), corrected \(\alpha = 0.008\)). We noticed that the interface that has the least cognitive workload for the grasping action is different for each mode. Specifically, it is the integrated interface in baseline mode, the AA 2 (scaling) interface in AR mode, and the integrated interface in autonomy mode. The integrated interface in autonomy mode takes the least cognitive workload to grasp, across all the modes and interfaces. In terms of placing, PA 2 (pop-up PIP) caused the least cognitive workload for baseline and AR mode, and integrated interface caused the least cognitive workload for autonomy mode. The PA 2 (pop-up PIP) interface in AR mode caused the least cognitive workload to place, across all the modes and interfaces. Note that the interfaces that have the least cognitive workload for the grasping and placing action are highlighted in green in Table 9.

Table 9.

5.4 Effects of Other Human Factors

We further analyze the effects of several human factors, including gender, background, and experience with technology, on the performance and usage of the interfaces. The participants were divided into groups based on conditions as defined in Section 4. From the comparison between different user groups, we have the following results regarding the task performance in terms of the completion time. As shown in Figure 17, the multiple pairwise comparisons (Welch’s t-test) showed that: (1) users’ background impacts the task performance with more augmentation interfaces and control modes; (2) using assistive autonomy can mitigate the gap between different user groups in task performance. The following results will provide insights into how to utilize different types of augmentation for freeform teleoperation based on the operator’s characteristics (RQ4).

Fig. 17.

Table 10 compares the task completion time between user groups of different genders, backgrounds, proficiency with the telemanipulation interface, gaming frequency, spatial reasoning level, and mode order. We highlighted the identified significant differences, which indicate that the male, non-nursing, experienced users, frequent video game players, and high-level spatial reasoning users took less time to complete the task.

Table 10.

Table 11 compares the use of the complementary viewpoint between user groups, using the duration of gaze fixation on the complementary viewpoint with respect to the total task completion time. We found significant differences between the male and female user groups when using the interface with PA 1 (Fixed PIP) and using the integrated interface in baseline mode. For both interfaces, the male participants used the complementary viewpoints more than the female participants by 13\(\%\) and 8.7\(\%\) on average, respectively. We also noticed that: in baseline mode, both PA 1 and PA 2 led to significant differences in the use of complementary viewpoints between the users with and without nursing professional experience, with \(p \lt 0.05\) and \(p \lt 0.01\), respectively. Participants without nursing experience or training used the complementary viewpoint 13.4% more on average for the interface with PA 1 (Fixed PIP), and 9.8% more on average for the interface with PA 2 (Pop-up PIP). Moreover, we found significant differences between user groups of different spatial reasoning skills in the use of complementary viewpoint in baseline mode and autonomy mode when using the integrated interface, with \(p \lt 0.01\) and \(p \lt 0.05\). In baseline mode and autonomy mode, when using the integrated interface, participants with lower spatial reasoning skills use the complementary viewpoint more by 10.3\(\%\) and 9.2\(\%\) on average.

Table 11.

Table 12 highlights the significant difference between user groups by the use of the “object” and “box” AR visual cues. In AR mode, we found significant differences (\(p \lt 0.05\)) between users of different background in the use of the “object” cue when using the interface with AA 1 (Trackpad), and in the use of the “box” cue when using the Default interface. Specifically, the users without nursing experience used the “object” cue more by 8% when using the interface with AA 1, while the users with nursing experience used the “box” cue more by 10.3% when using the Default interface. Regarding the factor of proficiency, we found that: in AR mode, the participants with prior experience of robot teleoperation used the “object” cue significantly more (with \(p \lt 0.05\)) than the other group by 7.9%, when using the interface with PA 1 (Fixed PIP). We also found that the high-proficiency group used the “object” cue significantly more (with \(p \lt 0.05\) and \(p \lt 0.01\)) than the other group by 8.8% and 9.0% when using the interface with AA 1 (trackpad) and AA 2 (scaling). Regarding the effects of gaming experience, we found that frequent video game players used the “object” cue significantly more (\(p \lt 0.01\)) than the other user group when using the interface with AA 1 (Trackpad), by 10.3% on average. We also found that frequent video game players used the “object” cue significantly more (\(p \lt 0.05\)) than the other user group when using the integrated interface, by 7.3% on average. Regarding the factor of spatial reasoning skills, we found that the participants with better spatial reasoning skills used the “object” cue significantly more (\(p \lt 0.05\)) than the other users when using the interface with AA 1 (Trackpad), by 7.0% on average. We also found that the participants of better spatial reasoning skills used the “box” cue significantly less (\(p \lt 0.05\)) than the other users when using the interface with AA 1 (Trackpad, by 9.1% on average). Regarding the effects of mode orders, we found that the users performed the mode in 3 \(\rightarrow\) 1 \(\rightarrow\) 2 order used the “box” cue significantly more (\(p \lt 0.05\)) than in 2 \(\rightarrow\) 3 \(\rightarrow\) 1 order when using the Default interface, by 12.3% on average. We also found that the participants used the mode in 1 \(\rightarrow\) 2 \(\rightarrow\) 3 order used the “box” cue significantly more (\(p \lt 0.05\)) than in 2 \(\rightarrow\) 3 \(\rightarrow\) 1 order when using the interface with PA 1 (Fixed PIP, by 10.8% on average).

Table 12.

Table 13 shows the trend in the use of the height bar, hint box and distance between user groups. The table presents a comparison of the total number of frames for each AR cue used while performing the task across all the participants in the user group. We highlighted the user group that used these features more, measured by the total number of frames.

Table 13.

Remarks:. These analyses demonstrate the influence of various user demographics on task performance and augmentation utilization. We acknowledge the need for further investigation, including the possibility of conducting controlled experiments dedicated to studying the demographics of users in more depth.

6 Discussion

6.1 Summary of Novelty and Contributions

Our prior work developed shared autonomy to assist the remote perception and action control in freeform telemanipulation. Specifically, we have leveraged AR visual cues [59], interface mapping [93] and autonomous actions for precise manipulation (e.g., grasping) [60], to effectively reduce the human’s workload while improving the control efficiency. The work in this article further proposed new approaches for perception and action augmentation, to achieve more efficient and effortless freeform telemanipulation motion control. Moreover, we proposed a novel method for objective physical and cognitive workload estimation based on human motion and eye tracking devices, to accurately evaluate human comfort while performing remote manipulation. To address the research questions (listed in Section 4), we conducted a comprehensive user study to evaluate and compare various integrations of telemanipulation assistance and discovered new knowledge about their impacts on performance, workload and the preference of the user’s groups that differ in various human factors. The main findings include:

(1)

Based on the comparison between different control modes (direct manual, AR visual cues, and assistive autonomy without any augmentations), using AR visual cues can significantly reduce cognitive workload while using assistive autonomy can significantly reduce the occurrence of errors.

(2)

Based on the comparison between different augmentations for all control modes, using Fixed PIP (PA 1) can improve the performance of completion time and trajectory lengths, using Trackpad (AA 1) can reduce the occurrence of errors but result in higher mental and physical workload, while using the Integrated (preferred PA+AA) can reduce the cognitive workload.

(3)

Based on the comparison between different action phases (reaching, grasping, moving, and placing) in the object pick-and-place task, using perception augmentations can reduce the completion time for grasping action (takes the longest time), and using Integrated (preferred PA+AA) can reduce the cognitive workload for both grasping and placing actions (with higher workload than reaching and moving).

(4)

Based on the comparison between user groups, using assistive autonomy control mode with perception augmentations and Trackpad can mitigate the gap in task performance between different user groups.

6.2 The Effective Integration of Perception and Action Assistance is Task-Dependent

Related work in the literature has developed various augmented reality interfaces and shared autonomy for action supports, to assist remote perception and robot motion control [6, 15, 95], yet there are still no comprehensive comparisons to evaluate their individual and integrated impacts on task performance, workload, and user preference. To fill this gap, we first conducted a user study to compare interfaces that only have the perception assistance (of AR visual cues) or the action assistance (of autonomous actions), to the baseline interface without any assistance. Our results show that both perception and action assistance can effectively reduce the cognitive workload. Moreover, the assistive action can effectively reduce the occurrence of errors.

We further evaluated to what extent the additional perception and augmentations proposed in this work can further assist the telemanipulation. Our results show that the effectiveness of perception and action assistance (and their integration) depends on the task performance objective. Specifically, we found that: for the tasks that need to be completed as fast as possible, the integration of Fixed PIP display and autonomous actions is the most effective, because it led to the least task completion time and motion efficiency (measured by the total robot trajectory length). As the participants commented: “...the complementary viewpoint helped me to clearly understand the relationship between robot and target so that can move faster in the right direction to active the autonomy feature.” For the task that emphasizes the reliability and precision of motion control, the interface that only provides autonomous actions for assistance turns out to be most effective, because the operators can focus more on the use of autonomous actions and will not be distracted by other interface augmentation designed to enhance task efficiency or to reduce the workload. However, when the human’s workload and comfort are prioritized in the telemanipulation, it is more effective to provide only the AR visual cues and assistive autonomy integrated with Fixed PIP interfaces, as they can effectively reduce human’s physical and cognitive workload, respectively. These findings suggest the extension of the task- or goal-dependent perception and augment assistance design, which will be intelligent to not only provide suitable augmentation but activate the assistance based on the online estimation of human comfort. From the user preference and SUS, we also found that humans strongly prefer to use autonomy to assist their remote perception and action, but only if they are reliable. Most of the participants commented that: “if the robot autonomy is reliable, I would like to use it on daily basis.” Particularly, the nursing participants “...would like the robot to be as autonomous as possible, because we do not have more bandwidth to control the robot during nursing duty.” However, the robot autonomy may not be consistently reliable due to the perception and action uncertainty of the robots, as well as the complexity of the manipulation tasks. It is still unclear how to adjust the level and type of robot assistance if the reliability of the robot autonomy may vary. Our future work therefore will further investigate the impacts of unreliable autonomy in human-robot collaboration for robot remote manipulation.

6.3 Need for Assist-as-needed Interfaces for Different Tasks and User Groups

Effects on Different Action Phases. The work in this article also compared the effects of telemanipulation assistance on different action phases during the tasks. While our comparison is limited to the action phases of a general-purpose pick-and-place task, the findings may still imply a design guideline generally applicable to freeform telemanipulation. The pick-and-place task we performed has four action phases, namely, reaching, grasping, moving, and placing. Our results show that: the grasping and placing, which require more precise motion control, took significantly longer time for human control and need more perception and action assistance. Most participants found: “...it was a relief that the robot can take over the grasping and placing part of the task.” We also found that precise manipulation actions cause significantly higher physical and cognitive workload than gross manipulation (i.e., reaching and moving). Especially, the physical workload for the placing action is also significantly higher than the grasping action. Our results show that the most effective interface to assist the grasping action needs to integrate the AR visual cues, fixed PIP display, and assistive autonomous actions, while the most effective interface to assist the placing action only needs the AR visual cues. In terms of cognitive workload, the interface with assistive autonomy integrated with fixed PIP and scaling motion significantly reduced the workload while grasping. Moreover, the interface with AR visual cues and pop-up PIP display of the complementary viewpoint can significantly reduce the cognitive workload of the placing action. The participants in the post-study survey commented that: “it will be great if the assistive autonomy can integrate the certain AR visual cues to gain more information,” which is similar to the design we developed in the integrated interface for autonomy mode. This insight suggests the extension of the research to investigate what type of AR visual cues should be displayed to enable humans to seamlessly utilize various levels of robot autonomy for remote manipulation.

Effects on Different User Groups. We also find significant differences regarding the impacts of perception and action assistance on different user groups. Regarding the task performance, we found that: while the assistive autonomous actions can effectively reduce the task completion time for almost all the comparisons between user groups, the AR visual cues are only significantly more effective for some user groups (of specific gender, background, and proficiency). We also found that the female and nursing participants were less capable of using the complementary viewpoints when the interface did not provide AR visual cues or assistive autonomous actions. The action augmentations proposed in this work, including the “Trackpad” and “Scaling” to improve the efficiency and accuracy of motion control, do not effectively reduce the task completion time for most of the users. However, when these action augmentations were provided, we found significant differences between user groups in their use of the AR visual cues during grasping. This finding implies that for some user groups, the action augmentation only reduced their task completion time to a limited extent (for the task we studied), because they can better leverage the AR visual cues. We also found significant differences from most of the between-user-group comparisons (in the task performance and in the use of the complementary viewpoint), when participants used an interface that integrated some perception and action augmentation proposed in this work (i.e., pop-up PIP complementary viewpoint display and the Trackpad control, as preferred by the pilot study user). This finding implies when the robot autonomy like the AR visual cues and assistive autonomous action are not reliable, some user groups can benefit from some perception or action augmentation. Our future work therefore will investigate how to adjust the integration of the perception and action assistance for different users.

6.4 Limitations

We are aware of several limitations of the work in this article. Our user study can be solidified by recruiting more participants from the nursing profession (e.g., nursing students, faculty, and practitioners), and considering their professional experiences (e.g., grade of students, teaching years as faculty, hours of nursing services) as the factors in our data analysis. More end-users as participants will facilitate the human-in-the-loop design ranging from the relevant task setup to robot assistance development (e.g., testing of the feasibility of the nursing tasks [57]). We will further collaborate with the hospital and nursing school to not only work with the nursing profession but also deploy the robot to work in a (simulated) hospital room.

The task in our experimental study is limited to a general-purpose pick-and-place task with four simple manipulation action phases. Even though our robot autonomy and augmentation (perception and action) might adapt to different purpose tasks, it is unclear if our findings can be generalized to other manipulation actions, tasks, and specialized manipulation for nursing assistance tasks. The telemanipulation task is also limited, because it only involves freeform position control. It is unclear whether the proposed perception and action assistance can be effective in manipulation tasks that emphasize orientation control or 3D object pose control. A natural extension of our implementation is to further map the angular velocity as an input to control the robot’s orientation. Such an approach could enable robots to perform a wider range of remote manipulation tasks such as tracking the path, aligning, and stacking. Future work could implement similar perception and action assistance proposed in this article for assisting the orientation control.

7 Conclusion

This article conducted a comprehensive evaluation of individual and integrated approaches to assist the remote perception and motion when humans control a 7 DOF manipulator (using natural hand motions) to perform a single-object pick-and-place task. We analyzed the performance, workload, and human preference when using the interfaces that integrated different perception and action assistance and discovered new knowledge about how to effectively provide task- and use-dependent telemanipulation assistance.

Specifically, we have discovered the effective integration of perception assistance (e.g., AR visual cues, display of complementary viewpoint) and action assistance (e.g., assistive autonomous actions, directional precision motion control using the trackpad and larger motion mapping scaling) vary with the task performance objectives, manipulation actions, and the human factors of the user groups.

References

[1]

Intuitive. [n.d.]. Da Vinci Surgeon Console. Retrieved from https://www.intuitive.com/en-us/products-and-services/da-vinci/systems. Accessed: 2022-04-28.

Abstract

1 Introduction

1.1 Remote Perception and Teleaction Problems in Freeform Telemanipulation

1.2 Division of Human and Robot Efforts in Assisted Teleoperation

1.3 Overview of Research Efforts

2 Related Work

2.1 Cognitive and Physical Workload in Freeform Telemanipulation

2.2 Action Augmentation via Interface Mapping and Scaling Design

2.3 Visual Assistance for Perception Augmentation

3 Interface and Evaluation Design

3.1 System Overview

3.2 Design and Implementation of AR Visual Cues and Assistive Autonomy

3.3 Complementary Viewpoint for Perception Augmentation

3.4 Dynamic Interface Mapping for Action Augmentation

3.5 Integration of Perception and Action Augmentation

3.6 Estimation of Cognitive and Physical Workload

4 User Study

5 Results

5.1 Effects of AR Visual Cues and Assistive Autonomy

5.2 Effects of Various Perception and Action Augmentations

5.2.1 Task Performance and Workload.

5.2.2 Usage of Complementary View and AR Visual Cues.

5.2.3 Subjective Feedback.

5.3 Effects on Different Action Phases

5.4 Effects of Other Human Factors

6 Discussion

6.1 Summary of Novelty and Contributions

6.2 The Effective Integration of Perception and Action Assistance is Task-Dependent

6.3 Need for Assist-as-needed Interfaces for Different Tasks and User Groups

6.4 Limitations

7 Conclusion

References

Index Terms

Recommendations

Motor Activity-Perception Based Approach for Improving Teleoperation Systems

A comparison of two cost-differentiated virtual reality systems for perception and action tasks

Research on Mixed Reality Visual Augmentation Method for Teleoperation Interactive System

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations