This section will present our proposed approaches for perception and action augmentation, and their implementation on a representative telemanipulation system and for a general-purpose pick-and-place task. We further propose methods for the estimation of eye-tracking-based cognitive workload, and motion-tracking-based physical workload to enable the evaluation of integrated interfaces in formal user studies.
3.1 System Overview
Figure
1 shows the telemanipulation system developed in our prior work [
59], which enables the development of the perception and action augmentation proposed for this project. The
robot platform is a 7-DoF Kinova Gen 3 manipulator with a two fingered Robotiq gripper that can detect contact with the grasped object. Two RealSense Cameras (D435) were standing alone in the workspace for primary and complementary remote perception.
For
robot motion control, we use an HTC Vive hand-held controller (referred to as “controller” in the rest of this article) that allows human operators to control the freeform robot motions using their natural hand motions and constrained motion using controller’s trackpad. By default (i.e., Mode 1 of Figure
2), the linear velocity of the controller will be mapped to the linear velocity of the robot. The input-to-output motion mapping ratio is 1:5 along the x-axis and 1:3 along the y- and z-axis. This mode will be referred to as
baseline mode in the rest of this article. We locked the robot’s rotational motions, because this work focuses on developing and comparing different
modalities of teleoperation assistance instead of the capabilities of robot control. To perform a telemanipulation task, the operator will: (1) press the menu button on the controller to send the robot to the home configuration, (2) press the menu button again to get the robot ready, and press the grip (side) button to initiate the control.
For
visual feedback, we used 1,440
\(\times\)1,080 pixel resolution Unity 3D window on a 27-inch desktop monitor to display the remote camera video stream (at 30 Hz frame rate) and to display augmented reality visual cues (see Section
3.2 for details). By default, the graphical user interface (GUI) will only display the robot’s operation state. Specifically, the GUI will display: (1) “
WAITING” when the tele-robotic system is ready for operation and waiting for a control command; (2) “
SENDING HOME” when the operator presses a controller button to set the robot to the default pose; (3) “
READY” when the robot is posed at the start position for the current task; (4) “
TELEOPERATING” when the robot is being teleoperated; (5) “
PAUSED” when the robot is paused by the teleoperator. Figure
3 shows the control architecture and data communication pipeline of the telemanipulation system. The RGB video from the remote cameras are streamed at 30 Hz frame rate. A screen-based eye tracker (Tobii Pro Nano) was attached below the monitor to track the human operator’s gaze and eye movements (e.g., pupil diameter) at 60 Hz. The autonomy for perception and action can detect the ArUco tags attached to the objects, container, and counter workspace [
35,
79], to estimate the information for the AR visual cues and control the robot autonomous actions for precise manipulation (e.g., object grasping and placing).
3.2 Design and Implementation of AR Visual Cues and Assistive Autonomy
To assist robot remote manipulation, we implemented systematic AR visual cues and user-triggered autonomous actions as the baseline representing the common solution for remote perception and action problems. We then develop, integrate, and compare different types of perception and action assistance upon the baseline AR visual support and assistive autonomy, to discover new knowledge on optimal human-robot collaboration for freeform telemanipulation.
AR Visual Cues. Our prior work [
59] has proposed four types of AR visual cues for freeform teleoperation assistance, including: (1) the
Target Locator to indicate the robot’s movement direction and distance to the targeted object or goal pose; (2) the
Action Affordance to indicate if the robot is ready to afford the action to be performed (e.g., grasping or stacking an object, with a good chance of success); (3) the
Action Confirmation to indicate that the robot has successfully performed an appropriate action; and (4) the
Collision Alert to alert the teleoperator if the end-effector is about to violate any environment constraints (e.g., hitting the table). Figure
2 (Mode 2) shows the implementation of these AR visual cues to assist a pick-and-place task for this work. This mode will be referred to as the
AR mode in the rest of this article.
—
The Height indicator shows the robot’s distance to the table surface. Besides the display of numerical distance, the height bar display also turns from green to red if the robot is too close (within 0.1 m) to the table.
—
The Alignment indicator (displayed as a dot-in-circle) shows if the robot is aligned with the object to grasp or the container to place the object, in x- and y-direction. Once the blue dot moving with the robot is aligned with the pick circle displayed on the object or containers, the pink circle will change its color to light blue to indicate the operator can reliably close or open the gripper to reliably grasp or drop the object into the container.
—
The Grasping/Placing Hint includes two square-shape that turn on and off to show whether the robot is aligned with the object or the container so that the operator can confidently close or open the gripper. It is designed to confirm the critical information conveyed in the Alignment and Height cues.
—
The Arrow with Distance indicator shows the distance (in cm) and direction (using green and pink arrows, respectively) to show the target object to grasp or container to place.
The proposed implementation of AR visual cues is refined based on our prior design and evaluation results [
59]. Specifically, we have adjusted the
Height indicator to be vertical instead of horizontal for a more intuitive visual display. We grouped the
Grasping/Placing Hint into a white box and highlighted the boundary of the container to make them easier to spot at a glance. We also extended the AR support to picking-and-placing.
Assistive Autonomy. Shown in Figure
2 (autonomy mode), we also provide autonomous actions to assist the operators to perform precise manipulation (e.g., picking and placing an object). The robot autonomy can detect the human’s goal and action intents based on robot states, including the distance to the object or container, and whether the gripper is open or closed. When the gripper is open, we predict human intent to grasp the object if the robot is within the predefined distances to the center of the object (0.05 m, 0.08 m, and 0.13 m in the x-, y-, and z-direction). This mode will be referred to as
autonomy mode in the rest of the article. A hint of “AUTONOMY” will be displayed to show the robot has detected the human’s goal and action intent by filling the box when the robot can reliably perform the action. Humans therefore can press the controller’s trigger to confirm the execution of the autonomous action, after which the robot will autonomously reach to grasp the object and lift it to 0.2 m above the table surface. To place an object, the operator needs to move the robot to be within a predefined distance (0.08 m, 0.08 m, and 0.15 m in the x-, y-, and z-direction) to the center of the top of the container. Once confirmed by the human, the robot will autonomously move to the top of the container and drop the object into it reliably.
Remarks. Our proposed visual and action augmentation depends on the robot autonomy to predict human intents, determine action affordance and success, and detect and avoid collision. Here, we implemented a simple design of autonomy that predicts the human’s intent to grasp or place an object based on the robot state. The object detection, action affordance, and collision are also simplified given that we know the location and geometry of the object and the environmental constraints. Note that more advanced methods to predict human intents, from human control inputs [
24,
46,
81], gaze [
3,
78], or their fusion [
77], can be integrated with our proposed visual and action augmentation for more complex manipulation tasks. Advanced methods to detect objects and their action affordance (e.g., using Sim2real approach [
47], for unknown objects [
18]) can also be incorporated to enable more complicated precise manipulation and the delicate control of interaction forces. Collision in dynamic and cluttered environments can be detected using advanced methods such as generalized velocity obstacles [
99].
3.3 Complementary Viewpoint for Perception Augmentation
We propose to leverage an additional workspace camera to provide a complementary viewpoint in which the operator can better perceive the information missed in the primary workspace camera viewpoint. Shown in Figure
4, the GUI presents a
picture-in-picture (PIP) display to embed the complementary viewpoint into the primary viewpoint. The perception augmentation in the form of the complementary viewpoint can be presented always (i.e., the fixed viewpoint) or dynamically given the robot and task states (i.e., dynamic viewpoint). It can also be augmented with different interface control modes. Here, we present the pilot user study (Pilot Study I) for complementary viewpoint iterative design and evaluation. We conducted a pilot study with one expert participant (female, age = 33, without visual or motor disability, 100+ h experience with robot) to answer Q1–Q3.
Q1—Do we need multiple viewpoints? . We follow the experiment setup in the literature [
26], and set up five workspace cameras (4 RealSense cameras and 1 Webcam) to observe the workspace from different perspectives (the front, back, left, right, top). Including the viewpoint from the eye-in-hand camera of the robot, we presented six viewpoints to the user and tracked her gaze fixation on each viewpoint during the pick-and-place task. As shown in Figure
5, the operator was asked to grasp four blocks of different colors placed around a red cup and place each into the cup. During the task, the participant used an HTC Vive controller to control the robot’s motions. For the pilot study, the head-mounted display of the HTC Vive Pro Eye system is used to display graphical user interfaces and track human gaze.
In Figure
5, the camera viewpoints that the human looked at are compared between different manipulation actions, and compared between the manipulation of different objects. The operator’s gaze fixation mostly switched between the
back view that looks at the workspace from the operator’s standing point and the viewpoint in which she could observe the object to pick up with minimal occlusion. We also found that the participant spent more time looking at the back view than any other viewpoint, which implies that we need to distinguish the primary and complementary viewpoints based on the duration of their fixation.
Q2—Which camera is preferred for the complementary viewpoint?. We conducted another round of the pilot study with the same participant to determine the preferred camera view for a complementary viewpoint. Based on the result from Q1, we implemented a picture-in-picture multi-viewpoint display. By default, we displayed the back view camera to be the primary viewpoint and the front view camera to be the complementary viewpoint. The complimentary viewpoint system was selected, since the back view was utilized the most with different viewpoints used only when additional information was required. This implied that only one additional viewpoint to the primary viewpoint would be required. Shown in Figure
6, the operator could also press the controller’s button to switch the complementary viewpoint to be from other workspace cameras. We recorded the robot and task states and tracked the human gaze. We noticed that the operator preferred to only use the left view camera for the complementary viewpoint, because (1) it shows the additional objects not visible in the back view, and (2) it is less occluded by the robot arm. The participant also mentioned that manually switching the complementary viewpoint increased her cognitive workload and control efforts during a post-study interview. She also mentioned that the complementary view could be improved with a zoom functionality to provide detailed information on the task and workspace.
Q3—Do we need to adjust the field of view for the complementary viewpoint?. Based on the feedback from Q2, we enabled the operator to use the controller’s trackpad to control the complementary viewpoint to shift the center of the FOV, and to zoom in and out (Figure
7). We found that during the same pick-and-place task, the operator still chose the complementary viewpoint cameras in a similar way, but preferred to zoom in and shift the FOV to make the target object or container more centered and visible.
3.5 Integration of Perception and Action Augmentation
We conducted a pilot user study (Pilot Study II) to determine the effective integration of perception and action augmentation based on human preference. In total, we have 15 different experiment conditions, considering the three interface control modes with different perception and action augmentation. Our pilot study involved eight participants (four male and four female, five novices, and three participants who have used the same teleoperation system before). The participants performed a single-object pick-and-place task once under every experimental condition (the order of interfaces was randomized) and reported their preferred combinations of control modes and perception/action augmentation after the experiments.
Table
2 highlighted the augmentation combination preferred by the majority of the participants for each mode. In the baseline control mode, five of eight participants preferred to have the pop-up picture-in-picture (PIP) display of the complementary viewpoint (interface-1c) and to use trackpad control (interface-1d). Some participants commented:
“...would like to have the pop-up PIP display to provide more workspace information when needed and use a trackpad to control the robot in a single direction for precise motion.” In AR mode when the interface can display AR visual cues, six of eight participants preferred to use the single camera view display (interface-2a) without any perception augmentation and trackpad control (interface-2d). Participants commented:
“...the PIP display overwhelms the user interface while the AR visual cues are available.” In autonomy mode, seven of eight participants preferred to use the fixed PIP display of the complementary viewpoint (interface-3b) and motion scaling (interface-3e). As the participants commented:
“...the fixed PIP display increases the awareness of the region where autonomy is triggered” and
“...motion scaling prevents large movement that moves the robot out of the autonomy zone.”The preferred combination of interface control modes with perception and action augmentation will be further evaluated in our formal user study.We further refined the interface display based on the freeform comments from the participants. Shown in Figure
9, we use a sidebar in pink and green to prominently indicate the activation of action augmentation. In autonomy mode, we also highlight the region to activate the autonomous actions in both the primary and complementary viewpoint. The corresponding AR visual cue (i.e., the square around the object) will be turned from white to light blue color.
3.6 Estimation of Cognitive and Physical Workload
We estimate the cognitive workload using the operator’s gaze and eye movement tracked by a Tobbi Pro Nano eye tracker. We also propose a novel method to estimate the physical workload online from human motion tracking.
Estimation of Cognitive Workload (Offline). Following the methods in the literature [
44,
52,
89], we will estimate cognitive workload caused by stress
\(C_{str}\), interface complexity
\(C_{int}\) and task workload
\(C_{tsk}\) from the operator’s pupil diameter, gaze fixation and movements, and task duration. We will track the difference between the operator’s pupil diameter, and estimate the cognitive workload caused by stress as the difference between average pupil diameter (
\(D_{tsk}\) during a task and the operator’s calibrated pupil diameter
\(D_{cal}\) before the task start, and will be normalized with respect to the maximum cognitive workload among all the participants, i.e.,
\(C_{str} = \frac{\overline{D_{tsk}} - D_{cal}}{\max \nolimits _{p=p_1,\ldots ,p_n}(\overline{D_{tsk}} - D_{cal})}\). Prior literature suggests that [
44,
52,
89] pupil dilates with the increased workload, thus increasing the difference between the average pupil diameter during a task (
\(D_{tsk}\)) and the operator’s calibrated pupil diameter (
\(D_{cal}\)) prior to the start of the task.
The cognitive workload caused by interface complexity \(C_{int}\) will be computed as the ratio between the average distance in pixels of the operator’s gaze fixation and the center of visual display and the maximum distance in pixels (from edge to center of visual display, i.e., \(S_{tsk}\) and \(S_{max}\)). Thus, the interface complexity can be calculated as, \(C_{int} = \overline{S_{tsk}}/S_{max}\)). To compute the cognitive workload for each sub-task (e.g., picking-and-place one object), we will also estimate the cognitive workload caused by task complexity as the ratio between the time to complete a sub-task and total task completion time (namely, \(C_tsk = T_{sub}/T_{total}\)). Thus, the cognitive workload for a sub-task can be computed as the average of \(C_{str}\), \(C_{int}\) and \(C_{tsk}\). We also contribute the overall workload \(C_{task}\) of the entire task caused the stress and interface complexity as the average of \(C_{str}\) and \(C_{int}\), assuming they have equal contributions.
Estimation of Physical Workload (Online).
Surface Electromyography (sEMG) signals can provide more accurate measurements of the muscle efforts and physical workload than using subjective feedback (e.g., Rapid Upper Limb Assessment, namely, RULA [
2,
42,
65]). Our recent work has used sEMG for the
objective but
offline estimation of physical workload in robot teleoperation via whole-body motion mapping [
58,
60]. Here, we propose to learn predictive models for the online, accurate muscle effort prediction from human motion-tracking data. Our prior work [
58] shows that: the muscle efforts of the anterior, lateral deltoid, and bicep muscle groups, caused by shoulder flexion, abduction, and elbow flexion, contribute most to the physical workload when humans control telemanipulation using their arm and hand motions.
Shown in Figure
10, we thus attached 6 body trackers (Vive Tracker 3.0) to the upper arms, forearms, chest, and waist of the human operator, to estimate the shoulder and elbow joint angles. Specifically, the shoulder flexion (
\(\theta _{SF}\)) is estimated on the
sagittal plane as
which has
\(\vec{T}_{ua}\) to be the upper arm vector estimated from shoulder and elbow trackers, and the
\(\vec{g}\) to be the gravity vector, both of which are projected on the sagittal plane (i.e., the x-y plane).
The shoulder abduction
\(\theta _{SA}\) is estimated on the
frontal plane as
which has
\(\vec{T}_{vertical}\) to be the vector perpendicular to the vector connecting two shoulder trackers, and
\(\vec{T}_{ua}\) to be the vector of the upper arm formed by shoulder and elbow trackers, both of which are projected on the frontal plane (i.e., the x-z plane).
The elbow flexion
\(\theta _{EF}\) is estimated as
which has the
\(\vec{T}_{ua}\) to be the upper arm vector, and
\(\vec{T}_{la}\) is the forearm vector estimated from the elbow tracker and hand-held controller positions. Both these vectors are projected on the sagittal plane (i.e., the y-z plane). Note that:
\(0^{\circ }\lt \theta _{SF}\lt 150^{\circ }\),
\(0^{\circ }\lt \theta _{SA}\lt 120^{\circ }\) and
\(0^{\circ }\lt \theta _{EF}\lt 150^{\circ }\).
We conducted a pilot study (Pilot Study III) to validate the physical workload estimation with sEMG data. Shown in Figure
11 (Left), before telemanipulation, we asked the human operator to perform a compound arm exercise that involves the active coordination of the anterior and lateral deltoid and the bicep muscle groups. The participants held one HTC Vive controller in each hand and moved their shoulder and elbow from fully extended to fully flexed for 20 s at the speeds and angular velocities for typical robot control motions. We computed the joint angles of shoulder flexion, abduction, and elbow flexion from the body and arm motions tracked by the HTC Vive trackers, and used the corresponding sEMG data to estimate the offline muscle efforts [
58]. For the offline workload estimation, we used a bandpass filter to extract the 40–700 Hz EMG signals from the wireless sEMG sensors (Delsys Trigno Avanti Sensors) attached to the anterior, lateral deltoid and bicep muscle groups. We pre-processed the data using a high pass filter (cutoff frequency 10 Hz) to remove the soft tissue artifact and offset the frequency baseline and used a full-wave rectification then a sixth-order elliptical low pass filter (cutoff frequency 50 Hz) to remove noise and transients and develop a linear envelope of the EMG signals, following the method in the literature [
45] but choose tunable parameters for our own task and data. The shoulder muscle efforts were computed using the weighted sum of the anterior and lateral deltoid (at the ratio of 3:4 based on their capabilities of force generation [
51]), while the elbow efforts were calculated from the bicep flexion. The muscle efforts were computed by normalizing the processed EMG data with respect to the person’s maximum voluntary contraction following the standard procedure in the literature [
13]. We averaged the shoulder and elbow muscle efforts for each arm, and estimated the operator’s overall physical workload as the weighted sum of muscle efforts from the dominant and non-dominant arm (at the ratio of 9:1), for the tasks that operators extensively move their dominant arms for robot motion control:
where
\(P_{DS}\) and
\(P_{NDS}\) are the shoulder muscle efforts of the dominant and non-dominant arms, while the
\(P_{DE}\) and
\(P_{NDE}\) are the elbow muscle efforts. A set of injunctive mapping functions was learned to predict the muscle efforts based on the arm joint angles with good accuracy.
Figure
10 shows that our predictive model can estimate the sEMG-based physical workload based on the joint angles in isolation exercises, comparable to literature results [
2,
42]. For compound exercises, Figure
11 (Right) shows an example of the prediction accuracy of our simple models for one male (32 years old) and one female (33 years old) of functional upper extremities and normal body mass index. The root-mean-square errors between the proposed method and EMG data are 3.68, 4.52, 4.63, and 3.78 for males and 4.97, 3.81, 4.12, and 4.37 for female participants for the left should, left elbow, right shoulder, and right elbow.