research-article

Open access

BirdViewAR: Surroundings-aware Remote Drone Piloting Using an Augmented Third-person Perspective

Authors:

Maakito Inoue,

Kazuki Takashima,

Kazuyuki Fujita,

Yoshifumi KitamuraAuthors Info & Claims

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Article No.: 31, Pages 1 - 19

https://doi.org/10.1145/3544548.3580681

Published: 19 April 2023 Publication History

All formats PDF

Abstract

We propose BirdViewAR, a surroundings-aware remote drone-operation system that provides significant spatial awareness to pilots through an augmented third-person view (TPV) from an autopiloted secondary follower drone. The follower drone responds to the main drone’s motions and directions using our optimization-based autopilot, allowing the pilots to clearly observe the main drone and its imminent destination without extra input. To improve their understanding of the spatial relationships between the main drone and its surroundings, the TPV is visually augmented with AR-overlay graphics, where the main drone’s spatial statuses are highlighted: its heading, altitude, ground position, camera field-of-view (FOV), and proximity areas. We discuss BirdViewAR’s design and implement its proof-of-concept prototype using programmable drones. Finally, we conduct a preliminary outdoor user study and find that BirdViewAR effectively increased spatial awareness and piloting performance.

Figure 1:

1 Introduction

Affordable yet highly sophisticated drone devices and operating systems are increasingly available on the consumer market. Drones are now being used for the aerial photography and scanning of static objects [23, 24, 45, 56], for interactive applications with environments, objects, and humans, including disaster investigations and rescue [44], product delivery [7], remote repairs [81], haptic proxy for VR [1], outdoor navigation [37], and communication agents [3, 83]. For such scenarios that require drone operators’ real-time judgment, interactive drone-piloting approaches have become dominant over the recently advanced autopiloting technologies [68, 73]. However, drone operation is generally difficult for most pilots due to the inevitable challenges of speed-control mechanisms [9, 72, 80], environmental factors (e.g., wind, field complexity [14]), poor spatial awareness [67], etc. To reflect such difficulty, most countries have enacted strict regulations that rarely permit remote drone piloting, i.e., beyond visual line of sight (BVLOS) drone flights, for inexperienced pilots.

Our modern societies nevertheless desire BVLOS drone operations to utilize the drone’s actual sensing and locomotion capabilities [7]. The biggest challenge for achieving BVLOS operation is the pilot’s poor spatial awareness of remote drones and the surrounding environment. When using a typical consumer drone system, a video feed from an on-drone front-facing camera (FPV: first-person view) is streamed to the pilot. Since these cameras do not cover the left, right, and rear directions, the pilot is unable to fully grasp the drone’s spatial status relative to its surroundings. If the drone enters such a blind-spot region, the pilot will obviously be concerned about the operation’s safety. This issue becomes critical when pilots must position drones near surrounding objects or humans in remote places.

To address these blind-spot issues of FPVs, previous works have considered methods that implement multiple cameras (e.g., DJI Matrice 210 [12]) or a 360-degree camera attached to the aircraft [34, 59, 85]. However, the images obtained by this method are from a first-person perspective and do not fully help the pilot’s distance perception of the drone and its surroundings.

3D digital map generation using SLAM [69, 70] has also been explored as a way to provide environmental knowledge that FPV cameras cannot capture. Such a world-reconstruction approach using on-drone sensors is quite promising for briefly representing the remote drone’s operating area, but the current technologies still have difficulty reflecting dynamic changes in unknown remote environments and objects. Specifically, we notice a significant data transmission delay and low FPS issues when running current SLAM algorithms via a wireless network [77, 82], suggesting that a fully reliable and real-time 3D digital map cannot be reconstructed [8].

Another notable approach is using additional cameras to provide a third-person view (TPV) that shows an overview of the information around the operating drone. Applying the idea of using a wide, upward perspective has increased the spatial awareness of operators in the teleportation robot domain (e.g., [32, 40, 64]) and improved 2D/3D content navigation interfaces (e.g., [10]). This idea was recently applied to a drone interface system that allows pilots to manipulate a remote drone (i.e., main drone) through a TPV from an additional higher-positioned drone [67] or a preset indoor overhead camera [46]. Hereinafter, we call this unique navigation TPV-based piloting. This is an important advance toward successful BVLOS drone operation because TPV captures real-world information in real time and covers the blind-spot regions of FPV cameras from higher view angles.

Despite the great potential of TPV-based piloting, studies of it are scant, and two critical challenges remain. The first challenge is the drone’s low visibility in the TPV. Pilots often struggle to distinguish the main drone’s body in a colorful or blackish background or to precisely perceive its spatial status (e.g., height, position) relative to the surroundings, since only vague geometrical depth cues are available from an overview perspective (Fig. 2a) [9, 11, 36, 67, 74, 86]. The second challenge regards TPV framing. Depending on the main drone’s motion direction, a fixed bird’s eye TPV is not always suitable, and it needs to be dynamically adjusted to better focus on the drone’s current motion direction [70]. However, prior TPV-based piloting interfaces have either a static or pre-adjustable TPV [46, 67], which does not fully enhance the pilot’s spatial awareness while the main drone moves flexibly. Although addressing these two challenges is crucial toward practical TPV-based piloting, there remain many other technical difficulties, including GPS’s poor distance tracking ability, potential optical noises in outdoor environments, and interference with radio waves.

In this work, we propose BirdViewAR, a novel surroundings-aware remote drone-operation system that significantly expands a pilot’s spatial awareness using an augmented third-person view (TPV) from an autopiloted follower drone. The follower drone is autonomously flown behind the main drone at a higher altitude to offer a bird’s-eye TPV that entirely captures the main drone’s 3D motion degrees: horizontal, ascending/descending motions (z-axis motions), and yaw rotation (around the z-axis). We expand these basic TPV capabilities by introducing two relevant augmentations to address TPV’s two crucial challenges mentioned above. First, to improve the pilot’s spatial understanding of the TPV’s contents, we employ AR overlays to visually highlight the main drone’s spatial statuses, including the current position, heading, height, camera field-of-view (FOV), and proximity areas (Fig. 1 B). Second, to improve the TPV framing for the fast-moving main drone, we employ motion-dependent automatic TPV framing, where the follower drone’s position and directions are dynamically controlled to clearly capture the main drone’s status, visualized overlays, and near-future destination in the TPV (Fig. 1 A). We introduce the two augmentations simultaneously because they work in a complimentary manner; for example, the AR overlays can obviously be more effective when the TPV captures more motion-related areas. At the same time, the dynamic TPV framing would not be recognizable without clear visual feedback via our AR overlay graphics. Later, we show how their combination is superior over the individual augmentations.

While our work is built upon previous TPV-based piloting studies [46, 67], we significantly expand them by introducing the above two technical augmentations. To demonstrate our concept, we focus on designing and prototyping the BirdViewAR system, in which we further propose a feasible drone formation tracker using the follower drone’s on-board camera, an accurate AR-overlay generator, and an optimization-based follower drone control algorithm. In terms of our target scenarios, from many application opportunities with drones, we focus on both aerial recording and a drone’s interactive activities with remote objects (e.g., communication, delivery, inspection), where a pilot’s real-time spatial awareness of remote areas and accurate drone-positioning ability are required.

The following are the main contributions of this work:

•

We propose BirdViewAR, a remote surroundings-aware drone-operation system that increases the spatial awareness of pilots using an augmented TPV through AR overlays and motion-dependent automatic TPV framing.

•

We describe all of the design considerations for TPV visual augmentation and the follower drone’s motion-dependent automatic control.

•

We show technical insights for implementing BirdViewAR in consumer-level programmable drones, including our vision-based drone-sensing platform and an optimization-based control process for the follower drone.

•

We describe BirdViewAR’s potential as well as its remaining challenges for beginner pilots through a preliminary outdoor user study.

2 Related Work

2.1 Spatial Awareness Levels

Situational awareness is a crucial concept for remote robot operators to achieve dynamic interactions with surroundings. Its concept [15] has been discussed in terms of the following three levels.

•

Level 1: Perception of the elements in the environment

•

Level 2: Comprehension of current situation

•

Level 3: Projection of future status

If higher levels of situational awareness are ensured, the pilots can better understand their remote environment, making the robots operations more effective and smoother [16]. Spatial awareness (SA), which is a subset of situational awareness, only extracts the degree of understanding or comprehension of 3D space properties from situational awareness [32]. In this paper, we focus on the drone pilots’ spatial awareness, which can be examined with the above three level categories. SA during drone operation is primarily influenced by two aspects: the camera’s blind spot and human spatial perception ability. Since the drone’s onboard front-facing camera image (i.e., FPV) cannot capture the left, right, and backward directions, SA-level 1 cannot even be satisfied. Even if the flying drone is visible from the pilots’ perspective or from an additional camera view, the pilots still face the difficulty of understanding the drone’s location relative to its surroundings due to the limited spatial perception capabilities of humans [36]. When only poor depth cues are available, the pilots might still struggle to determine the exact 3D distance between the drone and the surrounding objects in the sky [9, 11, 36, 74, 86]. In the worst case, pilots cannot identify the drone by sight, which might easily happen when its color blends in with the outdoor colors. These facts demonstrate that regular BVLOS drone operations with a single drone fail to reach SA-level 2 and even level 1 in part; level 3 is never reached.

Each aspect has been scrutinized in the teleoperation robotic and VR/AR domains, and below we summarize the current solutions for each.

2.2 Eliminating Blind Spots for Teleoperated Robots and Drones

2.2.1 TPV for Teleoperated ground robot.

To reduce the blind spots of remotely controlled robots, using an additional perspective from a different angle is effective. If the second camera is placed at a higher place, operators can obtain an overview of the remote robot’s surrounding, significantly increasing their SA levels, and possibly enabling remote robot manipulation through the second camera’s view (e.g., [27, 61]). Hollands et al. represented a TPV based on previously acquired terrain data [32]. Time Follower’s Vision generates a virtual TPV from previous images taken by the robot’s onboard camera, and the operators can then access overview information after a few seconds of delay [65]. Kelly et al. explored the TPV generation from multiple cameras and laser scanners mounted on a remote robot [40]. Keyes et al. presented both FPV and TPV by mounting a camera on top of an aircraft and confirmed that the number of collisions with obstacles was reduced during drone piloting [41]. Saakes et al. confirmed that the bird’s-eye TPV is effective for remote ground vehicle navigation and suggested placing a static bird’s-eye camera on the vehicle [58]. Stela et al. also confirmed the effectiveness of FPV and TPV integration for remote rescue-robot operation [64].

2.2.2 Remote drone interface with 3D models.

A similar approach was explored in drone interface research. Flying Frustum is a remote drone interface [47] that allows users to designate a drone’s flight path using a pen on a spatially augmented 3D printed terrain model of the operation area. Thomason et al. examined an adaptive TPV that maximizes visibility for drone manipulation in a narrow complex environment [69] using a VR simulator. ARPilot is a remote drone interface that uses AR to display the flight environment on the ground, allowing users to move around the displayed environment while holding a mobile device to intuitively design drone camera motion paths [33]. Okan et al. proposed drone-augmented human vision for an indoor remote drone-piloting system that eliminates architectural blind spots (e.g., behind walls) by applying AR-based visualizations [17]. However, these existing methods relied on prepared 3D physical and/or virtual models, thus failing to provide a real-time TPV. To generate 3D models around a drone in real time, SLAM technology is also receiving much attention. However, SLAM still faces various challenges, such as visual resolution, shape accuracy of reconstructed virtual models[48], and object occlusion. One significant challenge is its data transmission delay; for example, several prior works have reported that 3D world models updated at limited 4–5 fps when SLAM is operated via wireless communication [77, 82]. It has also been reported that teleportation performance would be significantly degraded at speeds of 7.5 fps or lower [8], suggesting the currently available SLAM algorithms are not ready to provide such effective TPVs that the another flying drone’s camera can offer.

2.2.3 Remote drone interface with additional TPV.

To mitigate blind spots during drone piloting, using multiple cameras from different directions has been proven effective. Chase View [31] is a drone that affixes an additional front-facing camera at the end of a body to provide a short-range TPV, however, other direction blind spots remain. StarHopper [46] is a remote drone interface for indoor inspection where an additional overhead camera is set in a remote operation room to provide a room overview with a flying drone. This scheme drastically mitigates the blind spot of the FPV-only setup, but it relies on a prepared ceiling camera, which means that the whole system is solely designed for indoor use, not for dynamic and/or outdoor scenarios. Third-Person Piloting (TPP) [67] uses a secondary drone flying over a primary drone to provide an adjustable bird’s-eye TPV around the primary one. The second drone’s direction can be steered by a physical manipulator on its controller. TPP is an effective method for solving blind spot issues, however, manual adjustment of TPV by the pilot is time consuming and increases the number of interactions and task load[70]. While our work is inspired by TPP’s concept, we go beyond this by adding two major contributions for achieving feasible TPV-based remote drone piloting. First, instead of manual TPV adjustment offered by TPP [67], we contribute an automatically adapted follower drone control while the drones are moving, allowing the pilots to always obtain suitable TPV framing without any extra input. Second, we further expand the TPV by addressing the visibility and spatial perception issues with AR overlays.

2.3 Improving Drone Pilot’s Spatial Perception

Visual line of sight (VLOS) drone flights are often permitted for every pilot. Unfortunately, a bright clear sky provides very limited spatial and geometrical cues about flying drones and confuses pilots who want to estimate the depth distance or height information. Such depth perception issues have been specifically shared in the VR and AR domains, and several AR-overlay solutions have been proposed [11, 74] and applied to address drone visibility and perception issues. FlyAR [86] uses an AR overlay on a user-centric camera view that visualizes the drone and its height information. PinpointFly [9] is a mobile AR-based egocentric drone manipulation interface that uses AR overlays of the ground position and the orientation of a flying drone on a mobile touchscreen. Researchers also have proposed AR-based drone interfaces using AR glasses to directly overlay the drone’s spatial information (e.g., position, orientation, trajectory, and camera FOV) onto the pilot’s FOV [29, 33, 43, 72]. These methods suggest that AR overlays significantly help pilots recognize small flying drones in their sight or captured camera images. Unfortunately, all of the above methods are only available for VLOS operation from a pilot’s egocentric viewpoint.

Another AR-based approach is augmenting the drone’s onboard camera (i.e., FPV image) with visual highlighting. While these improvements help pilot’s FPV content understanding, the issue of blind spots have not be addressed (e.g., [42, 60]).

2.4 Camera/Drone Motion Optimization

2.4.1 2D/3D content navigation.

Many studies have attempted to automate or semi-automate camera movement to achieve smooth viewpoint transitions in the domain of graphical content navigation interfaces [19, 26, 28, 35, 66]. Igarashi et al. proposed a speed-dependent automatic zoom (SDAZ) method that changes the display area of a document according to the scrolling speed [35]. Bureli et al. proposed a method to construct a predictive model of viewpoints in 3D games based on the player’s gaze and game-play data [5]. Tan et al. applied an SDAZ method for exploration of VR environments and showed its effectiveness [66]. Although these methods are not designed for drone operation, they do present important findings in viewpoint management. Therefore, we introduce these concepts of "speed-dependent viewpoint management" and "show what the pilot wants to see" into our method.

2.4.2 Camera motion optimization in teleoperated ground robot.

Ground robot camera-viewpoint optimizations have also been widely attempted. For example, Daniel et al. proposed a method of attaching a camera to a robotic arm and positioning it in a collision-free position using optimization [54]. Saman et al. [53] suggested an optimization-based remote robot camera-angle adjustment based on the circumstances of the current task, allowing the operator to focus on the robot’s manipulation and near-future destinations. The basic idea is close to our motion-dependent TPV framing, but the robot type and motion freedom greatly differ. We take these viewpoint optimization approaches and establish our own model to generate optimal follower drone motions.

2.4.3 Drone motion optimization for videography.

Numerous studies have been conducted on a drone’s real-time motion trajectory generation for smooth aerial videography. One example involves automatic camera trajectory generation for moving targets [38, 50], which maximizes target visibility while ensuring collision-free operation. Other works apply an optimization process to correct or create smoother and more pleasant camera motions based on the key points/frames specified by the operator [21, 22, 23, 39, 49]. Our study differs from these studies in that it targets interactive teleoperation. However, such optimization knowledge can be effectively used for viewpoint management.

3 BirdViewAR Design

Figure 2:

Figure 3:

3.1 Overview and Concept

We designed BirdViewAR, a new surroundings-aware remote (i.e., BLVOS) drone-operation system that improves a pilot’s SA of the remote area using an augmented third-person view (TPV) for their arbitrary drone manipulation. The TPV is always provided from the follower drone. As shown in Fig. 2a, the human visual perception system suffers from a limitation to comprehend the exact 3D location of a flying drone from a bird’s-eye TPV because the visible geometrical depth cues are vague. A previous TPV-based drone interface [46, 67] did not consider these visibility and comprehensibility issues in its design. To resolve this visibility issue of the conventional TPV, we visually augment the TPV content to enhance the main drone’s spatial status as well as its relations with its surroundings using AR graphics (described in 3.3). Next, Fig. 2b illustrates an example of the TPV framing issue. In a hovering session, the TPV is able to capture the main drone’s body in its center, but the regions of its motion direction soon become narrower and narrower as the drone starts moving due to a certain delay that occurs in the follower drone motion. The same issue frequently arises when the main drone heads to the left, right, up, or down, thus degrading the TPV. To mitigate this TPV framing issue, the follower drone’s position and orientation are automatically modified to adjust the TPV framing to better capture the the main drone’s body and its near-future destination (described in 3.4).

3.2 Assumption and scope

The following three assumptions inform the scope of our work on our BirdViewAR design.

First, automatic collision avoidance systems have become the new standard for drone piloting; however, we believe they should be used in a supplementary way to avoid danger. Moreover, such systems have a different purpose than the arbitrary manipulation we focus on. A similar situation exists in recent car driving, where drivers never completely rely on auto-braking systems when keeping a safe distance from other cars. BirdViewAR also supports such defensive driving strategies by providing clear spatial awareness and guidance regarding the main drone’s status with its surroundings.

Second, we assume that the follower drone is flying at a higher altitude, which might reduce its collision risk with its surroundings. Of course, this assumption does not support complicated environments such as narrow streets sandwiched between tall buildings, forests filled with trees of different heights and shapes, and so on. However, our current society, technologies, and regulations are not prepared for such risky BVLOS drone operations. Our current scope is focused on outdoor remote drone activities in empty fields without high obstructions: athletic fields, with simple elements of nature such as small mountains, riversides, seas, etc. We could define the safety altitude areas of the follower drone based on field structures (e.g., highest trees) and the follower drone’s camera capabilities.

Third, we define our target users as beginners (not novices) who are familiar with speed-based drone control using two joysticks. Based on the local regulations, we define beginners as those who have experienced a few hours (up to ten hours) of drone piloting.

3.3 TPV AR-Overlays

Fig. 2a and b illustrate typical TPVs that capture the main drone from a follower drone, both of which are generated by our drone-piloting simulator. Here, we discuss the design considerations of the visual overlays that augment the TPV. We incorporated successful existing knowledge from prior work to complete a set of necessary visual overlays for TPV-based remote drone operations and eventually designed four types (Fig. 3). The height (Fig. 3a), angle (Fig. 3b), and FPV frustum (Fig. 3d) overlays were designed based on prior egocentric drone interfaces (e.g., [9, 29, 72, 86]). The circular proximity area guide (Fig. 3c) is unique to our proposed piloting strategy. We selected these four basic overlays for our first prototype because they can complementarily visualize invisible or unclear spatial information.

3.3.1 Height/depth/ground position.

Fig. 3a shows an example when the main drone passes over obstacles. Here the pilot needs to understand the relative height between the drone and the obstacles. However, as Fig. 2a shows, pilots struggle to grasp the height, depth, and ground position under minimal geometrical cues. The existing altitude meter display (on the controller) may help the pilots, but such information could be limited when the target obstacle’s height is unknown. We address this by adding a height bar overlay that visualizes the distance (height) between the drone and the ground surface (red line in Fig. 3a). This simple overlay helps pilots quickly understand the drone height as well as its ground position, allowing for more accurate motion path planning.

3.3.2 Head direction.

Since current drone-control systems rely on drone-centric coordinates, pilots need a clear awareness of the drone’s head to correctly describe the drone motion path using two separate joysticks. However, the captured main drone in the TPV can be inconspicuous depending on the follower drone’s height or the captured field colors. They might be able to estimate the drone’s head direction from the captured content on FPV, but this only applies when the pilots have prior environmental or target knowledge. We improved this by adding a triangular overlay (Fig. 3b). Although such other options exist as arrows [9, 72], we introduce a no-fill triangle because it avoids concealing the figure of the main drone and its tip clearly represents the drone’s head direction.

3.3.3 Proximity area guide.

For humans or animals (fragile inspection targets, too), the drone-to-subject distance becomes crucial for safety, physical, and social aspects. Drone operation must avoid creating feelings of fear or danger in subjects [71, 75]. Correct distance estimations around the drone and surroundings are core aspects in creating high SA (esp., level 2). To achieve such proximity-aware drone piloting, we add circular distance guides from the main drone position (Fig. 3c), allowing the pilots to always grasp the drone-to-subject distance in all directions. The radius of the circular guide can be further customized depending on the target and the follower drone’s FOV, etc. We selected on-ground circle rendering to avoid any additional height perception issues. We did not add vertical circle guides to highlight the upper and lower distances from the drone because the height overlay and the front-facing TPV sufficiently support the pilot’s vertical distance awareness. We visualized two circles that make up a two-step warning. The inner-circle visualizes the main drone’s emergency-stop activation area (5 m), in which pilots operating the main drone must guarantee that surrounding objects are far from this inner circle. The outer circle was set to 8 m, in which the drone’s noise will not affect the conversation of nearby humans (i.e., the sound noise ratio becomes 0 dB[62]).

3.3.4 FPV field-of-view and frustum.

One challenge in TPV-based piloting is integrating two distinct views of FPV and TPV into a single interface. To connect the two views without confusing pilots, we added a FPV frustum overlay that visualizes the main drone camera’s FOV and its heading in the TPV (Fig. 3d). A similar idea with a co-located drone interface has been proven to be effective [29]. In our design, the frustum’s bottom edge in TPV is depicted so that it coincides with the actual FPV’s bottom edge. This overlay also allows for rough drone piloting only by TPV’s information. We might also overlay a real-time FPV video feed into the FPV overlay’s square regions. However, since the FPV and TPV roles are not simple and depend on many factors, we leave this possibility for future work.

The aforementioned four visual overlays are recommended to be fully activated because each supports different spatial perception aspects. However, part of them can be easily disabled. Individual modes can be seen in our video figure, and a non-FPV frustum mode can also be seen in our study section. The color or graphics designs are customizable to adapt to environmental texture and colors. For now, we manually set the red and blue colors by considering our outdoor study field.

3.4 TPV Auto Framing

Figure 4:

3.4.1 Overview and concept.

In previous TPV-based piloting interfaces, the TPV was fixed to the environment [46] or specified by the user [67]. However, such a fixed TPV does not fully capture the main drone’s free 3D and unpredictable motions. For example, even if the TPV initially captures the target, pilots can easily lose it after the drone starts to flexibly move. One solution might be using a higher follower drone (e.g., top-view TPV), but this choice drastically reduces the visibility of all contents (i.e., every detail appears small).

Therefore, we propose motion-dependent automatic TPV framing that always provides necessary or useful information, including AR overlays of the drone’s surroundings and destinations, while maintaining a reasonable height of the follower drone. In our design, the secondary follower drone immediately follows the main drone to provide a bird’s-eye TPV. Yet once the main drone starts moving, the motion-dependent follower drone control is activated, and its height and camera angles are dynamically adapted to the main drone’s speed and directions while keeping the AR overlays within the TPV. As a result, the pilot can check whether any object is located in the main drone’s motion direction in the TPV and decide subsequent necessary actions beforehand with the help of the AR overlays.

Fig. 4 (middle) illustrates the spatial relationship of the main and follower drones. Their primary variables are summarized in the table on the left. For this two-drone formation, we defined a real-world drone coordinate system whose origin is set to the main drone position. The goal of this control is to continuously calculate the follower drone’s next 3D position P(t) (x, y and z), its camera tilt angle θ_f(t), and its yaw angle ϕ_f(t) in this coordinate system. We calculate them by introducing an optimization process whose concept minimizes the pilot’s cognitive load while primarily satisfying the motion conditions of the main drone by capturing the AR-overlay productions. In the following, we describe our motion optimization process.

3.4.2 Initial states.

The operator first decides initial position P_init, \(\theta _{f_{init}}\), and yaw angles \(\phi _{f_{init}}\) of the follower drone relative to the main drone. These initial positions determine the basic formation of the two drones and strongly influence our optimization calculation. These states can be decided by operators or pilots based on the on-sight situations, necessary TPV ranges, etc. The detailed initial values for our implementation are provided below.

3.4.3 Framing conditions.

We set following three primarily conditions that need to be satisfied in our motion optimization process.

C1: Capturing the main drone. The first condition is that the follower drone’s FOV (i.e., TPV) must capture the moving main drone within its central view. This condition is essential to achieve a regular TPV.

C2: Containing AR overlays. The second condition is set to maximize the effects of the AR overlays. Part of the overlays easily slips off-screen while the main drone is moving (e.g., ascending motions). To keep them visible, TPV is required to contain part or all of the AR overlays. Depending on the pilot’s tasks, this condition can be modified. Our current condition is that FPV frustum should be captured in the TPV because current main usage of drones require the images from FPV (e.g., videography, scanning, etc.).

C3: Capturing wider areas in the drone-moving direction. The third condition regards the TPV’s motion-dependent framing. To increase the pilot’s SA around the drone’s destination, TPV requires that sufficient areas be captured to which the drone is currently heading. In the virtual reality (VR) and GUI navigation domains, it is suggested to dynamically adjust camera angles according to the displacement speed and/or direction [19, 26, 35, 66] to help users predict near-future events while interacting with 3D content. We also apply this idea to TPV framing as a framing condition to allow pilots to observe a wider area in which the main drone is currently located. More specifically, we employ a simple idea that the TPV should capture the main drone’s 1.5-second future destination area, which allows them to prepare their next actions (e.g., moving or stopping). The distance to the capturing destination from main drone d(t) can be described:

\begin{equation} d\left(t\right) [\mathrm{m}] = 1.5 [\mathrm{s}] \times v\left(t\right) [\mathrm{m/s}] \end{equation}

(1)

Here, v(t) is the current velocity of the main drone at time t, and 1.5 s is the time required for a human’s choice reaction time [30]. A longer time can be further set, but this is reasonable for the human cognitive process and also enables the circular overlays in the motion direction clearly visible.

3.4.4 Cost function.

This optimization process seeks the best position and angle of the follower drone to satisfy the above three conditions while minimizing the pilot’s cognitive load. According to previous SA studies [16, 25], cognitive load negatively affects the spatial awareness of pilots. We defined the cost function as the sum of the absolute motion distance, the difference in camera orientation, and the absolute distance from the initial position, since these factors induce the pilot’s cognitive load [54, 67]. By minimizing this cost function, the pilot’s cognitive load will be reduced. Note that our optimization process is designed for rapidly-moving drone motions, not for hovering (e.g., for careful path planning), where a short-range TPV modification might be more acceptable to quickly comprehend both views. For such careful flight-planning sessions, manual second drone control (e.g., TPP [67]) is more straightforward. The detailed constraint equations, including the gimbal specifications of the available camera and the drone, were derived from the three conditions described in the Appendix (Eq. 10).

We formulate our cost function (2) as follows.

\[\begin{eqnarray} \min _{\boldsymbol {P}\left(t\right),\theta _{f}\left(t\right),\phi _{f}\left(t\right)}\;\;\; &w_{1}\Vert \boldsymbol {P}\left(t\right)-\boldsymbol {P}\left(t-1\right)\Vert ^2 +w_{2}\phi _{f}\left(t\right)^2 \nonumber \\ &+w_{3}\left(\theta _{f}\left(t\right)-\theta _{f_{init}}\right)^2 +w_{4}\Vert \boldsymbol {P}\left(t\right)-\boldsymbol {P}_{init}\Vert ^2 \end{eqnarray}\]

(2)

Here, P(t) is the 3D vector at time t, ϕ_f(t) is the yaw angle at time t, and θ_f(t) is the tilt angle of the follower drone’s camera at time t. P_init and \(\theta _{f_{init}}\) are the given initial position and angle as explained in 3.4.2. w₁ to w₄ are the weights of each term. The first and second terms of Eq. 2 were set to minimize the follower drone’s positional movement and head rotation. The third and fourth terms were introduced to maintain the distance from the initial position as much as possible to prevent visibility degradation. By introducing these terms, the system lets the follower drone return to the initial positions and orientations at the operation’s conclusion. Regarding the weights, for example, when w₃ and w₄ are set larger than w₁ and w₂, the follower drone is automatically controlled to quickly return to its initial position once the main drone stops. This cost function is solved every frame, and the calculated results are immediately applied to the follower drone’s automatic control.

4 BirdViewAR Implementation

Figure 5:

4.1 System Overview

We implemented a proof-of-concept prototype of BirdViewAR using two programmable drones and two ground workstations. The entire system configuration and workflow are summarized in Fig. 5. All source codes are available at our lab’s project page.¹

Unlike standard outdoor drone operations, we did not use GPS information for the drone’s motion control because its delay (only 10 Hz) and potential positional errors (1-2 meters) are insufficient for achieving real-time two-drone formation and AR overlays onto the captured real-world content. We only used GPS information for the initialization and formation reset processes.

Fig. 5 (lower part) illustrates the main drone process. We used a DJI’s Mavic 2 Pro, directly manipulated by the pilot by its accompanying joystick controller. This drone body was colored fluorescent orange to guarantee minimum visibility from the upper follower drone’s video view (i.e., TPV). This drone’s aircraft information (e.g., altitude, azimuth, heading, and velocity) and onboard video images are transmitted to a Windows ground workstation at 30 Hz using the OcuSync 2.0 protocol and DJI Windows SDK [13]. This information is immediately shared with another Linux ground workstation by a UDP protocol.

Fig. 5 (upper box) summarizes the whole follower drone’s workflow. We used a fully programmable Parrot Anafi 4K [51], which was automatically controlled and whose onboard video feed (i.e., TPV) and aircraft information (i.e., altitude, azimuth) were sent to the Linux (Ubuntu 20.04) ground workstation at 30 Hz by Olympe [52]. The Linux workstation ran on our Python script that managed three functions: (1) follower drone’s position calculator, (2) AR-overlay generator, and (3) follower drone controller. Finally, the Windows and Linux workstations respectively offered FPV and an augmented TPV, displayed in our controller on two separate monitors ((4) in Fig. 5). The details are described in the following subsections (1)–(4).

4.2 (1) Position Calculator

The follower drone needs to maintain its capture of the main drone within its camera FOV (TPV), and the system needs to precisely display several AR-overlay graphics on/around the captured main drone in the TPV. Therefore, the system must acquire the precise 3D position of the follower drone relative to the main drone. We built a feasible and accurate vision-based self-localization system that calculates the relative position between the two drones based on TPV’s captured images. Fig. 4 (middle) shows the real-drone coordinates, variables, and dimensions of the two flying drones. The altitude of follower drone y_f(t) relative to the origin (main drone) is given by the altitude sensors from both drones. Thus, the goal of this process is to calculate planar positions x_f(t) and z_f(t). Fig. 4 (right) shows an in-TPV coordinate system whose origin is the center of the TPV image. The position of the captured main drone was expressed as x_m(t) and y_m(t) and converted to x_f(t) and z_f(t) using the following three steps:

Step 1: Image correction.

The system first corrects the follower drone’s video image using both the camera and distortion matrixes previously obtained with the Zhang method [84]. This step produces a relationship between the pixels in the video image and the camera’s focal length.

Step 2: Main drone detection.

The system recognizes the main drone in the corrected image of the TPV using OpenCV [6] color recognition and detects the 2D positions of main drone x_pixel and y_pixel in the pixel coordinates whose origin (0,0) was in the upper left corner of the image (Fig. 4, right).

Step 3-1: Pixel to in-TPV distance.

The system converts the pixel coordinates to real-drone coordinates by two-step coordinate transformations. First, positions x_pixel and y_pixel in the pixel coordinates were converted to positions x_m(t) and y_m(t) in the in-TPV coordinates using camera parameters and the relative altitude of the two drones:

\[\begin{eqnarray} &y_{m}\left(t\right) = 2a\left(\frac{1}{2}-\frac{x_{pixel}}{x_{max}}\right) \end{eqnarray}\]

(3)

\[\begin{eqnarray} &x_{m}\left(t\right) = 2h\tan \frac{\theta _{fhfov}}{2}\left(\frac{y_{pixel}}{y_{max}}-\frac{1}{2}\right). \end{eqnarray}\]

(4)

Here, x_max and y_max are the maximum pixel values for each axis, h is the linear distance between M and the follower drone, and a is the value of the y-coordinate of O in the In-TPV coordinate system. The equations of h and a are given as follows:

\[\begin{eqnarray} &h = \frac{y_{f}\left(t\right)}{\cos \left(\theta _{f}\left(t\right) -\theta _{fvfov}/2\right)}\cos \theta _{fvfov} \end{eqnarray}\]

(5)

\[\begin{eqnarray} &a = \frac{y_{f}\left(t\right)}{\cos \left(\theta _{f}\left(t\right) -\theta _{fvfov}/2\right)}\sin \theta _{fvfov}. \end{eqnarray}\]

(6)

Positions x_m(t) and y_m(t) in the in-TPV coordinates express the main drone’s distance [m] from the center of the TPV image.

Step 3-2: In-TPV distance to real-world distance.

The system then transforms x_m(t) and y_m(t) in the in-TPV coordinates into x_f(t) and z_f(t) in the real-world coordinates using the camera tilt angle and the relative altitude of the two drones. The following are the detailed equations (see the table in Fig. 4 for definitions of the variables):

\[\begin{eqnarray} &z_{f}\left(t\right) = \frac{x_{m}\left(t\right)\left(h+a\sin \phi \cos \phi \right)-ah\cos ^2\phi }{x_{m}\left(t\right)\cos \phi +h\sin \phi }+h\cos \phi \end{eqnarray}\]

(7)

\[\begin{eqnarray} &x_{f}\left(t\right) = \frac{y_{m}\left(t\right)\left(h\sin \phi +a\cos \phi \right)}{x_{m}\left(t\right)\cos \phi +h\sin \phi } \end{eqnarray}\]

(8)

\[\begin{eqnarray} &\phi = \frac{\pi }{2}-\theta _{f}\left(t\right). \end{eqnarray}\]

(9)

The derivations of Eqs. (7) and (8) are available in the Appendix B.

4.3 (2) AR Overlay Generator

The AR-overlay contents (e.g., position, size, length) were initially created based on the main drone’s flight status, all of which are described in the real-drone coordinates. Thus, to correctly render the AR content onto the TPV image, another coordinate transformation was needed, which can be made using the given relationship between the two coordinates in Eqs. (7) and (8). The necessary information for all of our AR overlays is available within our coordinate transformation and the sensor information of both drones. The main drone’s height was highlighted with a line-shaped overlay, which was rendered using the origin information (i.e., the main drone position) in the real-drone coordinates and the altitude information between drones y_f(t) in Eqs. (5) and (6). Our height (Fig. 3a) and circular proximity guide overlays (Fig. 3c) were rendered on the ground position and generated using the absolute altitude of the follower drone instead of the relative altitude between the two drones. Regarding the circular proximity area guides, we set two circles with 5- and 8-m radius in the real-world coordinates. They were selected based on our setup, such as DJI product’s collision-avoidance sensor specification, a noise sound, etc.

4.4 (3) Optimized Motion Controller

The Linux workstation finally manages the follower drone’s automatic motion control using our optimization process. We installed a PID control mechanism where the target positions of the follower drone were provided by solving our optimization function at 10 Hz. This function was solved using open-source frameworks CasADi [2] and the IpOpt [76] libraries. Here we used a main dual interior point method to solve the optimization result. The calculation cost is not a problem for our current workstations’ computation powers and is also robust to further constraints.

The drone we chose (Anafi 4K) used an acceleration control mechanism for the horizontal motions (but the velocity control mechanism was applied in the ascending and descending directions). To adopt such a drone’s pre-installed motion mechanisms, we used a two-layer cascade PID control that achieves a drone’s position control from its current acceleration through an intermediate velocity status control. From our practice, this process reduced the follower drone’s vibrations around the motion’s end-point.

So that the follower drone always captures the main drone, we set the follower drone’s speed slightly faster than the main drone’s. The maximum speeds of the main and follower drones were set to 7 and 10 m/s. These slower velocities than their flight abilities are reasonable for our exploration because these flights can track such ordinary moving targets as walking/running people, bicycles, or animals. We conducted a brief technical test on how the drones function in the outdoor setup, which confirmed that the follower drone could successfully capture the main drone within these velocity ranges.

The system was equipped with a fail-safe mechanism. If the follower drone lost the main drone (e.g., perhaps due to poor radio-wave conditions), the system terminated the follower drone control and automatically reset the two-drone formation to the initial states based on the GPS information of each drone.

4.5 (4) Control Interface with FPV and TPV

Finally, we considered a control interface that manages two drones and two perspectives. Because the follower drone was autopiloted by our software, we chose a standard joystick controller to manipulate the main drone. For the display unit, we applied the simplest approach where FPV and TPV are viewed equally with two same-sized displays. They were arranged vertically and slightly tilted, as shown in Fig. 7a . This arrangement was selected because a survey [55] reported that users frequently suffer from visual confusion when using side-by-side two-screen arrangements. This approach reflects a unique TPV-based piloting style in which pilots can focus on whatever views they want. We built a display unit using two 7-inch full HD monitors (Elecrow RC070, (4) of Fig. 5) on a tilted 3D printed stand. This display unit was used on a table in our outdoor study, although it can be mounted on a joystick controller or around a user’s neck to make it a fully mobile setup.

4.6 Results of Generated TPV Examples

Figure 6:

Fig. 6 shows representative screenshots of the actual TPV and FPV generated from our prototype in our outdoor study sight. Fig. 6 a shows the FPV and TPV when the main drone is hovering at an altitude of 1.2 m with the follower drone’s three given initial states: \(x_{f_{init}}=0\) m, \(y_{f_{init}}=7\) m, and \(z_{f_{init}}=7\) m. The TPV screenshot depicts that the two circular guides, the height bars, the heading triangle, and the FPV overlays are correctly superimposed on and around the main drone; the main drone’s visibility is not very clear even from a 7 m higher follower drone for such a neutral ground texture. Regarding the AR-overlay rendering accuracy, we performed a brief indoor test by eliminating the sunlight condition and found less than 1.0-cm error when the relative altitude between the two drones was set to 1.0 m.

Fig. 6 b shows how the follower drone and its TPV respond when the main drone moves in a left direction from the initial position. The follower drone’s yaw rotation ϕ_f(t), altitude y_f(t), and its relative horizontal position (z_f(t)) immediately change so that the TPV captures the main drone’s body and its 1.5-sec future destination d(t). The length of d(t) varies depending on the main drone velocity; a faster motion increases this length. This picture also shows that this automatic TPV adjustment also enhances the effect of two circular proximity guides in the motion direction, which might lead to strong attention by the pilot to the motion direction. Eventually, our design conditions for C1 and C3 are satisfied. Furthermore, follower’s altitude y_f(t) and horizontal position z_f(t) also respond to the main drone’s ascension, preventing the FPV frustum from disappearing from the TPV (Fig. 6 c). This figure shows a piece of quick evidence that C2 was satisfied. More BirdViewAR behaviors are found in our video figures.

5 Preliminary User Study

We conducted a preliminary outdoor user study to verify how BirdViewAR improved the pilots’ SA and steering performance. The goal of this study is to obtain initial insights on two aspects of BirdViewAR: TPV AR overlays and motion-dependent TPV framing.

5.1 Participants

We recruited from a local university nine beginners (males: 8, average age: 23.1 years) who had personal experience flying drones for 1-10 hours. Such a small number of participants was due to the small number of people with drone experience, strict recruiting requirements imposed by COVID-19, and failed data acquisition related to climate or equipment problems.

5.2 Experimental Setup and Venue

Figure 7:

We conducted this study on a riverbank that is officially zoned as outside of a populated area, which allows every pilot to freely operate drones without local permission. We built a base tent containing a floor mat, portable desk, mobile batteries, two workstations, and a controller with two monitors (Fig. 7a). The tent protected our equipment and the participants from direct sunlight and cold winds (Fig. 7b). Our participants manipulated the drone from inside the tent. Since pure BVLOS operations are not permitted by the local government, participants recognized the flying drones by sight as a partial simulation of a BVLOS drone operation. To ensure safety, an assistant carefully watched the flying drones and warned the experimenter when danger seemed imminent.

5.3 Tasks and Design

We designed two tasks: nose-in-circle and fast-positioning. The former required the pilots to continuously perceive the spatial relationship between the laterally moving drone and the target. We used this task to test how our AR-overlay increased the pilot’s distance awareness around the drone and the target and improved the drone piloting. This task is intended for 3D scanning of static objects and inspection from a safe distance. The fast-positioning task required pilots to quickly grasp the spatial relationships among the current moving drone’s position, its imminent future destination, and the targets. We designed this task to confirm how the motion-dependent TPV framing improved the SA and rapid drone positioning. This task is intended for rapid pesticide application, delivery of packages, and photography of moving animals from a safe distance. Both of the tasks consist of basic manipulation of drones, which is also required for more advanced drone tasks (e.g., videography).

Figure 8:

5.3.1 Nose-in-Circle Task.

In this task, the drone is aimed inward at a target around which it pivots with its nose continually facing inward as accurately as possible. Fig. 8 a shows a typical drone movement to complete such a task. Participants were asked to maintain a 5-m distance from the target.

Figure 9:

As independent variables, we compared three methods: FPV-only, BirdView (auto TPV without AR overlays), and BirdViewAR (auto TPV + AR overlays). The BirdView condition was introduced to examine the impact of AR overlays. Fig. 9 shows their display view examples. In BirdView and BirdViewAR, both a FPV capturing the target and a TPV capturing the drone and the target were presented on the controller. In all interfaces, the collision-avoidance system on the main drone was enabled.

As dependent variables, we collected the participants’ answers to a set of 7 Likert scale questions regarding SA, anxiety, and concentration (see below) as well as some NASA-TLX mental workload questions. We also analyzed the distance error based on how accurately they kept the distance to the target measured using an on-pole AR marker and image recognition [20].

Our questionnaire consisted of the following questions:

•

(Q1: Space perception) Were you able to perceive the situation around the main drone?

•

(Q2: Space understanding) Were you able to understand the spatial relations between the main drone and the target?

•

(Q3: Motion planning) Were you able to dynamically plan your flight?

•

(Q4: Anxiety) Did you feel anxious about the moving direction?

•

(Q5: Concentration) Were you able to concentrate on your piloting?

Q1 to 3 corresponded to the three SA levels [16]. Q4 and 5 were introduced to capture the psychological impact regarding drone flight with blind spots and the complexity of the TPV-based interfaces.

5.3.2 Fast-positioning Task.

The participants piloted the drone between multiple poles as accurately and quickly as possible. Similar to the previous task, they also captured the pole within a 5–8m FPV (i.e., keeping a safe distance). Such operation requires faster pilot reactions for stopping it or modifying its motion direction. This task mainly examined how the motion-dependent TPV framing of our BirdViewAR improved their SA and piloting performance. Fig. 8 b shows the positions of the poles in the experimental sight. To avoid a learning effect, we prepared three different routes with different starting points, and our participants were variously assigned to each interface.

As independent variables, we compared three methods: FPV-only, TopViewAR (AR overlays + static top view), and BirdViewAR. TopViewAR was introduced to extract how TPV’s motion-dependent framing helped pilots complete such positioning tasks. In TopViewAR and BirdViewAR, height, head, and proximity guides were activated. To avoid clutter in the TPV during fast motions, we did not render the FPV frustum in this setup. In all interfaces, the collision-avoidance system on the main drone was enabled.

Figure 10:

As dependent variables, we gave the participants a questionnaire identical to the previous task and the NASA-TLX workload test. We also analyzed the success rate (if the participants could have stopped within 5-8m from the pole) for how accurately the pilots maintained the distance to the target and the total motion distance for how accurately they planned the motion and manipulated the joysticks. Fig. 10 shows the display screens.

5.4 Procedure

Our participants were given sufficient practice sessions before running the actual tasks. Next they engaged in the two tasks. The order of piloting methods (two baselines and BirdViewAR) was counterbalanced among the participants. The order of tasks was fixed; the nose-in-circle task was performed first, followed by the fast-positioning task. This order was adopted because the fast-positioning task requires more training to complete it, and the nose-circle task could provide good practice. At the end of the trial for each interface, they answered questionnaires and semi-structured interviews. The experiment last approximately 3.5 hours, including transportation between the laboratory and the venue by taxi; yet their actual working time was less than 1 hour because they received many breaks for operations and preparations (e.g., battery replacement, calibration, etc.).

5.5 Results and Discussion

Table 1:

Table 2:

5.5.1 Nose-in-Circle Task.

Table 1 shows a summary of the nose-in-circle task. We carried out a Wilcoxon signed rank test and calculated the p-value and effect size (r) for every paired comparison. The * in the table indicates significant differences with a p-value of less than 0.05 after Holm’s correction method. The following argument is focused on the significant results with * having a higher effect size (i.e., larger than 0.5 [57]).

In Q1 (perception), we found that both BirdViewAR and BirdView scored significantly higher than FPV (p=0.012 and p=0.012, respectively). This demonstrates a clear benefit of providing a TPV for improving the perception of the surrounding situation. In Q2 (understanding), BirdViewAR had a significantly higher score than FPV (p=0.012) and BirdView (p=0.016). The difference between BirdViewAR and BirdView reveals the benefit of the AR-overlay that effectively helped the users’ spatial understanding of the primary drone. Furthermore, in Q3 (Planning), BirdViewAR offered a significantly higher score than FPV (p=0.012), which would indicate that an accurate spatial understanding (i.e., reaching SA Level 2 as shown in Q2’s results) has enabled the prediction of a higher level of situation (SA Level 3). However, we did not find a significant difference between BirdView and BirdViewAR. This could be because the TPV of the conventional BirdView offers sufficient field information to help the users perceive obstacles and dynamically plan the flight path. Similarly, BirdViewAR scored significantly higher than FPV (p=0.012) in Q4 (anxiety), but it was identical to BirdViewer. This occurs for the same reason that the users were able to perceive, through the conventional TPV in the BirdView condition, that no obstacle existed around the main drone. For Q5 (concentration), the scores were identical among all of the interfaces. We expected slightly increased cognitive effort for the TPV-based piloting interfaces due to the additional attention switch to the TPV monitor. Yet this result did not demonstrate such extra cost. We discuss this cost with another metric below.

We analyzed the NASA-TLX score to examine how busy participants were as they completed the nose-in-circle task. BirdViewAR had the lowest score, but there was no significant difference between the three interfaces. To explore this result, we discuss the individual questions’ results (the table of results is shown in Appendix C). Here, there were no significant differences between all interfaces in each metrics, so we focus on their average scores. When comparing their average scores, we found that BirdViewAR was inferior to FPV for physical and temporal demand questions. These adverse effects could be brought about by the additional complexity of our hardware (e.g., more frequent eye movements between two monitors) and explicit progress visualization, which might increase the user’s time pressure. BirdViewAR was inferior to BirdView in the average scores of mental, physical and temporal demand and effort. This was also related to the effects of additional AR overlays, but it was essentially a result of the participants’ different piloting strategies, where many of them very carefully manipulated the drone to strictly follow the proximity circular AR guide (their accurate drone path is demonstrated in the next paragraph). While this total score of NASA-TLX did not show a clear benefit, it is worth noting that BirdViewAR offers a new piloting style along with additional opportunities.

To evaluate the participants’ accuracy in drone manipulation, we analyzed the distance errors, defined as the average distance between the drone’s head and the target pole. BirdViewAR had the lowest error compared to FPV (p=0.035) and BirdView (p=0.039). For the average error, BirdViewAR improved FPV by 68%, and BirdView by 59%. This result indicates that the AR overlay, especially the proximity guide, enabled the subject to accurately understand the distance to the pole. With the result of the spatial understanding question (Q2) and the average error, BirdViewAR was more effective in understanding the situation in remote areas than the conventional TPV-based interfaces (e.g., TPP[67]).

5.5.2 Fast-positioning task.

Table 2 summarizes the results of the fast-positioning task. Similar to the previous section, we carried out a Wilcoxon signed rank sum test to compare the conditions. The following argument is focused on the significant results where the adjusted p-value by Holm’s method was less than .05 and the high effect size was larger than approximately 0.5 [57].

For Q1 (perception), we found that TopViewAR and BirdViewAR offered significantly higher scores than FPV (p=0.012 and p=0.012, respectively), indicating that TPV itself improved the user’s spatial perception. We also found significant differences between all interfaces in Q2 and Q3, which demonstrates that the TPV and our AR overlay were effective in understanding and path planning. In particular, BirdViewAR was superior to TopViewAR in these two questions, indicating that our motion-dependant TPV framing worked well. For Q4 (anxiety), we found significant differences between FPV and TopViewAR (p=0.016) and between FPV and BirdViewAR (p=0.012), which was a different tendency from the nose-in-circle task. The reason could be that the feature of the fast-positioning task in which drones move rapidly toward the FPV’s blind spot, thus the anxiety was more enhanced in the FPV condition. For Q5, similar to the nose-in-circle task, there was no significant difference.

For NASA-TLX, although BirdViewAR had the lowest weighted sum score (in fact, BirdViewAR also had the lowest scores for the individual questions as shown in Appendix C), we could not find a significant difference. This might reflect the general manipulation difficulty of the positioning task using the joystick-based speed control mechanism[9], which is independent of the user’s improved spatial awareness. However, it should be noted that BirdViewAR succeeded in lowering the score 10 points from TopViewAR. For the metric of mental demand, we found significant differences between FPV and TopViewAR (p=0.047) and between FPV and BirdViewAR (p=0.023). The reason for this would be that the third-person perspective eased the pilot’s anxiety, as indicated by the result of Q4 (anxiety).

We counted the number of successful tasks when the drone stopped around 5–8 m from the pole. Since the Wilcoxon signed rank sum test focuses on rank, and the four possible values for each subject’s results (0%, 25%, 75%, and 100%) are indicators that are easily tied, it is difficult to find significant differences in the success rate. Since this task was performed after the nose-in-circle task, it is also possible that there was a learning effect regarding the difficulty of speed control by joysticks. However, BirdViewAR improved it on average by 37% compared to FPV and by 27% compared to TopViewAR, implying the effectiveness of motion-dependant TPV and AR-overlay for positioning.

BirdViewAR also had the lowest total motion distance (i.e., less redundant motion such as overshooting) with significant differences compared to the other two interfaces (vs. FPV: p=0.012, vs. TopViewAR: p=0.012). This result indicates that BirdViewAR’s motion-dependant TPV framing assisted in predicting the situation and reducing overshoot.

From the results of Q2 (understanding), Q3 (planning), and total motion distance, we conclude that motion-dependant TPV framing is significantly effective for remote drone manipulations.

6 Discussion

6.1 Reflection

6.1.1 General observations and feedback.

The system functioned correctly without issues or accidents throughout our study and technical test sessions. For accuracy of position calculation, we observed that the system had a distance error of less than 10 cm when the drone was placed 5 m away from the target, as measured during the nose-in-circle task, where the pole’s AR marker was considered the ground truth. We also found that the radio-wave condition weakened for a short period (e.g., a few milliseconds), affecting the drone and overlay behaviors (e.g., vibrations of overlay graphics). For the outdoor study, the wind was a vital factor. The higher-positioned follower drone easily suffered from stronger winds, and its hovering state frequently showed swinging motion in our riverside study field. However, its camera gimbal and our cascade PID control mechanism successfully corrected the TPV’s swaying motions, and the drones remained steady in the air even when the wind strength was around 5 m/sec. Another critical environmental and operational factor was the field’s textures and colors, both of which we discuss below in future work.

From the user study, all of the participants recognized FPV’s blind-spot issues, even in the partial BVLOS drone operation setup. Moreover, the participants recognized the spatial perception issue in conventional TPV as well as its effects. Some participants commented that it was difficult to get a sense of distance from the pole by conventional TPV. One participant commented that he could not obtain valid information from the conventional TPV, and that the cognitive load caused by viewing two screens in one field of view made it difficult to concentrate on the operation.

Some participants also recognized the key challenge of the top-view TPV: the difficulty of recognizing the very small target pole. Its height information is also hidden from the top perspective. Even though they know the main drone is flying higher than the poles, the lack of height information (i.e., one of the motion degrees) probably increased their mental load. Overall, BirdViewAR successfully helped the pilots with the AR-overlay and motion-dependent TPV framing mechanism.

6.1.2 Pilot’s attention strategy.

We found two types of attention switches. First, the pilots might need to change their attention between the TPV and FPV. We instructed our participants to use them as they wanted, but most completed the tasks only using the TPV (except under the FPV-only condition). Some commented that their attention was unconsciously led to the TPV’s AR overlays. This is a unique element for multi-view piloting, which might need to be better supported to avoid pilot confusion. Alternatively, we may need to establish user guidelines for the two views depending on the operation sessions (e.g., remote drone positioning, path planning, videography, etc.). The second attention switching was between the AR overlays and the captured video content within the TPV. Although both were spatially connected, they easily lost the underlying real-world content when they were firming focusing on the foreground overlays. This is an interesting description of how they engaged with the TPV content and suggests that in-TPV real-world targets also need to be augmented with graphic overlays to allow them to have equal attentions, which can be achieved with more sophisticated computer-vision technologies.

6.1.3 Impact of each AR overlay.

Our study results primarily confirmed that circle guides are effective for understanding the spatial relations between the drone and its surroundings. Next we describe the potential and the effectiveness of each overlay based on our post-task interviews.

The FPV frustum and the triangle overlays actually helped the pilots grasp the drone’s heading. Compared to the BirdView (without AR overlays), these overlays led a complete circle trajectory around the pole for most of the participants in the nose-in-circle task. For FPV and BirdView, no participant drew a closed loop trajectory around the pole. The triangle overlay highlighted the drone’s head and its position, which was praised, especially when the follower drone moved higher in the fast-positioning task. Such visibility highlights become more effective when TPV’s height or zoom level dynamically vary. The height bar is primarily practical for a bird’s-eye-style TPV. Participants felt anxious when using TopViewerAR to pass over the poles because the moving drone’s height was not presented. Another comment on the height bar deemed it helpful when setting down on the landing pad (round blue pad in Fig. 7b) after every trial. The bottom of the height bar indicates the exact landing position, and its length progressively presents how close the drone is getting to the ground. This is worth note because landing is a crucial step for all drone activities.

6.2 Alternative Design Opportunities

6.2.1 Manual TPV adjustment.

BirdViewAR is designed to assist pilots when a drone is moving. However, some participants wanted to change their TPV before it started to move, a strategy that was actually advocated by a previous study [67]. However, we believe that manual TPV adjustment is risky and might create different SA issues when the drone’s motion does not take a single straight path. For example, in our task, if our pilots could have rotated the follower drone before capturing the target pole, the TPV would lose the target regions as the main drone gets closer to the target. A simple solution might be combining the manual and automatic TPV framings into a single interface, which can be differently used in path-planning and action sessions.

6.2.2 Proximity circle definitions.

We used 5- and 8-m circular overlays, selected based on the obstacle detection range and the effect of the main drone on humans. Alternative distance relations could also be incorporated. For example, a bird-watching guideline suggests that drones be at least 4 m from bird’s [71]. We may also use HRI (e.g., 1.2 m [75]) or Human-vehicle interaction guideline (e.g., 10-100m) [4]. Furthermore, to support defensive piloting considering braking distance, speed-dependant dynamic circle guide might be an interesting option.

6.2.3 Scalability.

Our current conditions for optimization were selected as a minimum set to achieve stable follower motions. Our optimization model is generally scalable and can be used in more complex circumstances by introducing additional constraints. For example, if there are obstacles in the follower drone’s flight environment, the cost function can be expanded by introducing a potential term as a constraint condition that keeps the drone a certain distance from obstacles. Such function modifications can manage the follower drone’s collision-avoidance control if sensors detect obstacles.

6.3 Application Examples

6.3.1 Human/animal/bird monitoring.

A practical application is interactive remote monitoring from an appropriate distance. Drone-based animal monitoring is becoming a problem because a drone’s noise negatively affects the animals [18]. The risk of bird strikes is also increasing. To reduce such animal-related issues, pilots can create safer motion paths or videography locations. BirdViewAR offers a quick spatial guidance to the pilots. If the targets are moving rapidly, our TPV auto framing might also work effectively. Such high-speed surroundings-aware drone piloting is very difficult to achieve outside of our method.

6.3.2 Outdoor flying display.

Our modern societies are actively exploring flying displays for indoor uses [9, 78] and outdoor uses [63, 79]. BirdViewAR’s main drone can be equipped with a small monitor for navigation in the wild or a teleoperated agent for interpersonal communication with others. BirdViewAR helps pilots/operators correctly position flying displays in remote spaces in terms of physical and social distancing aspects.

6.3.3 Remote physical interactions.

Drones have recently been equipped with functional accessories beyond the camera such as spray, light, robotic arm, package cage, asphyxiator, and sensors (e.g., temperature, gas). This trend also suggests the necessity of remote interactive drone positioning relative to the surroundings. BirdViewAR may help pilots position drones and assist in physical interactions. For example, its height bar can provide clear guidance for placing packages on the ground or correctly emitting the asphyxiator at the area on fire.

6.3.4 Remote field surveys.

Remote scanning is a representative task for BVLOS drone operation. BirdViewAR’s distance (e.g., circle and height) guidance overlays can be used as a large-scale spatial ruler to help pilots/investigators quickly and roughly estimate a field’s size and structure. Depending on the necessary ruler size, BirdViewAR’s overlay guides can be flexibly customized.

6.4 Limitations and Future Work

6.4.1 Simplicity of the tasks.

In our tasks, participants were not required to change the drone’s altitude. The primary reason was that moving in a flat direction has been assumed to exacerbate the FPV’s blind spot issues. Additionally, we did not involve three-dimensional and more complex movements to keep the study simple for our novice participants. However, real-world aerial videography generally requires high-dimensional inputs, such as changing altitude and direction while moving away from the target. The impact of augmented TPV for more practical videography scenarios has not been verified.

The poles in our study were colored blue or yellow so our participants could easily distinguish them from the background field even when the TPV altitude increased. When considering exploration tasks in which the appearance of the target is unknown, the current BirdViewAR’s capabilities are not sufficient for helping the user’s spatial understanding and object recognition. Future evaluations and improvements using computer vision technology should be made for more complex or in the wild tasks.

We also assume that there was a learning effect due to the simplicity of the task and the difficulty of the speed control. However, we believe that counterbalancing and recruiting beginner pilots minimized this effect.

6.4.2 Safety of follower drones.

We first assumed that our system would be used in simple outdoor areas such as streets with low buildings, athletic fields, and seas/beaches/rivers with trees of typical heights. We acknowledge that our system is not viable in complicated situations where upper and lower regions are not clearly separated, e.g., a narrow path deep inside a forest. Of course, the follower drone also benefits from its onboard collision-avoidance system during emergencies. Rather, we suggest a calibration process that sets a safe follower drone’s height area as the initial state by pre-scanning the field structure. Another approach that leverages two-drone formations installs an upper-facing camera on the main drone’s body that keeps capturing the follower drone and its surroundings. This additional information improves control safety for the follower drone. In addition, we expect that 3D models of the surrounding environment will be presented at higher fps by future advanced wireless communication and SLAM technologies, which possibly replace the follower drone’s video feed with computationally reconstructed field images by on-drone sensors.

6.4.3 Transmission delay.

We used 2.4-GHz radio waves for communication between the controllers and the drones, a choice that might cause an information delay due to short-period (up to 0.1 seconds) low signal issues. Although no participant complained about such delays during our user study, this issue must be improved in the future. The simplest solution is using 5-GHz radio waves, but they are subject to local radio ordinances.

6.4.4 Image processing accuracy.

While our image processing approach was sufficient for our follower drone calculation, it may not be robust to various field colors and sunlight conditions. This problem can be mitigated using more advanced ML-based image recognition or thermal image technologies. Furthermore, using such highly accurate location information as a Real-Time Kinematic GPS with image recognition might solve this problem.

6.4.5 Challenges of outdoor user study.

We encountered many difficulties conducting our user study outdoors, including transportation, unpredictable weather, frequent battery replacement, and changing regulations, etc. We had two more participants, but due to weather conditions, we were unable to take their data correctly. Additionally, as an example of the unique challenges faced during public space studies, we were forced to abandon our study’s field site after construction work started on it for eight months. Our current study remains in a preliminary stage. However, since very few previous studies conducted outdoor research, our data, practices, and initial insights from beginner pilots are worth sharing with our community because they cannot be obtained from simulator studies. We plan to run a formal follow-up study in a larger area where faster drone motions and larger guidelines can be tested.

7 Conclusion

We proposed BirdViewAR, a surroundings-aware remote drone-operation system that provides pilots with significant spatial awareness using an augmented TPV from an autopiloted follower second drone. We described our optimization-based motion-dependent TPV framing and the AR overlays of the spatial information for an on-TPV main drone. We implemented a proof-of-concept prototype using programmable drones and our workflow using a vision-based sensing platform. We conducted a preliminary outdoor study and demonstrated BirdViewAR’s potential to increase beginner pilots’ SA and improve their piloting performances. Future work will conduct a more formal user study and improve our system by combining with a TPV manual framing interface and a collision/occlusion avoidance system.

Acknowledgments

We would like to thank A. Ebi for preliminary prototyping of the motion-dependent automatic control system using GPS. We are grateful to H. Wang and R. Matsui for their assistance with the outdoor experiment. This work in part is supported by JSPS Kakenhi (22J22982).

A Constraints of Optimization

Here, we show the constrains for optimization.

\begin{align} &i. &&-\pi /6 \le \phi _{f}\left(t\right) \le \pi /6 \nonumber \\ &ii. &&0 \le \theta _{f}\left(t\right) \le \pi /4 \nonumber \\ &iii. &&y_{f}\left(t\right) \tan \left(\theta _{f}\left(t\right) - \theta _{fvfov}/2\right)-z_{f}\left(t\right) \le 1.5 \min \left(0,v_{z}\left(t\right)\right)\nonumber \\ &iv. &&y_{f}\left(t\right) \tan \left(\theta _{f}\left(t\right) + \theta _{fvfov}/2\right)-z_{f}\left(t\right) \ge 1.5 \ \max \left(0,v_{z}\left(t\right)\right)\nonumber \\ &v. &&R_{side}\left(t\right) \left(1+\frac{\tan \left(\phi _{f}\left(t\right)\right)}{\tan \left(\frac{\pi -\theta _{fhfov}}{2}\right)}\right) \le 1.5 \; \max \left(0,v_{x}\left(t\right)\right)\nonumber \\ &vi. &&R_{side}\left(t\right) \left(1+\frac{\tan \left(\phi _{f}\left(t\right)\right)}{\tan \left(\frac{\pi +\theta _{fhfov}}{2}\right)}\right) \ge 1.5 \; \min \left(0,v_{x}\left(t\right)\right)\nonumber \\ &vii. &&y_{f}\left(t\right)\tan \left(\frac{\pi }{2}-\frac{\theta _{mvfov}}{2}\right)+z_{f}\left(t\right) \le \nonumber \\ &&&\left(y_{f}\left(t\right)-y\left(t\right)\right)\tan \left(\theta _{f}\left(t\right)+\frac{\theta _{fvfov}}{2}\right) \end{align}

(10)

Here,

\begin{align} R_{side}\left(t\right) =&\frac{z_{f}\left(t\right)-y_{f}\left(t\right)\tan \left(\theta _{f}\left(t\right)- \frac{\theta _{fvfov}}{2}\right)}{y_{f}\left(t\right) \left(\tan \left(\theta _{f}\left(t\right)+ \frac{\theta _{fvfov}}{2}\right)-\tan \left(\theta _{f}\left(t\right) - \frac{\theta _{fvfov}}{2}\right)\right)}\nonumber \\ &\left(\frac{y_{f}\left(t\right)}{\cos \left(\theta _{f}\left(t\right)+ \frac{\theta _{fvfov}}{2}\right)}-\frac{y_{f}\left(t\right)}{\cos \left(\theta _{f}\left(t\right)- \frac{\theta _{fvfov}}{2}\right)}\right) \nonumber \\ &\tan \left(\frac{\theta _{fhfov}}{2}\right)+\frac{y_{f}\left(t\right)}{\cos \left(\theta _{f}\left(t\right)- \frac{\theta _{fvfov}}{2}\right)}\tan \left(\frac{\theta _{fhfov}}{2}\right). \end{align}

(11)

i is used to suppress the angular difference between the viewpoint and the control target, and ii is a technical requirement to fit the motion range of the follower drone’s camera gimbal. The constraints iii to vi are set to obtain the necessary field of view in each motion direction as shown in Eq. 1. The constraint vii is set to keep the AR-overlay within the TPV (esp., FPS frustum).

B Distance Calculation Formula Derivation Process

Figure 11:

When deriving Eqs. 7 and 8, we place a coordinate system centered at point M in Fig. 11. In such a case, the three-dimensional coordinates of the points M, R, and A are M(0, 0, 0), R(asin ϕ, 0, −acos ϕ) and A(hcos ϕ, 0, hsin ϕ) respectively, where ϕ = ∠ACR, h = AM and a = RM.

The plane indicated by the red line is represented by the following equation:

\begin{equation} x\cos \phi +z\sin \phi = 0. \end{equation}

(12)

If we place a point P(x, y, −acos ϕ) on the purple dotted line in Fig. 11 and a point Q(X, Y, Z) on the line AP, then

\begin{equation} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix} = (1-t) \begin{pmatrix} x \\ y \\ -a\cos \phi \end{pmatrix} +t \begin{pmatrix} h\cos \phi \\ 0 \\ h\sin \phi \end{pmatrix}. \end{equation}

(13)

Here, when a point Q is on the red line plane, we have the following equation:

\begin{equation} \cos \phi ((1-t)x+th\cos \phi)+\sin \phi (a(t-1)\cos \phi +th\sin \phi) = 0. \end{equation}

(14)

Hence,

\begin{equation} t = \frac{(a\sin \phi -x)\cos \phi }{h+(a\sin \phi -x)\cos \phi }. \end{equation}

(15)

Rotate (X, Z) around the point M by (π/2 − ϕ) to (x_m, y_m, 0),

\[\begin{eqnarray} x_{m} = \frac{h(x\sin \phi +a\cos ^2\phi)}{h+(a\sin \phi -x)\cos \phi)} \end{eqnarray}\]

(16)

\[\begin{eqnarray} y_{m} = \frac{hy}{h+(a\sin \phi -x)\cos \phi)}. \end{eqnarray}\]

(17)

Therefore, by replacing x with z_f and y with x_f, solving the inverse functions of Eqs. 16 and 17, and adding the bias between point M and the follower drone, Eqs. 7 and 8 are obtained.

C NASA-TLX Score of the Individual Questions

Table 3:

Table 4:

Table 3 and Table 4 show the NASA-TLX score of the individual questions in each task. We carried out a Wilcoxon signed rank test and calculated the p-value and effect size (r) for every paired comparison, same as the result section (Sec. 5.5). The * in the table indicates significant differences with a p-value of less than 0.05 after Holm’s correction method. Weight Avg. in the tables represents the average of the weighting coefficients for each participants.

Footnote

https://www.icd.riec.tohoku.ac.jp/en/research/projects/birdviewar/

Supplementary Material

MP4 File (3544548.3580681-video-figure.mp4)

Video Figure

Download
210.44 MB

MP4 File (3544548.3580681-video-preview.mp4)

Video Preview

Download
73.03 MB

MP4 File (3544548.3580681-talk-video.mp4)

Pre-recorded Video Presentation

Download
276.67 MB

References

[1]

Muhammad Abdullah, Minji Kim, Waseem Hassan, Yoshihiro Kuroda, and Seokhee Jeon. 2017. HapticDrone: An Encountered-Type Kinesthetic Haptic Interface with Controllable Force Feedback: Initial Example for 1D Haptic Feedback. In Adjunct Publication of the 30th Annual ACM Symposium on User Interface Software and Technology (Québec City, QC, Canada) (UIST ’17). Association for Computing Machinery, New York, NY, USA, 115–117. https://doi.org/10.1145/3131785.3131821