Open AccessArticle

Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control

Guhnoo Yun

Hwykuen Kwak

and

Dong Hwan Kim

^1,*

Korea Institute of Science and Technology, Seoul 02792, Republic of Korea

Hanwha Systems Co., Ltd., Seongnam 13524, Republic of Korea

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10230; https://doi.org/10.3390/app142210230

Submission received: 31 July 2024 / Revised: 18 October 2024 / Accepted: 5 November 2024 / Published: 7 November 2024

(This article belongs to the Section Aerospace Science and Engineering)

Download

Browse Figures

Figure 1
Overview of the HGR pipeline. Hand keypoints are estimated and input into two HGR models, which detect drone motion commands from hand gestures. "> Figure 2
An illustration of the manipulation of drone movement with a combination of hand gestures. When a user presents the stop gesture, no motion command is transmitted to the drone. Meanwhile, when presenting the neutral gesture, the drone movement is controlled by a combination of hand gestures. "> Figure 3
Hand gesture examples for manipulation of roll and throttle. From the neutral position, the commands for both throttle and roll movements consist of the thumb, middle, and pinky fingers. The stop sign is excluded from this illustration. "> Figure 4
An illustration demonstrating the manipulation of the yaw axis of the drone. When the hand rotates counterclockwise, the positive rotation angle causes the drone to turn left along the yaw axis. Conversely, the drone turns right if the rotation angle is negative. "> Figure 5
Overview of the MLP architecture for the classification of the 21-keypoint sequence. The network comprises three dense blocks, one FC layer, and a softmax activation function. The input to the network consists of preprocessed, normalized relative keypoints; see <a href="#sec3dot1-applsci-14-10230" class="html-sec">Section 3.1</a>. "> Figure 6
Examples of motion-based hand gesture recognition. (a–c) Pitch-related gestures: forward, neutral, and backward. (d–f) Yaw-related gestures: yaw_left, neutral, and yaw_right. "> Figure 7
The confusion matrices of motion-based hand gesture recognition. (a) Pitch-related gestures. (b) Yaw-related gestures. "> Figure 8
The confusion matrix of the posture-based hand gesture recognition model. "> Figure 9
An illustrative scenario from the drone simulator demonstrates the application of the proposed hand gesture vocabulary in controlling a drone flight. In this example, the drone executes commands for ‘pitch forward’, ‘roll left’, and ‘yaw left’ simultaneously. ">

Versions Notes

Abstract

Recent progress in hand gesture recognition has introduced several natural and intuitive approaches to drone control. However, effectively maneuvering drones in complex environments remains challenging. Drone movements are governed by four independent factors: roll, yaw, pitch, and throttle. Each factor includes three distinct behaviors—increase, decrease, and neutral—necessitating hand gesture vocabularies capable of expressing at least 81 combinations for comprehensive drone control in diverse scenarios. In this paper, we introduce a new set of hand gestures for precise drone control, leveraging an RGB camera sensor. These gestures are categorized into motion-based and posture-based types for efficient management. Then, we develop a lightweight hand gesture recognition algorithm capable of real-time operation on even edge devices, ensuring accurate and timely recognition. Subsequently, we integrate hand gesture recognition into a drone simulator to execute 81 commands for drone flight. Overall, the proposed hand gestures and recognition system offer natural control for complex drone maneuvers.

Keywords:

hand gesture recognition (HGR); drone motion control; human–drone interaction

1. Introduction

Hand gestures are a common form of nonverbal communication in everyday life interactions. They are relatively more natural, private, and intuitive than other forms, such as facial expressions [1], gaze [2], and silent speech [3]. This makes them a valuable tool for interaction with people and machines. Hand gesture recognition (HGR) is the process of extracting information from hand gestures and converting it into a machine-readable format. This can be performed for a variety of applications, such as sign language recognition [4], human–robot interaction [5,6], virtual environments [7], drone flight control [8], and so on. Moreover, there have also been attempts to use hand gestures to engage with unmanned systems during rescue operations, fostering effective human–machine collaboration in specific environments [9].

HGR can be divided into two main categories: wearable glove-based and camera vision-based approaches [10]. The first HGR for human–computer interaction (HCI) started with the invention of the data glove sensor [11]. Data gloves use different sensor types to capture hand motion and positions. The sensors detect signals of a physical response, such as the bending of the fingers and the movement of the hands, and then use this information to determine the coordinates of the location of the palm and fingers [12]. Since then, various sensors exploiting different physical principles have been used to measure the angle of finger bending and hand location to capture hand motion and positions, such as curvature sensors [13], angular displacement sensors [14], flex sensors [15], and accelerometer sensors [16]. While these methods yield accurate results due to the direct measurement of movements, the inconvenience of wearing them and the high costs of the sensors are problems that reduce their accessibility and usability. In contrast, vision-based HGR has gained popularity due to its ease of use and advances in open-source software. It has been applied in various domains, such as assisting visually impaired individuals [17], personal computers [18], robot control [19], clinical operations [20], and so on. Vision-based HGR utilizes different visual sensors, including RGB, time-of-flight (TOF), infrared (IR), thermal, or night vision, to track hand movement. This approach offers advantages such as cost-effectiveness and improved user comfort compared to wearable glove-based methods. Recently, efforts have been made to combine both approaches, known as multimodal hand gesture recognition (HGR), to enhance the performance [21].

In this paper, we propose a simple and efficient single-handed hand gesture recognition (HGR) system for human–machine interaction, primarily designed for human–drone interaction but adaptable to various environments, including mobile and edge devices. Previous works [8,22,23,24] have introduced their own hand gesture vocabularies and HGR models. However, these vocabularies often have a limited number of commands, restricting the drone flight dynamics, and rely on external devices or expensive vision sensors. Our research focuses on a vision-based HGR approach. The objective is to create a user-friendly hand gesture set for the control of dynamic drone movements and to develop a real-time, accurate, lightweight, and efficient vision-based HGR algorithm suitable for mobile computing environments. We begin by analyzing the four fundamental elements of drone movement, which serve as the basis for the design of the new hand gesture set. Subsequently, we construct a simplified and efficient vision-based HGR with an RGB camera sensor.

Our contributions to this research field can be summarized in two ways. Firstly, we design a new hand gesture set specifically tailored for human–drone interaction. This gesture set aims to enhance the user convenience while aligning with the movement characteristics of drones. By incorporating the four fundamental movement elements of drones, our gesture set enables the representation of complex flight tasks. Secondly, we propose a vision-based HGR that is both simple and efficient for the control of drones using hand gestures. While our system is designed for our specific application, it exhibits the flexibility to be utilized on other mobile and edge devices for diverse tasks.

The structure of this paper is as follows. Section 2 discusses related work on the interaction between hand gestures and drone flight, as well as various approaches to vision-based HGR. In Section 3, we introduce our new hand gesture set for drone maneuvers and present our vision-based HGR for the execution of commands. The experimental results are presented in Section 4, and the discussion is provided in Section 5. Finally, Section 6 summarizes our research findings and conclusions.

2. Related Works

Human–machine interaction has evolved significantly over the years, with applications ranging from entertainment to critical assistance in industrial and personal domains. Effective communication between humans and machines is essential for successful interaction. Traditionally, human–machine interaction relies on learning to operate conventional controllers, which often requires familiarity with various buttons and complex functions, posing challenges for beginners. Recent advancements have introduced natural user interfaces (NUIs) as a transformative approach, offering more intuitive methods for interaction by enabling users to interact with devices similarly to the ways in which they interact with the physical world, using voice commands, hand gestures, and body movements [25]. For instance, video game components like the Wiimote and Kinect allow users to experience gaming in an immersive and natural manner [26].

Recent years have seen a significant rise in research exploring gesture-based interactions across a range of human–machine systems. In particular, gesture control extends to broader human–robot interaction (HRI) contexts. For instance, Qi et al. [5] developed a multi-sensor guided hand gesture recognition system for teleoperated robots, highlighting the effectiveness of intuitive hand gestures in robotic control scenarios. Similarly, Gao et al. [6] used multimodal data fusion to enhance hand gesture recognition for human–robot interaction, emphasizing the value of robust and flexible gesture-based communication in diverse robotics applications. These studies illustrate the applicability of hand gestures in providing an intuitive interface for the control of various robotic systems, showcasing the versatility and utility of gesture-based interaction for effective control.

Human–drone interaction, in particular, has emerged as a promising field with applications ranging from surveillance to delivery services and entertainment [27]. Among various modalities, hand gestures stand out as being highly intuitive and natural, making them particularly suitable for drone control. Designing an optimized hand gesture set is therefore crucial to enhance the interaction between humans and drones. The drone’s flight involves complex motions represented by a combination of four independent factors: roll, yaw, pitch, and throttle. Each movement axis can express three behaviors: increase, decrease, and neutral. In order to enable the drone to navigate through complex motions within unfamiliar and cluttered environments, it is essential to consider well-aligned hand gestures of high accuracy that can represent at least 81 (i.e.,

3^{4}

) motion combinations.

Lu et al. [8] suggested a single-hand gesture set corresponding to the direction of drone flight. They developed a set of gestures to control the direction of the drone flight; however, each gesture can only represent one motion at a time, potentially limiting the dynamics of the drone flight. Bello et al. [22] also introduced intuitive hand gestures for a drone control vocabulary; however, their gesture set encompassed a restricted range, potentially constraining its capacity to address complex flight scenarios. Konstantoudakis et al. [23] conducted a comprehensive study to examine the comfort and intuitiveness of hand gestures associated with flight concepts across a broad spectrum of scenarios. They designed two sets of gestures, categorized as palm- and finger-based, and assessed two distinct modes of gesture control across a varied user demographic, encompassing individuals of both genders, as well as first responders and members of the general population. Referring to this, we design a set of gestures, as detailed in Section 3.2.

After designing a set of hand gestures, consideration must be given to the methods of acquiring and classifying the gesture data. The sources for data acquisition in HGR algorithms can be categorized into two approaches: image-based and non-image-based methods [8]. Image-based methods employ a range of vision sensors, including depth cameras, stereo cameras, and single cameras. Conversely, non-image-based methods typically utilize wearable sensors such as gloves and bands [24], which often require sophisticated sensing devices and can be costly. Vision sensors such as depth and stereo cameras also incur costs. However, image-based methods are generally regarded as more comfortable compared to non-image-based methods because the tools are less cumbersome to wear or install. Therefore, we propose an image-based HGR method.

The most common method of utilizing visual data is to extract 2D or 3D keypoints of the hands from an image to assess the hand posture and shape, which can be used to recognize gestures [10]. These require additional computation to model the hand shape before recognizing the gestures. On the other hand, there are approaches to inferring gestures in images by detecting the hand positions and estimating their shapes without any modeling information about the human hand [28]. These methods commonly aim to find the exact location and area of the hands by considering all aspects other than the hands as the background. Although HGR utilizing keypoints employs extra modules to estimate the hand pose, many lightweight hand pose estimation methods [29,30] have been introduced. Furthermore, the extraction of hand keypoints is more robust in capturing sophisticated hand gestures from different viewpoints. In this paper, we present a vision-based HGR method designed for drone control. Initially, we define a set of hand gestures, considering both a natural user interface and drone flight patterns. Subsequently, we propose a method of classifying these gestures. Our approach is cost-effectively implemented using a simple RGB camera capable of edge device operation.

3. Methodology

In this section, we present a hand gesture recognition system based on a hand model. The proposed method consists of three steps. First, we employ a hand tracker to estimate the hand pose and define the input data format in Section 3.1. Next, we define the hand gesture classes that need to be recognized in Section 3.2. Finally, we propose two hand gesture recognition (HGR) models: motion-based and posture-based. The motion-based HGR deterministically detects commands for yaw and pitch, as discussed in Section 3.3.1 and Section 3.3.2. In contrast, the posture-based HGR, implemented as a network model and discussed in Section 3.3.3, detects commands for throttle and yaw. Figure 1 provides an overview of our HGR pipeline.

3.1. Hand Tracking for Hand Pose Estimation

Accurate hand gesture recognition requires a highly reliable hand tracker to detect the hand position and estimate its pose accurately. The hand model is defined by a set of joint coordinates. Additionally, its real-time feasibility on edge devices must be considered. Since real-time hand pose estimation is a challenging problem that was beyond the scope of our study, we employed Google MediaPipe [29], a reliable two-stage hand tracking module, to simplify our implementation. MediaPipe has been validated in various real-time applications, making it suitable for edge device deployment. This enabled us to focus on developing an intuitive hand gesture vocabulary and a classification model. MediaPipe consists of two pretrained models: a palm detector and a hand pose estimator. The palm detector first takes an RGB image as input and locates the hand by estimating the palm bounding box. The hand pose estimator then identifies 21 keypoints from the detected hand region. We implemented it using the Python v 3.10 package mediapipe, which provides a simple API for the integration of the palm detection and hand pose estimation models. For the keypoint nodes of a single hand, a list of the node sequence S can be expressed as follows:

S = {v_{i} \in R^{2} | i = 0, \dots, K - 1},

(1)

where

v_{i}

is the i-th keypoint in the Cartesian coordinate

(x, y)

of a landmark, and the number of keypoints K is set to 21. Then, the obtained keypoint coordinates for the hand are converted into a relative distance with regard to the wrist:

{\tilde{v}}_{i} \leftarrow v_{i} - v_{0},

(2)

where

v_{0}

represents the coordinate of the wrist and

{\tilde{v}}_{i}

is the updated relative coordinate of the i-th keypoint. Finally, the pose information is flattened into a one-dimensional tensor and normalized to the maximum absolute value so that it is prepared for input to the posture-based HGR model.

3.2. Hand Gesture Vocabulary for Drone Control

In this section, we design a set of hand gestures for the control of drone flights, utilizing an RGB camera mounted on a user’s head to capture hand gestures from an egocentric view for first-person-view (FPV) flights. Our goal is to ensure user convenience and accessibility in drone control. Lu et al. [8] designed a single-handed gesture set that was as close as possible to the drone’s flight direction and was easy to perform and remember. In addition, they proposed a multiple-handed gesture set to reduce the rate of hand gesture misoperation and improve the adaptability of hand gesture interaction. Nevertheless, limitations exist in executing simultaneous roll, pitch, yaw, and throttle maneuvers. Konstantoudakis et al. [23] developed two gesture vocabularies, finger-based and palm-based, for virtual drone navigation, evaluating their effectiveness in a user study. However, palm-based gestures require precise hand pose estimation using a 3D sensor, while finger-based gestures lack roll movements. Building upon the insights gained from the aforementioned references, we define a set of precise hand gestures tailored for drone maneuvering.

Traditional drones with four rotors, also known as quadcopters, are maneuvered by a combination of four basic controls, namely throttle, roll, pitch, and yaw. Throttle controls the speed of the rotors, which in turn controls the altitude of the drone. Roll controls the rotation of the drone around its longitudinal axis, which causes it to move left or right. Pitch controls the rotation of the drone around its lateral axis, which causes it to move forward or backward. Yaw controls the rotation of the drone around its vertical axis, which causes it to rotate clockwise or counterclockwise. In particular, throttle and yaw are interrelated and can be affected by the pitch and roll and vice versa. Therefore, the movement of a drone can be described as a combination of the following four elements:

Movement = Throttle \cdot I_{T} + Roll \cdot I_{R} + Pitch \cdot I_{P} + Yaw \cdot I_{Y},

(3)

where

I_{T}

I_{R}

I_{P}

, and

I_{Y}

denote the indicator functions, which represent whether throttle, roll, pitch, and yaw are active (1) or not (0), respectively. Each element has three states: increase, decrease, and neutral. Therefore, there are at least 81 (i.e.,

3^{4}

) movement commands.

To ensure intuitive and convenient drone maneuvering, it is crucial to align the hand gestures closely with the desired drone flight direction. Figure 2 illustrates a method of controlling drone movement using the hand gestures that we propose. We initially establish gestures to represent the stop and neutral states of the drone. The neutral state occurs when a drone is not moving and is not controlled by the user. In this state, the drone must remain stationary and wait for commands to move. The stop command tells the drone to stop moving and remain stationary. This command stops the drone from moving in any direction. In the study conducted by Konstantoudakis et al. [23], the participants reported experiencing a notably high level of ease and comfort when executing index-finger-related gestures. Consequently, we design the neutral state to correspond to an index-finger-only extension motion. The stop command is defined as a closed fist, which is a gesture that is commonly associated with stopping or halting an action.

Next, it is necessary to design gestures for yaw, roll, pitch, and throttle. We divide the hand gestures into two categories based on the characteristics of the four components: motion-based and posture-based gestures. For motion-based gestures, the translation of a user’s hand forward and backward along the camera’s focal axis can be intuitively aligned with the pitch direction. Consequently, moving the hand forward results in the drone pitching up and moving forward, while moving the hand backward causes the drone to pitch down and move backward. Similarly, the control of yaw can be achieved by rotating the hand to the right or left. From the neutral gesture, when the hand rotates from the baseline, we can measure the angle and direction of the rotation. This measurement can effectively correspond to the yaw rotation of the drone.

The roll and throttle correspond to posture-based gestures. The roll represents a movement to the left or right. Since it is often used to maneuver the drone around obstacles or to follow a moving target, a low-fatigue and intuitive hand gesture is required. To address this, we use the thumb, which can be easily extended along with the index finger [23]. In other words, when the thumb is extended to the left with the index finger extended, it indicates a leftward movement. Conversely, it indicates a rightward movement when the thumb is extended to the right. Similarly, the throttle is defined by the extension of the middle and pinky fingers. When the middle finger is extended along with the index finger, the drone will increase its throttle and ascend. Conversely, if the pinky finger is extended, the drone will decrease its throttle and descend.

The extension of the pinky finger to decrease the throttle is comparatively less intuitive than other gestures. This discrepancy arises because the direction of finger extension does not align well with the corresponding throttle movement. Conversely, extending the middle finger downward for throttle control is more intuitive. Nevertheless, the configuration of our RGB camera viewpoint leads to the occlusion of other fingers when employing this gesture. As a mitigation strategy, we opt to maintain the less intuitive throttle-down gesture, prioritizing the minimization of occlusion while ensuring the intuitiveness of the remaining gestures. Additionally, both of these finger gestures can be combined with the thumb gesture. Our posture-based gestures are shown in Figure 3.

3.3. Hand Gesture Recognition (HGR)

This section introduces a straightforward and efficient HGR approach. For pitch control, Section 3.3.1 describes a method to measure the hand distance from the RGB camera without employing a distance sensor. Section 3.3.2 outlines the estimation of the hand rotation angle for yaw control. Finally, a simple network is presented for the classification of the posture-based hand gestures, enabling the manipulation of the remaining movements.

3.3.1. Hand Distance Estimation Module

In Section 3.2, pitch manipulation is achieved by adjusting the hand’s position. Several depth estimation methods exist to determine the distance between an object and an RGB camera in the absence of a dedicated distance sensor. One common approach involves estimating the depth from a single image by measuring the scale change of a target object based on the pinhole camera model. Yoo et al. [31] utilized this to estimate the scale variations of objects according to the distance. Similarly, Vakunov et al. [32] computed the distance between a subject and a pinhole camera by considering the focal length and the size of the horizontal iris diameter.

For this application, sophisticated and complex approaches are unnecessary. Instead, it suffices to classify the distances into three broad categories, forward, backward, and neutral, providing a simplified representation. Consequently, we also utilize a pinhole camera model and estimate the real-world distance

d_{w}

between the camera and the hand by examining the pixel length

l_{p}

between two hand joint landmarks. To achieve this, we describe a distance function F that represents the relationship between

l_{p}

and

d_{w}

F : l_{p} \to d_{w} .

(4)

To do this,

l_{p}

is expressed as

l_{p} = | | {\tilde{v}}_{i} - {\tilde{v}}_{j} | |,

(5)

where

{\tilde{v}}_{i}

and

{\tilde{v}}_{j}

represent the relative distances with respect to the i-th and j-th landmarks, respectively. We use the 5th and 17th landmarks. Then, following the pinhole camera model as described in [31], the hand distance can be computed as follows:

d_{w} = F (l_{p}) = \frac{f}{l_{p}} L,

(6)

where f denotes the focal length obtained from our camera specifications, and L represents the real length between the joints corresponding to the landmarks. We approximate this constant value to be 7 cm, the average size of our four participants. Since our goal is to classify the hand’s movement direction rather than measure precise distances, we omit the correction for radial distortion as it does not impact the classification accuracy. Finally, we simply establish two thresholds,

τ_{1}

and

τ_{2}

, to delineate the distance bands as follows:

Pitch = \{\begin{matrix} forward, & d_{w} > τ_{1} \\ backward, & d_{w} < τ_{2} \\ neutral, & e l s e \end{matrix},

(7)

where

τ_{1}

and

τ_{2}

represent user-adaptable values and the condition is

0 < τ_{2} < τ_{1}

3.3.2. Hand Rotation Estimation Module

The manipulation of the yaw axis is related to altering the orientation of the drone’s perspective. This adjustment can be intuitively matched to the rotation of the hand as if operating the steering wheel of a vehicle. We utilize the finger joint information to measure the rotation angle of the hand, which is then used to determine the rotation direction of the yaw axis. First, we designate two points to establish a baseline. One is the origin set at the point of the wrist, and the other is an arbitrary point located far away in the vertical direction above the origin. Specifically, the wrist point (landmark #0) serves as the origin, while the arbitrary point is selected as the middle finger point (landmark #9). Subsequently, the angle

θ

of rotation of the hand is approximated by determining the angle between a straight line on the palm and the baseline, as shown in Figure 4.

This straight line is defined by connecting two points in the current frame: the joint connecting the middle finger (landmark #9) and the wrist (landmark #0). To obtain the value of

θ

, we use np.arctan2. Specifically, the yaw factor can be determined using the following expression:

Yaw = \{\begin{matrix} yaw_left, & θ > θ_{1} \\ yaw_right, & θ < θ_{2}, \\ neutral, & else \end{matrix},

(8)

where

θ_{1}

and

θ_{2}

denote user-adaptable values and the condition is

θ_{2} < 0 < θ_{1}

3.3.3. Posture-Based Gesture Classification Module

We design a network model using a multilayer perceptron (MLP) [33] to classify posture-based gestures from keypoint sequences. The MLP is a feedforward neural network consisting of an input layer, hidden layers, and an output layer. Generally, the input layer size corresponds to the number of input features, while the output layer size equals the number of classes. The number and width of the hidden layers are crucial factors that directly impact the performance of the MLP. By adjusting the number of hidden layers and their sizes, the MLP can be scaled to handle a larger set of hand gestures, enhancing its generalization capabilities. These hyperparameters are determined based on the complexity and variability of the training data. The general MLP model with N hidden layers is expressed as follows:

\begin{matrix} h_{1} & = σ (Dropout (W_{1} p + b_{1})) \end{matrix}

(9)

\begin{matrix} h_{2} & = σ (Dropout (W_{2} h_{1} + b_{2})) \end{matrix}

(10)

\begin{matrix} ⋮ \end{matrix}

(11)

\begin{matrix} h_{N} & = σ (Dropout (W_{N} h_{N - 1} + b_{N})) \end{matrix}

(12)

\begin{matrix} h_{FC} & = W_{FC} h_{N} + b_{FC} \end{matrix}

(13)

\begin{matrix} output & = softmax (W_{out} h_{FC} + b_{out}), \end{matrix}

(14)

where

p \in R^{D}

denotes the input tensor of dimension D;

W_{k}

and

b_{k}

are the weight matrix and bias of the k-th layer, respectively; and

h_{k}

represents the features at this layer. Each layer’s transformation, involving matrix multiplication and summation, is performed by a fully connected (FC) layer that connects all inputs to all outputs. Dropout [34] is applied during training to randomly set a fraction of the activations to zero, helping to prevent overfitting.

σ

is the activation function, set to ReLU [35]. After N hidden layers, an FC layer aggregates and transforms the features into a final representation before classification. Finally, softmax [36] is applied to the output layer for classification.

An overview of our model is shown in Figure 5. The input layer consists of 42 nodes, representing the

(x, y)

coordinates of the 21 keypoints. The output layer has ten nodes, corresponding to the ten posture-based gestures defined in Section 3.2. Generally, increasing the number of hidden layers and nodes can improve the model accuracy but also increases the training time and computational requirements. To achieve high classification accuracy for the ten classes while maintaining efficiency for edge device deployment, we apply hyperparameter optimization using a grid search [37] to determine the optimal number of hidden layers and the width of each layer. The implementation details are provided in Section 4.2.2.

4. Experiments and Results

4.1. Motion-Based Gesture Recognition

As detailed in Section 3.3.1 and Section 3.3.2, the motion-based gestures for yaw and pitch movements are determined by interpretable rules. Consequently, we can anticipate consistent performance, given that the distance from the camera and the rotation angle are determined using landmark coordinates obtained from a standardized hand model. To evaluate the effectiveness of these gestures, we conducted experiments with four participants, each performing pitch- and yaw-related gestures, as illustrated in Figure 6, each 25 times. Consequently, as depicted in Figure 7, the average classification accuracy is shown to be 100%. This is because utilizing less noisy landmarks in the palm of the hand makes gesture recognition more reliable.

Since the recognition of motion-based gestures relies on heuristic rules, the recognition time is primarily determined by the speed of landmark detection. The MediaPipe hand tracker operates in real time, with an average inference time of approximately 17.12 milliseconds (ms) on a Google Pixel 6 device [29]. This allows landmarks to be provided immediately, as soon as a hand is detected, enabling instantaneous classification. As a result, the overall average recognition time per gesture is minimal, making the system suitable for effective real-time application without noticeable latency.

4.2. Posture-Based Gesture Recognition

4.2.1. Dataset Acquisition

Before implementing the proposed MLP model-based HGR, it is essential to collect data to learn the posture-based hand gestures that need to be recognized. By default, in our system, an RGB camera is mounted on the forehead, so all scenes capturing hand gestures are considered from an egocentric perspective. As discussed in Section 3.2, the hand gestures associated with controlling the pitch and yaw can align directly with the drone’s flight direction. These gestures can be estimated using only the coordinates of the hand model, eliminating the need for training. Hence, the data acquisition focused solely on ten posture-based gestures, specifically those indicating the roll, throttle, stop, and neutral states. Four subjects participated, with each subject performing each gesture an average of 25 times, resulting in a total of 1000 scenes. Of these, 75% were used for training and the remaining 25% for testing. To capture natural hand motions, the subjects were instructed to transition from the neutral state (with only the index finger extended) to other gestures during the experiment. To minimize storage usage, the gesture data were collected and stored as the 2D coordinates of 21 keypoints extracted from the MediaPipe hand tracker.

4.2.2. Model Implementation and Results

As aforementioned in Section 3.3.3, we utilize a grid search [37] to explore the optimal designs considering the model size and performance. Specifically, the search space for the width, which determines the W size of each hidden layer, is set to four values by linearly scaling from 16: 16, 32, 48, and 64. N is verified by increasing it by one from 1 until the minimum depth that yields the optimal performance is found. The value of dropout for training is explored from 0 to 0.5 in increments of 0.1. We split the acquired dataset into training and validation sets at a ratio of 3:1. The batch size is configured to 64 and the models are trained for 1000 epochs using the Adam optimizer [38] with a learning rate of 1 ×

10^{- 3}

. As a result, we design the MLP with three hidden layers containing 16, 32, and 16 nodes, respectively, resulting in a total of 1930 trainable parameters. These configurations are chosen to balance the model accuracy and size. Additionally, to enable deployment on resource-constrained environments such as edge devices, we convert the model weights to TensorFlow Lite [39].

Figure 8 shows the confusion matrix of the posture-based hand gesture recognition. As shown in the results, the average classification accuracy for 10 types of hand gestures is over 98% on the test set, with the exception of the gestures of ‘down’ and ‘down+left’, whose values are 95.4% and 96.9%, respectively. We observed that extending the pinky finger in slightly occluded views resulted in noisy estimated joint values, leading to inaccuracies in the data acquisition and inference results. This issue can be mitigated by employing a robust hand pose estimator. Overall, the gestures were classified with high accuracy, making them well suited for the control of the drone.

To evaluate the real-time feasibility of the proposed model, we measured the average recognition time per gesture using the trained model on a set of 100 samples. The inference was performed on an Intel i5-1135G7 processor, and the average time per gesture was found to be 74 milliseconds (ms), demonstrating that the model is capable of recognizing hand gestures at a rate suitable for real-time applications.

4.3. Drone Flight Performed in a Simulated Environment

To assess the functionality of the proposed hand gesture vocabularies in controlling drone flight, we employ a Tello UAV simulator developed with Unity [40]. Specifically, we align the classified gesture signals with the control signals according to the remote control message format used in the simulator, corresponding to the four elements: roll, pitch, throttle, and yaw. For example, when a hand gesture is classified as the stop sign, zeros are assigned to all factors. If a user extends both the index finger and thumb, and the gesture is classified as ‘move left’, only the throttle factor is assigned a zero value because the yaw and pitch factors are influenced by the hand distance and rotation. Additionally, we set the speed to a fixed value for simplified implementation. Figure 9 shows a sample scene from the drone simulator integrated with our hand gesture controller.

Before commencing the hand gesture control test for the drone, we conducted a preliminary session with four participants to familiarize them with drone manipulation using keyboards across 81 distinct flight patterns. Subsequently, they were instructed to replicate the same drone control using hand gestures. Feedback from the participants indicated that controlling individual factors such as the roll, pitch, and yaw with hand gestures felt intuitive, with a difficulty level similar to that of keyboard control. Additionally, the participants perceived gesture control as more intuitive when simultaneously manipulating the roll and yaw. However, the participants noted initial challenges with throttle-related controls. Specifically, they reported confusion when attempting to move the drone downwards by folding the middle finger and extending the pinky finger and vice versa. This challenge arose from the less intuitive nature of the proposed throttle gestures, making them difficult for the users to execute accurately. While defining throttle reduction as extending the middle finger downward presents a more physically aligned gesture, it hinders the performance of other gestures and obscures other fingers in the head-mounted 2D RGB camera view, thereby complicating recognition. This challenge will be addressed in the future by finding an optimal camera installation location to reduce occlusion issues and exploring alternative gestures for throttle control.

5. Discussion

In this study, we introduce a new hand gesture command set and develop a vision-based HGR algorithm for the control of drones using single-handed gestures in FPV flights. Our approach aims to address the limitations of previous methods [8,22,23,24], which either utilized a limited number of commands, restricting the drone flight dynamics, or required external devices and costly vision sensors. One of the key strengths of our method is its comprehensive command vocabulary, derived from categorizing drone movements into four elements, throttle, roll, pitch, and yaw, with each having three distinct states—increase, decrease, and neutral. This results in 81 possible commands, which we map to intuitive hand gestures based on user-friendly designs informed by a comprehensive user study [23]. Our lightweight HGR algorithm, leveraging an RGB camera, demonstrates accurate performance in classifying these gestures, making it suitable for deployment on edge devices. However, we observed a performance degradation in controlling the throttle for downward motions. This issue stemmed from the less intuitive design of the downward throttle gesture and noisy pose estimation caused by occlusions in the 2D RGB camera view. These findings highlight the need for further refinement in the design of certain gestures to enhance the intuitiveness and accuracy. Additionally, addressing hardware constraints, such as improving pose estimation under occluded conditions, is crucial for the practical deployment of our system. Future work will focus on exploring alternative gesture designs and more robust pose estimation techniques to mitigate these issues. We will also conduct practical manipulation tests in real-world scenarios to demonstrate the efficacy and robustness of our proposed HGR algorithm. These efforts will contribute to more dynamic and user-friendly drone control systems, advancing the field of human–drone interaction.

6. Conclusions

In this work, we aimed to develop an intuitive hand gesture system for the control of drone flights, accurately recognized using a 2D RGB camera. Motivated by the need for natural and efficient drone manipulation, we introduced a novel set of hand gesture vocabularies encompassing four key movement factors: roll, yaw, pitch, and throttle. Each factor had three values (increase, decrease, neutral), enabling complex flight motions with at least 81 combinations. To implement these, we categorized gestures into motion-based and posture-based types. Motion-based gestures relate to hand pose alignment with distance and rotation, suitable for pitch and yaw movements, utilizing the pinhole camera model and hand joint lines for measurement. Posture-based gestures, with ten designs, control the throttle, roll, stop, and neutral states. A lightweight MLP model classifies these gestures effectively. Our HGR algorithm demonstrated high accuracy in gesture classification. We tested the system’s usability by integrating the HGR model with a drone flight simulator, comparing gesture control to keyboard control. The participants found gesture control similarly effective overall, although throttle control initially posed challenges due to occlusion issues with the 2D RGB camera. In future work, we will address these challenges by exploring solutions involving head-mounted RGB sensors to improve the throttle gesture recognition and overall control efficiency.

Author Contributions

Conceptualization, G.Y. and D.H.K.; methodology, G.Y.; software, G.Y.; validation, G.Y.; formal analysis, G.Y. and D.H.K.; investigation, G.Y. and D.H.K.; resources, D.H.K.; data curation, G.Y.; writing—review and editing, G.Y. and D.H.K.; visualization, G.Y.; supervision, D.H.K.; project administration, D.H.K. and H.K.; funding acquisition, D.H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korean Research Institute for Defense Technology Planning and Advancement (KRIT)—grant funded by the Defense Acquisition Program Administration (DAPA) (KRIT-CT-21-027).

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of hand gesture data collection, which does not require approval from an ethical committee.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

Author HwyKuen Kwak was employed by the company Hanwha Systems Co., Ltd. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kim, C.; Kim, C.; Kim, H.; Kwak, H.; Lee, W.; Im, C.H. Facial electromyogram-based facial gesture recognition for hands-free control of an AR/VR environment: Optimal gesture set selection and validation of feasibility as an assistive technology. Biomed. Eng. Lett. 2023, 13, 465–473. [Google Scholar] [CrossRef] [PubMed]
Chen, X.L.; Hou, W.J. Gaze-Based Interaction Intention Recognition in Virtual Reality. Electronics 2022, 11, 1647. [Google Scholar] [CrossRef]
Kwon, J.; Nam, H.; Chae, Y.; Lee, S.; Kim, I.Y.; Im, C.H. Novel three-axis accelerometer-based silent speech interface using deep neural network. Eng. Appl. Artif. Intell. 2023, 120, 105909. [Google Scholar] [CrossRef]
Rinalduzzi, M.; De Angelis, A.; Santoni, F.; Buchicchio, E.; Moschitta, A.; Carbone, P.; Bellitti, P.; Serpelloni, M. Gesture recognition of sign language alphabet using a magnetic positioning system. Appl. Sci. 2021, 11, 5594. [Google Scholar] [CrossRef]
Qi, W.; Ovur, S.E.; Li, Z.; Marzullo, A.; Song, R. Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network. IEEE Robot. Autom. Lett. 2021, 6, 6039–6045. [Google Scholar] [CrossRef]
Gao, Q.; Liu, J.; Ju, Z. Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional neural network for human–robot interaction. Expert Syst. 2021, 38, e12490. [Google Scholar] [CrossRef]
Ilyina, I.A.; Eltikova, E.A.; Uvarova, K.A.; Chelysheva, S.D. Metaverse-death to offline communication or empowerment of interaction? In Proceedings of the 2022 Communication Strategies in Digital Society Seminar (ComSDS), Saint Petersburg, Russia, 13 April 2022; pp. 117–119. [Google Scholar]
Lu, C.; Zhang, H.; Pei, Y.; Xie, L.; Yan, Y.; Yin, E.; Jin, J. Online Hand Gesture Detection and Recognition for UAV Motion Planning. Machines 2023, 11, 210. [Google Scholar] [CrossRef]
Liu, C.; Szirányi, T. Real-time human detection and gesture recognition for on-board UAV rescue. Sensors 2021, 21, 2180. [Google Scholar] [CrossRef] [PubMed]
Oudah, M.; Al-Naji, A.; Chahl, J. Hand gesture recognition based on computer vision: A review of techniques. J. Imaging 2020, 6, 73. [Google Scholar] [CrossRef]
Premaratne, P.; Premaratne, P. Historical development of hand gesture recognition. In Human Computer Interaction Using Hand Gestures; Springer: Singapore, 2014; pp. 5–29. [Google Scholar]
Ahuja, M.K.; Singh, A. Static vision based Hand Gesture recognition using principal component analysis. In Proceedings of the 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE), Amritsar, India, 1–2 October 2015; pp. 402–406. [Google Scholar]
Kramer, R.K.; Majidi, C.; Sahai, R.; Wood, R.J. Soft curvature sensors for joint angle proprioception. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 1919–1926. [Google Scholar]
Jesperson, E.; Neuman, M.R. A thin film strain gauge angular displacement sensor for measuring finger joint angles. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, New Orleans, LA, USA, 4–7 November 1988; p. 807. [Google Scholar]
Shrote, S.; Deshpande, M.; Deshmukh, P.; Mathapati, S. Assistive Translator for Deaf & Dumb People. Int. J. Electron. Commun. Comput. Eng. 2014, 5, 86–89. [Google Scholar]
Gupta, H.P.; Chudgar, H.S.; Mukherjee, S.; Dutta, T.; Sharma, K. A continuous hand gestures recognition technique for human-machine interaction using accelerometer and gyroscope sensors. IEEE Sens. J. 2016, 16, 6425–6432. [Google Scholar] [CrossRef]
Alashhab, S.; Gallego, A.J.; Lozano, M.Á. Efficient gesture recognition for the assistance of visually impaired people using multi-head neural networks. Eng. Appl. Artif. Intell. 2022, 114, 105188. [Google Scholar] [CrossRef]
Rajesh, R.J.; Nagarjunan, D.; Arunachalam, R.; Aarthi, R. Distance transform based hand gestures recognition for PowerPoint presentation navigation. Adv. Comput. 2012, 3, 41. [Google Scholar]
Van den Bergh, M.; Carton, D.; De Nijs, R.; Mitsou, N.; Landsiedel, C.; Kuehnlenz, K.; Wollherr, D.; Van Gool, L.; Buss, M. Real-time 3D hand gesture interaction with a robot for understanding directions from humans. In Proceedings of the 2011 Ro-Man, Atlanta, GA, USA, 31 July–3 August 2011; pp. 357–362. [Google Scholar]
Wachs, J.P.; Kölsch, M.; Stern, H.; Edan, Y. Vision-based hand-gesture applications. Commun. ACM 2011, 54, 60–71. [Google Scholar] [CrossRef]
Zhang, A.; Li, Q.; Li, Z.; Li, J. Multimodal Fusion Convolutional Neural Network Based on sEMG and Accelerometer Signals for Inter-Subject Upper Limb Movement Classification. IEEE Sens. J. 2023, 23, 12334–12345. [Google Scholar] [CrossRef]
Bello, H.; Suh, S.; Geißler, D.; Ray, L.S.S.; Zhou, B.; Lukowicz, P. CaptAinGlove: Capacitive and inertial fusion-based glove for real-time on edge hand gesture recognition for drone control. In Proceedings of the Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing, Cancun, Mexico, 8–12 October 2023; pp. 165–169. [Google Scholar]
Konstantoudakis, K.; Albanis, G.; Christakis, E.; Zioulis, N.; Dimou, A.; Zarpalas, D.; Daras, P. Single-Handed Gesture UAV Control for First Responders—A Usability and Performance User Study. In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2020), Blacksburg, VA, USA, 24–27 May 2020; pp. 24–27. [Google Scholar]
Khaksar, S.; Checker, L.; Borazjan, B.; Murray, I. Design and Evaluation of an Alternative Control for a Quad-Rotor Drone Using Hand-Gesture Recognition. Sensors 2023, 23, 5462. [Google Scholar] [CrossRef]
Helen, S.; Jenny, P.; Yvonne, R. Interaction Design: Beyond Human-Computer Interaction; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
Glonek, G.; Pietruszka, M. Natural user interfaces (NUI). J. Appl. Comput. Sci. 2012, 20, 27–45. [Google Scholar]
Herdel, V.; Yamin, L.J.; Cauchard, J.R. Above and beyond: A scoping review of domains and applications for human-drone interaction. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–22. [Google Scholar]
Al Farid, F.; Hashim, N.; Abdullah, J.; Bhuiyan, M.R.; Shahida Mohd Isa, W.N.; Uddin, J.; Haque, M.A.; Husen, M.N. A structured and methodological review on vision-based hand gesture recognition system. J. Imaging 2022, 8, 153. [Google Scholar] [CrossRef]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. Mediapipe Hands: On-Device Real-Time Hand Tracking. 2020. Available online: https://arxiv.org/abs/2006.10214 (accessed on 15 June 2023).
Leap Motion Developer. 2020. Available online: https://leap2.ultraleap.com/ (accessed on 31 March 2024).
Yoo, J.H.; Kim, D.H.; Park, S.K. Categorical object recognition method robust to scale changes using depth data from an RGB-D sensor. In Proceedings of the 2015 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 9–12 January 2015; pp. 98–99. [Google Scholar]
MediaPipe Iris: Real-Time Iris Tracking & Depth Estimation. 2020. Available online: https://ai.googleblog.com/2020/08/mediapipe-iris-real-time-iris-tracking.html (accessed on 15 June 2023).
Taud, H.; Mas, J. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Agarap, A.F. Deep Learning Using Rectified Linear Units (Relu). 2018. Available online: https://arxiv.org/abs/1803.08375 (accessed on 15 June 2023).
Bridle, J. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1989; pp. 211–217. [Google Scholar]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
David, R.; Duke, J.; Jain, A.; Janapa Reddi, V.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Wang, T.; et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proc. Mach. Learn. Syst. 2021, 3, 800–811. [Google Scholar]
Tello UAV Simulator. 2022. Available online: https://github.com/PYBrulin/UAV-Tello-Simulator (accessed on 31 March 2024).

Figure 1. Overview of the HGR pipeline. Hand keypoints are estimated and input into two HGR models, which detect drone motion commands from hand gestures.

Figure 2. An illustration of the manipulation of drone movement with a combination of hand gestures. When a user presents the stop gesture, no motion command is transmitted to the drone. Meanwhile, when presenting the neutral gesture, the drone movement is controlled by a combination of hand gestures.

Figure 3. Hand gesture examples for manipulation of roll and throttle. From the neutral position, the commands for both throttle and roll movements consist of the thumb, middle, and pinky fingers. The stop sign is excluded from this illustration.

Figure 4. An illustration demonstrating the manipulation of the yaw axis of the drone. When the hand rotates counterclockwise, the positive rotation angle causes the drone to turn left along the yaw axis. Conversely, the drone turns right if the rotation angle is negative.

Figure 5. Overview of the MLP architecture for the classification of the 21-keypoint sequence. The network comprises three dense blocks, one FC layer, and a softmax activation function. The input to the network consists of preprocessed, normalized relative keypoints; see Section 3.1.

Figure 6. Examples of motion-based hand gesture recognition. (a–c) Pitch-related gestures: forward, neutral, and backward. (d–f) Yaw-related gestures: yaw_left, neutral, and yaw_right.

Figure 7. The confusion matrices of motion-based hand gesture recognition. (a) Pitch-related gestures. (b) Yaw-related gestures.

Figure 8. The confusion matrix of the posture-based hand gesture recognition model.

Figure 9. An illustrative scenario from the drone simulator demonstrates the application of the proposed hand gesture vocabulary in controlling a drone flight. In this example, the drone executes commands for ‘pitch forward’, ‘roll left’, and ‘yaw left’ simultaneously.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yun, G.; Kwak, H.; Kim, D.H. Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control. Appl. Sci. 2024, 14, 10230. https://doi.org/10.3390/app142210230

AMA Style

Yun G, Kwak H, Kim DH. Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control. Applied Sciences. 2024; 14(22):10230. https://doi.org/10.3390/app142210230

Chicago/Turabian Style

Yun, Guhnoo, Hwykuen Kwak, and Dong Hwan Kim. 2024. "Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control" Applied Sciences 14, no. 22: 10230. https://doi.org/10.3390/app142210230

APA Style

Yun, G., Kwak, H., & Kim, D. H. (2024). Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control. Applied Sciences, 14(22), 10230. https://doi.org/10.3390/app142210230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Single-Handed Gesture Recognition with RGB Camera for Drone Motion Control

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Hand Tracking for Hand Pose Estimation

3.2. Hand Gesture Vocabulary for Drone Control

3.3. Hand Gesture Recognition (HGR)

3.3.1. Hand Distance Estimation Module

3.3.2. Hand Rotation Estimation Module

3.3.3. Posture-Based Gesture Classification Module

4. Experiments and Results

4.1. Motion-Based Gesture Recognition

4.2. Posture-Based Gesture Recognition

4.2.1. Dataset Acquisition

4.2.2. Model Implementation and Results

4.3. Drone Flight Performed in a Simulated Environment

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI