Open AccessFeature PaperArticle

LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation

Godwyll Aikins

^1,†

Mawaba Pascal Dao

^2,†

Koboyo Josias Moukpe

^2,†

Thomas C. Eskridge

and

Kim-Doang Nguyen

^1,*

Department of Mechanical and Civil Engineering, Florida Institute of Technology, Melbourne, FL 32901, USA

Department of Electrical Engineering and Computer Science, Florida Institute of Technology, Melbourne, FL 32901, USA

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(22), 4508; https://doi.org/10.3390/electronics13224508 (registering DOI)

Submission received: 16 October 2024 / Revised: 9 November 2024 / Accepted: 15 November 2024 / Published: 17 November 2024

(This article belongs to the Collection Predictive and Learning Control in Engineering Applications)

Download

Browse Figures

Figure 1
Our framework incorporates several LLMs to generate and refine drone waypoints based on user commands. "> Figure 2
Illustrative diagram of the components of the high-level planner system, showing the role of each LLM agent type, their inputs, and outputs. (a) Instructor agent. (b) Generator agent. (c) Critic agents. (d) Aggregator agent. "> Figure 3
The overall trajectory is divided into individual waypoints for each drone. The waypoints, combined with each drone’s real-time observations, are then processed by the dedicated low-level policy for that UAV. The process generates the specific actions required to guide the drone’s movement. "> Figure 4
Sample Star generated based on Gemini. "> Figure 5
Sample Star generated based on GeminiFlash. "> Figure 6
Sample Star generated based on GPT-4o. "> Figure 7
Successful 5-petal flower trajectory generated by the Gemini model. "> Figure 8
Common failure mode of the Gemini model for petal flower geometries. "> Figure 9
A thousand drones successfully form parallel lines generated by Gemini. "> Figure 10
One hundred drones successfully form a spiral generated by Gemini. "> Figure 11
A thousand drones unsuccessfully form a dragon generated by Gemini. ">

Versions Notes

Abstract

This paper presents LEVIOSA, a novel framework for text- and speech-based uncrewed aerial vehicle (UAV) trajectory generation. By leveraging multimodal large language models (LLMs) to interpret natural language commands, the system converts text and audio inputs into executable flight paths for UAV swarms. The approach aims to simplify the complex task of multi-UAV trajectory generation, which has significant applications in fields such as search and rescue, agriculture, infrastructure inspection, and entertainment. The framework involves two key innovations: a multi-critic consensus mechanism to evaluate trajectory quality and a hierarchical prompt structuring for improved task execution. The innovations ensure fidelity to user goals. The framework integrates several multimodal LLMs for high-level planning, converting natural language inputs into 3D waypoints that guide UAV movements and per-UAV low-level controllers to control each UAV in executing its assigned 3D waypoint path based on the high-level plan. The methodology was tested on various trajectory types with promising accuracy, synchronization, and collision avoidance results. The findings pave the way for more intuitive human–robot interactions and advanced multi-UAV coordination.

Keywords:

multi-robot coordination; vision-language model; human–robot interaction; waypoint generation; drone choreography; program synthesis; multi-UAVs

1. Introduction

The integration of large language models (LLMs) into robotics has opened up new avenues for high-level planning, including automating search and rescue flight planning [1], enhancing agricultural monitoring [2] and infrastructure inspection [3], and creating dynamic drone show performances [4]. Recognizing the challenges inherent in planning robot and drone trajectories, this paper presents a novel framework that simplifies trajectory generation by leveraging LLMs to generate waypoints for UAV or drone swarms based on user commands in text and audio inputs. Unlike existing approaches that primarily rely on single-mode decision models or predefined plans, our approach introduces a multi-critic consensus mechanism to evaluate the generated waypoints by using multiple multimodal LLMs, in which a plot of the previously generated waypoints is provided as an image input for the multi-critic consensus. By integrating the multi-critic system, LEVIOSA refines its trajectory planning based on diverse feedback, making it more adaptable and precise than conventional methods. The code for this project is available at: https://github.com/sesem738/Leviosa.

The challenges include error accumulation in multi-step reasoning [5] and difficulties grounding language in physical environments [6]. Our approach addresses the issues by allowing users to translate complex high-level natural language commands into executable UAV trajectories, thereby reducing the technical barriers to trajectory generation and enabling more intuitive human–robot interactions.

Coordinating multi-UAV and multi-robot systems presents significant communication and task allocation challenges, driving research toward advanced planning frameworks [7,8]. LLMs have emerged as a promising solution, offering benefits for multi-robot coordination [7,9,10] through their ability to process complex commands [11,12,13], incorporate feedback mechanisms for environmental adaptation [7], and generate structured plans from high-level instructions. However, challenges persist in ensuring trajectory safety in dynamic environments [14] and maintaining accurate user intent interpretation, limitations stemming from LLMs’ auto-regressive nature and complex reasoning constraints [15].

Our research addresses the issues by developing a framework that leverages LLMs for high-level planning and coordination of UAV trajectories and uses reinforcement learning for low-level control and flight error corrections. We tackle the issues through two key contributions that advance the understanding and application of LLMs in drone trajectory coordination:

Multi-critic consensus mechanism: a system that utilizes multiple critics to evaluate generated trajectories with a majority voting scheme to ensure high-quality outputs.
Hierarchical prompt structuring: a method that organizes and summarizes outputs from multiple critics into a coherent context, improving the LLMs’ ability to understand and execute complex tasks.

Complementing the high-level planning, LEVIOSA incorporates per-UAV reinforcement learning for low-level control. The approach involves training individual RL policies for each UAV, allowing for precise execution of the high-level trajectories. The per-UAV RL control offers several advantages, including adaptability to individual UAV dynamics, scalability to large swarms, and ability to handle local perturbations and obstacles.

Multi-drone path generation is an excellent testbed for our framework as multimodal processing is required to combine natural language understanding with visual perception, spatial and geometric reasoning to determine appropriate trajectories, code synthesis for waypoint generation, and multi-agent interactions for coordination. The framework integrates several multimodal LLMs for high-level planning by converting natural language inputs into 3D waypoints that guide UAV movements. The waypoints are then executed by the per-UAV low-level RL controllers, which navigate each UAV along an assigned path based on the high-level plan. The hierarchical approach combines the strengths of LLMs in interpreting complex commands with the adaptive capabilities of RL in real-time flight control. Our goal is to advance the state of the art in human–robot interaction to enhance significantly the accessibility and capability of autonomous robotic systems, opening new avenues for their application in solving complex real-world problems across various domains.

This paper is structured as follows: Section 2 reviews related work in LLMs for robotics and reinforcement learning for UAV control; Section 3 presents LEVIOSA’s methodology, including high-level planning and low-level control; Section 4 details our experimental setup and evaluation; Section 4.2 discusses results and analyzes system performance; Section 5 presents ablation studies that examine the impact of critic agents and computational timing analysis; Section 6 highlights our key findings; and Section 7 concludes with findings and future directions.

2. Related Work

2.1. Large Language Models in Robotics

The use of large language models (LLMs) in UAV trajectory generation and multi-robot systems has been explored in various innovative frameworks [10,16,17]. A notable example is Swarm-GPT [16], which integrates LLMs with safe swarm motion planning to automate synchronized drone performances. The system processes both music audio and user text prompts to choreograph drone trajectories, and the LLM-generated waypoints are refined through a safety filter based on AMSwarm [18], an advanced distributed motion planning framework. The safety filter ensures collision-free and feasible motion, incorporating prior knowledge of the drone system. Swarm-GPT achieved a mean sim-to-real root-mean-square error (RMSE) of 28.7 mm, demonstrating high practical applicability at live events [16]. Swarm-GPT effectively generates synchronized trajectories for entertainment applications but is limited by a focus on predefined performances and relies on a static model-based safety filter for collision avoidance. In contrast, LEVIOSA introduces a multi-critic consensus mechanism to enable a more flexible, real-time evaluation of generated trajectories. LEVIOSA assesses the trajectories from the perspective of efficiency and user intent to ensure applicability to diverse tasks beyond synchronized drone performances such as search and rescue or infrastructure inspection.

LLMs have also shown promise in multi-robot collaboration, which is relevant for managing UAV swarms or heterogeneous robot systems that must communicate and coordinate autonomously. In this context, RoCo [10] leverages pretrained LLMs to facilitate dialogue between robots and enable them to plan tasks collaboratively and generate waypoints through discussion. The approach is particularly useful in environments requiring dynamic task allocation as robots are allowed to reason about their roles and coordinate their actions. Human-in-the-loop interaction is also supported by users who can guide and adjust the system’s performance. Evaluated on RoCoBench, RoCo demonstrated high success rates and adaptability to varied tasks [10]. RoCo excels in environments where robots must coordinate complex tasks, like manipulation or concurrent task execution, but primarily focuses on grounded, structured settings where robots communicate and negotiate task allocation through natural language. In contrast, LEVIOSA is a hierarchical framework focused on real-time, dynamic UAV trajectory generation suited for unpredictable environments. The multi-critic consensus mechanism guides high-level planning, while the reinforcement learning-driven low-level controller enables UAVs to adapt their paths continuously based on environmental feedback. The hierarchical design offers greater adaptability and autonomy than RoCo’s pre-negotiated task plans, making LEVIOSA more suitable for fast-paced, adaptive multi-UAV coordination.

In a related effort, LLMs have been used for logical inference in robotic systems to translate high-level language commands into executable motion sequences [17]. Liu et al.’s system combines YOLOv5 [19] for environmental perception with dynamic movement primitives for real-time action correction. However, the approach is constrained by task-specific frameworks and does not scale well in dynamic, unstructured environments. LEVIOSA overcomes limitations by incorporating reinforcement learning for low-level UAV control to allow the system to refine and adjust flight paths in real time based on feedback from the environment. Combining LLM-based high-level planning with adaptive low-level control provides a more robust solution for complex, multi-UAV coordination in unpredictable environments.

The concept of grounding LLMs in robotic affordances, allowing robots to interpret and execute high-level instructions, is central to SayCan [20]. SayCan integrates pretrained language models with visual inputs to empower robots to plan and perform actions, demonstrating strong generalization across novel tasks and environments. SayCan leverages vision to enable robots to perceive and understand the physical surroundings and ground high-level instructions in real-world contexts. Without this, the language model alone lacks the necessary environmental awareness to ensure safe and effective execution. Furthermore, VOILA [21] combines both vision language models (VLMs) and LLMs in a modular architecture. The integration allows robots to execute complex tasks informed by both visual observations and natural language to make the system adaptable to various platforms and application domains. While SayCan and VOILA primarily focus on vision-language grounding for task execution, LEVIOSA incorporates vision during the multi-critic reflection phase, in which multiple critics evaluate images of generated waypoints to suggest adjustments, ensuring safety and precision. The use of multiple critics provides more diverse and accurate feedback. By combining the outputs of multiple critics, LEVIOSA reduces the risk of over-correcting to any one source of feedback or perspective [22]. The approach enhances resilience to errors and promotes better decision-making [22,23].

2.2. Reinforcement Learning in Aerial Vehicle Control

While LLMs play a crucial role in enhancing high-level planning and multi-agent collaboration, RL has emerged as a powerful tool for low-level control in aerial vehicles. RL’s ability to handle the complexity and dynamism of UAV environments makes the tool a powerful component for trajectory generation and flight stability [24,25,26,27,28,29]. For instance, Geles et al. demonstrated agile flight from raw visual inputs without the need for state estimation, showcasing RL’s potential for agile UAV control in unstructured environments [30]. Moreover, recent work in [31] developed an RL-based policy that enabled drones to land accurately on moving targets, even in the presence of sensor noise and failures. These advances highlight the versatility of RL in managing UAV operations under challenging real-world conditions.

Recent work has demonstrated the effectiveness of MARL for UAV swarm coordination. A multi-critic policy gradient optimization method has been proposed to achieve optimal coordination of multiple UAVs while maintaining constraints like collision avoidance [32]. The approach uses multiple value-estimating networks and a novel advantage function to optimize a stochastic actor policy network. Another study [33] introduces a comprehensive testbed that systematically evaluates MARL algorithms under various environmental disturbances and agent failures, providing a standardized framework for benchmarking robustness. The paper’s findings underscore the importance of developing more resilient MARL techniques for UAV control.

Combining LLMs for high-level decision-making and RL for low-level control forms a comprehensive approach to UAV trajectory generation. While LLMs excel in converting natural language inputs into high-level flight plans, RL is instrumental in refining the plans to ensure safe, agile, and efficient execution. RL policies adapt to real-time environmental changes, compensate for uncertainties, and optimize flight performance based on immediate feedback. The per-UAV RL approach in LEVIOSA offers significant advantages in scalability, flexibility, and integration with high-level planning. As the number of UAVs increases, the complexity of per-UAV RL remains constant, unlike MARL approaches in which complexity grows exponentially. The scalability allows for easier deployment of large-scale UAV operations. The approach also seamlessly accommodates heterogeneous swarms, in which each UAV is trained with a policy tailored to specific characteristics and constraints. The adaptability enables more diverse and specialized swarm compositions. Furthermore, using individual RL controllers for each UAV aligns well with LEVIOSA’s hierarchical structure. The high-level planner, powered by LLMs, generates waypoints for each UAV independently, which are then executed by the corresponding low-level RL controller. The separation of concerns simplifies the integration between high-level planning and low-level control and allows for a more flexible and efficient system that easily adapts to various mission requirements and UAV configurations. However, challenges remain in integrating the systems, particularly in environments with dynamic obstacles, where both LLMs and RL operate in tandem to achieve safe and robust UAV operations [34].

3. Methodology

3.1. Problem Formulation

LEVIOSA is a framework that integrates several LLMs to enable operators to specify in natural language trajectories for one or more UAVs and ensure that the trajectories produced are correct and error-free.

In this work, we developed a control architecture for a swarm of UAVs that leveraged a set of multimodal LLMs as a high-level planner (see Figure 1). Multimodal LLMs process multiple types of inputs, including text, audio, and images, which are capabilities that are crucial to our method. The LEVIOSAsystem was designed to enable an operator to use speech or audio commands to control a swarm of UAVs such as drones, allowing them to execute coordinated trajectories.

The LEVIOSA system is comprised of two main components:

High-level planner: The component employs four types of multimodal LLMs to process the operator’s natural language input and creates the final coordinated waypoint paths to be executed by the UAVs. The high-level planner comprises four main types of LLMs that collaborate as a multi-agent flow to achieve the audio-to-semantic conversion of the speech or audio command into a Python (v3.10) script that generates 3D waypoints. Each type of LLM is assigned unique roles that contribute to the system’s overall effectiveness. Some roles only require text input and output in the case of the aggregator LLM agent, while other roles require handling multiple modalities, such as audio in the case of the instructor LLM agent or images in the case of the critic LLM agents. We discuss the details of the high-level planner system, its LLM organization, and the roles of each LLM in the following sections.
Per-UAV low-level controllers: Implemented as policies trained with RL to provide low-level motor control, the controllers are the same for each drone or UAV. One low-level controller takes the 3D waypoints the high-level planner system provides and produces low-level motor controls to operate the individual UAVs to execute the intended trajectory.

A vital aspect of the high-level planner system is that it runs several iteration cycles of generations and reflections (as illustrated in Figure 1) to progressively improve the generations until they are correct or reach a maximum number of iterations M allowed. LLMs used for trajectory generation often struggle to interpret accurately and maintain user intent due to their auto-regressive nature, accumulating errors over generated tokens [15]. In our framework, we set

M = 10

, meaning, unless immediately correct, after ten cycles of generations and reflections, the final generation is selected by default. By generations, we mean the 3D waypoint paths obtained via Python code synthesis from the generator LLM agent (i.e., the generator’s output, Figure 2b). By reflections, we mean the combination of error messages from the Python interpreter and the aggregated feedback from visual inspection of the plotted 3D waypoint paths from the critic LLM agent(s) (i.e., the aggregator’s output Figure 2d). Theoretically, more iteration cycles could lead to the generation of better trajectories. The current limit of 10 iterations was chosen by experience with the algorithm and trading off computation time versus correct output (as seen later in Section 5), optimizing the number of iterations was not implemented but could be the subject of future work.

After iteration is complete, the finalized waypoints are input for the individual low-level controller policies. Unlike the high-level planner, the low-level controllers do not use LLMs but employ Multi-Layer Perceptron (MLP) networks trained with RL. Each UAV in the swarm uses its own low-level controller to execute the planned trajectories.

The core challenge lies in designing an LLM-based system that can translate natural language into spatially coherent 3D waypoints, requiring both semantic understanding and geometric reasoning. This reliability is crucial as the outputs directly control physical UAV movements. While the LLM must handle natural language ambiguities to ensure safe execution, using RL policies as low-level controllers helps mitigate this burden through emergent behaviors like obstacle avoidance. The detailed algorithm is presented in Algorithm A1.

3.2. High-Level Planning

This section discusses in detail the high-level planner system, denoted as

Φ

, used to generate 3D waypoint paths for a fleet of drones. The planner’s primary function is to translate natural language commands (in speech or text) into precise 3D waypoint paths, which are then executed by RL low-level controller policies,

π_{θ}

. The planner

Φ

is parameterized by a set of multiple multimodal LLMs, which take as input a text prompt p or a voice command v to output a series of N waypoints for each of the N drones. Each drone’s waypoint path is defined by an array of D triplets of three-dimensional coordinates

(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})

. The low-level policy

π_{θ}

takes the current state

s_{t}^{n}

of each drone and conditions on the waypoints generated by

Φ

to output the necessary control actions

a_{t}^{n}

{\{{(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})}_{d = 1}^{D}\}}_{n = 1}^{N} = Φ (p, v)

(1)

a_{t}^{n} = π_{θ} (s_{t}^{n}, {\{(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})\}}_{d = 1}^{D})

(2)

The planner system leverages a set of LLM agents to process both the linguistic and geometric aspects of input commands to ensure that the framework generates waypoints that are accurate and feasible for real-time execution. The set is organized around four types of LLM agents:

The instructor multimodal LLM agent converts the input audio from the user into high-level requirements that capture the semantics of the audio. Optionally, the agent also converts text commands from the user into requirements.
The generator LLM agent takes the high-level requirements provided by the instructor agent and synthesizes a Python program that generates the 3D waypoints for all the drones when executed.
The critic multimodal LLM agents or critics, characteristic of our multi-critic consensus mechanism, take as input the visual plot of the generated waypoints and the requirements to provide feedback on the quality of the waypoints.
The aggregator LLM agent, illustrative of our hierarchical prompt structuring, aggregates current and previous aggregated feedback from the critics to provide new, comprehensive feedback and progress direction to the generator.

Each agent type addresses critical aspects of the high-level planning process as shown in Figure 1, which is elaborated below.

3.2.1. Instructor Agent

To support natural language commands in text or audio, the system leverages the natural language understanding of multimodal LLMs. The multimodal instructor LLM agent, denoted

ϕ_{i}

, handles both natural language text prompt p and audio commands v from the user or operator. The instructor serves as the entry point to the high-level planner system. The agent is tasked with interpreting high-level natural language instructions and converting them to requirements

p_{r e q}

that capture their semantics (Figure 2 top-left quadrant a), where the agent must understand both the intended shape for drone paths and the number of drones involved in creating that shape. For instance, a user might provide a prompt to create a “star-shaped trajectory using five drones”, where each drone is responsible for tracing one arm of the star. The LLM processes the instructions, understanding the geometric requirements and the spatial coordination necessary among the drones, and then translates them into actionable requirements that lead to the generation of 3D waypoints. Mathematically, we have the following:

p_{r e q} = ϕ_{i} (p, v)

(3)

The LLM’s ability to handle complex prompts relies heavily on its integrated understanding of 3D geometry and natural language. The model must accurately infer the spatial relationships and dynamics involved in forming the described shapes to ensure that the drones’ paths are synchronized and that the resulting formation matches the intended design.

We note that this agent type only runs once per user command. In contrast, the agent types in the following sections run in multiple iterative cycles until the generated waypoints are correct or the maximum number M of cycles allowed is reached (see Algorithm A1 for the step-by-step algorithm).

3.2.2. Generator Agent

To generate arbitrary long waypoint paths for an arbitrary number of UAVs or drones, our framework leverages code synthesis via the auto-regressive generation capabilities of LLMs. The generator agent, denoted

ϕ_{g}

, generates a set of 3D waypoints via Python code synthesis based on the requirements

p_{r e q}

instructed by the instructor agent and the aggregated feedback f from the aggregator agent (illustrated in Figure 2 top right quadrant b). To generate the 3D waypoints, the agent synthesizes a Python program that defines the generation process of the waypoints for the drones. The Python program, when executed, yields the following waypoints:

{\{{(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})}_{d = 1}^{D}\}}_{n = 1}^{N} = ϕ_{g} (p_{r e q}, f)

(4)

Similarly to the instructor agent, the generator agent must rely on an implicit understanding of 3D geometry and natural language. For the generator, inferring the spatial relationships and dynamics involved in forming the described shapes is essential to synthesize the correct corresponding Python program auto-regressively.

The generator agent, together with the critics and the aggregator agent, forms the generation–reflection cycle or loop, where generations from the generator are evaluated by the critics and aggregated by the aggregator. The aggregated feedback and any execution error from the Python interpreter are then sent back to the generator to give an improved generation. The setup is guaranteed to converge to decent outputs with working Python programs from the generator, outputs that would be erroneous otherwise.

3.2.3. Critic Agents: Multi-Critic Consensus Mechanism

A significant challenge identified during the development of the LLM-based planner was the presence of hallucinations, a common issue in LLMs, where the model generates plausible but incorrect or nonsensical outputs. To mitigate the problem and enhance the system’s reliability, we adopted a multi-critic consensus mechanism to leverage redundancy by deploying a group of critic agents to analyze and validate the waypoints collaboratively. The critic agents or critics, denoted

ϕ_{c_{j}} j \in {1, \dots, C}

for a total of C critics, are multimodal LLMs tasked with visually analyzing the image plot i rendered from the waypoints output generated by the generator agent based on the requirements

p_{r e q}

of the instructor agent. They then each provide feedback

f_{j}

on the waypoints and a score for how well the waypoints visually adhere to the requirements. The critics assess whether the generated waypoints meet the requirements specified, focusing on aspects such as trajectory continuity, geometric accuracy, obstacle avoidance, and synchronization among the drones. The feedback from each critic agent is given by

f_{j} = ϕ_{c_{j}} (p_{r e q}, i), j \in {1, \dots, C}

(5)

We note that the image plot i is internal to the high-level planner system and used only for the critics’ evaluation. The critics are instructed to provide a score over 100 and helpful suggestions for improvement.

3.2.4. Aggregator Agent: Hierarchical Prompt Structuring

Transformer-based LLMs used in this framework have a limited context window. Processing many feedback inputs from multiple critics is contextually expensive and confusing for the generator agent. Additionally, critics can provide hallucinations or conflicting opinions, leading to confusing feedback. A final issue in providing feedback for generations is whether a current generation is better or worse than a previous generation. A direction of progress is helpful for the generator to steer its generation. To alleviate the issues, we leveraged the text summarization capabilities of LLMs to condense multiple pieces of feedback into a single, cohesive, aggregated response. The aggregator agent, denoted

ϕ_{a}

, condenses the feedback

f_{1}, \dots, f_{C}

from the critics and the previous aggregated feedback

f_{p r e v}

to produce the current aggregated feedback f, which is then provided to the generator agent. The process is mathematically described as follows:

f = ϕ_{a} (f_{1}, \dots, f_{C}, f_{p r e v})

(6)

The aggregator agent is instructed to reconcile any conflict among the critics and indicate if the generations are improving in the right direction if they are not good yet.

The generator, critics, and aggregator agents operate in the generation–reflection loop, where the generator agent refines the output based on the aggregated feedback until the majority of critics evaluate the generation to be correct or the maximum number of attempts M is reached. At that point, the generation stops, and the last provided waypoints are retrieved.

This multi-agent architecture introduces redundancy and cross-verification, significantly enhancing the robustness and reliability of the waypoint generation process. By distributing the process among specialized agents and incorporating rigorous verification and feedback aggregation mechanisms, the system achieves higher confidence in the generated waypoints, a fact that is reflected in improved score ratings from the critics. However, one drawback of the setup is the increased computational expense, as the involvement of multiple agents extends the compute time required to achieve the desired level of accuracy. Nonetheless, the trade-off is justified by substantial gains in the precision and reliability of the drone control system. The trade-off is further discussed in the ablation Section 5.

3.3. Per-UAV Low-Level Controllers

In our framework, the per-UAV low-level controller policies, illustrated in Figure 3, translate the high-level waypoints generated with the LLM into precise control commands for each UAV. To achieve this, we formulate the multi-drone navigation problem within an RL framework, where the drones autonomously learn to execute trajectories while minimizing errors and collisions. In RL, the primary objective is to derive optimal policies

π

that maximize the expected cumulative reward within a Markov decision process (MDP). The MDP is formally defined as a tuple

(S, A, T, r, γ)

characterized by state space S, action space A, transition function T, a reward function

r (s, a)

, and discount factor

γ \in (0, 1)

. At each discrete time step t, the low-level policy observes a drone’s current state

s_{t} \in S

. Based on its observation, an action

a_{t} \in A

is sampled from the stochastic policy

π (a_{t} | s_{t})

, which defines the probability distribution over actions given the current state. Upon execution, the drone transitions to a new state

s_{t + 1}

according to the state transition probability

T (s_{t + 1} | s_{t}, a_{t})

. Simultaneously, the agent receives an immediate reward

r_{t} = r (s_{t}, a_{t})

. The ultimate aim is to find a policy

π^{*}

that maximizes the expected sum of discounted rewards, formally expressed as

π_{θ}^{*} = \underset{π}{argmax} E_{τ \sim π} [\sum_{t = 0}^{\infty} γ^{t} r (t)]

(7)

Central to our control architecture is the use of proximal policy optimization (PPO), a robust and widely adopted RL algorithm [35]. PPO’s simplicity and effectiveness in policy optimization make the algorithm well suited for multi-drone applications, where stability and sample efficiency are crucial. To improve performance, we adopt a shared policy architecture in which all drones utilize the same learned policy to ensure consistent behavior across the fleet and simplify coordination while reducing the training complexity.

At the core of the approach lies the definition of observation and action spaces, which provide drones with the necessary information to make real-time decisions and adjust movements accordingly. The spaces must not only capture the drones’ current states but also allow for temporal reasoning based on past actions.

We detail the design of the observation and action spaces and explain the reward structure that guides the learning process to ensure optimal trajectory execution.

Observations and actions: Each drone is fed with its own sequence of observations comprised of a $1 \times 15$ vector concatenated with a previous actions buffer. The vector consists of the drone’s current position $(x, y, z)$ , orientation $(ϕ, θ, ψ)$ , linear velocity $(v_{x}, v_{y}, v_{z})$ , angular velocity $(ω_{x}, ω_{y}, ω_{z})$ , and difference between the current position and target waypoints $(δ x, δ y, δ z)$ . Additionally, a buffer of previous actions $a_{t - k : t - 1}$ is added to the observations, which are the actions from the previous $0.3$ s, after some trial and error, in our case. The temporal information allows the agent to reflect the current state $s_{t}$ and the sequence of recent actions. The state representation allows the RL policy to understand the drone’s kinematic state and its relation to the waypoints. The observation space of each drone excludes the positions of the other drones as the LLM generates collision-free paths given constant velocity for all drones. The action space $A$ consists of continuous low-level control input for the drone’s rotors, specifically the revolutions per minute (RPM) for each rotor $({RPM}_{1}, {RPM}_{2}, {RPM}_{3}, {RPM}_{4})$ , which allows for fine-grained control over the drone’s movement and orientation.
Reward: The reward function of our task is designed to encourage the drones to navigate through assigned waypoints while avoiding collisions efficiently. The reward balances immediate positional goals with broader flight characteristics. The primary component, $R_{d}$ , uses an exponential decay based on the squared Euclidean distance to the current waypoint, providing a continuous gradient that intensifies as the drone approaches its target:

$R_{d} = e^{- T_{d} \cdot d^{2}} where : d^{2} = {(δ x)}^{2} + {(δ y)}^{2} + {(δ z)}^{2}$

(8)

The distance reward is complemented by $R_{s u c c e s s}$ , a binary reward triggered when the drone is within a tight threshold of a waypoint, offering a substantial bonus for precision:

$R_{success} = \{\begin{matrix} 15, & if d^{2} < 0.1 \\ 0, & otherwise \end{matrix}$

(9)

The reward function also incorporates a velocity-based reward. $R_{linvel}$ to encourage the drone to maintain moderate speeds with a peak reward at 0.5 units per time step, while $R_{angvel}$ and the corresponding penalty $P_{angvel}$ work in tandem to promote smooth rotational movements and discourage erratic changes in orientation:

$R_{linvel} = e^{- T_{v} \cdot (| \vec{v} | - 0.5)} - 1$

(10)

$R_{angvel} = e^{- T_{ω} \cdot | \vec{ω} |}$

(11)

$P_{angvel} = - e^{T_{p} \cdot (| \vec{ω} | - 0.5)}$

(12)

To address long-term objectives, $R_{t} r a j$ provides a significant reward upon completing the entire set of waypoints, motivating the drone to navigate efficiently through the full course. The careful tuning of temperature parameters $(T_{d}, T_{v}, T_{ω}, T_{p})$ allows for fine adjustment of each component’s influence and enables the reward function to be adapted to various mission profiles and drone capabilities. The reward function ensures that the drone reaches the waypoints and has a flight pattern that is efficient, stable, and suitable for real-world applications:

$R_{waypoint} = \{\begin{matrix} 50, & if all waypoints completed \\ 0, & otherwise \end{matrix}$

(13)

$R_{total} = R_{success} + R_{d} + R_{linvel} + R_{angvel} + P_{angvel} + R_{traj}$

(14)

We would like to note that the high-level planner is designed to generate collision-free waypoints by ensuring that the trajectories assigned to each UAV do not intersect when executed at constant velocities and that any potential collision courses are flagged and fixed during the reflection phase. This setup significantly reduces the likelihood of collisions during flight. However, the per-UAV low-level controllers operate independently and are not explicitly aware of the positions or intentions of other UAVs in the swarm. While the high-level planner aims to prevent collisions through careful waypoint generation, there are no guarantees of collision avoidance at the low-level control layer. This limitation arises because the low-level controllers do not share state information or coordinate their actions based on the movements of neighboring UAVs.

Future work would explore the integration of Multi-Agent Reinforcement Learning (MARL) techniques, where each UAV’s low-level controller is trained to be aware of other UAVs in the environment. By incorporating inter-agent communication and coordination, the controllers can actively learn to avoid collisions, even in dynamic and unpredictable scenarios, providing stronger guarantees for collision-free operation during complex multi-UAV missions.

4. Experiments

4.1. Setup

This section outlines the experimental methodology employed to evaluate the performance of our framework. The experiments were designed to assess the framework’s ability to generate accurate and efficient drone paths based on user-defined prompts with variations in the geometry of the paths, the number of drones involved, and the underlying LLM utilized. Conducted in a controlled simulation environment, the experiments ensured reproducibility and consistency.

The primary objective was to evaluate our framework’s performance in generating both simple continuous paths generated by a single function and composite discontinuous paths generated by multiple functions. Three LLMs—Gemini 1.5 Pro, Gemini 1.5-Flash, and GPT-4o—were used to interpret user prompts and synthesize Python scripts for drone waypoint generation. The number of drones involved in each path varied based on the complexity of the shape. More straightforward paths like circles involved fewer drones, and more complex shapes like octagons required more.

For each path type, we hand-designed specific prompts to guide the LLMs in generating the desired waypoints. The prompts were crafted to ensure clarity and precision in the instructions provided to the LLMs. For example, a continuous path prompt for a circle instructed the LLM as follows: “Create a circular path using 2 drones, where each drone traces out one half of the circle. The drones should move in perfect synchronization to form a complete circle”. To introduce variability in path dimensions and avoid overfitting the low-level controller, we intentionally left the circle’s radius unspecified to allow the model to generate paths of different sizes. The variability enhanced the diversity of the training dataset, thereby improving the generalization capabilities of the low-level controller by preventing the memorization of paths of a fixed scale. Similarly, a composite path prompt for a star instructed the LLM as follows:“Generate a star-shaped path using 5 drones. The drones should move in such a way that their combined flight paths trace out a symmetrical star with equal arm lengths”. Again, we did not specify any dimensions there. For the complete description of paths and prompts, see Table A1 in Appendix B.

Ten trials were conducted for each path type and prompt to assess the performance of the LLM-generated paths, and the maximum number of generation–reflection cycles was set at

M = 10

. The primary metrics used for evaluation included accuracy, measured as the fidelity of the generated path to the intended sequence of waypoints as described in the prompt; synchronization, evaluated based on the drones’ ability to maintain coordinated movement, particularly in simple paths; and collision avoidance, assessed by observing the drones’ ability to avoid intersecting paths, especially in complex, multi-drone scenarios.

The experimental procedure began with initializing the simulation environment and loading the LLM model, which was then provided with a specific waypoint prompt, and a generated path was executed in the simulation. Data on path accuracy were collected for each trial to determine the performance of each LLM in generating the specified waypoints and corresponding paths. The experiments were conducted using Gemini 1.5 Pro, Gemini 1.5-Flash, and GPT-4o to compare performance in path generation, focusing on model responsiveness, waypoint precision, and adaptability to varying numbers of drones and path complexities. By systematically evaluating the aspects, the experiments aimed to provide a comprehensive understanding of the capabilities and limitations of LLMs in drone path generation, ensuring that the findings were robust, reproducible, and applicable to real-world scenarios involving autonomous aerial systems.

4.1.1. Simulation Setup

We developed our experimental framework using the PyBullet physics simulator (v3.2.6) [36], which provided a comprehensive environment for evaluating both individual UAV controllers and the complete LEVIOSA system. The simulation environment was explicitly designed to model quadrotor dynamics based on the Crazyflie 2.X platform, chosen to facilitate future transition to real-world hardware implementation.

To ensure high-fidelity simulation, we implemented a multi-rate system architecture where core motor dynamics were computed at 200 Hz, enabling the precise modeling of quadrotor behavior and accurate representation of rapid control adjustments. The observation system operated at up to 100 Hz, balancing the trade-off between control fidelity and processing overhead. This dual-rate approach reflects real-world constraints where sensor readings and control updates often occur at different frequencies. The simulator accurately replicated the Crazyflie platform’s sensor suite, including position sensors, inertial measurement units (IMU), and motor feedback systems, ensuring control policies developed in simulation could transfer effectively to real hardware.

The simulation environment supported arbitrary spatial dimensions, which could be specified to the high-level planner via prompting. This flexibility enabled the generation of waypoint spaces of varying sizes to accommodate different experimental scenarios. The spatial limitations were primarily determined by the physical capabilities of the simulated UAVs, as different manufacturers offer varying specifications for flight range, altitude limits, and maneuverability. By specifying appropriate spatial dimensions in the generated code, we ensured that waypoints remained within the operational envelope of the UAVs being used. The low-level controller was designed to handle any collision-free coordinates generated by the high-level planner.

This carefully designed simulation environment, with its high-fidelity physics modeling and realistic sensor simulation, formed the foundation for our low-level policy training, providing a reliable and reproducible testbed for developing robust control strategies that could effectively transfer to real-world applications.

4.1.2. Low-Level Policy Training

Building upon our high-fidelity simulation environment, we developed a comprehensive training approach for the low-level control policy. We adopted a curriculum learning strategy [37], progressively increasing task complexity to build robust and reliable control capabilities. This training was implemented within the PyBullet environment, leveraging its accurate physics modeling to ensure realistic behavior.

The curriculum consisted of three distinct stages, each building upon the skills developed in previous stages:

Basic control: The initial stage focused on fundamental hover capability, requiring the policy to maintain stable position control at fixed points in space. This established the foundation for all subsequent flight behaviors.
Structured navigation: once hovering was mastered, the policy progressed to following predefined circular trajectories, introducing continuous motion and coordinated control across multiple axes.
Advanced trajectory tracking: the final stage involved tracking arbitrary trajectories, requiring the policy to generalize its learned skills to diverse and complex flight paths.

We implemented the training using the PPO algorithm, chosen for its stability and sample efficiency in continuous control tasks. A key innovation in our approach was the use of a shared policy architecture, where a single trained policy was deployed across all drones in the swarm. This design choice significantly reduced training complexity while ensuring consistent behavior across the fleet.

The curriculum learning approach proved pivotal in enhancing the final performance of the policy compared to direct training on complex tasks. The incremental building of capabilities resulted in more robust and stable control behaviors, with the policy demonstrating improved generalization to various flight scenarios [38]. Our method ensured that the agent could handle a wide variety of situations with enhanced stability and efficiency, ultimately leading to a more effective quadrotor control system suitable for real-world applications.

4.2. Results

The results of our experiments are summarized in Table 1. The success rates for generating each trajectory type are reported using the three different LLMs: Gemini, Gemini Flash, and GPT-4o. The success rate is the percentage of trials in which the LLM successfully generated a trajectory that met the specified criteria for accuracy, synchronization, and collision avoidance. Each trajectory was evaluated over ten trials, and the success rate was calculated as the ratio of successful trials to the total number of trials.

Table 1 shows the success rates for simple and composite trajectories separately, enabling a detailed comparison of the models’ performance across different trajectory types.

In our experiments, GPT-4o (with an average success rate of 76.0%) generally outperformed the Gemini models (Gemini: 64.0%, GeminiFlash: 50.5%), demonstrating superior capability in generating complex and accurate drone trajectories based on user-defined prompts. For example, in the star-shaped path the Gemini models (Figure 4 and Figure 5) were simpler and less aligned with the complex structure typically associated with a star. In contrast, GPT-4o (Figure 6) produced a design that was more intricate and recognizable as a star, despite the prompt to “Generate a star-shaped trajectory using 5 drones. The drones should move in such a way that their combined flight paths trace out a symmetrical star with equal arm lengths”. Not specifying a high level of detail, GPT-4o successfully captured the underlying intent of the design and created a sophisticated star pattern that reflected an advanced understanding of the desired outcome and effectively highlighted the ability to interpret and expand upon vague or minimally detailed instructions.

The quantitative performance of the models, as detailed in Table 1, further supported the observation that GPT-4o consistently achieveed higher success rates across various path types.

We also notice a difference in success rates between Gemini Flash (50.5%) and Gemini (64.0%), illustrating that model size in our waypoint path generation task mattered. Both Gemini and GPT-4o (76.0%), known to be larger in parameters than Gemini Flash, outperformed Gemini Flash in most of the path types.

The Gemini model struggled significantly with petal flower geometries, as evidenced by performance on the three-petal, four-petal, and five-petal flower shapes.

Figure 7 shows the one successful instance in which the Gemini model correctly generated the five-petal flower. However, the success was an outlier, as the model generally struggled to manage complex trajectories.

As illustrated in Figure 8, the most common failure mode for the Gemini model in these scenarios involved generating the wrong number of petals or failing to assign a drone to each petal. Often, the model reused a single drone to generate multiple petals, leading to incomplete or incorrect geometries.

The results suggested that the Gemini model struggled with the complexity of petal flower trajectories, mainly when multiple drones coordinated to produce an intended shape. While optimized prompts may yield better results, the goal of this study was not to find the most effective prompts but to compare multiple models using the same prompt to reflect real-world use cases in which layman users may not employ specialized terms or refined instructions. As shown in Table 1, the Gemini model’s success rates were consistently lower than its counterparts in all petal flower geometries.

Our experiments revealed the crucial role of the various components of the high-level planning module, particularly the consensus mechanism involving the critic agents. Using multiple critic agents significantly improved the robustness of the generated trajectories by providing a rigorous evaluation of the waypoint paths, allowing the system to correct iteratively and refine outputs. The iterative process, driven by a majority voting scheme among the critics, helped mitigate errors and promote convergence toward the desired shape, albeit with some limitations.

For example, when generating a three-petal rose curve, the initial output of the Gemini model was far from the target shape. However, after several iterations in which the critic agents identified issues and provided corrective feedback, the system showed incremental improvements, as demonstrated by the log excerpt below.

2024-08-25 21:14:22,929 - INFO - Aggregated feedback from multiple critics:

 MAJORITY INVALID (0/3)
 Feedback Summary:
The feedback highlights several issues with the drone trajectories,

particularly concerning the completeness and shape of the 3-petal rose curve.

**Common Points:**

* **Drone 3’s trajectory is the biggest problem.
** All critics agree that Drone 3’s path is incomplete and does not match
the expected shape of a petal.
* **The overall shape is incorrect.
** The combined trajectories do not form a proper
3-petal rose curve. This is mainly due to Drone 3’s incomplete path.
* **Starting positions are generally good.
** There is no consensus on issues with starting positions,
except for Drone 3, which doesn’t follow its designated starting point in
the second iteration.

**Consensus:**

The consensus is that the drone trajectories are not valid and need
significant improvement. The primary focus should be on fixing Drone 3’s
path to ensure it traces a complete petal and adjusting the other drones’
paths to achieve the correct overall shape.

The feedback-driven approach allowed the system to refine the waypoints gradually, although often multiple retries (up to 10 maximum reflection retries) were required to approach the desired shape. The iterative refinement process, however, was computationally expensive and sometimes resulted in marginal improvements, as indicated by the relatively low success rates for more complex geometries like the three-petal rose. Including critic agents and a consensus mechanism significantly enhanced the system’s ability to correct and improve output, though several iterations may be required to achieve satisfactory results. The approach was particularly effective for complex, multi-agent coordination tasks in which precision and synchronization are critical.

4.3. Discussion on Varying Results Among LLMs

We observed significant variations in the performance of the three LLMs: Gemini, GeminiFlash, and GPT-4o. The discrepancy was attributed to several factors: model architecture, training data, and shape complexity.

The architecture and size of the models played a crucial role in performance. A larger model, GPT-4o, exhibited superior capabilities in generating complex and accurate trajectories compared to the Gemini models. Studies have shown that larger models perform better on tasks requiring intricate reasoning [39,40] and understanding of spatial reasoning [41], which are essential for waypoint generation tasks involving multiple drones. On the other hand, the Gemini model struggled with more complex shapes, particularly composite paths, due to a comparatively smaller size and possibly less sophisticated architecture.

The complexity of the path shapes significantly influenced the models’ performance. Our findings indicated that both Gemini and Gemini Flash had difficulty generating complex shapes, such as petal flower geometries, in which precise coordination among multiple drones is critical. This observation aligns with recent research on LLMs’ spatial reasoning capabilities. The Minesweeper study [42] found that LLMs possessed foundational abilities for spatial reasoning, though they struggled to integrate these into a coherent, multi-step logical reasoning process. This limitation explained the models’ challenges in generating intricate shapes like multi-petal flowers. Such tasks require not only basic spatial understanding but also the ability to coordinate multiple elements (drones in this case) logically across several steps. The models’ difficulty in accurately counting and assigning drones to specific path components, leading to frequent failures in more complex geometries, further exemplified the struggle with multi-step spatial reasoning.

Another factor contributing to the performance differences was the implementation of critic agents in our framework. The consensus mechanism provided by the agents allowed for an iterative refinement of the trajectories, significantly enhancing the robustness of the generated paths. While this approach proved beneficial, it also highlighted the limitations of the models, particularly in handling complex geometries. The models often required multiple iterations to converge on a satisfactory trajectory. The iterative feedback mechanism was crucial in multi-agent coordination tasks, where precision and synchronization were paramount.

In summary, the varying results among LLMs are attributed to architectural differences, the complexity of the trajectory shapes, and the effectiveness of our iterative refinement process. Due to the novelty of our approach and the lack of readily available benchmarks or open-source implementations for direct comparison, we could not include benchmark comparisons in this study. We made our code open source to facilitate future research and encourage benchmarking in this area. Future work will explore optimizing model architectures or incorporating additional training data focused on complex trajectory generation to improve performance across all models. Furthermore, implementing LLM-based frameworks from existing research for comparison with our open-source implementation and data is a compelling area of future work.

5. Ablation Studies

To assess specifically the impact of critic agents on our framework, we conducted three ablation studies: (1) comparing the performance of no critics, one critic, and three critics, (2) analyzing computational time across different model configurations, and (3) evaluating drone capacity scaling. The primary critic study was designed to isolate the role of critic agents in improving trajectory generation, synchronization, and accuracy. By varying the number of critics, we aimed to quantify their effect on the system’s ability to refine and correct drone waypoints and to provide insights into the optimal configuration for robust performance. Additionally, our timing analysis investigated the computational costs of different model configurations, while our drone capacity study explored the scalability limits of our system from simple to complex shapes.

5.1. Contributions of High-Level Planning Modules

The final high-level planning module was a group of LLM agents, each with a function. The benefits and drawbacks of each agent, besides the generator, to achieve desired waypoints are explained below.

Instructor agentstranslate the natural language prompt into a detailed set of requirements that guide the generator agent, focusing less on performance and more on efficiency. Since audio data are larger in storage size than text data, sharing audio data across multiple agent calls is expensive compared to simpler text descriptions. Because text is more efficient to manipulate than audio, we utilized an instructor agent, so we only handled the audio command once during the initial conversion of the audio commands to text.
Critic agents provide feedback to the generator. In Table 2, we expose the difference between no critic, one critic, and three critics. We chose three critics to achieve two objectives: to overcome hallucinations in the feedback evaluation with redundancy and to utilize an odd number of critics to guarantee no ties. Additional critics can be included as long as the total number of critics remains odd to prevent tie votes. However, with more critics comes added computation cost and overhead. Three is the minimum number of critics satisfying our previously stated objectives while keeping the framework efficient. In Table 2, we observe that three critics significantly outperformed other configurations with an average success rate of 64.0%, compared to single critic (56.0%) and no critics (54.5%). Three critics achieved the best performance in most path types, demonstrating substantial improvements in complex trajectories like the cross (100%), helix (100%), and zigzag (90%). A single critic showed modest improvements over no critics (56.0% vs. 54.5%), suggesting that even minimal feedback helps refine trajectories. However, we observed interesting failure modes where no critics performed better, particularly in paths like triangle (90% vs. 70%), square (80% vs. 60%), and octagon (40% vs. 30%). This suggests that in certain conditions where the generator is already confident in its generation, adding critics may introduce confusion and lead the model away from an initially correct trajectory. These findings highlight the trade-off between the benefits of multiple perspectives and the potential for overcomplicated feedback in simpler scenarios.
Aggregator agents: Similar to the instructor agent, the aggregator agent is for efficiency. We found that with multiple critics, the context window of the generator was quickly exhausted on mostly redundant information from the critics. We also found that the outputs from the critics occasionally differed due to the occasional hallucinations in the LLM/VLM experiments. We used an aggregator agent to provide an unambiguous feedback signal to the generator agent.

5.2. Timing Analysis

The timing analysis in Table 3 reveals several interesting patterns in the computational costs of different model configurations:

Examining the Gemini configurations with varying numbers of critics showed a clear trade-off between refinement capability and computational cost. While Gemini without reflection was the fastest, with a total time of 5.45 s, it lacked any refinement mechanisms. Adding critics significantly increased total computation time: one critic (127.72 s), three critics (203.60 s), and five critics (444.36 s). The increase was driven by two factors: (1) reflection time per round grew substantially with more critics (9.23 s for one critic to 59.33 s for five critics) as more feedback had to be processed, and (2) the generation phase became more complex (from 4.24 s to 14.73 s) as the model had to process more comprehensive feedback. Notably, adding more critics did not necessarily reduce the number of rounds needed, with similar average rounds for one and five critics (six rounds) and more for three critics (eight rounds), suggesting diminishing returns from adding critics beyond a certain point.
Comparing Gemini Flash (small model) with Gemini (large model) in their three-critic configurations revealed the efficiency–capability trade-off. While Gemini Flash had a notably lower average success rate of 50.5% compared to Gemini’s 64.0%, it achieved faster generation (1.73 s vs. 5.21 s) and reflection times (9.23 s vs. 20.78 s). However, it required more reflection rounds on average (nine vs. eight) to achieve satisfactory results, indicating that while individual operations were faster, the model often needed more iterations to converge and still achieved lower performance. This resulted in a lower but comparable total time (94.95 s vs. 203.60 s), suggesting that the smaller model might be preferable when computational resources were constrained and longer convergence times were acceptable, though this came at a significant performance cost (−13.5% success rate).
GPT-4o demonstrated superior per-iteration performance among the three-critic configurations with the highest average success rate of 76.0%, compared to Gemini (64.0%) and Gemini Flash (50.5%). Despite having higher generation time (8.55 s) and reflection time (31.48 s) compared to both Gemini (5.21 s, 20.78 s) and Gemini Flash (1.73 s, 9.23 s), it required significantly fewer reflection rounds (two vs. eight and nine, respectively). This efficiency in convergence led to the best total time (80.06 s) among configurations with critics, even outperforming the smaller Gemini Flash model (94.95 s). This suggested that GPT-4o’s enhanced capabilities enabled it to generate higher-quality outputs that needed less refinement, making it more efficient overall despite higher per-operation costs. The combination of superior success rate (+12% over Gemini, +25.5% over Gemini Flash) and fastest total computation time underscored GPT-4o’s exceptional performance in both quality and efficiency. These results demonstrated that investing in a stronger base model for the LEVIOSA framework yielded compounding benefits: better generation quality required fewer refinement iterations, ultimately leading to both superior performance and faster convergence.

5.3. Drone Capacity

The system’s scalability theoretically allows for coordinating an arbitrary number of drones, denoted as N, by dynamically generating the necessary code to produce the required waypoints. However, in practice, the system’s capacity is highly dependent on the language model’s ability to comprehend the complexity of the query and generate Python code that effectively manages the complexity. Specifically, the challenge lies in the model’s ability to decompose a complex overall trajectory into manageable sub-trajectories that individual drones accomplish. The process requires the model to understand the intricate inter-dependencies between the drones’ paths and generate efficient, executable code that ensures coordination and avoids conflicts. Given the current limitations of large language models, handling such complexity, especially at scale, is difficult. Therefore, the system’s effectiveness hinges on the model’s ability to navigate challenges and produce reliable code supporting the desired level of drone coordination. For example, Figure 9 shows that the system has no trouble synthesizing with the Python code necessary for 1000 drones for easily decomposed shapes like parallel lines. Similarly, Figure 10 shows that the system can interpret how to distribute the shape across a hundred drones. However, for highly intricate shapes like a dragon Figure 11, the system struggles to produce the desired shape with a thousand drones. With such advanced shapes, further research into imbuing LLM with spatial reasoning and planning where the LLM breaks down the complexity of the shape into more manageable sub-shapes is needed. Reasoning how those sub-shapes compose back to the original and finally planning out the process will be the subject of future work.

6. Findings

Our experiments demonstrated the effectiveness of the LEVIOSA framework for natural language UAV orchestration across various trajectory types. The framework showed particular strength in handling composite trajectories like stars, zigzags, and pentagons, while continuous trajectories presented more challenges. The results highlighted several key findings:

Role of critic agents: The inclusion of critic agents and the consensus mechanism significantly enhanced the robustness of the generated trajectories. By providing iterative feedback, the system could correct and refine outputs over several iterations. This approach was particularly effective for complex, multi-agent coordination tasks, where precision and synchronization are critical.
Model performance: GPT-4o consistently achieved higher success rates across various path types than Gemini and GeminiFlash. For instance, GPT-4o produced a more intricate and recognizable design in the star-shaped path, successfully capturing the underlying intent despite minimal detail in the prompt. In contrast, the trajectories generated by the Gemini models were simpler and less aligned with the complex structure typically associated with a star.
Complex geometries: The Gemini model struggled significantly with petal flower geometries, as evidenced by lower success rates for the 3-petal, 4-petal, and 5-petal rose curves. Common failure modes included generating the wrong number of petals or failing to assign a drone to each petal, leading to incomplete or incorrect geometries. This suggests the model has difficulty handling complex spatial reasoning tasks required for intricate shapes.
Impact of model size: There was a noticeable difference in success rates between GeminiFlash and the larger models (Gemini and GPT-4o), illustrating that model size matters in the waypoint path generation task. Both Gemini and GPT-4o, which are larger in terms of parameters, outperformed GeminiFlash in most path types, indicating that larger models may have more capacity to handle the complex spatial reasoning and code synthesis required for generating accurate trajectories.
Generative iterations: The iterative process involving generation and reflection cycles contributed to incremental improvements in trajectory generation. However, achieving satisfactory results for more complex geometries often required multiple retries up to the maximum allowed, which could be computationally expensive and yield marginal improvements.
Computational efficiency: timing analysis revealed three key insights about computational trade-offs: (1) adding critics increased computational time; (2) while smaller models like Gemini Flash had faster per-operation times, they required more refinement iterations, leading to longer total execution times; (3) GPT-4o, despite higher per-operation costs, achieved better overall efficiency through fewer refinement iterations, demonstrating that model capability had more impact on total performance than raw operational speed.

These findings suggest that while the LEVIOSA framework effectively translates natural language commands into executable drone trajectories, its performance is influenced by the capabilities of the underlying LLMs, particularly in handling complex spatial reasoning tasks. The critic agents are also vital in refining the outputs, especially for complex geometries requiring precise coordination among multiple drones.

7. Conclusions

This paper presented LEVIOSA, a novel framework for UAV trajectory generation that leverages large language models to translate natural language commands into executable flight paths for drone swarms. Through innovative components like the multi-critic consensus mechanism and hierarchical prompt structuring, LEVIOSA demonstrated effective coordination of multiple UAVs while maintaining safety and trajectory fidelity.

Our experimental results revealed both the strengths and limitations of different LLM architectures in this domain. GPT-4o consistently outperformed Gemini models in generating complex and accurate trajectories, particularly for composite shapes like stars and pentagons. The framework showed particular promise in handling decomposable geometric patterns, successfully coordinating up to 1000 drones for simple formations. However, performance degraded with highly intricate shapes, indicating current limitations in LLMs’ spatial reasoning capabilities.

The ablation studies highlighted the crucial role of the multi-critic system in improving trajectory quality, though at the cost of increased computational overhead. This trade-off between accuracy and processing time emerges as a key consideration for real-world applications. Additionally, while the current framework excels in static environments, its inability to handle dynamic obstacles and real-time trajectory adjustments represents a significant limitation for practical deployments. Despite these challenges, LEVIOSA represents a significant advancement in natural language-based UAV control, offering an intuitive interface for drone swarm coordination. The framework’s success in translating high-level commands into precise flight paths demonstrates the potential of LLM-based approaches in robotics, paving the way for more accessible and flexible autonomous systems.

In future work, we plan to implement and test the LEVIOSA framework on real UAV hardware to evaluate its effectiveness under real-world conditions, including dynamic and unstructured environments. This deployment will involve near real-time analysis and adaptation to ensure the system can respond quickly to environmental changes, further enhancing the robustness of UAV swarm coordination.

Future Work

Future work for LEVIOSA will include implementing and testing the framework on real UAV hardware to validate its performance in real-world, dynamic environments. This deployment will focus on achieving near real-time analysis and adaptation, ensuring the system’s capability to respond swiftly to environmental changes. Additionally, efforts will be directed at enhancing dynamic obstacle detection, expanding scalability, and exploring heterogeneous swarm coordination, all aimed at broadening the applicability and robustness of the framework.

Future research for LEVIOSA will focus on enhancing its capabilities and addressing current limitations. A critical area for improvement is the development of robust algorithms for detecting and avoiding dynamic obstacles while maintaining formation and mission objectives. This could involve integrating real-time sensor data processing, predictive modeling of obstacle movements, and rapid trajectory re-planning techniques. The work of Hu et al. on dynamic obstacle avoidance using game theory-based methods highlights the potential of such approaches [43].

As UAV swarms grow in size and complexity, investigating methods to scale LEVIOSA efficiently becomes crucial. This could involve developing hierarchical control structures, distributed computing approaches, and communication protocols optimized for large-scale swarm coordination. The research by Albrekht and Pysarenko on heterogeneous UAV swarms using reinforcement learning provides insights into scaling swarm intelligence [44].

Additionally, extending the framework to manage heterogeneous swarms comprising different types of UAVs with varying capabilities would significantly broaden its applicability. This research direction might explore task allocation strategies, role-based coordination methods, and adaptive formation control algorithms suitable for diverse UAV teams. The comprehensive review by Chen et al. on collaborative task assignment for heterogeneous UAVs underscores the importance and challenges of this area [45].

Author Contributions

Conceptualization, G.A., M.P.D. and K.J.M.; data curation, M.P.D. and K.J.M.; formal analysis, G.A., M.P.D. and K.J.M.; funding acquisition, K.-D.N.; investigation, G.A., M.P.D. and K.J.M.; methodology, G.A., M.P.D., K.J.M. and T.C.E.; project administration, K.-D.N.; software, G.A., M.P.D. and K.J.M.; supervision, T.C.E. and K.-D.N.; validation, G.A., M.P.D., K.J.M. and K.-D.N.; visualization, G.A., M.P.D. and T.C.E.; writing—original draft, G.A., M.P.D., K.J.M., T.C.E. and K.-D.N.; writing—review and editing, K.J.M., T.C.E. and K.-D.N. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the U.S. National Science Foundation under grants CMMI Grant #2138206 and EEC grant #2245022. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Data Availability Statement

The data supporting this study’s findings are available from the corresponding author upon reasonable request. The code for this project is available at: https://github.com/sesem738/Leviosa (accessed on 17 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithm Description

Algorithm A1 presents the two main components of LEVIOSA: the high-level planner that processes natural language commands and generates waypoints using multiple LLM/VLM agents, and the per-UAV low-level controller policies that execute the generated trajectories.

Algorithm A1 LEVIOSA framework.
	1. High-level planner $Φ$
	Input: Text prompt p or voice audio v, maximum iterations M, number of critics C
	Output: Set of waypoint coordinates ${(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})}$ for N drones
1:	$p_{r e q} \leftarrow ϕ_{i} (p, v)$ ▷ Instructor VLM agent converts input to requirements
2:	$f \leftarrow \emptyset$	▷ Initialize empty feedback
3:	for $i \leftarrow 1$ to M do	▷ Generation reflection Loop
	% Generation phase
4:	$c o d e \leftarrow ϕ_{g} (p_{r e q}, f)$	▷ Generator LLM agent synthesizes Python code
5:	${(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})} \leftarrow ExecuteCode (c o d e)$ ▷ Non-agent function
6:	$i \leftarrow PlotWaypoints ({(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})})$ ▷ Non-agent function
	% Reflection Phase
7:	for $j \leftarrow 1$ to C do
8:	$f_{j} \leftarrow ϕ_{c_{j}} (p_{r e q}, i)$	▷ Critic VLM agents evaluate plot
9:	end for
10:	$f \leftarrow ϕ_{a} (f_{1 : C}, f)$	▷ Aggregator LLM agent combines feedback
11:	if $MajorityCriticsApprove (i)$	▷ Non-agent function then
12:	break	▷ stop as good waypoints found
13:	end if
14:	end for
15:	return ${(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})}$

	2. Per-UAV low-level controller $π_{θ}$	▷ Executed in parallel for each of the N drones
	Input: Waypoint sequence ${(δ x_{n, d}, δ y_{n, d}, δ z_{n, d})}_{d = 1}^{D}$ from high-level planner, current state $s_{t}^{n}$
	Output: Sequence of control actions ${a_{t}^{n}}$ for drone n until terminal position
16:	while not reached terminal position do
17:	$a_{t}^{n} \leftarrow π_{θ} (s_{t}^{n})$	▷ Generate control action using RL policy
18:	Execute $a_{t}^{n}$ and observe next state $s_{t + 1}^{n}$
19:	$s_{t}^{n} \leftarrow s_{t + 1}^{n}$
20:	end while

Appendix B

Table A1. Path types and corresponding prompts.

Path Type	Path Name	Prompt
Single	Circle	Create a circular trajectory using 2 drones, where each drone traces out one half of the circle. The drones should move in perfect synchronization to form a complete circle.
	Hyperbola	Design a hyperbolic path using 2 drones, with each drone tracing one branch of the hyperbola. The drones should maintain symmetry and smoothness in their paths.
	3-Petal rose	Generate a 3-petal rose curve using 3 drones, where each drone is responsible for tracing out one petal. The drones should coordinate to form a seamless rose pattern.
	4-Petal rose	Create a 4-petal rose curve using 4 drones, with each drone tracing one petal. The drones should work together to ensure the rose curve is smooth and continuous.
	5-Petal rose	Design a 5-petal rose curve using 5 drones, where each drone forms one petal. The drones should synchronize their movements to create a harmonious rose shape.
	Sine wave	Construct a sine wave pattern using 3 drones, where each drone covers a separate section of the wave. The drones should ensure a continuous and smooth wave formation.
	Helix	Draw a helical path using 1 drone, creating a spiral in three-dimensional space. The drone should maintain a consistent radius and pitch throughout the helix.
	Double helix	Create a double helix trajectory using 2 drones, with each drone forming one strand of the helix. The drones should maintain parallel paths and synchronized movement.
	Triple helix	Generate a triple helix pattern using 3 drones, with each drone forming one strand. The drones should coordinate to maintain uniform spacing and synchronization.
	Double conical helix	Design a double conical helix using 2 drones, where each drone traces one conical spiral. The drones should ensure the cones are symmetrical and the paths are smooth.
Composite	Star	Generate a star-shaped trajectory using 5 drones. The drones should move in such a way that their combined flight paths trace out a symmetrical star with equal arm lengths.
	Zigzag	Create a dynamic zigzag pattern using 3 drones. The drones should move in unison, forming a synchronized zigzag path. Each drone should follow a separate path within the zigzag, ensuring the pattern is evenly spaced and consistent throughout the trajectory.
	Heart	Design a geometric, angular heart-shaped path using 2 drones. Each drone should trace one half of the heart, starting from the bottom point and meeting at the top. The heart should have an angular appearance, with both halves perfectly mirroring each other.
	Cross	Generate a cross-shaped path using 2 drones. Each drone should be responsible for one arm of the cross. Ensure that the paths are perpendicular to each other and intersect at the center.
	Pentagon	Create a pentagon using 5 drones. Each drone should trace one side of the pentagon, with their paths combining to form the shape.
	Hexagon	Design a hexagon-shaped path using 3 drones, each responsible for two sides of the hexagon. The drones should work together to form a complete hexagon, ensuring that the drones’ paths connect seamlessly at the vertices to maintain the shape’s integrity.
	Triangle	Create an equilateral triangle path using 3 drones. Each drone should trace one side of the triangle, starting from a common point and moving outward to form the triangle. The drones should synchronize their movements to complete the triangle simultaneously.
	Square	Generate a square trajectory using 4 drones. Each drone should be responsible for one side of the square, ensuring that the angles at each corner are well-defined. The drones should coordinate their movements to maintain equal side lengths and complete the square simultaneously.
	Octagon	Design an octagon-shaped path using 8 drones. Each drone should be responsible for tracing two sides of the octagon. Ensure that the drones’ paths create a symmetric and precise overall shape.
	Pyramid	Create a pyramid-shaped path using 4 drones. Each drone should trace one side of the pyramid, starting from the base and converging at the apex. The drones should coordinate their movements to form a symmetrical and well-defined pyramid shape.

References

Javaid, S.; Fahim, H.; He, B.; Saeed, N. Large language models for uavs: Current state and pathways to the future. arXiv 2024, arXiv:2405.01745. [Google Scholar] [CrossRef]
Tzachor, A.; Devare, M.; Richards, C.; Pypers, P.; Ghosh, A.; Koo, J.; Johal, S.; King, B. Large language models and agricultural extension services. Nat. Food 2023, 4, 941–948. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Mehrooz, G.; Jacobsen, R.H. Inspection Path Planning for Aerial Vehicles via Sampling-based Sequential Optimization. In Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 15–18 June 2021; pp. 679–687. [Google Scholar]
Pu, H.; Yang, X.; Li, J.; Guo, R. AutoRepo: A general framework for multimodal LLM-based automated construction reporting. Expert Syst. Appl. 2024, 255, 124601. [Google Scholar] [CrossRef]
Wan, G.; Wu, Y.; Chen, J.; Li, S. CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction. arXiv 2024, arXiv:2408.13940. [Google Scholar]
Mikami, Y.; Melnik, A.; Miura, J.; Hautamäki, V. Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs. arXiv 2024, arXiv:2403.13801. [Google Scholar]
Chen, Y.; Arkin, J.; Zhang, Y.; Roy, N.; Fan, C. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4311–4317. [Google Scholar]
Ivanova, N. Swarm Robotics-Coordination and Cooperation: Exploring Coordination and Cooperation Strategies in Swarm Robotics Systems for Achieving Collective Tasks. J. Comput. Intell. Robot. 2024, 4, 1–13. [Google Scholar]
Zu, W.; Song, W.; Chen, R.; Guo, Z.; Sun, F.; Tian, Z.; Pan, W.; Wang, J. Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1019–1025. [Google Scholar]
Mandi, Z.; Jain, S.; Song, S. Roco: Dialectic multi-robot collaboration with large language models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 286–299. [Google Scholar]
Mi, J.; Liang, H.; Katsakis, N.; Tang, S.; Li, Q.; Zhang, C.; Zhang, J. Intention-related natural language grounding via object affordance detection and intention semantic extraction. Front. Neurorobot. 2020, 14, 26. [Google Scholar] [CrossRef]
Stramandinoli, F.; Tikhanoff, V.; Pattacini, U.; Nori, F. Grounding speech utterances in robotics affordances: An embodied statistical language model. In Proceedings of the 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Cergy-Pontoise, France, 19–22 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 79–86. [Google Scholar]
Mees, O.; Borja-Diaz, J.; Burgard, W. Grounding language with visual affordances over unstructured data. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 11576–11582. [Google Scholar]
Wu, X.; Xian, R.; Guan, T.; Liang, J.; Chakraborty, S.; Liu, F.; Sadler, B.; Manocha, D.; Bedi, A.S. On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities. arXiv 2024, arXiv:2402.10340. [Google Scholar]
Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; et al. Inner monologue: Embodied reasoning through planning with language models. arXiv 2022, arXiv:2207.05608. [Google Scholar]
Jiao, A.; Patel, T.P.; Khurana, S.; Korol, A.M.; Brunke, L.; Adajania, V.K.; Culha, U.; Zhou, S.; Schoellig, A.P. Swarm-gpt: Combining large language models with safe motion planning for robot choreography design. arXiv 2023, arXiv:2312.01059. [Google Scholar]
Liu, H.; Zhu, Y.; Kato, K.; Tsukahara, A.; Kondo, I.; Aoyama, T.; Hasegawa, Y. Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration. arXiv 2024, arXiv:2406.14097. [Google Scholar] [CrossRef]
Adajania, V.K.; Zhou, S.; Singh, A.K.; Schoellig, A.P. AMSwarm: An Alternating Minimization Approach for Safe Motion Planning of Quadrotor Swarms in Cluttered Environments. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1421–1427. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Brohan, A.; Chebotar, Y.; Finn, C.; Hausman, K.; Herzog, A.; Ho, D.; Ibarz, J.; Irpan, A.; Jang, E.; Julian, R.; et al. Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; PMLR: New York, NY, USA, 2023; pp. 287–318. [Google Scholar]
Yan, K.; Ji, L.; Wang, Z.; Wang, Y.; Duan, N.; Ma, S. Voila-A: Aligning Vision-Language Models with User’s Gaze Attention. arXiv 2023, arXiv:2401.09454. [Google Scholar]
Naik, R.; Chandrasekaran, V.; Yuksekgonul, M.; Palangi, H.; Nushi, B. Diversity of Thought Improves Reasoning Abilities of LLMs. arXiv 2024, arXiv:2310.07088. [Google Scholar]
Wang, X.; Wang, Z.; Liu, J.; Chen, Y.; Yuan, L.; Peng, H.; Ji, H. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. arXiv 2024, arXiv:2309.10691. [Google Scholar]
Lou, J.; Wu, W.; Liao, S.; Shi, R. Air-M: A Visual Reality Many-Agent Reinforcement Learning Platform for Large-Scale Aerial Unmanned System. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 5598–5605. [Google Scholar]
Aikins, G.; Jagtap, S.; Gao, W. Resilience Analysis of Deep Q-Learning Algorithms in Driving Simulations Against Cyberattacks. In Proceedings of the 2022 1st International Conference on AI in Cybersecurity (ICAIC), Victoria, TX, USA, 24–26 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
Ho, T.M.; Nguyen, K.K.; Cheriet, M. UAV Control for Wireless Service Provisioning in Critical Demand Areas: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2021, 70, 7138–7152. [Google Scholar] [CrossRef]
Amendola, J.; Cenkeramaddi, L.R.; Jha, A. Drone Landing on Moving UGV Platform with Reinforcement Learning Based Offsets. In Proceedings of the 2023 IEEE International Symposium on Smart Electronic Systems (iSES), Ahmedabad, India, 18–20 December 2023; pp. 16–21. [Google Scholar]
Yun, W.J.; Park, S.; Kim, J.; Shin, M.; Jung, S.; Mohaisen, D.A.; Kim, J.H. Cooperative Multiagent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control. IEEE Trans. Ind. Inform. 2022, 18, 7086–7096. [Google Scholar] [CrossRef]
Tovarnov, M.S.; Bykov, N.V. Reinforcement learning reward function in unmanned aerial vehicle control tasks. J. Phys. Conf. Ser. 2022, 2308, 012004. [Google Scholar] [CrossRef]
Geles, I.; Bauersfeld, L.; Romero, A.; Xing, J.; Scaramuzza, D. Demonstrating Agile Flight from Pixels without State Estimation. arXiv 2024, arXiv:2406.12505. [Google Scholar]
Aikins, G.; Jagtap, S.; Nguyen, K.D. A Robust Strategy for UAV Autonomous Landing on a Moving Platform under Partial Observability. Drones 2024, 8, 232. [Google Scholar] [CrossRef]
Alon, Y.; Zhou, H. Multi-agent reinforcement learning for unmanned aerial vehicle coordination by multi-critic policy gradient optimization. arXiv 2020, arXiv:2012.15472. [Google Scholar]
Guo, J.; Chen, Y.; Hao, Y.; Yin, Z.; Yu, Y.; Li, S. Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 115–122. [Google Scholar]
Zhang, J.; Hu, C.; Cai, R.; Wang, W.; Yan, J.; Lv, C. Safe Trajectory Generation for Complex Urban Environments Using Spatio-temporal Semantic Corridor. IEEE Robot. Autom. Lett. 2018, 3, 2784–2791. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Panerati, J.; Zheng, H.; Zhou, S.; Xu, J.; Prorok, A.; Schoellig, A.P. Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 7512–7519. [Google Scholar]
Hou, Y.; Liang, X.; Lv, M.; Yang, Q.; Li, Y. Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making. Eng. Appl. Artif. Intell. 2023, 125, 106703. [Google Scholar] [CrossRef]
Kurkcu, A.; Acar, C.; Campolo, D.; Tee, K.P. Discrete Task-Space Automatic Curriculum Learning for Robotic Grasping. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; pp. 731–738. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2024. [Google Scholar]
Angel, M.; Rinehart, J.B.; Canneson, M.; Baldi, P. Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the American Board of Anesthesiology Examination. Anesth. Analg. 2024. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Zhao, S.; Lin, Q.; Chen, L.; Luo, Q.; Wu, S.; Ye, X.; Feng, H.; Du, Z. Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study. arXiv 2024, arXiv:2408.14438. [Google Scholar]
Li, Y.; Wang, H.; Zhang, C. Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
Lin, Y.; Na, Z.; Feng, Z.; Lin, B.; Lin, Y. Dual-game based UAV swarm obstacle avoidance algorithm in multi-narrow type obstacle scenarios. EURASIP J. Adv. Signal Process. 2023, 2023, 118. [Google Scholar] [CrossRef]
Albrekht, Y.; Pysarenko, A. Exploring the power of heterogeneous UAV swarms through reinforcement learning. Technol. Audit Prod. Reserv. 2023, 6, 6–10. [Google Scholar] [CrossRef]
Chen, J.; Xiao, K.; You, K.; Qing, X.; Ye, F.; Sun, Q. Hierarchical task assignment strategy for heterogeneous multi-UAV system in large-scale search and rescue scenarios. Int. J. Aerosp. Eng. 2021, 2021, 7353697. [Google Scholar] [CrossRef]

Figure 1. Our framework incorporates several LLMs to generate and refine drone waypoints based on user commands.

Figure 2. Illustrative diagram of the components of the high-level planner system, showing the role of each LLM agent type, their inputs, and outputs. (a) Instructor agent. (b) Generator agent. (c) Critic agents. (d) Aggregator agent.

Figure 3. The overall trajectory is divided into individual waypoints for each drone. The waypoints, combined with each drone’s real-time observations, are then processed by the dedicated low-level policy for that UAV. The process generates the specific actions required to guide the drone’s movement.

Figure 4. Sample Star generated based on Gemini.

Figure 5. Sample Star generated based on GeminiFlash.

Figure 6. Sample Star generated based on GPT-4o.

Figure 7. Successful 5-petal flower trajectory generated by the Gemini model.

Figure 8. Common failure mode of the Gemini model for petal flower geometries.

Figure 9. A thousand drones successfully form parallel lines generated by Gemini.

Figure 10. One hundred drones successfully form a spiral generated by Gemini.

Figure 11. A thousand drones unsuccessfully form a dragon generated by Gemini.

Table 1. Success rates of LLMs in generating specified waypoint paths. Each column is with the multi-agent flow. Bold and underline represent best and second best, respectively.

Path Type	Path Name	Gemini (%)	GeminiFlash (%)	GPT-4o (%)
Single	Circle	90	90	80
	Hyperbola	70	10	10
	3-Petal rose	70	50	90
	4-Petal rose	70	30	100
	5-Petal rose	70	70	100
	Sine wave	20	60	60
	Helix	100	90	100
	Double helix	90	30	80
	Triple helix	80	60	100
	Double conical helix	50	0	30
Composite	Star	40	40	80
	Zigzag	90	60	90
	Heart	10	0	10
	Cross	100	60	100
	Pentagon	70	80	90
	Hexagon	10	20	80
	Triangle	70	60	30
	Square	60	90	100
	Octagon	30	40	90
	Pyramid	90	70	100
Average Success Rate		64.0	50.5	76.0

Table 2. Ablations results of critics agents in our framework. Results are from Gemini. Bold and underline represent best and second best, respectively.

Path Type	Path Name	No Critic (%)	One Critic (%)	Three Critics (%)
Single	Circle	90	80	90
	Hyperbola	50	50	70
	3-Petal rose	40	50	70
	4-Petal rose	30	70	70
	5-Petal rose	70	70	70
	Sine wave	40	40	20
	Helix	90	90	100
	Double helix	30	80	90
	Triple helix	80	90	80
	Double Conical helix	40	20	50
Composite	Star	30	40	40
	Zigzag	50	40	90
	Heart	0	0	10
	Cross	80	70	100
	Pentagon	40	30	70
	Hexagon	30	40	10
	Triangle	90	60	70
	Square	80	80	60
	Octagon	40	30	30
	Pyramid	100	90	90
Average Success Rate		54.5	56.0	64.0

Table 3. Computational time analysis across different model configurations. The times shown are averages across all trials. Generation time represents the waypoint generation phase; reflection time represents one reflection cycle; rounds represent the average number of reflection rounds; and total time includes all iterations. Bold and underline represent the best and second best values, respectively, for each metric.

Model	Generation (s)	Reflection (s)	Rounds	Total Time (s)
Gemini (no reflection)	5.45	–	–	5.45
Gemini (1 critic)	4.24	9.23	6	127.72
Gemini (3 critics)	5.21	20.78	8	203.60
Gemini (5 critics)	14.73	59.33	6	444.36
GPT-4o (3 critics)	8.55	31.48	2	80.06
Gemini Flash (3 critics)	1.73	9.38	9	94.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aikins, G.; Dao, M.P.; Moukpe, K.J.; Eskridge, T.C.; Nguyen, K.-D. LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation. Electronics 2024, 13, 4508. https://doi.org/10.3390/electronics13224508

AMA Style

Aikins G, Dao MP, Moukpe KJ, Eskridge TC, Nguyen K-D. LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation. Electronics. 2024; 13(22):4508. https://doi.org/10.3390/electronics13224508

Chicago/Turabian Style

Aikins, Godwyll, Mawaba Pascal Dao, Koboyo Josias Moukpe, Thomas C. Eskridge, and Kim-Doang Nguyen. 2024. "LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation" Electronics 13, no. 22: 4508. https://doi.org/10.3390/electronics13224508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu