1. Introduction
Unmanned aerial vehicles (UAVs) have been widely adopted in emergency rescue, instant messaging, and other fields due to their excellent flexibility and cost-effectiveness. Flexibly deployed UAVs can carry communication payloads and act as airborne base stations (BSs) to provide access services for ground users.
However, the open nature of wireless channels and the broadcasting nature of wireless signals significantly increase the security risks of UAV communication. Malicious users can eavesdrop on the communication content of legitimate users by intercepting and stealing UAV signals, which results in significant security threats, especially in applications involving sensitive data transmissions, such as surveillance, disaster response, and secure communication networks [
1,
2,
3]. Therefore, ensuring the communication security of legitimate users and effectively combating potential eavesdroppers has become a major challenge in the field of UAV communication. In order to address the above challenges, a new method called “friendly UAV jamming” has been proposed, which sends special interference signals to eavesdroppers to prevent them from obtaining eavesdropped content, thereby ensuring the information security of legitimate users [
4,
5].
Unfortunately, operating the aforementioned UAV security communication mechanism requires solving complex mathematical problems that traditional methods are powerless to address. Therefore, researchers have begun to use new methods based on artificial intelligence to solve these problems. Notable achievements include deep reinforcement learning (DRL)-based methods [
6,
7], including the twin delayed deep deterministic policy gradient (TD3) algorithm [
8], proximal policy optimization (PPO) [
9], and the soft actor-critic (SAC) algorithm [
10]. Additionally, the multi-agent DRL approach is also effective in providing distributed and online solutions. For instance, the authors of [
11] introduced multi-agent DRL approaches to jointly optimize critical parameters, such as UAV trajectories, user association variables, and transmit power, in multi-UAV-assisted communication systems.
Ensuring secure communication involves strategies to protect data transmissions between UAVs and legitimate users from eavesdropping, which becomes especially challenging in open environments. To solve the UAV security communication problem, many relevant studies exist [
6,
12,
13]. UAVs are often deployed to monitor or communicate within designated areas, ensuring consistent coverage over time for tasks like surveillance, data collection, or relaying communication signals. Periodic coverage ensures that UAVs revisit specific areas regularly, which is essential in dynamic environments.
Therefore, this paper studies the periodic coverage-assisted secure communication of UAVs with coverage evaluation constraints. Unlike existing studies [
14,
15], this paper fully considers scenarios with active and potential eavesdroppers. Specifically, we use multiple UAV BSs to provide services to legitimate users while also deploying a certain number of UAV jammers to send interference signals to eavesdroppers. Considering the limited carrying capacity of UAVs, as described in [
16], this paper adopts a cyclic coverage estimation scheme to improve the service capability of UAV clusters.
The main purpose of this paper is to propose a new strategy to help legitimate users maximize their minimum secrecy rate. The optimization objective presents a complex mixed-integer nonlinear optimization problem. This problem is mathematically intractable because maximizing the minimum secrecy rate requires the simultaneous optimization of the user association variables, UAV trajectory, and output power. However, due to the complex nature of the coverage constraints, the mobility of the UAVs, and the discrete nature of the user association variables, problem solving becomes highly challenging. DRL algorithms may be more suitable for addressing these problems. Therefore, this paper formulates the optimization objective as a sequential decision-making problem. The DRL method (i.e., single-agent SAC and TD3 algorithms) and the multi-agent SAC (MASAC) algorithm can effectively solve these problems. The numerical results show that the MASAC algorithm is superior in accumulating discounted rewards but at the cost of higher time complexity during the training process. In contrast, the SAC algorithm performs best in terms of stability and can obtain better cumulative discounted rewards than the TD3 algorithm.
The rest of this paper is organized as follows.
Section 2 describes the system model and problem formulation. Deep reinforcement learning-based solutions for joint optimization are discussed in
Section 3. The numerical results are provided in
Section 4, and
Section 5 concludes this paper.
Notations: represents the transpose, and and refer to the modulus and Euclidean norm, respectively. means that the calculation result inside the square brackets is non-negative. denotes the mathematical expectation. ∪ and ≫ represent the union and “much greater than” operations, respectively.
2. System Model and Problem Formulation
This paper mainly studies the secure communication model of jamming-enhanced UAVs.
Figure 1 shows the deployment of a jamming-enhanced secure UAV communication system. Assume that the number of single-antenna users is
U, and these users are served by
M UAV BSs. In addition, assume that the system is equipped with
J UAV jammers, which protect legitimate user information security by sending noise-like interference signals to eavesdroppers. Without loss of generality, assume that the number of ground eavesdroppers is
I. Furthermore, the sets of legitimate users, eavesdroppers, UAV BSs, and UAV jammers are denoted by
,
,
, and
, respectively. In addition to the
I deterministic eavesdroppers, this paper also considers the possibility of potential eavesdroppers snooping on legitimate information.
Assume that K potential eavesdroppers are randomly distributed within the target area, where , with signifying the position of the k-th latent eavesdropper.
To facilitate both system trajectory planning and resource allocation, we preset the flight cycle of the UAV as T, which can be divided into N time slots and has a duration of , where . As long as each time slot is short enough and the UAV’s flight speed is moderate, we assume that its position remains almost unchanged during this period. The flight altitude, or hovering altitude, of each UAV is expressed as H. In addition, and are used to characterize the m-th UAV BS and the j-th UAV jammer in the n-th time slot, respectively. For simplicity, the horizontal positions of all legitimate and eavesdropping UAVs are represented as and , respectively. In addition, the position of the legitimate user u in time slot n is denoted by with zero altitude. We also use to denote the location of the i-th deterministic eavesdropper.
Considering the limited carrying capacity of UAVs, this paper proposes a periodic coverage evaluation mechanism for UAV interference and potential eavesdropping. This mechanism ensures that the system achieves strong anti-eavesdropping capabilities at the lowest energy cost. Specifically, we assume that at least one potential eavesdropping coverage state needs to be calculated within each frame period. For ease of analysis, the total number of frames is set to , where represents the frame length. Next, we assume that the number of time slots contained in each coverage frame is .
Once the UAV jammer
j intends to access the potential eavesdropper
in time slot
n, we set
. Otherwise,
. Meanwhile, we limit access to a maximum of one potential eavesdropper per time slot. This separation coverage evaluation mechanism effectively reduces computational overhead. Based on the above analysis, the associated variables of potential eavesdroppers must meet the following conditions:
In the following, we employ
to denote the coverage state at
in time slot
n, as given by
where
denotes the signal-to-noise-plus-interference ratio (SINR) of
. This equation indicates that as long as
is not lower than the system preset threshold
,
will be considered as staying inside the coverage range.
Like in [
17], we also use the reference signal receiving power (denoted by RP) to determine the value of the SINR, as given by
where
represents the transmission power of the
j-th UAV jammer,
denotes the power gain of the jammer, and
stands for the variance of additive white Gaussian noise (AWGN).
The channel power gain from the
j-th jammer to
is
where
represents the channel power at a reference distance of 1 m.
Let
represent the association coefficient between the
m-th UAV BS and the
u-th legitimate user, where
implies that the legitimate user
u in time slot
n is served by the
m-th UAV; otherwise,
. Assume that each UAV has the ability to serve multiple targets simultaneously, while each target is exclusively served by only one UAV, i.e.,
.
Considering the limited spectrum resources used by the system, UAVs adopt the principle of spectrum reuse to increase system capacity while ensuring that the interference they receive can be controlled at an acceptable level.
As described in [
18], the data rate of the
u-th legitimate user is given by
where
denotes the user’s SINR,
represents the transmit power of the UAV
m, and
stands for the noise variance at the receiver. The power gain of the
u-th legitimate user is thus given by
Similarly, the data rate of eavesdropping is expressed as
where
,
represents the transmit power of the
i-th UAV jammer, and the power gains are
and
.
Following (
9) and (
12), the worst achievable average secrecy rate of the
u-th legitimate user over a
T-duration in the presence of eavesdroppers can be given by
where
.
In the following, we focus on maximizing the minimum secrecy rate by optimizing the parameters, including trajectory planning, user association variables, and power distribution of UAVs. The optimization goal can thus be formulated as
where
denotes the user association variables;
and
represent the transmission power of the UAV BS and UAV jammer, respectively; and
and
are the coordinates of the UAVs. For a given coverage-evaluating frequency, the coverage constraints for
are shown in (15b), and the power constraints are shown in (15c)–(15e).
It is evident that the optimization objective (
15) involves a mixed-integer nonlinear non-convex problem, making it difficult to solve using traditional iterative optimization methods. Therefore, we transform the above optimization problem into a sequential decision-making problem and adopt a DRL-based approach to achieve the joint optimization of the user association variables, power allocation, and trajectory planning in jamming-enhanced secure UAV communication systems.
3. Deep Reinforcement Learning-Based Solutions for Joint Optimization
The nonlinearity and non-convexity of the optimization objective (
15) pose significant mathematical challenges for solving the aforementioned joint optimization problem. Considering that trajectory planning, user association variables, and power allocation are all sequential decision problems, the above optimization process can be reconstructed as a Markov decision process (MDP). This section investigates single-agent and multi-agent DRL solutions.
3.1. The Single-Agent DRL Solution
Let represent the tuple of the MDP, where S denotes the state space, A corresponds to the action space, and R signifies the reward function. The long-term cumulative discounted reward can be expressed as , where represents the discount factor. The constituent elements are defined as follows:
State space
S:
represents the state during time slot
t, encompassing the coordinates of the UAV BSs, the UAV jammers, and the legitimate users:
where
denotes the coordinates of all UAVs in time slot
t, which is composed of the coordinates of UAV BSs
and the coordinates of UAV jammers
.
represents the coordinates of legitimate users.
and
are the coordinates of the active and potential eavesdroppers, respectively.
Action space
A:
represents the action taken during time slot
t, encompassing the user association variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:
where
and
denote the allocated power by communication UAVs and jamming UAVs, respectively, within time slot
t. The flight displacement of UAVs is represented by
, which is composed of
and
.
Reward function: The reward function
,
comprises two components, i.e., the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by
, where
. Therefore, the reward function can be expressed as
where
denotes the penalty factor.
Single-agent DRL algorithms, such as the SAC algorithm and TD3 algorithm, can be used to solve this problem. Take the SAC algorithm as an example. The SAC algorithm aims to maximize the long-term cumulative discounted reward while maximizing the strategy entropy, as given by . Here, and represent the temperature parameter and actor network with parameter , respectively.
In the SAC algorithm framework, there exist two main critic networks, i.e., and , with network parameter vectors and . There also exist two target critic networks, i.e., and , with parameter vectors and . The purpose of the critic networks is to fit the soft Q-function of the agent. Furthermore, the stochastic actor network generates actions based on the state of the agent.
Figure 2 shows a diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network. In each time slot, the agent interacts with the environment to generate a new experience tuple
, which is then stored in the replay memory buffer
. As time passes, the number of tuples in the replay buffer gradually increases until a sufficient number of samples are reached. In order to optimize the parameter vectors of the critic and actor networks, the system randomly samples a batch of tuples
B to form the replay buffer
, that is,
.
The critic network can be updated by minimizing
where
,
, and
denotes the target value of the main critic network in time slot
t, which is given by
The actor network can be updated according to
In addition, the temperature parameter
can be updated according to [
10], and the target critic networks follow the soft update rule
, where
is the soft update parameter.
Note that the TD3 algorithm utilized in this paper is also a typical single-agent DRL solution and is similar to the SAC algorithm, with an off-policy actor-critic mechanism. The actor and critic networks are depicted in
Figure 3, and their parameter updating follows existing studies [
8,
19]. For simplicity, the details of the TD3 algorithm are not elaborated further.
3.2. The Multi-Agent DRL Solution
In the multi-agent DRL solution, each UAV BS represents an agent. We use to denote the tuple of the MDP, where O represents the global observation of all agents. The main elements are explained as follows:
Observation space
O:
represents the state of agent
m during time slot
t,
. The local observation space
mainly consists of the coordinates of UAVs, the coordinates of legitimate users, and those of the active and potential eavesdroppers:
where
and
.
Action space
A:
represents the action of agent
m during time slot
t, and it is composed of the user associative variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:
Reward function
R: The reward function
,
for the agent
m comprises both the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by
. The reward function can be written as
In this paper, we employ the MASAC algorithm to solve the problem, where each agent corresponds to a UAV and comprises two main critic networks (i.e., , ), two target critic networks (i.e., , , ), and one actor network .
In the training process, the agent
m is designed to maximize
Figure 4 shows a diagram of the MASAC algorithm for this jamming-enhanced secure UAV communication network. After each interaction with the environment, the experience tuple
is gradually generated and stored in
B. To update the neural network parameters, a minibatch of experience tuples
B is randomly sampled from
.
The MASAC algorithm follows the centralized training and decentralized execution mechanism. The critic network can be updated by minimizing the soft Bellman residuals:
where
,
, and
is the target value of the main critic network in time slot
t, as given by
where
and
.
The actor network can be updated according to
In addition, the target critic networks of each agent follow the soft update rule
, where
is the soft update parameter, and the temperature parameter
can be updated according to [
20].
The pseudocode of the MASAC is presented in Algorithm 1.
Algorithm 1 MASAC algorithm for jamming-enhanced secure UAV communications |
- 1:
For each , initial main network parameters , , set target network parameters , ←, , . - 2:
for each episode do - 3:
Initial the global observation - 4:
for do - 5:
for do - 6:
Select policy - 7:
end for - 8:
Execute actions - 9:
Observe reward and the next global observation - 10:
Store the tuple in - 11:
- 12:
for do - 13:
Sample minibatch B from - 14:
Update the main critic network parameter: - 15:
- 16:
Update actor network parameter: - 17:
- 18:
Update target critic network following the soft update rule - 19:
Update temperature parameter according to [ 20] - 20:
end for - 21:
end for - 22:
end for
|
3.3. Computational Complexity Analysis
In this section, we investigate the optimization of the secrecy rate based on single-agent and multi-agent DRL methods and analyze their complexity. The complexity of these algorithms is determined by the neural network framework of the algorithms. In our DRL solution, both the critic and actor networks are four-layer fully connected networks, with an architecture consisting of one input layer, two hidden layers, and one output layer. Let and represent the number of nodes in hidden layer 1 and hidden layer 2 in the actor network, respectively. Meanwhile, let and represent the number of nodes in hidden layer 1 and hidden layer 2 in the critic network.
First, we analyze the complexity of the single-agent SAC algorithm. It can be deduced that both the state space and the action space have a dimension of
and
, respectively, corresponding to the number of input nodes
and output nodes
of the actor network, i.e.,
and
. According to [
21], the time complexity at each training step of each actor network is given by
Similarly, the time complexity of each critic network at each step can be calculated as
Considering all the main and target networks of the SAC algorithm at each step, the time complexity in the training process is .
During the testing process, only the actor network is used to determine the action interacting with the environment, and its time complexity depends only on the matrix multiplication complexity of the layers. The calculation can be expressed as
Note that there are six neural networks in the TD3 algorithm, including two main critic networks, two target critic networks, one main actor network, and one target actor network. The time complexity at each step in the training process can be described as . During the testing process, as we only applied the optimal actor network to decide the actions, the time complexity of each step is approximately equivalent to that of the SAC algorithm.
For the MASAC algorithm, we can deduce that the local observation space has a dimension of , denoted by , , and the action space has a dimension of , denoted by , . Therefore, the time complexity for the m-th agent at each training step of each actor network is given by . Similarly, the time complexity of each critic network at each step is .
Therefore, the time complexity for all agents in each training step is
In the testing process, the time complexity of all agents can be calculated as
In summary, the number of input nodes for an actor network is usually smaller than the number of input nodes for a critic network, implying that the testing process has comparatively lower computational complexity, whereas the training process has much higher computational complexity.
Table 1 shows the computational complexity of different algorithms in the training and testing processes. Moreover, it can also be concluded that
; therefore, the MASAC algorithm has higher time complexity than the SAC algorithm in the training process, that is,
.
4. Numerical Results
The simulation environment was set up in a square area with a side length of 1 km. Assume that legitimate users were randomly distributed throughout the entire area, and the positions of UAVs in the target area were randomly initialized at the beginning of the simulation. We considered a periodic coverage-assisted area comprising three UAV BSs, two UAV jammers, 20 legitimate users, two ground eavesdroppers, and five latent eavesdroppers. The flight period was set to
s, the coverage evaluation frame length was
s, the predetermined flight altitude of the drones was 150 m, and each UAV was associated with the nearest legitimate users. Moreover, the time slot length was set to
s, and the threshold for the coverage evaluation was set to
dB. The reference channel power was
dB [
22].
The experiments were simulated using Python v3.7, and the deep learning framework used was PyTorch. Both the critic and the policy networks were implemented as four-layer fully connected networks, with 128 neurons implemented in each hidden layer. Each episode comprised 50 time slots. Furthermore, the discount factor and the number of experience tuples were set to and 256, respectively, while the learning rates of the critic and actor networks were set to and , respectively.
Figure 5 shows the cumulative discounted return of the DRL algorithms versus the training episodes. It can be seen that the MASAC algorithm performed best in terms of convergence speed and cumulative discounted return compared to the other algorithms. The SAC algorithm demonstrated the best stability during the training phase and better cumulative discounted return performance than the TD3 algorithm. This is because the multi-agents in the MASAC algorithm have a better ability to explore and cooperate. On one hand, multiple agents explore different parts of the environment simultaneously, which helps the agents learn better policies compared to a single agent. On the other hand, multiple agents cooperate in their actions to achieve shared or individual goals more efficiently, and each agent can specialize in a role or a subset of tasks, leading to better performance. However, the MASAC algorithm had higher time complexity than the other algorithms, that is,
. The MASAC algorithm was quite time-consuming, which can be attributed to its centralized training and decentralized execution mechanism.
To verify the effectiveness of the DRL-based solutions, we saved the neural network parameters after each algorithm’s training was completed. Then, only the actor network was utilized to determine the action interacting with the environment and further calculate the corresponding secrecy rate in the testing process.
Figure 6 shows the normalized average secrecy rate versus the number of time slots. It can be observed that the secrecy rate for each algorithm increased as the number of time slots increased, with the secrecy rate for the MASAC algorithm increasing by more than
and
compared to that of the SAC and TD3 algorithms, respectively. The simulation results reveal the validity of the DRL algorithms in finding the effective user association variables, UAV trajectory, and power allocation policy for the considered scenarios.
Finally, we studied the relationship between the normalized average secrecy rate and the number of eavesdroppers, as shown in
Figure 7 and
Figure 8. These experiments included three UAV BSs, two UAV jammers, and 20 legitimate users. In the solutions, we saved the parameters of their respective actor networks after each algorithm’s training was completed, then loaded them to decide on the variables in the testing process, and finally calculated the corresponding secrecy rate. It can be observed in
Figure 7 that the secrecy rate tended to decrease as the number of eavesdroppers increased. The MASAC algorithm achieved the best secrecy rate compared with the other algorithms. In
Figure 8, it can be seen that the number of latent eavesdroppers had less influence on the average secrecy rate. It can be deduced that the existing UAV jammers effectively ensured the secrecy rate in this scenario. Moreover, the MASAC algorithm achieved the best secrecy rate in terms of different numbers of latent eavesdroppers.