Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
A Microfluidic-Based Sensing Platform for Rapid Quality Control on Target Cells from Bioreactors
Previous Article in Journal
Embedding AI-Enabled Data Infrastructures for Sustainability in Agri-Food: Soft-Fruit and Brewery Use Case Perspectives
Previous Article in Special Issue
Two-Step Iterative Medical Microwave Tomography
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications

1
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
2
National School of Elite Engineering, University of Science and Technology Beijing, Beijing 100081, China
3
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(22), 7328; https://doi.org/10.3390/s24227328
Submission received: 10 October 2024 / Revised: 11 November 2024 / Accepted: 13 November 2024 / Published: 16 November 2024
(This article belongs to the Special Issue Novel Signal Processing Techniques for Wireless Communications)

Abstract

:
Despite its flexibility, unmanned aerial vehicle (UAV) communications are susceptible to eavesdropping due to the open nature of wireless channels and the broadcasting nature of wireless signals. This paper studies secure UAV communications and proposes a method to optimize the minimum secrecy rate of the system by using interference technology to enhance it. To this end, the system not only deploys multiple UAV base stations (BSs) to provide services to legitimate users but also assigns dedicated UAV jammers to send interference signals to active or potential eavesdroppers to disrupt their eavesdropping effectiveness. Based on this configuration, we formulate the optimization process of parameters such as the user association variables, UAV trajectory, and output power as a sequential decision-making problem and use the single-agent soft actor-critic (SAC) algorithm and twin delayed deep deterministic policy gradient (TD3) algorithm to achieve joint optimization of the core parameters. In addition, for specific scenarios, we also use the multi-agent soft actor-critic (MASAC) algorithm to solve the joint optimization problem mentioned above. The numerical results show that the normalized average secrecy rate of the MASAC algorithm increased by more than 6.6% and 14.2% compared with that of the SAC and TD3 algorithms, respectively.

1. Introduction

Unmanned aerial vehicles (UAVs) have been widely adopted in emergency rescue, instant messaging, and other fields due to their excellent flexibility and cost-effectiveness. Flexibly deployed UAVs can carry communication payloads and act as airborne base stations (BSs) to provide access services for ground users.
However, the open nature of wireless channels and the broadcasting nature of wireless signals significantly increase the security risks of UAV communication. Malicious users can eavesdrop on the communication content of legitimate users by intercepting and stealing UAV signals, which results in significant security threats, especially in applications involving sensitive data transmissions, such as surveillance, disaster response, and secure communication networks [1,2,3]. Therefore, ensuring the communication security of legitimate users and effectively combating potential eavesdroppers has become a major challenge in the field of UAV communication. In order to address the above challenges, a new method called “friendly UAV jamming” has been proposed, which sends special interference signals to eavesdroppers to prevent them from obtaining eavesdropped content, thereby ensuring the information security of legitimate users [4,5].
Unfortunately, operating the aforementioned UAV security communication mechanism requires solving complex mathematical problems that traditional methods are powerless to address. Therefore, researchers have begun to use new methods based on artificial intelligence to solve these problems. Notable achievements include deep reinforcement learning (DRL)-based methods [6,7], including the twin delayed deep deterministic policy gradient (TD3) algorithm [8], proximal policy optimization (PPO) [9], and the soft actor-critic (SAC) algorithm [10]. Additionally, the multi-agent DRL approach is also effective in providing distributed and online solutions. For instance, the authors of [11] introduced multi-agent DRL approaches to jointly optimize critical parameters, such as UAV trajectories, user association variables, and transmit power, in multi-UAV-assisted communication systems.
Ensuring secure communication involves strategies to protect data transmissions between UAVs and legitimate users from eavesdropping, which becomes especially challenging in open environments. To solve the UAV security communication problem, many relevant studies exist [6,12,13]. UAVs are often deployed to monitor or communicate within designated areas, ensuring consistent coverage over time for tasks like surveillance, data collection, or relaying communication signals. Periodic coverage ensures that UAVs revisit specific areas regularly, which is essential in dynamic environments.
Therefore, this paper studies the periodic coverage-assisted secure communication of UAVs with coverage evaluation constraints. Unlike existing studies [14,15], this paper fully considers scenarios with active and potential eavesdroppers. Specifically, we use multiple UAV BSs to provide services to legitimate users while also deploying a certain number of UAV jammers to send interference signals to eavesdroppers. Considering the limited carrying capacity of UAVs, as described in [16], this paper adopts a cyclic coverage estimation scheme to improve the service capability of UAV clusters.
The main purpose of this paper is to propose a new strategy to help legitimate users maximize their minimum secrecy rate. The optimization objective presents a complex mixed-integer nonlinear optimization problem. This problem is mathematically intractable because maximizing the minimum secrecy rate requires the simultaneous optimization of the user association variables, UAV trajectory, and output power. However, due to the complex nature of the coverage constraints, the mobility of the UAVs, and the discrete nature of the user association variables, problem solving becomes highly challenging. DRL algorithms may be more suitable for addressing these problems. Therefore, this paper formulates the optimization objective as a sequential decision-making problem. The DRL method (i.e., single-agent SAC and TD3 algorithms) and the multi-agent SAC (MASAC) algorithm can effectively solve these problems. The numerical results show that the MASAC algorithm is superior in accumulating discounted rewards but at the cost of higher time complexity during the training process. In contrast, the SAC algorithm performs best in terms of stability and can obtain better cumulative discounted rewards than the TD3 algorithm.
The rest of this paper is organized as follows. Section 2 describes the system model and problem formulation. Deep reinforcement learning-based solutions for joint optimization are discussed in Section 3. The numerical results are provided in Section 4, and Section 5 concludes this paper.
Notations:  ( · ) T represents the transpose, and | · | and · refer to the modulus and Euclidean norm, respectively. [ · ] + means that the calculation result inside the square brackets is non-negative. E ( · ) denotes the mathematical expectation. ∪ and ≫ represent the union and “much greater than” operations, respectively.

2. System Model and Problem Formulation

This paper mainly studies the secure communication model of jamming-enhanced UAVs. Figure 1 shows the deployment of a jamming-enhanced secure UAV communication system. Assume that the number of single-antenna users is U, and these users are served by M UAV BSs. In addition, assume that the system is equipped with J UAV jammers, which protect legitimate user information security by sending noise-like interference signals to eavesdroppers. Without loss of generality, assume that the number of ground eavesdroppers is I. Furthermore, the sets of legitimate users, eavesdroppers, UAV BSs, and UAV jammers are denoted by U = { 1 , 2 , U } , I = { 1 , 2 , I } , M = { 1 , 2 , M } , and J = { 1 , 2 , J } , respectively. In addition to the I deterministic eavesdroppers, this paper also considers the possibility of potential eavesdroppers snooping on legitimate information.
Assume that K potential eavesdroppers are randomly distributed within the target area, where s k = [ c k , d k ] T s , with  k K = { 1 , K } signifying the position of the k-th latent eavesdropper.
To facilitate both system trajectory planning and resource allocation, we preset the flight cycle of the UAV as T, which can be divided into N time slots and has a duration of δ T = T / N , where N 1 , N . As long as each time slot is short enough and the UAV’s flight speed is moderate, we assume that its position remains almost unchanged during this period. The flight altitude, or hovering altitude, of each UAV is expressed as H. In addition, Θ m [ n ] = [ X m [ n ] , Y m [ n ] ] T and θ j [ n ] = [ x j [ n ] , y j [ n ] ] T are used to characterize the m-th UAV BS and the j-th UAV jammer in the n-th time slot, respectively. For simplicity, the horizontal positions of all legitimate and eavesdropping UAVs are represented as Θ [ n ] = X 1 [ n ] , X M [ n ] ; Y 1 [ n ] , Y M [ n ] and θ [ n ] = x 1 [ n ] , x J [ n ] ; y 1 [ n ] , y J [ n ] , respectively. In addition, the position of the legitimate user u in time slot n is denoted by A u [ n ] = [ a u [ n ] , b u [ n ] ] T with zero altitude. We also use B i [ n ] = [ a i [ n ] , b i [ n ] ] T to denote the location of the i-th deterministic eavesdropper.
Considering the limited carrying capacity of UAVs, this paper proposes a periodic coverage evaluation mechanism for UAV interference and potential eavesdropping. This mechanism ensures that the system achieves strong anti-eavesdropping capabilities at the lowest energy cost. Specifically, we assume that at least one potential eavesdropping coverage state needs to be calculated within each frame period. For ease of analysis, the total number of frames is set to L = T / T L , where T L represents the frame length. Next, we assume that the number of time slots contained in each coverage frame is N L = N / L .
Once the UAV jammer j intends to access the potential eavesdropper s k in time slot n, we set c j , k [ n ] = 0 . Otherwise, c j , k [ n ] = 1 . Meanwhile, we limit access to a maximum of one potential eavesdropper per time slot. This separation coverage evaluation mechanism effectively reduces computational overhead. Based on the above analysis, the associated variables of potential eavesdroppers must meet the following conditions:
n = ( l 1 ) N L + 1 l N L c j , k [ n ] = 1 , j , k ;
k = 1 K c j , k [ n ] 1 , j , n ;
j = 1 J c j , k [ n ] 1 , k , n ;
c j , k [ n ] { 0 , 1 } .
In the following, we employ C s k { 0 , 1 } to denote the coverage state at s k in time slot n, as given by
C s k [ n ] = 1 , γ s k [ n ] μ ; 0 , otherwise ,
where γ s k denotes the signal-to-noise-plus-interference ratio (SINR) of s k . This equation indicates that as long as γ s k is not lower than the system preset threshold μ , s k will be considered as staying inside the coverage range.
Like in [17], we also use the reference signal receiving power (denoted by RP) to determine the value of the SINR, as given by
γ s k [ n ] = RP ( s k ) j = 1 J q j , k [ n ] g j , k [ n ] RP ( s k ) + σ 2 = max ( q j , k [ n ] g j , k [ n ] ) j = 1 J q j , k [ n ] g j , k [ n ] max ( q j , k [ n ] g j , k [ n ] ) + σ 2 .
where q j , k represents the transmission power of the j-th UAV jammer, g j , k denotes the power gain of the jammer, and  σ 2 stands for the variance of additive white Gaussian noise (AWGN).
The channel power gain from the j-th jammer to s k is
g j , k [ n ] = g 0 d 2 ( θ j [ n ] , s k ) = g 0 H 2 + θ j [ n ] s k [ n ] 2 ,
where g 0 represents the channel power at a reference distance of 1 m.
Let α m , u [ n ] represent the association coefficient between the m-th UAV BS and the u-th legitimate user, where α m , u [ n ] = 1 implies that the legitimate user u in time slot n is served by the m-th UAV; otherwise, α m , u [ n ] = 0 . Assume that each UAV has the ability to serve multiple targets simultaneously, while each target is exclusively served by only one UAV, i.e.,  α m , u [ n ] { 0 , 1 } , m , u .
m = 1 M α m , u [ n ] = 1 , u U ,
Considering the limited spectrum resources used by the system, UAVs adopt the principle of spectrum reuse to increase system capacity while ensuring that the interference they receive can be controlled at an acceptable level.
As described in [18], the data rate of the u-th legitimate user is given by
R m , u [ n ] = α m , u [ n ] log ( 1 + γ m , u [ n ] ) ,
where
γ m , u [ n ] = p m , u [ n ] g m , u [ n ] m M m p m , u [ n ] g m , u [ n ] + σ 2
denotes the user’s SINR, p m , u represents the transmit power of the UAV m, and  σ 2 stands for the noise variance at the receiver. The power gain of the u-th legitimate user is thus given by
g m , u [ n ] = g 0 d 2 ( Θ m [ n ] , A u [ n ] ) = g 0 H 2 + Θ m [ n ] A u [ n ] 2 .
Similarly, the data rate of eavesdropping is expressed as
R E , i m , u [ n ] = log ( 1 + γ E , i m , u [ n ] ) ,
where
γ E , i m , u [ n ] = p m , u [ n ] g m , i [ n ] I ,
I = m M m p m , u [ n ] g m , i [ n ] + j = 1 J q j , i [ n ] g j , i [ n ] + σ 2 , q j , i represents the transmit power of the i-th UAV jammer, and the power gains are g m , i [ n ] = g 0 d 2 ( Θ m [ n ] , B i [ n ] ) = g 0 H 2 + Θ m [ n ] B i [ n ] 2 and g j , i [ n ] = g 0 d 2 ( θ j [ n ] , B i [ n ] ) = g 0 H 2 + θ j [ n ] B i [ n ] 2 .
Following (9) and (12), the worst achievable average secrecy rate of the u-th legitimate user over a T-duration in the presence of eavesdroppers can be given by
R sec u = 1 N n = 1 N m = 1 M R m , u [ n ] max i R E , i m , u [ n ] + ,
where [ x ] + = max { x , 0 } .
In the following, we focus on maximizing the minimum secrecy rate by optimizing the parameters, including trajectory planning, user association variables, and power distribution of UAVs. The optimization goal can thus be formulated as
( P 0 ) max α , Θ , θ , p , q min R sec u .
( 1 ) , ( 2 ) , ( 3 ) , ( 4 ) , ( 8 )
c j , k [ n ] γ s k [ n ] c j , k [ n ] μ , j , k
0 p m , u [ n ] p max , u U
0 q j , i [ n ] p max , j J , i I
0 q j , k [ n ] p max , j J , k K ,
where α denotes the user association variables; p and q represent the transmission power of the UAV BS and UAV jammer, respectively; and  Θ and θ are the coordinates of the UAVs. For a given coverage-evaluating frequency, the coverage constraints for s k are shown in (15b), and the power constraints are shown in (15c)–(15e).
It is evident that the optimization objective (15) involves a mixed-integer nonlinear non-convex problem, making it difficult to solve using traditional iterative optimization methods. Therefore, we transform the above optimization problem into a sequential decision-making problem and adopt a DRL-based approach to achieve the joint optimization of the user association variables, power allocation, and trajectory planning in jamming-enhanced secure UAV communication systems.

3. Deep Reinforcement Learning-Based Solutions for Joint Optimization

The nonlinearity and non-convexity of the optimization objective (15) pose significant mathematical challenges for solving the aforementioned joint optimization problem. Considering that trajectory planning, user association variables, and power allocation are all sequential decision problems, the above optimization process can be reconstructed as a Markov decision process (MDP). This section investigates single-agent and multi-agent DRL solutions.

3.1. The Single-Agent DRL Solution

Let ( S , A , R , γ ) represent the tuple of the MDP, where S denotes the state space, A corresponds to the action space, and R signifies the reward function. The long-term cumulative discounted reward can be expressed as R ( π ) = E t = 1 T γ t 1 [ R ( s t , a t , s t + 1 ) ] , where γ [ 0 , 1 ) represents the discount factor. The constituent elements are defined as follows:
  • State space S: s t S represents the state during time slot t, encompassing the coordinates of the UAV BSs, the UAV jammers, and the legitimate users:
    s t = { X t , Y t , U x , t , U y , t , I x , t , I y , t , K x , t , K y , t } ,
    where ( X t , Y t ) denotes the coordinates of all UAVs in time slot t, which is composed of the coordinates of UAV BSs ( M x , t , M y , t ) and the coordinates of UAV jammers ( J x , t , J y , t ) . ( U x , t , U y , t ) represents the coordinates of legitimate users. ( I x , t , I y , t )  and ( K x , t , K y , t ) are the coordinates of the active and potential eavesdroppers, respectively.
  • Action space A: a t A represents the action taken during time slot t, encompassing the user association variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:
    a t = { α t , p t , q t , Δ X t , Δ Y t } ,
    where p t and q t denote the allocated power by communication UAVs and jamming UAVs, respectively, within time slot t. The flight displacement of UAVs is represented by ( Δ X t , Δ Y t ) , which is composed of ( Δ M x , t , Δ M y , t ) and ( Δ J x , t , Δ J y , t ) .
  • Reward function: The reward function r t , r t R comprises two components, i.e., the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by R c , where R c = c j , k [ t ] γ s k [ t ] c j , k [ t ] μ . Therefore, the reward function can be expressed as
    r t = min u U R sec u + β R c ,
    where β denotes the penalty factor.
Single-agent DRL algorithms, such as the SAC algorithm and TD3 algorithm, can be used to solve this problem. Take the SAC algorithm as an example. The SAC algorithm aims to maximize the long-term cumulative discounted reward while maximizing the strategy entropy, as given by max E t = 1 T γ t 1 [ r t ( s t , a t ) ρ log π ϕ ( a t | s t ) ] . Here, ρ and π ϕ represent the temperature parameter and actor network with parameter ϕ , respectively.
In the SAC algorithm framework, there exist two main critic networks, i.e.,  Q θ 1 and Q θ 2 , with network parameter vectors θ 1 and θ 2 . There also exist two target critic networks, i.e.,  Q θ 1 and Q θ 2 , with parameter vectors θ 1 and θ 2 . The purpose of the critic networks is to fit the soft Q-function of the agent. Furthermore, the stochastic actor network π ϕ generates actions based on the state of the agent.
Figure 2 shows a diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network. In each time slot, the agent interacts with the environment to generate a new experience tuple ( s t , a t , r t , s t + 1 ) , which is then stored in the replay memory buffer B . As time passes, the number of tuples in the replay buffer gradually increases until a sufficient number of samples are reached. In order to optimize the parameter vectors of the critic and actor networks, the system randomly samples a batch of tuples B to form the replay buffer B , that is, ( s i , a i , r i , s i + 1 ) i = 1 | B | .
The critic network can be updated by minimizing
J ( θ i ) = 1 | B | Q θ i s t , a 1 , a t y t 2 ,
where s t , a t B , i = 1 , 2 , and  y t denotes the target value of the main critic network in time slot t, which is given by
y t = r t + γ min i = 1 , 2 Q θ i ( s t + 1 , a t + 1 ) ρ log π ϕ ( a ˜ t + 1 | s t + 1 ) , a ˜ t + 1 = π ϕ ( · | s t + 1 ) .
The actor network can be updated according to
J ( ϕ ) = 1 | B | [ min i = 1 , 2 Q θ i ( s t , a t ) ρ log π ϕ ( a t | s t ) ] .
In addition, the temperature parameter ρ can be updated according to [10], and the target critic networks follow the soft update rule θ i ϵ θ i + ( 1 ϵ ) θ i , i = 1 , 2 , where ϵ is the soft update parameter.
Note that the TD3 algorithm utilized in this paper is also a typical single-agent DRL solution and is similar to the SAC algorithm, with an off-policy actor-critic mechanism. The actor and critic networks are depicted in Figure 3, and their parameter updating follows existing studies [8,19]. For simplicity, the details of the TD3 algorithm are not elaborated further.

3.2. The Multi-Agent DRL Solution

In the multi-agent DRL solution, each UAV BS represents an agent. We use ( O , A , R , γ ) to denote the tuple of the MDP, where O represents the global observation of all agents. The main elements are explained as follows:
  • Observation space O: o m , t O represents the state of agent m during time slot t, m M J . The local observation space o m , t mainly consists of the coordinates of UAVs, the coordinates of legitimate users, and those of the active and potential eavesdroppers:
    o m , t = { X x m , t , Y y m , t , U x m , t , U y m , t , I x m , t , I y m , t , K x m , t , K y m , t } ,
    where X M x , t J x , t and Y M y , t J y , t .
  • Action space A: a m , t A represents the action of agent m during time slot t, and it is composed of the user associative variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:
    a m , t = { α m , t , p m , t , q m , t , Δ M x m , t , Δ M y m , t , Δ J x m , t , Δ J y m , t } .
  • Reward function R: The reward function r m , t , r m , t R for the agent m comprises both the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by R c = c j , k [ t ] γ s k [ t ] c j , k [ t ] μ . The reward function can be written as
    r m , t = min u U R sec u + β R c .
In this paper, we employ the MASAC algorithm to solve the problem, where each agent corresponds to a UAV and comprises two main critic networks (i.e., Q θ m , 1 , Q θ m , 2 ), two target critic networks (i.e., Q θ m , 1 , Q θ m , 2 , m M J ), and one actor network π ϕ m .
In the training process, the agent m is designed to maximize
E t = 1 T γ t 1 [ r m , t ( o m , t , a m , t ) ρ m log π ϕ m ( a m , t | o m , t ) ] .
Figure 4 shows a diagram of the MASAC algorithm for this jamming-enhanced secure UAV communication network. After each interaction with the environment, the experience tuple ( O t , A t , R t , O t + 1 ) is gradually generated and stored in B. To update the neural network parameters, a minibatch of experience tuples B is randomly sampled from B .
The MASAC algorithm follows the centralized training and decentralized execution mechanism. The critic network can be updated by minimizing the soft Bellman residuals:
J ( θ m , i ) = 1 | B | Q θ m , i O t , a 1 , t , a m , t y m , t 2 ,
where O t , a m , t B , i = 1 , 2 , and  y m , t is the target value of the main critic network in time slot t, as given by
y m , t = 1 | B | r m , t + γ V ( Q t + 1 ) ,
where V ( Q t + 1 ) = min i = 1 , 2 Q θ m , i ( O t + 1 , a ˜ m , t + 1 ) ρ m log π ϕ m ( a ˜ m , t + 1 | o m , t + 1 ) and a ˜ m , t + 1 π ( · | o m , t + 1 ) .
The actor network can be updated according to
J ( ϕ m ) = 1 | B | [ min i = 1 , 2 Q θ m , i O t , a 1 , t , a m , t ρ m log π ϕ m a m , t | o m , t ] .
In addition, the target critic networks of each agent follow the soft update rule θ m , i ϵ θ i + ( 1 ϵ ) θ m , i , i = 1 , 2 , where ϵ is the soft update parameter, and the temperature parameter ρ m can be updated according to [20].
The pseudocode of the MASAC is presented in Algorithm 1.
Algorithm 1 MASAC algorithm for jamming-enhanced secure UAV communications
  1:
For each UAV m , initial main network parameters θ m , i , ϕ m , set target network parameters θ m , i , ← θ m , i , i = 1 , 2 , m M J .
  2:
for each episode do
  3:
    Initial the global observation O t
  4:
    for  t 1 , T  do
  5:
          for  m 1 , M J  do
  6:
               Select policy a m , t π ϕ m ( · )
  7:
          end for
  8:
          Execute actions A t = ( a m , t , a m , t )
  9:
          Observe reward R t and the next global observation O t + 1
10:
          Store the tuple O t , A t , R t , O t + 1 in B
11:
           O t O t + 1
12:
          for  m 1 , M J  do
13:
               Sample minibatch B from B
14:
               Update the main critic network parameter:
15:
                θ m , i θ m , i J ( θ m , i ) , i = 1 , 2
16:
               Update actor network parameter:
17:
                ϕ m ϕ m J ( ϕ m )
18:
               Update target critic network θ m , i , i = 1 , 2 following the soft update rule
19:
               Update temperature parameter ρ m according to [20]
20:
          end for
21:
      end for
22:
end for

3.3. Computational Complexity Analysis

In this section, we investigate the optimization of the secrecy rate based on single-agent and multi-agent DRL methods and analyze their complexity. The complexity of these algorithms is determined by the neural network framework of the algorithms. In our DRL solution, both the critic and actor networks are four-layer fully connected networks, with an architecture consisting of one input layer, two hidden layers, and one output layer. Let n a 1 and n a 2 represent the number of nodes in hidden layer 1 and hidden layer 2 in the actor network, respectively. Meanwhile, let n c 1 and n c 2 represent the number of nodes in hidden layer 1 and hidden layer 2 in the critic network.
First, we analyze the complexity of the single-agent SAC algorithm. It can be deduced that both the state space and the action space have a dimension of 2 ( M + J + U + I + K ) and 2 M U + J I + J K + 2 ( M + J ) , respectively, corresponding to the number of input nodes n i and output nodes n o of the actor network, i.e., n i = 2 ( M + J + U + I + K ) and n o = 2 M U + J I + J K + 2 ( M + J ) . According to [21], the time complexity at each training step of each actor network is given by
O a = O ( n i 2 + n a 1 2 + n a 2 2 + n o 2 ) .
Similarly, the time complexity of each critic network at each step can be calculated as
O c = O ( ( n i + n o ) 2 + n c 1 2 + n c 2 2 ) .
Considering all the main and target networks of the SAC algorithm at each step, the time complexity in the training process is O train SAC = O a + 4 O c .
During the testing process, only the actor network is used to determine the action interacting with the environment, and its time complexity depends only on the matrix multiplication complexity of the layers. The calculation can be expressed as
O test SAC = O ( n i n a 1 + n a 1 n a 2 + n a 2 n o ) .
Note that there are six neural networks in the TD3 algorithm, including two main critic networks, two target critic networks, one main actor network, and one target actor network. The time complexity at each step in the training process can be described as O train TD 3 = 2 O a + 4 O c . During the testing process, as we only applied the optimal actor network to decide the actions, the time complexity of each step is approximately equivalent to that of the SAC algorithm.
For the MASAC algorithm, we can deduce that the local observation space has a dimension of 2 ( 1 + J + U + I + K ) , denoted by n m , i , m M J , and the action space has a dimension of 2 M U + J I + J K + 2 ( 1 + J ) , denoted by n m , o , m M J . Therefore, the time complexity for the m-th agent at each training step of each actor network is given by O m , a = O ( ( n m , i ) 2 + n a 1 2 + n a 2 2 + ( n m , o ) 2 ) . Similarly, the time complexity of each critic network at each step is O m , c = O ( ( n m , i + n m , o ) 2 + n c 1 2 + n c 2 2 ) .
Therefore, the time complexity for all agents in each training step is
O train MASAC = m = 1 M + J O m , a + 4 m = 1 M + J O m , c .
In the testing process, the time complexity of all agents can be calculated as
O test MASAC = m = 1 M + J O ( n m , i n a 1 + n a 1 n a 2 + n a 2 n m , o ) .
In summary, the number of input nodes for an actor network is usually smaller than the number of input nodes for a critic network, implying that the testing process has comparatively lower computational complexity, whereas the training process has much higher computational complexity. Table 1 shows the computational complexity of different algorithms in the training and testing processes. Moreover, it can also be concluded that m = 1 M + J O m , a + 4 m = 1 M + J O m , c O a + 4 O c ; therefore, the MASAC algorithm has higher time complexity than the SAC algorithm in the training process, that is, O train MASAC O train SAC .

4. Numerical Results

The simulation environment was set up in a square area with a side length of 1 km. Assume that legitimate users were randomly distributed throughout the entire area, and the positions of UAVs in the target area were randomly initialized at the beginning of the simulation. We considered a periodic coverage-assisted area comprising three UAV BSs, two UAV jammers, 20 legitimate users, two ground eavesdroppers, and five latent eavesdroppers. The flight period was set to T = 50 s, the coverage evaluation frame length was T L = 20 s, the predetermined flight altitude of the drones was 150 m, and each UAV was associated with the nearest legitimate users. Moreover, the time slot length was set to δ t = 0.25 s, and the threshold for the coverage evaluation was set to μ = 3 dB. The reference channel power was g 0 = 60 dB [22].
The experiments were simulated using Python v3.7, and the deep learning framework used was PyTorch. Both the critic and the policy networks were implemented as four-layer fully connected networks, with 128 neurons implemented in each hidden layer. Each episode comprised 50 time slots. Furthermore, the discount factor γ and the number of experience tuples | B | were set to 0.96 and 256, respectively, while the learning rates of the critic and actor networks were set to 0.0001 and 0.00001 , respectively.
Figure 5 shows the cumulative discounted return of the DRL algorithms versus the training episodes. It can be seen that the MASAC algorithm performed best in terms of convergence speed and cumulative discounted return compared to the other algorithms. The SAC algorithm demonstrated the best stability during the training phase and better cumulative discounted return performance than the TD3 algorithm. This is because the multi-agents in the MASAC algorithm have a better ability to explore and cooperate. On one hand, multiple agents explore different parts of the environment simultaneously, which helps the agents learn better policies compared to a single agent. On the other hand, multiple agents cooperate in their actions to achieve shared or individual goals more efficiently, and each agent can specialize in a role or a subset of tasks, leading to better performance. However, the MASAC algorithm had higher time complexity than the other algorithms, that is, O train MASAC O train SAC . The MASAC algorithm was quite time-consuming, which can be attributed to its centralized training and decentralized execution mechanism.
To verify the effectiveness of the DRL-based solutions, we saved the neural network parameters after each algorithm’s training was completed. Then, only the actor network was utilized to determine the action interacting with the environment and further calculate the corresponding secrecy rate in the testing process. Figure 6 shows the normalized average secrecy rate versus the number of time slots. It can be observed that the secrecy rate for each algorithm increased as the number of time slots increased, with the secrecy rate for the MASAC algorithm increasing by more than 6.6 % and 14.2 % compared to that of the SAC and TD3 algorithms, respectively. The simulation results reveal the validity of the DRL algorithms in finding the effective user association variables, UAV trajectory, and power allocation policy for the considered scenarios.
Finally, we studied the relationship between the normalized average secrecy rate and the number of eavesdroppers, as shown in Figure 7 and Figure 8. These experiments included three UAV BSs, two UAV jammers, and 20 legitimate users. In the solutions, we saved the parameters of their respective actor networks after each algorithm’s training was completed, then loaded them to decide on the variables in the testing process, and finally calculated the corresponding secrecy rate. It can be observed in Figure 7 that the secrecy rate tended to decrease as the number of eavesdroppers increased. The MASAC algorithm achieved the best secrecy rate compared with the other algorithms. In Figure 8, it can be seen that the number of latent eavesdroppers had less influence on the average secrecy rate. It can be deduced that the existing UAV jammers effectively ensured the secrecy rate in this scenario. Moreover, the MASAC algorithm achieved the best secrecy rate in terms of different numbers of latent eavesdroppers.

5. Conclusions

This paper investigated the problem of maximizing the minimum achievable secrecy rate in interference-enhanced UAV secure communication systems, with a focus on addressing challenges such as joint user association variables, power allocation, and UAV trajectory optimization. Due to the high complexity of solving the optimization objective function, we transformed the problem into a Markov decision process with low complexity and used both single-agent DRL and multi-agent DRL algorithms (including SAC, TD3, and MASAC) to solve these problems. The simulation results show that the MASAC algorithm is effective in accumulating discounted rewards but at the cost of higher time complexity during the training process. In contrast, the SAC algorithm performs the best in terms of stability, and its cumulative discounted reward is better than that of the TD3 algorithm.
In future work, we will explore secure transmission for integrated sensing and communication (ISAC)-enabled UAV networks. These single-agent and multi-agent DRL algorithms will be exploited to solve the user association variables, UAV trajectory planning, and power allocation problems in UAV-ISAC networks.

Author Contributions

Conceptualization, Z.X., Y.Q. and C.D.; methodology, Z.X. and Y.Q.; software, Z.X.; validation, C.D.; formal analysis, C.D. and Z.Z.; investigation, W.W.; resources, W.W.; data curation, Z.X.; writing—original draft preparation, Z.X.; writing—review and editing, Y.Q.; visualization, Y.Q. and W.W.; supervision, C.D. and Z.Z.; project administration, Z.Z.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Post-Doctoral Science Foundation (2023M740266) and the National Natural Science Foundation of China (No. 62071035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, J.; Xu, J.; Lu, W.; Zhao, N.; Wang, X.; Niyato, D. Secure Transmission for IRS-Aided UAV-ISAC Networks. IEEE Trans. Wirel. Commun. 2024, 23, 12256–12269. [Google Scholar] [CrossRef]
  2. Lin, Z.; Lin, M.; De Cola, T.; Wang, J.B.; Zhu, W.P.; Cheng, J. Supporting IoT with rate-splitting multiple access in satellite and aerial-integrated networks. IEEE Internet Things J. 2021, 8, 11123–11134. [Google Scholar] [CrossRef]
  3. Yan, S.; Gu, Z.; Park, J.H.; Xie, X.; Dou, C. Probability-density-dependent load frequency control of power systems with random delays and cyber-attacks via circuital implementation. IEEE Trans. Smart Grid 2022, 13, 4837–4847. [Google Scholar] [CrossRef]
  4. Zhou, Y.; Yeoh, P.L.; Chen, H.; Li, Y.; Schober, R.; Zhuo, L.; Vucetic, B. Improving Physical Layer Security via a UAV Friendly Jammer for Unknown Eavesdropper Location. IEEE Trans. Veh. Technol. 2018, 67, 11280–11284. [Google Scholar] [CrossRef]
  5. Zhong, C.; Yao, J.; Xu, J. Secure UAV Communication With Cooperative Jamming and Trajectory Control. IEEE Commun. Lett. 2018, 23, 286–289. [Google Scholar] [CrossRef]
  6. Yao, Y.; Zhao, J.; Li, Z.; Cheng, X.; Wu, L. Jamming and eavesdropping defense scheme based on deep reinforcement learning in autonomous vehicle networks. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1211–1224. [Google Scholar] [CrossRef]
  7. Min, M.; Yang, S.; Zhang, H.; Ding, J.; Peng, G.; Pan, M.; Han, Z. Indoor Semantic Location Privacy Protection with Safe Reinforcement Learning. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1385–1398. [Google Scholar] [CrossRef]
  8. Zhang, Z.; Tian, J.; Wang, D.; Qiao, J.; Li, T. TD3-based Joint UAV Trajectory and Power optimization in UAV -Assisted D2D Secure Communication Networks. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–5. [Google Scholar]
  9. Dong, R.; Wang, B.; Tian, J.; Cheng, T.; Diao, D. Deep Reinforcement Learning Based UAV for Securing mmWave Communications. IEEE Trans. Veh. Technol. 2022, 72, 5429–5434. [Google Scholar] [CrossRef]
  10. Qin, Y.; Zhang, Z.; Li, X.; Huangfu, W.; Zhang, H. Deep Reinforcement Learning Based Resource Allocation and Trajectory Planning in Integrated Sensing and Communications UAV Network. IEEE Trans. Wirel. Commun. 2023, 22, 8158–8169. [Google Scholar] [CrossRef]
  11. Zhou, X.; Xiong, J.; Zhao, H.; Liu, X.; Ren, B.; Zhang, X.; Wei, J.; Yin, H. Joint UAV Trajectory and Communication Design with Heterogeneous Multi-Agent Reinforcement Learning. Sci. China Inf. Sci. 2024, 67, 132302. [Google Scholar] [CrossRef]
  12. Liu, Y.; Xiong, K.; Zhang, W.; Yang, H.C.; Fan, P.; Letaief, K.B. Jamming-enhanced Secure UAV Communications with Propulsion Energy and Curvature Radius Constraints. IEEE Trans. Veh. Technol. 2023, 72, 10852–10866. [Google Scholar] [CrossRef]
  13. Wang, D.; Zhao, Y.; He, Y.; Tang, X.; Li, L.; Zhang, R.; Zhai, D. Passive Beamforming and Trajectory Optimization for Reconfigurable Intelligent Surface-Assisted UAV Secure Communication. Remote Sens. 2021, 13, 4286. [Google Scholar] [CrossRef]
  14. Li, S.; Duo, B.; Di Renzo, M.; Tao, M.; Yuan, X. Robust Secure UAV Communications With the Aid of Reconfigurable Intelligent Surfaces. IEEE Trans. Wirel. Commun. 2021, 20, 6402–6417. [Google Scholar] [CrossRef]
  15. Tong, Y.; Sheng, M.; Liu, J.; Zhao, N. Energy-efficient UAV-NOMA aided wireless coverage with massive connections. Sci. China Inf. Sci. 2023, 66, 222303. [Google Scholar] [CrossRef]
  16. Cai, Y.; Wei, Z.; Li, R.; Ng, D.W.K.; Yuan, J. Joint Trajectory and Resource Allocation Design for Energy-Efficient Secure UAV Communication Systems. IEEE Trans. Wirel. Commun. 2020, 68, 4536–4553. [Google Scholar] [CrossRef]
  17. Qin, Y.; Huangfu, W.; Zhang, H.; Long, K.; Yuan, J. Rethinking Cellular System Coverage Optimization: A Perspective of Pseudometric Structure of Antenna Azimuth Variable Space. IEEE Syst. J. 2020, 15, 2971–2979. [Google Scholar] [CrossRef]
  18. Peng, H.; Shen, X. Multi-agent Reinforcement Learning Based Resource Management in MEC-and UAV-Assisted Vehicular Networks. IEEE J. Sel. Areas Commun. 2020, 39, 131–141. [Google Scholar] [CrossRef]
  19. Fujimoto, S.; Van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  20. Li, X.; Qin, Y.; Huo, J.; Huangfu, W. Heuristically Assisted Multiagent RL-Based Framework for Computation Offloading and Resource Allocation of Mobile Edge Computing. IEEE Internet Things J. 2023, 10, 15477–15487. [Google Scholar] [CrossRef]
  21. Truong, T.P.; Tuong, V.D.; Dao, N.N.; Cho, S. FlyReflect: Joint Flying IRS Trajectory and Phase Shift Design Using Deep Reinforcement Learning. IEEE Internet Things J. 2022, 10, 4605–4620. [Google Scholar] [CrossRef]
  22. Wu, Q.; Zeng, Y.; Zhang, R. Joint trajectory and communication design for multi-UAV enabled wireless networks. IEEE Trans. Wirel. Commun. 2018, 17, 2109–2121. [Google Scholar] [CrossRef]
Figure 1. The jamming-enhanced secure UAV communication deployment in the target area.
Figure 1. The jamming-enhanced secure UAV communication deployment in the target area.
Sensors 24 07328 g001
Figure 2. Diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network.
Figure 2. Diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network.
Sensors 24 07328 g002
Figure 3. Diagram of the agent in the single-agent TD3 algorithm.
Figure 3. Diagram of the agent in the single-agent TD3 algorithm.
Sensors 24 07328 g003
Figure 4. Diagram of the MASAC algorithm for the jamming-enhanced secure UAV communication network.
Figure 4. Diagram of the MASAC algorithm for the jamming-enhanced secure UAV communication network.
Sensors 24 07328 g004
Figure 5. The cumulative discounted reward versus the training episodes.
Figure 5. The cumulative discounted reward versus the training episodes.
Sensors 24 07328 g005
Figure 6. The normalized average secrecy rate versus the number of time slots.
Figure 6. The normalized average secrecy rate versus the number of time slots.
Sensors 24 07328 g006
Figure 7. The normalized average secrecy rate versus the number of ground eavesdroppers.
Figure 7. The normalized average secrecy rate versus the number of ground eavesdroppers.
Sensors 24 07328 g007
Figure 8. The normalized average secrecy rate versus the number of latent eavesdroppers.
Figure 8. The normalized average secrecy rate versus the number of latent eavesdroppers.
Sensors 24 07328 g008
Table 1. Computational complexity for different DRL algorithms in our considered scenarios.
Table 1. Computational complexity for different DRL algorithms in our considered scenarios.
AlgorithmTraining ProcessTesting Process
MASAC m = 1 M + J O m , a + 4 m = 1 M + J O m , c m = 1 M + J O ( n m , i n a 1 + n a 1 n a 2 + n a 2 n m , o )
SAC O a + 4 O c O ( n i n a 1 + n a 1 n a 2 + n a 2 n o )
TD3 2 O a + 4 O c O ( n i n a 1 + n a 1 n a 2 + n a 2 n o )
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xing, Z.; Qin, Y.; Du, C.; Wang, W.; Zhang, Z. Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors 2024, 24, 7328. https://doi.org/10.3390/s24227328

AMA Style

Xing Z, Qin Y, Du C, Wang W, Zhang Z. Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors. 2024; 24(22):7328. https://doi.org/10.3390/s24227328

Chicago/Turabian Style

Xing, Zhifang, Yunhui Qin, Changhao Du, Wenzhang Wang, and Zhongshan Zhang. 2024. "Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications" Sensors 24, no. 22: 7328. https://doi.org/10.3390/s24227328

APA Style

Xing, Z., Qin, Y., Du, C., Wang, W., & Zhang, Z. (2024). Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors, 24(22), 7328. https://doi.org/10.3390/s24227328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop