Open AccessArticle

Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications

Zhifang Xing

¹,

Yunhui Qin

²,

Changhao Du

^3,*

Wenzhang Wang

³ and

Zhongshan Zhang

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

National School of Elite Engineering, University of Science and Technology Beijing, Beijing 100081, China

School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Author to whom correspondence should be addressed.

Sensors 2024, 24(22), 7328; https://doi.org/10.3390/s24227328

Submission received: 10 October 2024 / Revised: 11 November 2024 / Accepted: 13 November 2024 / Published: 16 November 2024

(This article belongs to the Special Issue Novel Signal Processing Techniques for Wireless Communications)

Download

Browse Figures

Figure 1
The jamming-enhanced secure UAV communication deployment in the target area. "> Figure 2
Diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network. "> Figure 3
Diagram of the agent in the single-agent TD3 algorithm. "> Figure 4
Diagram of the MASAC algorithm for the jamming-enhanced secure UAV communication network. "> Figure 5
The cumulative discounted reward versus the training episodes. "> Figure 6
The normalized average secrecy rate versus the number of time slots. "> Figure 7
The normalized average secrecy rate versus the number of ground eavesdroppers. "> Figure 8
The normalized average secrecy rate versus the number of latent eavesdroppers. ">

Versions Notes

Abstract

Despite its flexibility, unmanned aerial vehicle (UAV) communications are susceptible to eavesdropping due to the open nature of wireless channels and the broadcasting nature of wireless signals. This paper studies secure UAV communications and proposes a method to optimize the minimum secrecy rate of the system by using interference technology to enhance it. To this end, the system not only deploys multiple UAV base stations (BSs) to provide services to legitimate users but also assigns dedicated UAV jammers to send interference signals to active or potential eavesdroppers to disrupt their eavesdropping effectiveness. Based on this configuration, we formulate the optimization process of parameters such as the user association variables, UAV trajectory, and output power as a sequential decision-making problem and use the single-agent soft actor-critic (SAC) algorithm and twin delayed deep deterministic policy gradient (TD3) algorithm to achieve joint optimization of the core parameters. In addition, for specific scenarios, we also use the multi-agent soft actor-critic (MASAC) algorithm to solve the joint optimization problem mentioned above. The numerical results show that the normalized average secrecy rate of the MASAC algorithm increased by more than 6.6% and 14.2% compared with that of the SAC and TD3 algorithms, respectively.

Keywords:

unmanned aerial vehicle (UAV); jamming UAV; deep reinforcement learning; sequential decision problem

1. Introduction

Unmanned aerial vehicles (UAVs) have been widely adopted in emergency rescue, instant messaging, and other fields due to their excellent flexibility and cost-effectiveness. Flexibly deployed UAVs can carry communication payloads and act as airborne base stations (BSs) to provide access services for ground users.

However, the open nature of wireless channels and the broadcasting nature of wireless signals significantly increase the security risks of UAV communication. Malicious users can eavesdrop on the communication content of legitimate users by intercepting and stealing UAV signals, which results in significant security threats, especially in applications involving sensitive data transmissions, such as surveillance, disaster response, and secure communication networks [1,2,3]. Therefore, ensuring the communication security of legitimate users and effectively combating potential eavesdroppers has become a major challenge in the field of UAV communication. In order to address the above challenges, a new method called “friendly UAV jamming” has been proposed, which sends special interference signals to eavesdroppers to prevent them from obtaining eavesdropped content, thereby ensuring the information security of legitimate users [4,5].

Unfortunately, operating the aforementioned UAV security communication mechanism requires solving complex mathematical problems that traditional methods are powerless to address. Therefore, researchers have begun to use new methods based on artificial intelligence to solve these problems. Notable achievements include deep reinforcement learning (DRL)-based methods [6,7], including the twin delayed deep deterministic policy gradient (TD3) algorithm [8], proximal policy optimization (PPO) [9], and the soft actor-critic (SAC) algorithm [10]. Additionally, the multi-agent DRL approach is also effective in providing distributed and online solutions. For instance, the authors of [11] introduced multi-agent DRL approaches to jointly optimize critical parameters, such as UAV trajectories, user association variables, and transmit power, in multi-UAV-assisted communication systems.

Ensuring secure communication involves strategies to protect data transmissions between UAVs and legitimate users from eavesdropping, which becomes especially challenging in open environments. To solve the UAV security communication problem, many relevant studies exist [6,12,13]. UAVs are often deployed to monitor or communicate within designated areas, ensuring consistent coverage over time for tasks like surveillance, data collection, or relaying communication signals. Periodic coverage ensures that UAVs revisit specific areas regularly, which is essential in dynamic environments.

Therefore, this paper studies the periodic coverage-assisted secure communication of UAVs with coverage evaluation constraints. Unlike existing studies [14,15], this paper fully considers scenarios with active and potential eavesdroppers. Specifically, we use multiple UAV BSs to provide services to legitimate users while also deploying a certain number of UAV jammers to send interference signals to eavesdroppers. Considering the limited carrying capacity of UAVs, as described in [16], this paper adopts a cyclic coverage estimation scheme to improve the service capability of UAV clusters.

The main purpose of this paper is to propose a new strategy to help legitimate users maximize their minimum secrecy rate. The optimization objective presents a complex mixed-integer nonlinear optimization problem. This problem is mathematically intractable because maximizing the minimum secrecy rate requires the simultaneous optimization of the user association variables, UAV trajectory, and output power. However, due to the complex nature of the coverage constraints, the mobility of the UAVs, and the discrete nature of the user association variables, problem solving becomes highly challenging. DRL algorithms may be more suitable for addressing these problems. Therefore, this paper formulates the optimization objective as a sequential decision-making problem. The DRL method (i.e., single-agent SAC and TD3 algorithms) and the multi-agent SAC (MASAC) algorithm can effectively solve these problems. The numerical results show that the MASAC algorithm is superior in accumulating discounted rewards but at the cost of higher time complexity during the training process. In contrast, the SAC algorithm performs best in terms of stability and can obtain better cumulative discounted rewards than the TD3 algorithm.

The rest of this paper is organized as follows. Section 2 describes the system model and problem formulation. Deep reinforcement learning-based solutions for joint optimization are discussed in Section 3. The numerical results are provided in Section 4, and Section 5 concludes this paper.

Notations:

{(\cdot)}^{T}

represents the transpose, and

| \cdot |

and

∥ \cdot ∥

refer to the modulus and Euclidean norm, respectively.

{[\cdot]}^{+}

means that the calculation result inside the square brackets is non-negative.

E (\cdot)

denotes the mathematical expectation. ∪ and ≫ represent the union and “much greater than” operations, respectively.

2. System Model and Problem Formulation

This paper mainly studies the secure communication model of jamming-enhanced UAVs. Figure 1 shows the deployment of a jamming-enhanced secure UAV communication system. Assume that the number of single-antenna users is U, and these users are served by M UAV BSs. In addition, assume that the system is equipped with J UAV jammers, which protect legitimate user information security by sending noise-like interference signals to eavesdroppers. Without loss of generality, assume that the number of ground eavesdroppers is I. Furthermore, the sets of legitimate users, eavesdroppers, UAV BSs, and UAV jammers are denoted by

U = {1, 2, \dots U}

I = {1, 2, \dots I}

M = {1, 2, \dots M}

, and

J = {1, 2, \dots J}

, respectively. In addition to the I deterministic eavesdroppers, this paper also considers the possibility of potential eavesdroppers snooping on legitimate information.

Assume that K potential eavesdroppers are randomly distributed within the target area, where

s_{k} = {[c_{k}, d_{k}]}^{T} \in s

, with

k \in K = {1, \dots K}

signifying the position of the k-th latent eavesdropper.

To facilitate both system trajectory planning and resource allocation, we preset the flight cycle of the UAV as T, which can be divided into N time slots and has a duration of

δ_{T} = T / N

, where

N \in 1, \dots N

. As long as each time slot is short enough and the UAV’s flight speed is moderate, we assume that its position remains almost unchanged during this period. The flight altitude, or hovering altitude, of each UAV is expressed as H. In addition,

Θ_{m} [n] = {[X_{m} [n], Y_{m} [n]]}^{T}

and

θ_{j} [n] = {[x_{j} [n], y_{j} [n]]}^{T}

are used to characterize the m-th UAV BS and the j-th UAV jammer in the n-th time slot, respectively. For simplicity, the horizontal positions of all legitimate and eavesdropping UAVs are represented as

Θ [n] = \{X_{1} [n], \dots X_{M} [n]; Y_{1} [n], \dots Y_{M} [n]\}

and

θ [n] = \{x_{1} [n], \dots x_{J} [n]; y_{1} [n], \dots y_{J} [n]\}

, respectively. In addition, the position of the legitimate user u in time slot n is denoted by

A_{u} [n] = {[a_{u} [n], b_{u} [n]]}^{T}

with zero altitude. We also use

B_{i} [n] = {[a_{i}^{'} [n], b_{i}^{'} [n]]}^{T}

to denote the location of the i-th deterministic eavesdropper.

Considering the limited carrying capacity of UAVs, this paper proposes a periodic coverage evaluation mechanism for UAV interference and potential eavesdropping. This mechanism ensures that the system achieves strong anti-eavesdropping capabilities at the lowest energy cost. Specifically, we assume that at least one potential eavesdropping coverage state needs to be calculated within each frame period. For ease of analysis, the total number of frames is set to

L = T / T_{L}

, where

T_{L}

represents the frame length. Next, we assume that the number of time slots contained in each coverage frame is

N_{L} = N / L

Once the UAV jammer j intends to access the potential eavesdropper

s_{k}

in time slot n, we set

c_{j, k} [n] = 0

. Otherwise,

c_{j, k} [n] = 1

. Meanwhile, we limit access to a maximum of one potential eavesdropper per time slot. This separation coverage evaluation mechanism effectively reduces computational overhead. Based on the above analysis, the associated variables of potential eavesdroppers must meet the following conditions:

\begin{matrix} \sum_{n = (l - 1) N_{L} + 1}^{l N_{L}} c_{j, k} [n] & = 1, \forall j, k; \end{matrix}

(1)

\begin{matrix} \sum_{k = 1}^{K} c_{j, k} [n] & \leq 1, \forall j, n; \end{matrix}

(2)

\begin{matrix} \sum_{j = 1}^{J} c_{j, k} [n] & \leq 1, \forall k, n; \end{matrix}

(3)

\begin{matrix} c_{j, k} [n] & \in {0, 1} . \end{matrix}

(4)

In the following, we employ

C_{s_{k}} \in {0, 1}

to denote the coverage state at

s_{k}

in time slot n, as given by

C_{s_{k}} [n] = \{\begin{matrix} 1, γ_{s_{k}} [n] \geq μ; \\ 0, otherwise, \end{matrix}

(5)

where

γ_{s_{k}}

denotes the signal-to-noise-plus-interference ratio (SINR) of

s_{k}

. This equation indicates that as long as

γ_{s_{k}}

is not lower than the system preset threshold

μ

s_{k}

will be considered as staying inside the coverage range.

Like in [17], we also use the reference signal receiving power (denoted by RP) to determine the value of the SINR, as given by

\begin{matrix} γ_{s_{k}} [n] & = \frac{RP (s_{k})}{\sum_{j = 1}^{J} q_{j, k} [n] g_{j, k} [n] - RP (s_{k}) + σ^{2}} \\ = \frac{\max (q_{j, k} [n] g_{j, k} [n])}{\sum_{j = 1}^{J} q_{j, k} [n] g_{j, k} [n] - \max (q_{j, k} [n] g_{j, k} [n]) + σ^{2}} . \end{matrix}

(6)

where

q_{j, k}

represents the transmission power of the j-th UAV jammer,

g_{j, k}

denotes the power gain of the jammer, and

σ^{2}

stands for the variance of additive white Gaussian noise (AWGN).

The channel power gain from the j-th jammer to

s_{k}

\begin{matrix} g_{j, k} [n] = & g_{0} d^{- 2} (θ_{j} [n], s_{k}) \\ = & \frac{g_{0}}{H^{2} + {∥ θ_{j} [n] - s_{k} [n] ∥}^{2}}, \end{matrix}

(7)

where

g_{0}

represents the channel power at a reference distance of 1 m.

Let

α_{m, u} [n]

represent the association coefficient between the m-th UAV BS and the u-th legitimate user, where

α_{m, u} [n] = 1

implies that the legitimate user u in time slot n is served by the m-th UAV; otherwise,

α_{m, u} [n] = 0

. Assume that each UAV has the ability to serve multiple targets simultaneously, while each target is exclusively served by only one UAV, i.e.,

α_{m, u} [n] \in {0, 1}, \forall m, u

\sum_{m = 1}^{M} α_{m, u} [n] = 1, \forall u \in U,

(8)

Considering the limited spectrum resources used by the system, UAVs adopt the principle of spectrum reuse to increase system capacity while ensuring that the interference they receive can be controlled at an acceptable level.

As described in [18], the data rate of the u-th legitimate user is given by

R_{m, u} [n] = α_{m, u} [n] \log (1 + γ_{m, u} [n]),

(9)

where

γ_{m, u} [n] = \frac{p_{m, u} [n] g_{m, u} [n]}{\sum_{m^{'} \in M ∖ m} p_{m^{'}, u} [n] g_{m^{'}, u} [n] + σ^{2}}

(10)

denotes the user’s SINR,

p_{m, u}

represents the transmit power of the UAV m, and

σ^{2}

stands for the noise variance at the receiver. The power gain of the u-th legitimate user is thus given by

\begin{matrix} g_{m, u} [n] & = g_{0} d^{- 2} (Θ_{m} [n], A_{u} [n]) \\ = \frac{g_{0}}{H^{2} + {∥ Θ_{m} [n] - A_{u} [n] ∥}^{2}} . \end{matrix}

(11)

Similarly, the data rate of eavesdropping is expressed as

R_{E, i}^{m, u} [n] = \log (1 + γ_{E, i}^{m, u} [n]),

(12)

where

γ_{E, i}^{m, u} [n] = \frac{p_{m, u} [n] g_{m, i} [n]}{I},

(13)

I = \sum_{m^{'} \in M ∖ m} p_{m^{'}, u} [n] g_{m^{'}, i} [n] + \sum_{j = 1}^{J} q_{j, i} [n] g_{j, i} [n] + σ^{2}

q_{j, i}

represents the transmit power of the i-th UAV jammer, and the power gains are

g_{m, i} [n] = g_{0} d^{- 2} (Θ_{m} [n], B_{i} [n]) = \frac{g_{0}}{H^{2} + {∥ Θ_{m} [n] - B_{i} [n] ∥}^{2}}

and

g_{j, i} [n] = g_{0} d^{- 2} (θ_{j} [n], B_{i} [n]) = \frac{g_{0}}{H^{2} + {∥ θ_{j} [n] - B_{i} [n] ∥}^{2}}

Following (9) and (12), the worst achievable average secrecy rate of the u-th legitimate user over a T-duration in the presence of eavesdroppers can be given by

R_{\sec}^{u} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{m = 1}^{M} {[R_{m, u} [n] - max_{\forall i} R_{E, i}^{m, u} [n]]}^{+},

(14)

where

{[x]}^{+} = max {x, 0}

In the following, we focus on maximizing the minimum secrecy rate by optimizing the parameters, including trajectory planning, user association variables, and power distribution of UAVs. The optimization goal can thus be formulated as

(P 0) max_{α, Θ, θ, p, q} min R_{\sec}^{u} .

(15)

\begin{matrix} (1), (2), (3), (4), (8) \end{matrix}

(15a)

\begin{matrix} c_{j, k} [n] γ_{s_{k}} [n] \geq c_{j, k} [n] μ, \forall j, k \end{matrix}

(15b)

\begin{matrix} 0 \leq p_{m, u} [n] \leq p_{\max}, \forall u \in U \end{matrix}

(15c)

\begin{matrix} 0 \leq q_{j, i} [n] \leq p_{\max}, \forall j \in J, i \in I \end{matrix}

(15d)

\begin{matrix} 0 \leq q_{j, k} [n] \leq p_{\max}, \forall j \in J, k \in K, \end{matrix}

(15e)

where

α

denotes the user association variables;

p

and

q

represent the transmission power of the UAV BS and UAV jammer, respectively; and

Θ

and

θ

are the coordinates of the UAVs. For a given coverage-evaluating frequency, the coverage constraints for

s_{k}

are shown in (15b), and the power constraints are shown in (15c)–(15e).

It is evident that the optimization objective (15) involves a mixed-integer nonlinear non-convex problem, making it difficult to solve using traditional iterative optimization methods. Therefore, we transform the above optimization problem into a sequential decision-making problem and adopt a DRL-based approach to achieve the joint optimization of the user association variables, power allocation, and trajectory planning in jamming-enhanced secure UAV communication systems.

3. Deep Reinforcement Learning-Based Solutions for Joint Optimization

The nonlinearity and non-convexity of the optimization objective (15) pose significant mathematical challenges for solving the aforementioned joint optimization problem. Considering that trajectory planning, user association variables, and power allocation are all sequential decision problems, the above optimization process can be reconstructed as a Markov decision process (MDP). This section investigates single-agent and multi-agent DRL solutions.

3.1. The Single-Agent DRL Solution

Let

(S, A, R, γ)

represent the tuple of the MDP, where S denotes the state space, A corresponds to the action space, and R signifies the reward function. The long-term cumulative discounted reward can be expressed as

R (π) = E [\sum_{t = 1}^{T} γ^{t - 1} [R (s_{t}, a_{t}, s_{t + 1})]]

, where

γ \in [0, 1)

represents the discount factor. The constituent elements are defined as follows:

State space S: $s_{t} \in S$ represents the state during time slot t, encompassing the coordinates of the UAV BSs, the UAV jammers, and the legitimate users:

$\begin{matrix} s_{t} = {X_{t}, Y_{t}, U_{x, t}, U_{y, t}, I_{x, t}, I_{y, t}, K_{x, t}, K_{y, t}}, \end{matrix}$

(16)

where $(X_{t}, Y_{t})$ denotes the coordinates of all UAVs in time slot t, which is composed of the coordinates of UAV BSs $(M_{x, t}, M_{y, t})$ and the coordinates of UAV jammers $(J_{x, t}, J_{y, t})$ . $(U_{x, t}, U_{y, t})$ represents the coordinates of legitimate users. $(I_{x, t}, I_{y, t})$ and $(K_{x, t}, K_{y, t})$ are the coordinates of the active and potential eavesdroppers, respectively.
Action space A: $a_{t} \in A$ represents the action taken during time slot t, encompassing the user association variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:

$a_{t} = {α_{t}, p_{t}, q_{t}, Δ X_{t}, Δ Y_{t}},$

(17)

where $p_{t}$ and $q_{t}$ denote the allocated power by communication UAVs and jamming UAVs, respectively, within time slot t. The flight displacement of UAVs is represented by $(Δ X_{t}, Δ Y_{t})$ , which is composed of $(Δ M_{x, t}, Δ M_{y, t})$ and $(Δ J_{x, t}, Δ J_{y, t})$ .
Reward function: The reward function $r_{t}$ , $r_{t} \in R$ comprises two components, i.e., the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by $R_{c}$ , where $R_{c} = c_{j, k} [t] γ_{s_{k}} [t] - c_{j, k} [t] μ$ . Therefore, the reward function can be expressed as

$r_{t} = min_{u \in U} R_{\sec}^{u} + β R_{c},$

(18)

where $β$ denotes the penalty factor.

Single-agent DRL algorithms, such as the SAC algorithm and TD3 algorithm, can be used to solve this problem. Take the SAC algorithm as an example. The SAC algorithm aims to maximize the long-term cumulative discounted reward while maximizing the strategy entropy, as given by

max E \sum_{t = 1}^{T} γ^{t - 1} [r_{t} (s_{t}, a_{t}) - ρ \log π_{ϕ} (a_{t} | s_{t})]

. Here,

ρ

and

π_{ϕ}

represent the temperature parameter and actor network with parameter

ϕ

, respectively.

In the SAC algorithm framework, there exist two main critic networks, i.e.,

Q_{θ_{1}}

and

Q_{θ_{2}}

, with network parameter vectors

θ_{1}

and

θ_{2}

. There also exist two target critic networks, i.e.,

Q_{θ_{1}^{'}}

and

Q_{θ_{2}^{'}}

, with parameter vectors

θ_{1}^{'}

and

θ_{2}^{'}

. The purpose of the critic networks is to fit the soft Q-function of the agent. Furthermore, the stochastic actor network

π_{ϕ}

generates actions based on the state of the agent.

Figure 2 shows a diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network. In each time slot, the agent interacts with the environment to generate a new experience tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, which is then stored in the replay memory buffer

B

. As time passes, the number of tuples in the replay buffer gradually increases until a sufficient number of samples are reached. In order to optimize the parameter vectors of the critic and actor networks, the system randomly samples a batch of tuples B to form the replay buffer

B

, that is,

{(s_{i}, a_{i}, r_{i}, s_{i + 1})}_{i = 1}^{| B |}

The critic network can be updated by minimizing

J (θ_{i}) = \frac{1}{| B |} \sum {[(Q_{θ_{i}} (s_{t}, a_{1}, \dots a_{t}) - y_{t})]}^{2},

(19)

where

s_{t}, a_{t} \in B

i = 1, 2

, and

y_{t}

denotes the target value of the main critic network in time slot t, which is given by

\begin{matrix} y_{t} = r_{t} + γ (\min_{i = 1, 2} Q_{θ_{i}^{'}} (s_{t + 1}, a_{t + 1}) \\ - ρ \log_{π_{ϕ}} ({\tilde{a}}_{t + 1} | s_{t + 1})), {\tilde{a}}_{t + 1} = π_{ϕ} (\cdot | s_{t + 1}) . \end{matrix}

(20)

The actor network can be updated according to

\begin{matrix} J (ϕ) = \frac{1}{| B |} \sum [\min_{i = 1, 2} Q_{θ_{i}} (s_{t}, a_{t}) \\ - ρ \log π_{ϕ} (a_{t} | s_{t})] . \end{matrix}

(21)

In addition, the temperature parameter

ρ

can be updated according to [10], and the target critic networks follow the soft update rule

θ_{i}^{'} \leftarrow ϵ θ_{i} + (1 - ϵ) θ_{i}^{'}, i = 1, 2

, where

ϵ

is the soft update parameter.

Note that the TD3 algorithm utilized in this paper is also a typical single-agent DRL solution and is similar to the SAC algorithm, with an off-policy actor-critic mechanism. The actor and critic networks are depicted in Figure 3, and their parameter updating follows existing studies [8,19]. For simplicity, the details of the TD3 algorithm are not elaborated further.

3.2. The Multi-Agent DRL Solution

In the multi-agent DRL solution, each UAV BS represents an agent. We use

(O, A, R, γ)

to denote the tuple of the MDP, where O represents the global observation of all agents. The main elements are explained as follows:

Observation space O: $o_{m, t} \in O$ represents the state of agent m during time slot t, $\forall m \in M \cup J$ . The local observation space $o_{m, t}$ mainly consists of the coordinates of UAVs, the coordinates of legitimate users, and those of the active and potential eavesdroppers:

$\begin{matrix} o_{m, t} = & {X_{x_{m}, t}, Y_{y_{m}, t}, U_{x_{m}, t}, U_{y_{m}, t}, \\ I_{x_{m}, t}, I_{y_{m}, t}, K_{x_{m}, t}, K_{y_{m}, t}}, \end{matrix}$

(22)

where $X \in M_{x, t} \cup J_{x, t}$ and $Y \in M_{y, t} \cup J_{y, t}$ .
Action space A: $a_{m, t} \in A$ represents the action of agent m during time slot t, and it is composed of the user associative variables, the allocation of power to both legitimate users and eavesdroppers, and the variations in UAV locations:

$\begin{matrix} a_{m, t} = & {α_{m, t}, p_{m, t}, q_{m, t}, Δ M_{x_{m}, t}, Δ M_{y_{m}, t}, \\ Δ J_{x_{m}, t}, Δ J_{y_{m}, t}} . \end{matrix}$

(23)
Reward function R: The reward function $r_{m, t}$ , $r_{m, t} \in R$ for the agent m comprises both the secrecy rate and the penalty of the coverage evaluation. The coverage evaluation of UAV jammers is denoted by $R_{c} = c_{j, k} [t] γ_{s_{k}} [t] - c_{j, k} [t] μ$ . The reward function can be written as

$r_{m, t} = min_{u \in U} R_{\sec}^{u} + β R_{c} .$

(24)

In this paper, we employ the MASAC algorithm to solve the problem, where each agent corresponds to a UAV and comprises two main critic networks (i.e.,

Q_{θ_{m, 1}}

Q_{θ_{m, 2}}

), two target critic networks (i.e.,

Q_{θ_{m, 1}^{'}}

Q_{θ_{m, 2}^{'}}

\forall m \in M \cup J

), and one actor network

π_{ϕ_{m}}

In the training process, the agent m is designed to maximize

E \{\sum_{t = 1}^{T} γ^{t - 1} [r_{m, t} (o_{m, t}, a_{m, t}) - ρ_{m} \log π_{ϕ_{m}} (a_{m, t} | o_{m, t})]\} .

Figure 4 shows a diagram of the MASAC algorithm for this jamming-enhanced secure UAV communication network. After each interaction with the environment, the experience tuple

(O_{t}, A_{t}, R_{t}, O_{t + 1})

is gradually generated and stored in B. To update the neural network parameters, a minibatch of experience tuples B is randomly sampled from

B

The MASAC algorithm follows the centralized training and decentralized execution mechanism. The critic network can be updated by minimizing the soft Bellman residuals:

J (θ_{m, i}) = \frac{1}{| B |} \sum {[(Q_{θ_{m, i}} (O_{t}, a_{1, t}, \dots a_{m, t}) - y_{m, t})]}^{2},

(25)

where

O_{t}, a_{m, t} \in B

i = 1, 2

, and

y_{m, t}

is the target value of the main critic network in time slot t, as given by

y_{m, t} = \frac{1}{| B |} \sum [r_{m, t} + γ V (Q_{t + 1})],

(26)

where

V (Q_{t + 1}) = {min}_{i = 1, 2} Q_{θ_{m, i}^{'}} (O_{t + 1}, \dots {\tilde{a}}_{m, t + 1}) - ρ_{m} log π_{ϕ_{m}} ({\tilde{a}}_{m, t + 1} | o_{m, t + 1})

and

{\tilde{a}}_{m, t + 1} \sim π (\cdot | o_{m, t + 1})

The actor network can be updated according to

\begin{matrix} J (ϕ_{m}) = \frac{1}{| B |} \sum [min_{i = 1, 2} Q_{θ_{m, i}} (O_{t}, a_{1, t}, \dots a_{m, t}) \\ - ρ_{m} log π_{ϕ_{m}} (a_{m, t} | o_{m, t})] . \end{matrix}

(27)

In addition, the target critic networks of each agent follow the soft update rule

θ_{m, i}^{'} \leftarrow ϵ θ_{i} + (1 - ϵ) θ_{m, i}^{'}, i = 1, 2

, where

ϵ

is the soft update parameter, and the temperature parameter

ρ_{m}

can be updated according to [20].

The pseudocode of the MASAC is presented in Algorithm 1.

Algorithm 1 MASAC algorithm for jamming-enhanced secure UAV communications

1:: For each ${UAV}_{m}$ , initial main network parameters $θ_{m, i}$ , $ϕ_{m}$ , set target network parameters $θ_{m, i}^{'}$ , ← $θ_{m, i}$ , $i = 1, 2$ , $\forall m \in M \cup J$ .
2:: for each episode do
3:: Initial the global observation $O_{t}$
4:: for $t \leftarrow 1, T$ do
5:: for $m \leftarrow 1, M \cup J$ do
6:: Select policy $a_{m, t} \sim π_{ϕ_{m}} (\cdot)$
7:: end for
8:: Execute actions $A_{t} = (a_{m, t}, \dots a_{m, t})$
9:: Observe reward $R_{t}$ and the next global observation $O_{t + 1}$
10:: Store the tuple $(O_{t}, A_{t}, R_{t}, O_{t + 1})$ in $B$
11:: $O_{t} \leftarrow O_{t + 1}$
12:: for $m \leftarrow 1, M \cup J$ do
13:: Sample minibatch B from $B$
14:: Update the main critic network parameter:
15:: $θ_{m, i} \leftarrow \nabla_{θ_{m, i}} J (θ_{m, i}), i = 1, 2$
16:: Update actor network parameter:
17:: $ϕ_{m} \leftarrow \nabla_{ϕ_{m}} J (ϕ_{m})$
18:: Update target critic network $θ_{m, i}^{'}, i = 1, 2$ following the soft update rule
19:: Update temperature parameter $ρ_{m}$ according to [20]
20:: end for
21:: end for
22:: end for

3.3. Computational Complexity Analysis

In this section, we investigate the optimization of the secrecy rate based on single-agent and multi-agent DRL methods and analyze their complexity. The complexity of these algorithms is determined by the neural network framework of the algorithms. In our DRL solution, both the critic and actor networks are four-layer fully connected networks, with an architecture consisting of one input layer, two hidden layers, and one output layer. Let

n_{a_{1}}

and

n_{a_{2}}

represent the number of nodes in hidden layer 1 and hidden layer 2 in the actor network, respectively. Meanwhile, let

n_{c_{1}}

and

n_{c_{2}}

represent the number of nodes in hidden layer 1 and hidden layer 2 in the critic network.

First, we analyze the complexity of the single-agent SAC algorithm. It can be deduced that both the state space and the action space have a dimension of

2 (M + J + U + I + K)

and

2 M U + J I + J K + 2 (M + J)

, respectively, corresponding to the number of input nodes

n_{i}

and output nodes

n_{o}

of the actor network, i.e.,

n_{i} = 2 (M + J + U + I + K)

and

n_{o} = 2 M U + J I + J K + 2 (M + J)

. According to [21], the time complexity at each training step of each actor network is given by

\begin{matrix} O_{a} = O ({n_{i}}^{2} + n_{a_{1}}^{2} + n_{a_{2}}^{2} + {n_{o}}^{2}) . \end{matrix}

(28)

Similarly, the time complexity of each critic network at each step can be calculated as

\begin{matrix} O_{c} = O ({(n_{i} + n_{o})}^{2} + n_{c_{1}}^{2} + n_{c_{2}}^{2}) . \end{matrix}

(29)

Considering all the main and target networks of the SAC algorithm at each step, the time complexity in the training process is

O_{train}^{SAC} = O_{a} + 4 O_{c}

During the testing process, only the actor network is used to determine the action interacting with the environment, and its time complexity depends only on the matrix multiplication complexity of the layers. The calculation can be expressed as

\begin{matrix} O_{test}^{SAC} = O (n_{i} * n_{a_{1}} + n_{a_{1}} * n_{a_{2}} + n_{a_{2}} * n_{o}) . \end{matrix}

(30)

Note that there are six neural networks in the TD3 algorithm, including two main critic networks, two target critic networks, one main actor network, and one target actor network. The time complexity at each step in the training process can be described as

O_{train}^{TD 3} = 2 O_{a} + 4 O_{c}

. During the testing process, as we only applied the optimal actor network to decide the actions, the time complexity of each step is approximately equivalent to that of the SAC algorithm.

For the MASAC algorithm, we can deduce that the local observation space has a dimension of

2 (1 + J + U + I + K)

, denoted by

n_{m, i}

\forall m \in M \cup J

, and the action space has a dimension of

2 M U + J I + J K + 2 (1 + J)

, denoted by

n_{m, o}

\forall m \in M \cup J

. Therefore, the time complexity for the m-th agent at each training step of each actor network is given by

O_{m, a} = O ({(n_{m, i})}^{2} + n_{a_{1}}^{2} + n_{a_{2}}^{2} + {(n_{m, o})}^{2})

. Similarly, the time complexity of each critic network at each step is

O_{m, c} = O ({(n_{m, i} + n_{m, o})}^{2} + n_{c_{1}}^{2} + n_{c_{2}}^{2})

Therefore, the time complexity for all agents in each training step is

O_{train}^{MASAC} = \sum_{m = 1}^{M + J} O_{m, a} + 4 \sum_{m = 1}^{M + J} O_{m, c} .

(31)

In the testing process, the time complexity of all agents can be calculated as

O_{test}^{MASAC} = \sum_{m = 1}^{M + J} O (n_{m, i} * n_{a_{1}} + n_{a_{1}} * n_{a_{2}} + n_{a_{2}} * n_{m, o}) .

(32)

In summary, the number of input nodes for an actor network is usually smaller than the number of input nodes for a critic network, implying that the testing process has comparatively lower computational complexity, whereas the training process has much higher computational complexity. Table 1 shows the computational complexity of different algorithms in the training and testing processes. Moreover, it can also be concluded that

\sum_{m = 1}^{M + J} O_{m, a} + 4 \sum_{m = 1}^{M + J} O_{m, c} ≫ O_{a} + 4 O_{c}

; therefore, the MASAC algorithm has higher time complexity than the SAC algorithm in the training process, that is,

O_{train}^{MASAC} ≫ O_{train}^{SAC}

4. Numerical Results

The simulation environment was set up in a square area with a side length of 1 km. Assume that legitimate users were randomly distributed throughout the entire area, and the positions of UAVs in the target area were randomly initialized at the beginning of the simulation. We considered a periodic coverage-assisted area comprising three UAV BSs, two UAV jammers, 20 legitimate users, two ground eavesdroppers, and five latent eavesdroppers. The flight period was set to

T = 50

s, the coverage evaluation frame length was

T_{L} = 20

s, the predetermined flight altitude of the drones was 150 m, and each UAV was associated with the nearest legitimate users. Moreover, the time slot length was set to

δ_{t} = 0.25

s, and the threshold for the coverage evaluation was set to

μ = - 3

dB. The reference channel power was

g_{0} = - 60

dB [22].

The experiments were simulated using Python v3.7, and the deep learning framework used was PyTorch. Both the critic and the policy networks were implemented as four-layer fully connected networks, with 128 neurons implemented in each hidden layer. Each episode comprised 50 time slots. Furthermore, the discount factor

γ

and the number of experience tuples

| B |

were set to

0.96

and 256, respectively, while the learning rates of the critic and actor networks were set to

0.0001

and

0.00001

, respectively.

Figure 5 shows the cumulative discounted return of the DRL algorithms versus the training episodes. It can be seen that the MASAC algorithm performed best in terms of convergence speed and cumulative discounted return compared to the other algorithms. The SAC algorithm demonstrated the best stability during the training phase and better cumulative discounted return performance than the TD3 algorithm. This is because the multi-agents in the MASAC algorithm have a better ability to explore and cooperate. On one hand, multiple agents explore different parts of the environment simultaneously, which helps the agents learn better policies compared to a single agent. On the other hand, multiple agents cooperate in their actions to achieve shared or individual goals more efficiently, and each agent can specialize in a role or a subset of tasks, leading to better performance. However, the MASAC algorithm had higher time complexity than the other algorithms, that is,

O_{train}^{MASAC} ≫ O_{train}^{SAC}

. The MASAC algorithm was quite time-consuming, which can be attributed to its centralized training and decentralized execution mechanism.

To verify the effectiveness of the DRL-based solutions, we saved the neural network parameters after each algorithm’s training was completed. Then, only the actor network was utilized to determine the action interacting with the environment and further calculate the corresponding secrecy rate in the testing process. Figure 6 shows the normalized average secrecy rate versus the number of time slots. It can be observed that the secrecy rate for each algorithm increased as the number of time slots increased, with the secrecy rate for the MASAC algorithm increasing by more than

6.6 %

and

14.2 %

compared to that of the SAC and TD3 algorithms, respectively. The simulation results reveal the validity of the DRL algorithms in finding the effective user association variables, UAV trajectory, and power allocation policy for the considered scenarios.

Finally, we studied the relationship between the normalized average secrecy rate and the number of eavesdroppers, as shown in Figure 7 and Figure 8. These experiments included three UAV BSs, two UAV jammers, and 20 legitimate users. In the solutions, we saved the parameters of their respective actor networks after each algorithm’s training was completed, then loaded them to decide on the variables in the testing process, and finally calculated the corresponding secrecy rate. It can be observed in Figure 7 that the secrecy rate tended to decrease as the number of eavesdroppers increased. The MASAC algorithm achieved the best secrecy rate compared with the other algorithms. In Figure 8, it can be seen that the number of latent eavesdroppers had less influence on the average secrecy rate. It can be deduced that the existing UAV jammers effectively ensured the secrecy rate in this scenario. Moreover, the MASAC algorithm achieved the best secrecy rate in terms of different numbers of latent eavesdroppers.

5. Conclusions

This paper investigated the problem of maximizing the minimum achievable secrecy rate in interference-enhanced UAV secure communication systems, with a focus on addressing challenges such as joint user association variables, power allocation, and UAV trajectory optimization. Due to the high complexity of solving the optimization objective function, we transformed the problem into a Markov decision process with low complexity and used both single-agent DRL and multi-agent DRL algorithms (including SAC, TD3, and MASAC) to solve these problems. The simulation results show that the MASAC algorithm is effective in accumulating discounted rewards but at the cost of higher time complexity during the training process. In contrast, the SAC algorithm performs the best in terms of stability, and its cumulative discounted reward is better than that of the TD3 algorithm.

In future work, we will explore secure transmission for integrated sensing and communication (ISAC)-enabled UAV networks. These single-agent and multi-agent DRL algorithms will be exploited to solve the user association variables, UAV trajectory planning, and power allocation problems in UAV-ISAC networks.

Author Contributions

Conceptualization, Z.X., Y.Q. and C.D.; methodology, Z.X. and Y.Q.; software, Z.X.; validation, C.D.; formal analysis, C.D. and Z.Z.; investigation, W.W.; resources, W.W.; data curation, Z.X.; writing—original draft preparation, Z.X.; writing—review and editing, Y.Q.; visualization, Y.Q. and W.W.; supervision, C.D. and Z.Z.; project administration, Z.Z.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Post-Doctoral Science Foundation (2023M740266) and the National Natural Science Foundation of China (No. 62071035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Xu, J.; Lu, W.; Zhao, N.; Wang, X.; Niyato, D. Secure Transmission for IRS-Aided UAV-ISAC Networks. IEEE Trans. Wirel. Commun. 2024, 23, 12256–12269. [Google Scholar] [CrossRef]
Lin, Z.; Lin, M.; De Cola, T.; Wang, J.B.; Zhu, W.P.; Cheng, J. Supporting IoT with rate-splitting multiple access in satellite and aerial-integrated networks. IEEE Internet Things J. 2021, 8, 11123–11134. [Google Scholar] [CrossRef]
Yan, S.; Gu, Z.; Park, J.H.; Xie, X.; Dou, C. Probability-density-dependent load frequency control of power systems with random delays and cyber-attacks via circuital implementation. IEEE Trans. Smart Grid 2022, 13, 4837–4847. [Google Scholar] [CrossRef]
Zhou, Y.; Yeoh, P.L.; Chen, H.; Li, Y.; Schober, R.; Zhuo, L.; Vucetic, B. Improving Physical Layer Security via a UAV Friendly Jammer for Unknown Eavesdropper Location. IEEE Trans. Veh. Technol. 2018, 67, 11280–11284. [Google Scholar] [CrossRef]
Zhong, C.; Yao, J.; Xu, J. Secure UAV Communication With Cooperative Jamming and Trajectory Control. IEEE Commun. Lett. 2018, 23, 286–289. [Google Scholar] [CrossRef]
Yao, Y.; Zhao, J.; Li, Z.; Cheng, X.; Wu, L. Jamming and eavesdropping defense scheme based on deep reinforcement learning in autonomous vehicle networks. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1211–1224. [Google Scholar] [CrossRef]
Min, M.; Yang, S.; Zhang, H.; Ding, J.; Peng, G.; Pan, M.; Han, Z. Indoor Semantic Location Privacy Protection with Safe Reinforcement Learning. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 1385–1398. [Google Scholar] [CrossRef]
Zhang, Z.; Tian, J.; Wang, D.; Qiao, J.; Li, T. TD3-based Joint UAV Trajectory and Power optimization in UAV -Assisted D2D Secure Communication Networks. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–5. [Google Scholar]
Dong, R.; Wang, B.; Tian, J.; Cheng, T.; Diao, D. Deep Reinforcement Learning Based UAV for Securing mmWave Communications. IEEE Trans. Veh. Technol. 2022, 72, 5429–5434. [Google Scholar] [CrossRef]
Qin, Y.; Zhang, Z.; Li, X.; Huangfu, W.; Zhang, H. Deep Reinforcement Learning Based Resource Allocation and Trajectory Planning in Integrated Sensing and Communications UAV Network. IEEE Trans. Wirel. Commun. 2023, 22, 8158–8169. [Google Scholar] [CrossRef]
Zhou, X.; Xiong, J.; Zhao, H.; Liu, X.; Ren, B.; Zhang, X.; Wei, J.; Yin, H. Joint UAV Trajectory and Communication Design with Heterogeneous Multi-Agent Reinforcement Learning. Sci. China Inf. Sci. 2024, 67, 132302. [Google Scholar] [CrossRef]
Liu, Y.; Xiong, K.; Zhang, W.; Yang, H.C.; Fan, P.; Letaief, K.B. Jamming-enhanced Secure UAV Communications with Propulsion Energy and Curvature Radius Constraints. IEEE Trans. Veh. Technol. 2023, 72, 10852–10866. [Google Scholar] [CrossRef]
Wang, D.; Zhao, Y.; He, Y.; Tang, X.; Li, L.; Zhang, R.; Zhai, D. Passive Beamforming and Trajectory Optimization for Reconfigurable Intelligent Surface-Assisted UAV Secure Communication. Remote Sens. 2021, 13, 4286. [Google Scholar] [CrossRef]
Li, S.; Duo, B.; Di Renzo, M.; Tao, M.; Yuan, X. Robust Secure UAV Communications With the Aid of Reconfigurable Intelligent Surfaces. IEEE Trans. Wirel. Commun. 2021, 20, 6402–6417. [Google Scholar] [CrossRef]
Tong, Y.; Sheng, M.; Liu, J.; Zhao, N. Energy-efficient UAV-NOMA aided wireless coverage with massive connections. Sci. China Inf. Sci. 2023, 66, 222303. [Google Scholar] [CrossRef]
Cai, Y.; Wei, Z.; Li, R.; Ng, D.W.K.; Yuan, J. Joint Trajectory and Resource Allocation Design for Energy-Efficient Secure UAV Communication Systems. IEEE Trans. Wirel. Commun. 2020, 68, 4536–4553. [Google Scholar] [CrossRef]
Qin, Y.; Huangfu, W.; Zhang, H.; Long, K.; Yuan, J. Rethinking Cellular System Coverage Optimization: A Perspective of Pseudometric Structure of Antenna Azimuth Variable Space. IEEE Syst. J. 2020, 15, 2971–2979. [Google Scholar] [CrossRef]
Peng, H.; Shen, X. Multi-agent Reinforcement Learning Based Resource Management in MEC-and UAV-Assisted Vehicular Networks. IEEE J. Sel. Areas Commun. 2020, 39, 131–141. [Google Scholar] [CrossRef]
Fujimoto, S.; Van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Li, X.; Qin, Y.; Huo, J.; Huangfu, W. Heuristically Assisted Multiagent RL-Based Framework for Computation Offloading and Resource Allocation of Mobile Edge Computing. IEEE Internet Things J. 2023, 10, 15477–15487. [Google Scholar] [CrossRef]
Truong, T.P.; Tuong, V.D.; Dao, N.N.; Cho, S. FlyReflect: Joint Flying IRS Trajectory and Phase Shift Design Using Deep Reinforcement Learning. IEEE Internet Things J. 2022, 10, 4605–4620. [Google Scholar] [CrossRef]
Wu, Q.; Zeng, Y.; Zhang, R. Joint trajectory and communication design for multi-UAV enabled wireless networks. IEEE Trans. Wirel. Commun. 2018, 17, 2109–2121. [Google Scholar] [CrossRef]

Figure 1. The jamming-enhanced secure UAV communication deployment in the target area.

Figure 2. Diagram of the single-agent SAC algorithm for the jamming-enhanced secure UAV communication network.

Figure 3. Diagram of the agent in the single-agent TD3 algorithm.

Figure 4. Diagram of the MASAC algorithm for the jamming-enhanced secure UAV communication network.

Figure 5. The cumulative discounted reward versus the training episodes.

Figure 6. The normalized average secrecy rate versus the number of time slots.

Figure 7. The normalized average secrecy rate versus the number of ground eavesdroppers.

Figure 8. The normalized average secrecy rate versus the number of latent eavesdroppers.

Table 1. Computational complexity for different DRL algorithms in our considered scenarios.

Algorithm	Training Process	Testing Process
MASAC	$\sum_{m = 1}^{M + J} O_{m, a} + 4 \sum_{m = 1}^{M + J} O_{m, c}$	$\sum_{m = 1}^{M + J} O (n_{m, i} * n_{a_{1}} + n_{a_{1}} * n_{a_{2}} + n_{a_{2}} * n_{m, o})$
SAC	$O_{a} + 4 O_{c}$	$O (n_{i} * n_{a_{1}} + n_{a_{1}} * n_{a_{2}} + n_{a_{2}} * n_{o})$
TD3	$2 O_{a} + 4 O_{c}$	$O (n_{i} * n_{a_{1}} + n_{a_{1}} * n_{a_{2}} + n_{a_{2}} * n_{o})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, Z.; Qin, Y.; Du, C.; Wang, W.; Zhang, Z. Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors 2024, 24, 7328. https://doi.org/10.3390/s24227328

AMA Style

Xing Z, Qin Y, Du C, Wang W, Zhang Z. Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors. 2024; 24(22):7328. https://doi.org/10.3390/s24227328

Chicago/Turabian Style

Xing, Zhifang, Yunhui Qin, Changhao Du, Wenzhang Wang, and Zhongshan Zhang. 2024. "Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications" Sensors 24, no. 22: 7328. https://doi.org/10.3390/s24227328

APA Style

Xing, Z., Qin, Y., Du, C., Wang, W., & Zhang, Z. (2024). Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications. Sensors, 24(22), 7328. https://doi.org/10.3390/s24227328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Driven Jamming-Enhanced Secure Unmanned Aerial Vehicle Communications

Abstract

1. Introduction

2. System Model and Problem Formulation

3. Deep Reinforcement Learning-Based Solutions for Joint Optimization

3.1. The Single-Agent DRL Solution

3.2. The Multi-Agent DRL Solution

3.3. Computational Complexity Analysis

4. Numerical Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI