2. Related Work
Most of the recent research in tracking-sensor networks focuses on improving the performance of sensors performing collaborative tracking tasks. Generally, the existing works are designed for single-target tracking or tracking of multiple homogeneous targets.
In [
7], a dynamic chain-based collaboration (DCBC) method was proposed for efficient target tracking and data gathering. In DCBC, sensors form a dynamic tracking chain around the target, which can adjust its structure as the target moves. The sensing data collected by tracking nodes is aggregated locally at each time step. In this way, DCBC can reduce the energy consumption of target tracking and prolong the network lifetime.
In [
8], an energy-aware task-scheduling approach (TOTD) was proposed to achieve a trade-off between energy efficiency and tracking performance. The authors utilized a true online reinforcement learning framework to obtain the optimal task scheduling scheme for sensors to track targets. Simulation results showed that TOTD outperforms other methods for target-tracking applications.
Ramluckun et al. proposed a chain-cluster based routing protocol in [
9], in which the combination of clustering and ant colony optimization (ACO) was used to reduce transmission delay and obtain the optimal routing path. In the simulation, this protocol was proved superior in residual energy, latency, and load balancing, making it suitable for target tracking.
To manage the problem of selecting the optimal sensors for target tracking, Anvaripour et al. proposed a reliable sensor selection approach in [
10]. They proposed an updated unscented Kalman filter (U2KF) to improve the efficiency of target tracking by sensor selection. Moreover, multi-objective optimization was utilized to select a sensor set without knowing the number of sensors needed to be selected. Sigma points probability (SP) and target trajectory (TT) approaches were utilized to eliminate the effects of noise while selecting. The superiority of Anvaripour et al.’s method in effectiveness and utility was proven in extensive experiments.
A dynamic cluster establishment scheme and a distributed consensus-based adaptive Kalman estimation method were proposed by Zhang et al. [
11] to reduce energy consumption and improve estimation accuracy in target tracking.
In [
12], a dynamic cluster member scheduling (DCMS) algorithm was proposed by Wu et al. to achieve the trade-off between energy efficiency and tracking performance; this method constructed a reward function by exploiting the weighted value. Then, a point-based online value iteration algorithm was designed to schedule sensors to collaboratively track targets. In simulations, DCMS was shown to improve tracking accuracy and extend network lifetime.
Combined with the particle swarm optimization (PSO) algorithm, Pang and Shan [
13] proposed an improved bee colony algorithm based on a double-roulette scheme to obtain the optimal sensor-scheduling scheme. Additionally, to reduce the target-missing risk and the sensor radiation interception risk, they created the sensor-scheduling objective function which takes risk factors into account.
The authors in [
14] proposed a dynamic-clustering protocol (EEDC) for target tracking and continuous event monitoring. In EEDC, the overlapping cluster structure was dynamically reconstructed to adapt to the changing location of targets, and candidate cluster heads were selected according to rough fuzzy C-means (RFCM) and genetic algorithm (GA). EEDC can significantly reduce system energy consumption as compared to other clustering protocols.
The authors in [
15] proposed a hierarchical-tracking structure based on edge intelligence (EI) technology. The structure can integrate the computing resources of sensors and edge servers to provide efficient computing for real-time tracking. Moreover, the authors designed a long-term dynamic resource allocation algorithm to obtain the optimal sensor-scheduling scheme. Simulation results demonstrated that the proposed scheduling method can enhance tracking accuracy and reduce energy consumption.
In [
16], Liu et al. proposed a message-pruning tree with shortcuts, which can reduce the total cost of updating the database and querying targets compared with the message-pruning tree.
A tree-based multi-cast routing protocol was proposed in [
17]. Based on residual energy and minimum hop count paths, this method can enhance multiple service quality parameters, including energy consumption, end-to-end delay, and packet delivery rate, by updating the multi-cast tree structure.
Overall, most of the existing sensor-scheduling algorithms focus on improving the performance of tracking networks by studying how to organize and schedule nodes to perform tracking tasks in a collaborative manner. Their goal is to select an optimal set of tracking nodes based on metrics such as sensor energy and distance to the target. However, in multi-target-tracking scenarios, it is insufficient to consider these metrics in the scheduling process. Owing to the lack of consideration for the difference in target properties and the difference in network status where targets are located, the existing sensor-scheduling algorithms may face the problem of unbalanced allocation of tracking resources for the targets when managing multi-target tracking.
For example,
Figure 1 shows the scheduling results of sensors using the classical nearest-neighbor method. It can be seen that, according to the nearest-neighbor principle, node
should work as a tracking node for target 1. However, target 2 has only two tracking nodes, and the system lacks sufficient localization information for target 2, so the tracking accuracy is seriously affected. Moreover, the two tracking nodes of target 2 have to undertake heavy tasks, which may cause unbalanced energy consumption. In this case, target 2 should be given higher priority than target 1 when scheduling the tracking task for node
. Besides the nearest-neighbor method, other algorithms that do not consider the balance of resource allocation in scheduling will encounter similar problems.
It can be seen from the example in
Figure 1 that once such an unbalanced task allocation occurs in the sensor-scheduling process, it leads to the degradation of the overall system performance, such as tracking accuracy and energy efficiency. Therefore, to address this issue, we propose a sensor-scheduling algorithm with consideration of the balance of tracking-task allocation for multi-target tracking (MTT-SS). The contributions of this paper are summarized as follows:
Taking advantage of the entropy-weight method (EWM), MTT-SS calculates the aggregated weight for each target according to four decision criteria, including target type, distance to the cluster center, node density, and energy level. After that, linear weighting is conducted to obtain the overall priority function to evaluate target priority. Prioritizing targets can help sensors select appropriate tasks to execute based on the current status of targets and the network.
To obtain a balanced resource scheduling result, a Q-learning-based task allocation mechanism is developed, in which nodes are mapped as intelligent agents. Each agent selects the appropriate task to execute by using an improved exploration–exploitation policy. At each time step, agents update their Q value according to the environment state and the delayed reward.
Extensive experiments and analyses for different solutions are presented in the simulation. Simulation results indicate that MTT-SS significantly outperforms other existing sensor-scheduling algorithms in terms of tracing accuracy and energy efficiency.
Following the related work given in
Section 2,
Section 3 describes the network model and the energy consumption model. The proposed sensor-scheduling algorithm is presented in
Section 4.
Section 5 provides the simulation results. Finally,
Section 6 concludes this paper.
4. Detailed Description of the Proposed Algorithm
The proposed sensor-scheduling algorithm for multi-target tracking is composed of three mechanisms: dynamic clustering, target prioritizing, and task allocation. In what follows, we elaborate on each of the three component mechanisms.
4.1. Dynamic Clustering
Cluster, as the most commonly used network structure, is widely used to reduce energy consumption and prolong network lifetime. In the clustering structure, sensors are classified into the cluster head (CH) and cluster members (CMs). CH has stronger abilities than CMs; it can collect the data from its CMs, perform data fusion to remove transmission redundancy, and forward the data to the sink node. Additionally, CH can provide necessary target information for its member nodes. Therefore, in the first step of our scheduling algorithm, sensors are organized into clusters to facilitate collaborative data processing in the network. The traditional clustering method is that the sink node selects several sensors with strong capability as the CHs, and determines their subordinate CMs. However, this approach is not suitable in target-tracking scenarios, since most of the sensors in the network do not need to participate in the tracking tasks. Therefore, we employ a delayed broadcasting mechanism in which CH voluntarily recruits sensors to form a dynamic cluster. In this way, we can reduce the number of nodes forming clusters in the network, minimizing unnecessary energy consumption. First, the sensors that can sense the target will set a back-off timer and broadcast its competing message when the timer expires. If by the time the back-off timer expires, the sensor receives a competing message from other sensors, it cancels the timer. Specifically, the back-off time,
, for which sensor
uses is defined as [
18,
20]:
where
and
denote the minimum and maximum back-off timer values;
is the average distance from node to targets;
is the residual energy of sensor; and
is a parameter which can nondimensionalize the second term in Equation (4).
is given by
where
is the initial energy of the sensor and
is the node sensing range, which denotes the maximum distance from the node to targets. Moreover, the third term in Equation (4) is a random part.
According to Equation (4), the delay determined by the back-off timer value can ensure that the sensor with a shorter average distance to targets and more residual energy will broadcast the competing message earlier, and thus have a higher chance of serving as the CH.
4.2. Target Prioritizing
In this subsection, to manage the problem of unbalanced task allocation, we first prioritize the targets according to target properties and network status. We introduce four indicators as follows to evaluate the priority of targets:
- (1)
Target Type
For different types of targets, the system pays different attention to them. For example, in military applications, systems have different importance for monitoring soldiers and armored vehicles because of their differences in speed and combat effectiveness. Therefore, it is necessary to distinguish the type of target and give an evaluation metric about its importance, which can be used as a reference for sensor scheduling. We use to measure the importance of the targets. The value of for different targets is evaluated and normalized according to specific applications requirements, so we have .
- (2)
Distance to cluster center
Distance to cluster center is obtained by calculating the Euclidean distance between the target and its corresponding cluster center. When this target moves to the boundary of its cluster, it may activate new clusters. In this case, the system needs a precise estimate of the target position to prevent target loss caused by activating the wrong clusters.
- (3)
Node Density
Node density indicates the distribution density of the nodes around the target. Owing to the fact that sensors are randomly deployed, there are areas where nodes are sparsely distributed in the network. When a target moves to this area, the system should give priority to it to ensure tracking reliability.
- (4)
Energy Level
Energy level indicates the residual energy of the nodes around the target. As the tracking system works, there may be nodes that take on the tracking task multiple times, resulting in greater energy consumption than others. To ensure the reliability of the tracking process, the target, whose surrounding nodes have low residual energy, should be set as a high tracking priority.
To determine the weights of different indicators, we use EWM to calculate the comprehensive score for target priority. EWM is an objective weighting method based on the degree of data dispersion for each indicator [
21]. EWM first uses the information entropy principle to calculate the entropy value of each indicator, and then corrects the entropy weight according to the data distribution of each indicator, so as to obtain the objective index weight. Compared with subjective methods such as analytic hierarchy process (AHP), EWM has better accuracy and adaptability. EWM includes the following steps:
First, we normalize all the indicators by using the extreme-value processing method. When the indicator has positive effects, we have:
When the indicator has negative effects, we have:
where
is the normalized value;
is the value of the
indicator of the
target; and
and
are the maximum and minimum values of the corresponding indicator, respectively. After normalization, the entropy
of the
indicator is given by:
where
is the number of indicators;
is the number of targets;
and assuming that when
. The function of
in Equation (9) is to make the entropy value fall within (0, 1). We then define the information utility value, which can measure the amount of useful information contained in the indicator. It can be learned from Equation (8) that, when values of indicators of all samples are equal,
obtains a maximum value, which means that data do not change, that is, there is little information. Thus, we reverse it to calculate the information utility value
of the
indicator:
Then, we determine the weight of each indicator by normalizing the information utility value:
where
is the weight of the
indicator. Finally, the comprehensive evaluation of target priority
is obtained:
where
denotes the
target.
When the value difference between the evaluating targets of the same indicator is large and the entropy is small, it indicates that this indicator provides more useful information than other indicators. Accordingly, the weight value of this indicator is higher. By carrying out EWM, we can avoid errors caused by the prejudice of decision-makers and thus have high accuracy in weight identification.
4.3. Task Allocation
To allocate tasks to sensors reasonably, we propose, with the use of target priority information, a Q-learning-based multi-target-tracking task allocation mechanism. In this subsection, we first map the sensor network to the Q-learning model. Then, we present the self-driven process of the mechanism and detail the setting of the reward function, which can evaluate the performance of the strategy at each step in Q-learning. Finally, we improve the exploration–exploitation policy according to the principle of balancing task allocation.
4.3.1. Reinforcement Learning and Q-Learning Model
Reinforcement learning (RL) mainly discusses how agents can maximize their rewards in the environment. RL consists of three parts: environment, action, and reward. After obtaining the status from the environment, the agent uses this status to output an action. After the action is executed, the environment outputs the next status and the delayed reward. Delayed reward is a scalar feedback signal sent by the environment to an agent, which can indicate how the agent performs when adopting a strategy at a certain step. We focus on RL for two reasons. First, the RL model can adapt well to the complexity and uncertainty of multi-target-tracking scenarios. Second, RL is free from the artificial label of supervised learning and explores the optimal strategies by agents themselves, which may be able to achieve performance beyond human capabilities.
Q-learning is a value-based model-free RL algorithm, in which agents do not need to estimate state transitions or formulate explicit policies, but maintain a value table. The agent selects the most valuable action through this table. That is, the Q-learning model can provide what the optimal action is in the current state, but it will not provide which next state the agent will move to. Therefore, Q-learning is suitable for multi-target-tracking scenarios where the environmental state is difficult to estimate and the action set is discrete and limited. In contrast, Q-learning is more generalizable than the model-based method, which requires modeling a virtual environment. Since our training process is describable, and the state of the agent is discrete and observable, we do not need the state transition function, and we can obtain good model performance by training samples. In our model based on Q-learning, sensor nodes in the network can be mapped as agents. Each agent observes the current state of the environment and performs an action according to the exploration–exploitation policy. After that, the environment moves to the next state and produces a reward to strengthen or weaken the trend of the agent to choose this action. The system chooses actions for the sensors based on the value table and based on the principle of maximizing the agent’s value function. Our proposed Q-learning-based task-scheduling mechanism is shown in
Figure 2.
Our goal is to maximize the cumulative reward and obtain the optimal scheduling decision. The proposed model can be described as follows:
Each sensor node acts as an intelligent agent that can make its own decision and the tracking status of targets in the network represents the environment in our model. Interaction between agents and the environment is achieved by executing actions and receiving a reward function. According to the reward of each action taken by the agent in various states during training, we summarize a Q table. As shown in
Figure 3, the ordinate of the Q table is the state, the abscissa is the action, and the Q value is obtained from the Q value of the previous state and the estimated future reward, which is given by Equation (13). Every time the agent is in a certain state, the next action is to retrieve from the table the future reward with the highest value. If there are multiple identical values, one of them is randomly selected. Repeat until finished. This operation is repeated until the final state is reached.
- (1)
Action Space
The set of actions is a set that collects all the targets that the sensors in the cluster can sense, and denotes the target number. When the sensor executes action , it means it selects to track the -th target. In particular, executing action indicates that the sensor does not sense any target and goes to sleep.
- (2)
State Space
The environment state that the agent is in at each time step constitutes the state space , . In our model, denotes the targets that do not meet the tracking quality requirements in the surveillance area of the sensors at the -th time step.
- (3)
Reward
The reward obtained by the agent executing the -th action at the -th time step is expressed as . The specific numerical setting of the reward function is given by Equation (14). The reward provides the evaluation of the actions executed by agents in a certain environment state. The reward function in our proposed model is designed for the trade-off between energy efficiency and tracking quality of the sensor in a multi-target-tracking scenario.
4.3.2. Learning Process
In the learning process, each sensor needs to maintain a
Q table, as shown in
Figure 3. Initially, all entries in the
Q table are zero. Sensors select an action from the action space to execute at each time step. The environment then moves to the next state based on the action taken by the agent. The update of the
Q values is calculated by the following rules:
where
is the learning rate that controls the learning speed;
is the discount factor, which we set to 0.8 to consider the expected reward in the near future. The reward function
for the agent is defined as
where
is a weight coefficient;
is the residual energy;
is the initial energy;
is sensing range;
is the distance to the
-th target;
is the number of nodes required to track a single target;
is the number of nodes that choose to track the
-th target currently; and
is the number of targets that are not allocated enough tracking nodes.
According to Equation (14), when , the node chooses to track the -th target. In this manner, if , then the node is qualified obtains a positive reward. Otherwise, the node is a redundant tracking node and receives no reward or one that is negative. The reward function consists of two parts: energy and distance; more rewards are given to nodes with more residual energy and closer to the target.
On the other hand, when , the node chooses to go to sleep. In this manner, if , all targets meet tracking requirements, it is reasonable for the node to choose to sleep, and a positive reward can be obtained. Otherwise, if , there are targets that do not meet tracking requirements. In this case, the node should assist in the tracking task, so choosing to sleep receives zero or negative rewards. Since the node chooses to sleep, the reward function no longer includes tracking-related indicators. More rewards will be given to nodes with low residual energy.
The agent will choose the most rewarding action in the
Q table at each time step or randomly select an action to execute according to the exploration–exploitation strategy. This process repeats until the maximum number of iterations is reached. By this iterative calculation, each agent learns the
Q value of performing different actions in different environmental states, and then updates the
Q table. After the iteration of the
Q model is completed, we simply need to look up the final
Q table to obtain the agent’s optimal policy
, that is, the optimal scheduling strategy for sensor resource allocation in the multi-target-tracking scenario.
can be expressed as
4.3.3. Exploration–Exploitation Policy
The exploration–exploitation policy has a direct effect on learning efficiency in Q-learning [
22]. Most of the existing research uses the greedy policy, and the exploration probability
is given by
where
and
are the upper and lower boundaries for the exploration factor, respectively;
is a fixed constant;
is the maximum number of states; and
is the current number of states already known. At each time step, random action is selected with the exploration probability
and the optimal action is selected with the exploitation probability
. However, this strategy cannot provide meaningful guidance for the agent to choose actions based on the real-time allocation of sensor resources in the network. To manage this issue, we propose, with the use of target priority information, a novel exploration probability
:
where
and
are the maximum and minimum values of the target priority. Note that if we obtain a more reasonable task allocation, then different targets obtain more balanced priorities in real-time. In the early learning process, the tasks are unevenly allocated, and the maximum differences in target priorities are larger, so the agent is encouraged to perform more exploratory actions. With the iterative learning process, the resource allocation scheme learned by the agent gradually becomes more balanced, thus
should be decreased to take more greedy actions to complete the convergence of the final Q-table.
The pseudocode of the proposed multi-target-tracking task allocation mechanism is shown in Algorithm 1.
Algorithm 1. The Q-learning based task allocation mechanism. |
Input: sensor nodes, targets, state space , action space , Q table |
Output: The optimal policy for sensor scheduling |
1: Parameter Initialization: maximum number of iterations , learning rate , discount factor |
2: Initial Q value in Q table is 0 for each node |
3: repeat |
4: for each node before do |
5: Evaluate the environment state at time stept |
6: Select an exploitation or exploration action according to Equation (17) |
7: if random(0, 1) < then |
8: Exploration: node choose a random action |
9: else |
10: Exploitation: node choose the action |
11: end if |
12: The corresponding delayed reward is calculated based on Equation (14) |
13 Observe the next state and update value as Equation (13): |
14: . |
15: end for |
16: The optimal policy for sensor scheduling can be obtained by: |
17: |
In general, the operation process of our proposed improved Q-learning-based sensor-scheduling algorithm is as follows: Firstly, in subsection A, based on distance and energy information, sensors use a delay-based broadcast mechanism to select CH. CH can provide sensor nodes, i.e., intelligent agents, with the environment information and computation of rewards necessary for the Q-learning model. In subsection B, we propose four indicators according to target property and network status, and use EWM to calculate the tracking priority of targets. In
Section 4.3, we detail the process of utilizing Q-learning models for task allocation. The reward function and the exploration–exploitation policy are designed to reasonably schedule nodes, thus achieving a balanced allocation of tracking tasks for targets. The overall flow chart of the proposed sensor-scheduling algorithm MTT-SS is shown in
Figure 4.