Proceedings of the 2015 Winter Simulation Conference
L. Yilmaz, W. K. V. Chan, I. Moon, T. M. K. Roeder, C. Macal, and M. D. Rossetti, eds.
SIMULATION-DRIVEN TASK PRIORITIZATION USING
A RESTLESS BANDIT MODEL FOR ACTIVE SONAR MISSIONS
Cherry Y. Wakayama
Zelda B. Zabinsky
Maritime Systems Division
SPAWAR Systems Center Pacific
San Diego, CA 92152, USA
Industrial & Systems Engineering Department
University of Washington
Seattle, WA 98195, USA
ABSTRACT
We consider a task prioritization problem of an active sonar tracking system when available ping resources
may not be sufficient to sustain all tracking tasks at any particular time. In this problem, the time-varying
conditions of a tracking task are represented by a finite-state discrete-time Markov decision process. The
objective is to find a policy which decides at each time interval which tracking tasks to perform so as
to maximize the aggregate reward over time. This paper addresses the derivation of the Markov chain
parameters from the sonar tracking system simulations, the establishment of task prioritization as a restless
bandit (TPRB) problem, and the TPRB policy obtained by a primal-dual index heuristic based on a first-order
linear programming relaxation to the TPRB problem. The superior performance of the resulting TPRB
policy is demonstrated using Monte Carlo simulations on various multi-target scenarios.
1
INTRODUCTION
Active sonar systems detect underwater objects of interest by transmitting acoustic waveforms, or pings, from
sources and detecting the echoes with receivers. Active sonar systems are especially employed in military
applications, and can carry out a variety of tasks, such as surveillance and multi-target tracking (Grimmett
and Coraluppi 2006). This paper describes an active sonar system composed of spatially distributed sonar
nodes or co-located source/receiver pairs for multi-target tracking as illustrated in Figure 1. Since the sonar
tracking system contains numerous transmit/receive nodes, and the number of simultaneous transmits from
multiple sources is limited due to potential interference and communications and processing constraints,
effective sonar resource management is required to use the system capabilities efficiently. There are
generally two related aspects of resource management for a sonar tracking system: scheduling and task
prioritization. Scheduling decides how to commit resources (sources, waveforms and time slots) between a
given set of tasks so as to best accomplish the tasks. An adaptive ping scheduling approach for active sonar
systems for multi-objective anti-submarine warfare (ASW) area search and tracking missions is developed
by Wakayama et al. (2015). However, in certain operating conditions, there may not be enough resources
to perform all the required tasks without compromising quality, e.g., the number of tracking tasks on hand
is greater than the number of orthogonal waveforms that the system can transmit simultaneously without
interference. Task prioritization determines the set of active tasks to focus on in such instances, and becomes
critical to the performance of the sonar resource management and to subsequent mission success. In this
paper, we study the dynamic task prioritization problem associated with the allocation of pings to tracking
tasks over time.
The dynamic task prioritization problem fits naturally into the framework of multi-armed bandits
(Mahajan and Teneketzis 2008). In this problem, tracking tasks may be viewed as bandits. At each
ping interval, ping(s) are allocated to selected bandit(s) to generate reward. In particular, this problem is
characterized as a restless bandit problem because track quality deteriorates when the tracking task is not
978-1-4673-9743-8/15/$31.00 ©2015 IEEE
3725
Wakayama and Zabinsky
Figure 1: Multi-target tracking scenario for an active sonar system.
being processed (Whittle 1998). With the assumption of the utilization of a multi-target tracker in the sonar
tracking system, we define a tracking task as a target which is previously detected and associated in the
tracker. The tracking task states represent the quantized areas of uncertainty (AOU) owing to track quality
during the track updates in the tracker. We assume that the tracking task states are perfectly observable
with the utilization of a tracker. We also assume that the tracking task states evolve independently of
each other. At each instance of time or ping interval, tasks are selected to which pings must be allocated,
and the sonar tracking system collects a reward as a function of the states. When a track is selected for
prioritization (i.e., a ping is allocated for the selected track), information on its position and velocity is
obtained if the ping results in a detection, and hence it may result in an improved AOU state (smaller
AOU results in faster target acquisition time). For the tracks that are not selected, information is being
lost, because the targets will certainly be performing evasive maneuvering. The tracking tasks continue to
change states whether they are being selected or not. Each tracking task has potentially different state space,
and the state transition rules for each task with each prioritization decision will be different in general.
Considering complexity in active sonar tracking systems, the Markovian transition rules with respect to
the binary prioritization decisions on a tracking task are derived from sonar tracking system simulations.
The complex dynamics of a sonar tracking system are evident from maneuvering targets, and imperfect
and noisy sensors whose detection probability is dependent on acoustic environment conditions, target
states and waveform characteristics. In this instance, simulation is very effective in modeling and capturing
dynamic effects of a sonar tracking system. Monte Carlo track samples are generated and the statistics about
the AOU state evolutions associated with a tracking task are obtained by analyzing the resulting simulated
data sets. Extracting efficient high-level models from complex dynamic processes is highly desirable for
real-time control applications such as dynamic task prioritization for sonar resource management.
The estimated Markov transition matrices are utilized to formulate the task prioritization as a restless
bandit (TPRB) problem. The TPRB problem is then solved using a primal-dual heuristic index algorithm
based on a first-order linear programming relaxation (Bertsimas and Niño-Mora 2000). The approach to
solving the TPRB problem is referred to as the TPRB scheme in this paper. The TPRB scheme provides
a feasible policy with a guarantee for its suboptimality, and has been demonstrated numerically to have
excellent performance. The tracker provides updates on tracking tasks at regular intervals, and the TPRB
3726
Wakayama and Zabinsky
problem is solved repeatedly with an updated initial condition to derive task prioritization policy throughout
the mission duration.
This paper proceeds as follows. Section 2 describes a general active sonar tracking system model and
its submodules focusing on tracking tasks. Section 3 discusses the modeling of individual tracking task
dynamics using a finite-state discrete-time Markov chain from the simulated data sets. Section 4 presents a
restless bandit model formulation for the dynamic task prioritization problem. Section 5 explains a solution
approach to the TPRB problem based on a first-order linear programming relaxation. Section 6 compares
the performance of the resulting TPRB policy with the conventional policies including a round robin policy
and a greedy policy on various multi-target scenarios, and Section 7 provides insights and conclusions.
2
SYSTEM MODEL
A general active sonar tracking system model is shown in Figure 2, and is used for simulation and analyzing
the task prioritization techniques. In this closed-loop architecture, information on targets is obtained by
active sonar measurements and processed by the multi-target tracker and the resource management modules
(task prioritization and scheduling) to derive effective ping decisions, and the ping decisions are then fed
back to the sonar system for improved tracking. In this section, every module of the system model is
briefly described.
Figure 2: Active sonar tracking system model.
Scenario
A field of multiple sensor nodes is positioned in an acoustic environment of interest and its operating region
is identified. In this study, the scenario is formed with tracking tasks of N targets. The target tracks are
sampled from a set of randomly-generated tracks. Tracks are generated with random headings (uniform
between −180◦ and 180◦ ), random speeds (uniform between minimum and maximum assumed speeds) and
random initial positions (uniform within the operating region). A discrete-time, nearly-constant-velocity
model with assumed motion noise is used to update each track for a specified scenario time window.
Sonar Performance Model
The sonar performance model computes mean signal-to-noise ratios (SNR) or signal excesses (SE) for a
given sensor field configuration consisting of sources, receivers and moving targets, and is an essential
element of the ping decision derivation and the sonar system simulation. Detection probability predictions
can then be computed from the SE values. For this analysis, mean SE values are obtained using Signal
Systems Corporation’s Multistatic Sonar Performance Model software (MSPM 2010).
3727
Wakayama and Zabinsky
Ping Contact Generator
A contact generator is required in order to test and evaluate sensor management techniques. This module
provides target contacts for each receiver, given a waveform transmission from a particular source. Target
contacts are derived from simulated tracks, and contain information including SNR, bearing, arrival time for
ranging, and range-rate if available. They are modeled by obtaining mean SE from the sonar performance
model, and adding a random fluctuation term from an assumed Gaussian distribution, along with assumed
measurement (bearing, time, and range-rate) errors. Target contacts are generated when the resulting SE
is positive. For this study on track prioritization scenarios, it is assumed that false contacts have already
been removed by the multi-target tracker and hence they are not included in the ping contact generator.
Multi-target Tracker
The purpose of the tracker is to estimate the positions and velocities of targets of interest from the detection
contacts. This process requires the removal of false contacts, the association of true contacts, and the fusion
of contact information. There is a large literature on sensor data fusion and target tracking (Bar-Shalom
et al. 2011). An appropriate tracker implementation must balance the tradeoff between complexity and
quality depending on the applications. In this paper, a simple Kalman filter based tracker is implemented.
The track outputs of the tracker are represented by their expected 2-dimensional positional and velocity
state vectors and covariance matrices. Their AOU states are calculated from the 2-dimensional positional
covariance matrix.
Task Prioritization
The task prioritization module dynamically determines the set of active tracks to focus on at any particular
ping interval in order to optimize the overall sonar system performance. Task prioritization becomes critical
as the number of tracking tasks on hand becomes higher and available resources may not be sufficient to
sustain all tracks. The number of active tracking tasks that can be scheduled at each ping time interval is
constrained by the number of orthorgonal waveforms available for the sonar system as well as processing
and communications limitations. The main focus of this paper is dynamic task prioritization, and an
optimization problem formulation for the task prioritization problem is discussed in detail in Section 4.
Scheduling
The scheduling module together with the task prioritization module control the performance of a sonar
tracking system. Given the active tracking tasks selected by the task prioritization module at each ping
time interval, the scheduling module provides the ping decisions on which source/waveform pairs are best
to ping next. The performance is measured by the instantaneous detection probability and the resulting
AOU of each active track. The scheduling strategy is to choose a source/waveform pair for each active
tracking task that will result in a minimum AOU and a detection probability of greater than 0.5.
3
TRACKING TASK DYNAMIC MODEL FROM SYSTEM SIMULATION
We seek to predict the next AOUs of a given tracking task with respect to the binary prioritization decisions
on the task using a Markov chain. Since it is simply not feasible to obtain real-world data, the AOU data
required for Markov chain construction are generated from Monte Carlo track samples using the sonar
tracking system model (Section 2). We then apply Markov chain calculations to each of the simulated data
sets corresponding to each of the prioritization decisions to obtain a finite transition probability matrix with
the states representing the quantized AOUs. These Markov chains represent a priori statistics about the
tracking task dynamic processes for the task prioritization optimization problem discussed in Section 4.
3728
Wakayama and Zabinsky
3.1 AOU Data Set Generation
The sonar tracking system simulation framework for AOU data generation consists of a discrete event
simulation model with an embedded optimization module or scheduling module, a sonar performance model,
a ping contact generator and a multi-target tracker (Section 2). The user can modify sensor configurations,
environment conditions and simulation parameters in the scenario module. A specified number of Monte
Carlo tracks are generated with random headings, speeds and initial positions, and assumed motion noise.
For each sampled track at each ping time interval, two data pairs, (δ , δ 1 ) and (δ , δ 0 ), are obtained with
binary track prioritization decisions, namely active and passive (we identify superscript 1 with active (i.e.,
the track is selected and a ping is allocated for the track), and superscript 0 with passive (i.e., the track
is not selected and is not considered in the ping decision generation)). The AOU of the sampled track
just before the prioritization decision is denoted by δ , and calculated from the current predicted positional
covariance matrix. With the active decision on the sampled track at any given interval, the scheduling
module provides a ping (source/waveform pair) optimized for the track, the ping contact generator provides
a target contact (if any) and the multi-target tracker updates its position and velocity estimates from the
contact measurement. The active AOU, δ 1 , is then calculated from the updated positional covariance
matrix. With the passive decision, no contact is generated. The multi-target tracker propagates its position
and velocity estimates and the passive AOU, δ 0 , is calculated. Typically, the AOU grows rapidly in time
without successive measurements. Grouping those data pairs according to the active or passive decision
results in two data sets, denoted by D1 and D0 respectively.
3.2 Markov Transition Matrices
We calculate a state transition matrix from each data set, thereby obtaining two transition matrices denoted
by P1 and P0 corresponding to the data sets D1 and D0 respectively. A key step when creating a Markov chain
is determining the appropriate quantization of states. To identify the number and quantization intervals,
we analyze the data as well as consider the operational requirements of the sonar tracking system. In some
tracking applications, it may be desirable to maintain the AOUs of the targets of interest below a certain
threshold as dictated by the system requirements. When the AOU becomes greater due to successive misses,
a track may be considered lost. In this paper, the quantization of AOU states is done by analyzing detection
patterns and resulting AOUs. Histograms are useful in ensuring there are enough data points available to
capture the state transitional behavior and establishing the quantization intervals or bin ranges. Once the
intervals are created, each of the data pairs are labeled with the intervals to which they correspond. After
the data pairs are labeled, the number of transitions from one interval to the next is counted and empirical
state transition probabilities are calculated yielding the transition matrices P1 and P0 . The state quantization
for the examples in this paper is shown in Table 1. Figure 3 includes three histograms illustrating the
distributions of the initial and next AOU states obtained from the simulated data sets. Figure 3a corresponds
to the values of δ , and Figures 3b and 3c correspond to the values of δ 1 and δ 0 respectively. It is observed
that with the active decision the mass of the distribution shifted towards the smaller AOU states and with
the passive decision towards the larger AOU states.
Table 1: State Quantization
States
AOU ranges (km2 )
States
AOU ranges (km2 )
States
AOU ranges (km2 )
1
2
3
4
5
(0,0.35]
(0.35,0.5]
(0.5,0.7]
(0.7,1]
(1,1.8]
6
7
8
9
10
(1.8,3]
(3,5]
(5,8]
(8,12]
(12,17]
11
12
13
14
15
(17,27]
(27,40]
(40,62]
(62,100]
> 100
3729
Wakayama and Zabinsky
(a)
(b)
(c)
Figure 3: Histograms of the data sets used in the simulation examples: (a) initial AOU data {δ }, (b) next
AOU data with the active decision {δ 1 } and (c) next AOU data with the passive decision {δ 0 }.
4
RESTLESS BANDIT FORMULATION
In this section, we provide a brief description of key features of the classical multi-armed bandit (MAB)
problem. We then formulate the task prioritization as a variant of the classical MAB problem, known as
restless bandit problem, and refer to it as the TPRB problem.
Multi-armed bandit problems are a class of sequential resource allocation problems concerned with
allocating one or more resources among several competing projects (Mahajan and Teneketzis 2008). In
the classical MAB problem at each instant of time a single resource is allocated to one of many competing
projects. The project to which the resource is allocated changes its state according to a random dynamical
process with known statistical descriptions and generates a reward. The remaining projects remain frozen
and contribute no reward. Projects are assumed independent. Gittins (1979) showed that an optimal solution
is of the index type. The Gittins indices can be computed separately for each project (or bandit) and thus,
finding an optimal policy reduces to selecting the project with the highest index.
The task prioritization problem can be seen as a variation of the classical MAB problem. In the
following we introduce the notation and provide the formal problem definition. We consider a collection
of N tracking tasks (bandits), labeled n ∈ N = {1, . . . , N}. At ping interval k, task n can be in one of the
finite number of states ikn ∈ Sn . Let akn ∈ {0, 1} be the action for task n at ping time k. If task n in state ikn
is selected (i.e., akn = 1), then an active reward R1ik is earned, and its state changes in a Markovian fashion,
n
according to an active transition probability matrix P1 into state jnk+1 with probability p1ik jk+1 . If the task
n n
is not selected (i.e., akn = 0), then a passive reward R0ik is received, and its state then changes according to
n
a passive transition probability matrix P0 into state jnk+1 with probability p0ik jk+1 . The tracking tasks are
n n
characterized as restless bandits because each track’s AOU state evolves over time even when it is not
being selected (Whittle 1998).
In the TPRB problem, the system reward represents the value assigned according to the track quality
(smaller AOU states correspond to higher track quality). If task n is selected or active, information may
be updated depending on whether the ping will result in a detection or not. This probability is captured
by P1 . When the state of the n-th task is ikn , we suppose that the active reward is the mean track quality
value, which can be represented as
(1)
R1in = ∑ p1in jn Vin ,
jn ∈Sn
3730
Wakayama and Zabinsky
where Vin is the track quality value assigned for being in state in , which is pre-specified by the system
operator. Similarly, the passive reward is defined as
R0in =
p0in jn Vin .
∑
(2)
jn ∈Sn
In this formulation, rewards are time discounted by a discount factor β , where 0 < β < 1.
At each ping interval k = 0, 1, . . . , K, exactly M tasks must be selected, i.e., ∑Nn=1 akn = M, where
1 ≤ M < N, and M is determined by the number of orthogonal waveforms that the system can transmit
simultaneously without interference. Note that in the classical MAB problem, M = 1. Tasks are selected
over time according to a Markovian prioritization policy u (which selects the current action as a function of
the current state). Let U denote the class of admissible Markovian policies. The prioritization policy u ∈ U
is [akn ] ∈ {0, 1}N×K . The TPRB problem consists of finding a prioritization policy that maximizes the total
expected discounted reward over an infinite horizon (we assume K is large enough for the infinite-horizon
approximation) subject to the constraint on the number of tasks that must be selected and the dynamical
constraint of each of the Markov decision processes represented by a tuple < P0 , P1 , R0 , R1 , β >. The total
expected discounted reward objective function can be described as
"
#
∞
max
u∈U
5
Z = Eu
N
akn
∑ ∑ β k Ri
k=0 n=1
k
n
.
(3)
THE TPRB SCHEME
In this section, we solve the TPRB problem mentioned in Section 4 using the primal-dual index heuristic
algorithm based on a first-order relaxation to form the TPRB scheme which has been demonstrated to have
less complexity and nearly optimal performance (Bertsimas and Niño-Mora 2000).
5.1 A First-order LP Relaxation
The first order relaxation as proposed by Whittle states that the original requirement that exactly M tasks
must be selected at any ping interval is relaxed to an averaged version: the total expected discounted
number of active tasks must be M/(1 − β ) (Whittle 1998). According to Whittle this relaxed version must
be interpreted as the problem of controlling optimally N separate Markov decision processes, subject to the
binding constraint on the average number of active tasks. In this section Whittle’s relaxation is reformulated
using a linear programming model developed in Bertsimas and Niño-Mora (2000).
The convenient decision variables (denoted here by xiann ) for a linear programming model for the Markov
decision processes are defined as follows. For each n ∈ N , in ∈ Sn and an ∈ {0, 1}, let xiann be the expected
total discounted time that task n is in state in and action an is made, when the distribution of the initial
state is known and given by a vector α ; i.e.,
"
#
∞
xiann = Eu
k=0
where
ak
Iikn
n
=
1
0
akn
∑ Ii
k
n
βk ,
if task n is in state ikn and action akn is taken at interval k,
otherwise.
(4)
(5)
The initial states of the tracking tasks are assumed to be known from the output of the tracker, and the
vector α is represented as
1 if project n is in state in at interval k = 0,
(6)
αin =
0 otherwise.
3731
Wakayama and Zabinsky
The objective function in (3) can then be translated into
max
∑ ∑
Z=
∑
n∈N in ∈Sn an ∈{0,1}
Rainn xiann ,
(7)
and the value Z is interpreted as the expected total discounted cost. Applying classical results on the
long-run properties of Markov decision processes, the constraints on xiann are given by
∑
xiann − β
an ∈{0,1}
∑
∑
painnjn xiann = α jn , for jn ∈ Sn , n ∈ N ,
(8)
0, for an ∈ {0, 1}, in ∈ Sn , n ∈ N .
(9)
in ∈Sn an ∈{0,1}
xiann ≥
Now, Whittle’s condition on the average number of active tasks can be written as
"
#
∞
∞
M
∑ ∑ xi1n = ∑ Eu ∑ k∑ Ii1kn β k = ∑ Mβ k = 1 − β .
k=0
k=0
n∈N i ∈S
n∈N in ∈Sn
n
(10)
n
The objective function (7) and the constraints (8) – (10) form a first-order LP relaxation model, denoted
by (P1 ), of the TPRB problem.
5.2 Primal-Dual Index Heuristic
To find each action akn once the xiann values are obtained, we use the primal-dual index heuristic proposed
in (Bertsimas and Niño-Mora 2000). The dual of the linear program (P1 ) is given by
(D1 )
min
∑ ∑
Y=
n∈N jn ∈Sn
α jn y jn +
M
y
1−β
subject to
yin − β
∑
p0in jn y jn ≥ R0in , for in ∈ Sn , n ∈ N ,
(11)
jn ∈Sn
yin − β
∑
p1in jn y jn + y ≥ R1in , for in ∈ Sn , n ∈ N .
jn ∈Sn
We denote {x̄iann } and {ȳin , ȳ} as an optimal and dual solution pair to the primal problem (P1 ) and its dual
problem (D1 ) respectively. We denote γ̄iann as the corresponding optimal reduced cost coefficients which
are given by
γ̄i0n = ȳin − β
∑
p0in jn ȳ jn − R0in ,
jn ∈SN
γ̄i1n
= ȳin − β
∑
p1in jn ȳ jn + ȳ − R1in ,
(12)
jn ∈SN
where γ̄i0n , γ̄i1n ≥ 0. γ̄i0n and γ̄i1n can be interpreted as the rates of decrease in the objective function value of
the primal problem (P1 ) per unit increase in the value of variables xi0n and xi1n respectively.
Given the current states (i1 , . . . , iN ) of the N tasks, the index for each of the task when it is in state in
is then computed as
(13)
ζin = γ̄i1n − γ̄i0n .
According to the primal-dual index heuristic, the tasks that have the M smallest indices are set active. In
case of ties, tasks with x̄i1n > 0 are set active.
3732
Wakayama and Zabinsky
6
SIMULATION RESULTS
In this section we illustrate the performance of the TPRB policy on multi-target scenarios using active
sonar tracking system simulations (Section 2). A monostatic active system with 36 nodes distributed in
an operational region of approximately 120 × 120 km2 in a 6 × 6 grid with a spacing of 15 km between
neighboring nodes is used. The field layout and the probability of detection map associated with a node
are shown in Figure 4a. Randomly-generated tracks are implemented to simulate tracking tasks. A 4-target
tracking scenario example is illustrated in Figure 4b. The number of tracks vary for different case studies.
(a)
(b)
Figure 4: (a) Probability of detection map of sonar node 17. (b) A 4-target tracking scenario example.
For performance comparison we implement traditional task prioritization policies including a round
robin policy and a greedy policy in addition to the TPRB policy. In this study, we set M = 1 assuming
only one ping from one node can be transmitted at each ping interval. The round robin policy uses a
pre-specified sequence for the N tracks to determine the ping decisions, i.e., at each ping time a ping is
transmitted from the node that will result in a minimum AOU and a detection probability of greater than
0.5 for the selected track. The greedy policy selects a track with the largest reward, denoted by n∗ , from
the N tracks at each ping time, i.e., n∗ = argmaxn∈N (R1n + ∑i∈N ,i6=n R0i ). As currently implemented, the
simulation experiments were performed using MATLAB. The linear programming models were solved
using LP Solve 5.5 (LP Solve 2015).
The process flow of the active sonar tracking system simulation can be summarized as follows: given
the number of tracking tasks and their positions and velocities, pings are first generated according to a
round robin policy for track initialization. Task prioritization is initialized when at least one detection
from each track has been obtained. The task prioritization model (round robin, greedy, or TPRB policy)
is solved to select a set of tracking tasks to perform next. Based on the selected tasks, the scheduling
model provides the ping decisions using sonar performance model predictions, and pings are generated
accordingly. Detections from the pings are processed by the multi-target tracker and the AOU state estimates
of the tracking tasks are updated. At this point the optimization procedure iterates. The task prioritization
and scheduling models are solved using the updated tracking tasks to generate new ping decisions. This
process is repeated until the end of the operational scenario time window.
To gain insights into the effect of task prioritization on the tracking performance, a 4-target tracking
scenario example (Figure 4b) is presented. In this example, it is desired to maintain tracks with their AOU
values below 1 km2 (i.e., Gaou = 1 km2 ). Typically, the goal AOU threshold Gaou will be selected by the
3733
Wakayama and Zabinsky
system’s operator. The AOU performance on the tracks with the round robin, greedy and TPRB policies
are plotted in Figure 5. To initialize task prioritization, pings are first generated according to a round robin
policy. At 5 minutes into the scenario, all the 4 tracks have been initialized and task prioritization is being
performed for the duration from 5 minutes to 35 minutes. With each ping the AOU of the selected track
decreases if it results in a detection and increases if it results in a miss, while the AOUs of the non-selected
tracks increase. The total number of detections over the scenario duration are the same for each of the
policies. However, the round robin policy cannot maintain continuous tracks below the goal AOU threshold
of 1 km2 . The greedy and TPRB policies strive to maintain 2 continuous tracks below 1 km2 threshold. In
this example, the greedy policy schedules a ping for an additional track at 30 minutes yielding an inferior
tracking result to the TPRB policy.
(a)
(b)
(c)
Figure 5: The AOU trajectories of the tracks in the 4-target scenario example with (a) the round robin
policy, (b) the greedy strategy, and (c) the TPRB policy.
The tracking performance of the TPRB, greedy and round robin policies is evaluated for N-target
scenarios, where N = 3, 4, and Gaou = 0.5, 0.7, 1, 3 km2 . Setting Gaou to a smaller value imposes a more
stringent requirement on the sonar tracking system as each tracking task demands more pings. For each
scenario, N random tracks are generated. Performance evaluation of each policy for each case study is
based on 10 N-target scenarios with 10 trials for each scenario (a total of 100 trials). The other simulation
parameters are defined as follows: the ping interval is set to 1 minute. The number of pings at each interval
is set to M = 1. The AOU state of each tracking task is quantized into 15 states with the corresponding
AOU ranges shown in Table 1. The values associated with being in each state is set to Vi = v for each
state i = 1, . . . , 15, where v = 40 for the desirable states (states with their AOU values below Gaou ),
v = −1, −2, . . . for each deteriorating state thereafter, with v = −15 for i = 15. The time discounting
parameter for the TPRB policy is set to β = 0.6. Each scenario is initiated with a round robin policy,
and the task prioritization module is activated when N tracks have been detected. The task prioritization
duration is set to 30 minutes.
In Table 2 we report the results of our experiments for various values of the parameters. For a given
set of parameters, the following quantities are computed for each of the policies: Zaou : mean AOU value
of the portions of the output tracks with values below Gaou (in km2 ), nT : mean number of successfully
tracked targets with AOU values below Gaou , nΓ : mean number of track fragments, dΓ : mean duration of
track fragments (in minutes), and DΓ : mean of the maximum duration of track fragments (in minutes) in
each trial. In general, the values of Zaou <= Gaou are of equivalent performance. With nT , dΓ , and DΓ ,
the higher the values, the more desirable the system performance is. With nΓ , the smaller values are more
desirable due to less track fragmentation.
It is observed that in general the performance of the greedy and TPRB policies significantly improves
the track quality of the prioritized tracks in terms of track fragmentation at the expense of neglecting tracking
3734
Wakayama and Zabinsky
tasks that cannot be fulfilled. The mean numbers of successfully tracked targets (nT ) are comparable among
all policies except for the cases with a stringent goal AOU threshold (Gaou = 0.5) or an easy goal AOU
threshold (Gaou = 3). In the cases with Gaou = 0.5, without any successive detection opportunities on a
single track, the round robin policy struggles to bring the AOU values below the threshold. In the case
with Gaou = 3 and N = 3, the greedy and TPRB policy is overly conservative by putting too much weights
on maintaining quality tracks and their overall performance is inferior to that of the round robin policy,
suggesting that the tracking system can provide performance without task prioritization. The value of task
prioritization in terms of track quality parameters (nΓ , dΓ , DΓ ) is apparent when the requirement for the
tracking tasks exceeds available ping resources. It can be observed that the greedy policy significantly
improves track quality over the round robin policy except for the case with Gaou = 3 and N = 3. Although
not by much, the greedy policy even outperforms the TPRB policy for the cases with Gaou = 0.5. In all the
other cases, the TPRB policy provides at least an equivalent or better performance over the greedy policy
in terms of track fragmentation metrics. Since the TPRB policy provides a more consistent performance
with limited ping resources, it can be concluded that the TPRB policy is superior to that of the round robin
and greedy policies.
Table 2: Numerical Experiments
Parameters
Policies
Zaou
nT
nΓ
dΓ
DΓ
N = 4, M = 1, Gaou = 0.5
TPRB
Greedy
Round robin
0.29
0.28
0.38
0.92
0.95
0.47
3.04
2.44
26.14
8.85
11.31
0.15
20.81
21.99
0.41
N = 4, M = 1, Gaou = 0.7
TPRB
Greedy
Round robin
0.32
0.38
0.48
0.97
0.99
0.95
3.90
9.01
25.37
6.70
2.33
0.18
20.54
11.11
0.96
N = 4, M = 1, Gaou = 1
TPRB
Greedy
Round robin
0.58
0.59
0.61
1.60
1.51
1.40
4.73
5.13
19.11
5.35
4.84
0.57
20.05
20.53
1.19
N = 4, M = 1, Gaou = 3
TPRB
Greedy
Round robin
0.82
1.22
1.22
1.99
2.77
2.81
3.83
8.36
7.89
6.83
2.59
2.84
20.49
9.95
17.31
N = 3, M = 1, Gaou = 0.5
TPRB
Greedy
Round robin
0.29
0.28
0.38
0.87
0.89
0.48
3.87
3.63
28.19
6.75
7.27
0.07
17.16
17.67
0.21
N = 3, M = 1, Gaou = 0.7
TPRB
Greedy
Round robin
0.34
0.39
0.48
0.93
0.95
0.92
4.74
6.84
24.99
5.33
3.38
0.20
17.76
13.37
0.71
N = 3, M = 1, Gaou = 1
TPRB
Greedy
Round robin
0.56
0.58
0.59
1.42
1.35
1.36
5.90
6.67
19.11
4.09
3.50
0.57
15.59
16.31
1.48
N = 3, M = 1, Gaou = 3
TPRB
Greedy
Round robin
0.87
1.18
1.12
1.90
2.51
2.51
4.04
4.55
2.71
6.42
5.59
10.06
19.51
16.83
27.34
3735
Wakayama and Zabinsky
7
CONCLUSIONS
In this paper, we addressed the dynamic task prioritization problem for an active sonar tracking system.
Finite-state Markov chains that capture the dynamical behavior of tracks’ AOUs are derived from the sonar
system simulations. The task prioritization problem is established as a restless bandit problem, which is
solved by the primal-dual index heuristic algorithm based on a first-order relaxation to form the TPRB
scheme. Extensive simulation results illustrate the superior performance of the proposed scheme over the
conventional round robin approach in terms of track fragmentation especially when the system is overloaded
with a large number of tracking tasks and/or has a stringent AOU requirement for the tracking tasks.
REFERENCES
Bar-Shalom, Y., P. K. Willett, and X. Tian. 2011. Tracking and Data Fusion: A Handbook of Algorithms.
Storrs, CT: YBS Publishing.
Bertsimas, D., and J. Niño-Mora. 2000. “Restless Bandits, Linear Programming Relaxations and a PrimalDual Heuristic”. Operations Research 48:80–90.
Gittins, J. 1979. “Bandit Processes and Dynamic Allocation Indicies (with discussion)”. Journal of Royal
Statistical Society 41 (2): 148–177.
Grimmett, D., and S. Coraluppi. 2006. “Multistatic Active Sonar System Interoperability, Data Fusion, and
Measures of Performance”. Technical Report NURC-FR-2006-004, NATO Undersea Research Centre.
LP Solve 2015. “LP Solve Reference Guide Menu”. Accessed Feb. 1, 2015. http://lpsolve.sourceforge.net.
Mahajan, A., and D. Teneketzis. 2008. “Multi-Armed Bandit Problems”. In Foundations and Applications
of Sensor Management, edited by A. O. H. III, D. A. Castanon, D. Cochran, and K. Kastella, 121–151.
Springer-Verlag.
MSPM 2010. “Multistatic Sonar Performance Model”. [software] (version 1.4). Signal Systems Corporation,
Maryland.
Wakayama, C. Y., Z. B. Zabinsky, R. Plate, and J. Hoff. 2015. “Dynamic Adaptive Ping Scheduling
for Monostatic Active Sonar Systems in Convergence Zone Environments”. U.S. Navy Journal of
Underwater Acoustics (accepted).
Whittle, P. 1998. “Restless Bandits: Activity Allocation in a Changing World”. Journal of Applied Probability 25:287–298.
AUTHOR BIOGRAPHIES
CHERRY WAKAYAMA is a Research Engineer in the Maritime Systems Division at SPAWAR Systems
Center Pacific. She is particularly involved in research and development concerning multistatic active sonar
information fusion, target tracking and sensor management. She holds M.S. and Ph.D. degrees in Electrical
Engineering from the University of Washington. Her research interests include dynamical systems, optimal
control theory and applications of operations research. Her e-mail address is wakayama@spawar.navy.mil.
ZELDA B. ZABINSKY is a Professor in the Department of Industrial and Systems Engineering at the
University of Washington, with adjunct appointments in the departments of Electrical Engineering, Mechanical Engineering, and Civil and Environmental Engineering. She is an IIE Fellow. Professor Zabinsky’s
research interests are in global optimization under uncertainty for complex systems. Her book, Stochastic
Adaptive Search in Global Optimization, describes research on theory and practice of algorithms useful for solving problems with multimodal objective functions in high dimension. Her email address is
zelda@u.washington.edu.
3736