Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Simulation-driven task prioritization using a restless bandit model for active sonar missions

2015, Winter Simulation Conference

Proceedings of the 2015 Winter Simulation Conference L. Yilmaz, W. K. V. Chan, I. Moon, T. M. K. Roeder, C. Macal, and M. D. Rossetti, eds. SIMULATION-DRIVEN TASK PRIORITIZATION USING A RESTLESS BANDIT MODEL FOR ACTIVE SONAR MISSIONS Cherry Y. Wakayama Zelda B. Zabinsky Maritime Systems Division SPAWAR Systems Center Pacific San Diego, CA 92152, USA Industrial & Systems Engineering Department University of Washington Seattle, WA 98195, USA ABSTRACT We consider a task prioritization problem of an active sonar tracking system when available ping resources may not be sufficient to sustain all tracking tasks at any particular time. In this problem, the time-varying conditions of a tracking task are represented by a finite-state discrete-time Markov decision process. The objective is to find a policy which decides at each time interval which tracking tasks to perform so as to maximize the aggregate reward over time. This paper addresses the derivation of the Markov chain parameters from the sonar tracking system simulations, the establishment of task prioritization as a restless bandit (TPRB) problem, and the TPRB policy obtained by a primal-dual index heuristic based on a first-order linear programming relaxation to the TPRB problem. The superior performance of the resulting TPRB policy is demonstrated using Monte Carlo simulations on various multi-target scenarios. 1 INTRODUCTION Active sonar systems detect underwater objects of interest by transmitting acoustic waveforms, or pings, from sources and detecting the echoes with receivers. Active sonar systems are especially employed in military applications, and can carry out a variety of tasks, such as surveillance and multi-target tracking (Grimmett and Coraluppi 2006). This paper describes an active sonar system composed of spatially distributed sonar nodes or co-located source/receiver pairs for multi-target tracking as illustrated in Figure 1. Since the sonar tracking system contains numerous transmit/receive nodes, and the number of simultaneous transmits from multiple sources is limited due to potential interference and communications and processing constraints, effective sonar resource management is required to use the system capabilities efficiently. There are generally two related aspects of resource management for a sonar tracking system: scheduling and task prioritization. Scheduling decides how to commit resources (sources, waveforms and time slots) between a given set of tasks so as to best accomplish the tasks. An adaptive ping scheduling approach for active sonar systems for multi-objective anti-submarine warfare (ASW) area search and tracking missions is developed by Wakayama et al. (2015). However, in certain operating conditions, there may not be enough resources to perform all the required tasks without compromising quality, e.g., the number of tracking tasks on hand is greater than the number of orthogonal waveforms that the system can transmit simultaneously without interference. Task prioritization determines the set of active tasks to focus on in such instances, and becomes critical to the performance of the sonar resource management and to subsequent mission success. In this paper, we study the dynamic task prioritization problem associated with the allocation of pings to tracking tasks over time. The dynamic task prioritization problem fits naturally into the framework of multi-armed bandits (Mahajan and Teneketzis 2008). In this problem, tracking tasks may be viewed as bandits. At each ping interval, ping(s) are allocated to selected bandit(s) to generate reward. In particular, this problem is characterized as a restless bandit problem because track quality deteriorates when the tracking task is not 978-1-4673-9743-8/15/$31.00 ©2015 IEEE 3725 Wakayama and Zabinsky Figure 1: Multi-target tracking scenario for an active sonar system. being processed (Whittle 1998). With the assumption of the utilization of a multi-target tracker in the sonar tracking system, we define a tracking task as a target which is previously detected and associated in the tracker. The tracking task states represent the quantized areas of uncertainty (AOU) owing to track quality during the track updates in the tracker. We assume that the tracking task states are perfectly observable with the utilization of a tracker. We also assume that the tracking task states evolve independently of each other. At each instance of time or ping interval, tasks are selected to which pings must be allocated, and the sonar tracking system collects a reward as a function of the states. When a track is selected for prioritization (i.e., a ping is allocated for the selected track), information on its position and velocity is obtained if the ping results in a detection, and hence it may result in an improved AOU state (smaller AOU results in faster target acquisition time). For the tracks that are not selected, information is being lost, because the targets will certainly be performing evasive maneuvering. The tracking tasks continue to change states whether they are being selected or not. Each tracking task has potentially different state space, and the state transition rules for each task with each prioritization decision will be different in general. Considering complexity in active sonar tracking systems, the Markovian transition rules with respect to the binary prioritization decisions on a tracking task are derived from sonar tracking system simulations. The complex dynamics of a sonar tracking system are evident from maneuvering targets, and imperfect and noisy sensors whose detection probability is dependent on acoustic environment conditions, target states and waveform characteristics. In this instance, simulation is very effective in modeling and capturing dynamic effects of a sonar tracking system. Monte Carlo track samples are generated and the statistics about the AOU state evolutions associated with a tracking task are obtained by analyzing the resulting simulated data sets. Extracting efficient high-level models from complex dynamic processes is highly desirable for real-time control applications such as dynamic task prioritization for sonar resource management. The estimated Markov transition matrices are utilized to formulate the task prioritization as a restless bandit (TPRB) problem. The TPRB problem is then solved using a primal-dual heuristic index algorithm based on a first-order linear programming relaxation (Bertsimas and Niño-Mora 2000). The approach to solving the TPRB problem is referred to as the TPRB scheme in this paper. The TPRB scheme provides a feasible policy with a guarantee for its suboptimality, and has been demonstrated numerically to have excellent performance. The tracker provides updates on tracking tasks at regular intervals, and the TPRB 3726 Wakayama and Zabinsky problem is solved repeatedly with an updated initial condition to derive task prioritization policy throughout the mission duration. This paper proceeds as follows. Section 2 describes a general active sonar tracking system model and its submodules focusing on tracking tasks. Section 3 discusses the modeling of individual tracking task dynamics using a finite-state discrete-time Markov chain from the simulated data sets. Section 4 presents a restless bandit model formulation for the dynamic task prioritization problem. Section 5 explains a solution approach to the TPRB problem based on a first-order linear programming relaxation. Section 6 compares the performance of the resulting TPRB policy with the conventional policies including a round robin policy and a greedy policy on various multi-target scenarios, and Section 7 provides insights and conclusions. 2 SYSTEM MODEL A general active sonar tracking system model is shown in Figure 2, and is used for simulation and analyzing the task prioritization techniques. In this closed-loop architecture, information on targets is obtained by active sonar measurements and processed by the multi-target tracker and the resource management modules (task prioritization and scheduling) to derive effective ping decisions, and the ping decisions are then fed back to the sonar system for improved tracking. In this section, every module of the system model is briefly described. Figure 2: Active sonar tracking system model. Scenario A field of multiple sensor nodes is positioned in an acoustic environment of interest and its operating region is identified. In this study, the scenario is formed with tracking tasks of N targets. The target tracks are sampled from a set of randomly-generated tracks. Tracks are generated with random headings (uniform between −180◦ and 180◦ ), random speeds (uniform between minimum and maximum assumed speeds) and random initial positions (uniform within the operating region). A discrete-time, nearly-constant-velocity model with assumed motion noise is used to update each track for a specified scenario time window. Sonar Performance Model The sonar performance model computes mean signal-to-noise ratios (SNR) or signal excesses (SE) for a given sensor field configuration consisting of sources, receivers and moving targets, and is an essential element of the ping decision derivation and the sonar system simulation. Detection probability predictions can then be computed from the SE values. For this analysis, mean SE values are obtained using Signal Systems Corporation’s Multistatic Sonar Performance Model software (MSPM 2010). 3727 Wakayama and Zabinsky Ping Contact Generator A contact generator is required in order to test and evaluate sensor management techniques. This module provides target contacts for each receiver, given a waveform transmission from a particular source. Target contacts are derived from simulated tracks, and contain information including SNR, bearing, arrival time for ranging, and range-rate if available. They are modeled by obtaining mean SE from the sonar performance model, and adding a random fluctuation term from an assumed Gaussian distribution, along with assumed measurement (bearing, time, and range-rate) errors. Target contacts are generated when the resulting SE is positive. For this study on track prioritization scenarios, it is assumed that false contacts have already been removed by the multi-target tracker and hence they are not included in the ping contact generator. Multi-target Tracker The purpose of the tracker is to estimate the positions and velocities of targets of interest from the detection contacts. This process requires the removal of false contacts, the association of true contacts, and the fusion of contact information. There is a large literature on sensor data fusion and target tracking (Bar-Shalom et al. 2011). An appropriate tracker implementation must balance the tradeoff between complexity and quality depending on the applications. In this paper, a simple Kalman filter based tracker is implemented. The track outputs of the tracker are represented by their expected 2-dimensional positional and velocity state vectors and covariance matrices. Their AOU states are calculated from the 2-dimensional positional covariance matrix. Task Prioritization The task prioritization module dynamically determines the set of active tracks to focus on at any particular ping interval in order to optimize the overall sonar system performance. Task prioritization becomes critical as the number of tracking tasks on hand becomes higher and available resources may not be sufficient to sustain all tracks. The number of active tracking tasks that can be scheduled at each ping time interval is constrained by the number of orthorgonal waveforms available for the sonar system as well as processing and communications limitations. The main focus of this paper is dynamic task prioritization, and an optimization problem formulation for the task prioritization problem is discussed in detail in Section 4. Scheduling The scheduling module together with the task prioritization module control the performance of a sonar tracking system. Given the active tracking tasks selected by the task prioritization module at each ping time interval, the scheduling module provides the ping decisions on which source/waveform pairs are best to ping next. The performance is measured by the instantaneous detection probability and the resulting AOU of each active track. The scheduling strategy is to choose a source/waveform pair for each active tracking task that will result in a minimum AOU and a detection probability of greater than 0.5. 3 TRACKING TASK DYNAMIC MODEL FROM SYSTEM SIMULATION We seek to predict the next AOUs of a given tracking task with respect to the binary prioritization decisions on the task using a Markov chain. Since it is simply not feasible to obtain real-world data, the AOU data required for Markov chain construction are generated from Monte Carlo track samples using the sonar tracking system model (Section 2). We then apply Markov chain calculations to each of the simulated data sets corresponding to each of the prioritization decisions to obtain a finite transition probability matrix with the states representing the quantized AOUs. These Markov chains represent a priori statistics about the tracking task dynamic processes for the task prioritization optimization problem discussed in Section 4. 3728 Wakayama and Zabinsky 3.1 AOU Data Set Generation The sonar tracking system simulation framework for AOU data generation consists of a discrete event simulation model with an embedded optimization module or scheduling module, a sonar performance model, a ping contact generator and a multi-target tracker (Section 2). The user can modify sensor configurations, environment conditions and simulation parameters in the scenario module. A specified number of Monte Carlo tracks are generated with random headings, speeds and initial positions, and assumed motion noise. For each sampled track at each ping time interval, two data pairs, (δ , δ 1 ) and (δ , δ 0 ), are obtained with binary track prioritization decisions, namely active and passive (we identify superscript 1 with active (i.e., the track is selected and a ping is allocated for the track), and superscript 0 with passive (i.e., the track is not selected and is not considered in the ping decision generation)). The AOU of the sampled track just before the prioritization decision is denoted by δ , and calculated from the current predicted positional covariance matrix. With the active decision on the sampled track at any given interval, the scheduling module provides a ping (source/waveform pair) optimized for the track, the ping contact generator provides a target contact (if any) and the multi-target tracker updates its position and velocity estimates from the contact measurement. The active AOU, δ 1 , is then calculated from the updated positional covariance matrix. With the passive decision, no contact is generated. The multi-target tracker propagates its position and velocity estimates and the passive AOU, δ 0 , is calculated. Typically, the AOU grows rapidly in time without successive measurements. Grouping those data pairs according to the active or passive decision results in two data sets, denoted by D1 and D0 respectively. 3.2 Markov Transition Matrices We calculate a state transition matrix from each data set, thereby obtaining two transition matrices denoted by P1 and P0 corresponding to the data sets D1 and D0 respectively. A key step when creating a Markov chain is determining the appropriate quantization of states. To identify the number and quantization intervals, we analyze the data as well as consider the operational requirements of the sonar tracking system. In some tracking applications, it may be desirable to maintain the AOUs of the targets of interest below a certain threshold as dictated by the system requirements. When the AOU becomes greater due to successive misses, a track may be considered lost. In this paper, the quantization of AOU states is done by analyzing detection patterns and resulting AOUs. Histograms are useful in ensuring there are enough data points available to capture the state transitional behavior and establishing the quantization intervals or bin ranges. Once the intervals are created, each of the data pairs are labeled with the intervals to which they correspond. After the data pairs are labeled, the number of transitions from one interval to the next is counted and empirical state transition probabilities are calculated yielding the transition matrices P1 and P0 . The state quantization for the examples in this paper is shown in Table 1. Figure 3 includes three histograms illustrating the distributions of the initial and next AOU states obtained from the simulated data sets. Figure 3a corresponds to the values of δ , and Figures 3b and 3c correspond to the values of δ 1 and δ 0 respectively. It is observed that with the active decision the mass of the distribution shifted towards the smaller AOU states and with the passive decision towards the larger AOU states. Table 1: State Quantization States AOU ranges (km2 ) States AOU ranges (km2 ) States AOU ranges (km2 ) 1 2 3 4 5 (0,0.35] (0.35,0.5] (0.5,0.7] (0.7,1] (1,1.8] 6 7 8 9 10 (1.8,3] (3,5] (5,8] (8,12] (12,17] 11 12 13 14 15 (17,27] (27,40] (40,62] (62,100] > 100 3729 Wakayama and Zabinsky (a) (b) (c) Figure 3: Histograms of the data sets used in the simulation examples: (a) initial AOU data {δ }, (b) next AOU data with the active decision {δ 1 } and (c) next AOU data with the passive decision {δ 0 }. 4 RESTLESS BANDIT FORMULATION In this section, we provide a brief description of key features of the classical multi-armed bandit (MAB) problem. We then formulate the task prioritization as a variant of the classical MAB problem, known as restless bandit problem, and refer to it as the TPRB problem. Multi-armed bandit problems are a class of sequential resource allocation problems concerned with allocating one or more resources among several competing projects (Mahajan and Teneketzis 2008). In the classical MAB problem at each instant of time a single resource is allocated to one of many competing projects. The project to which the resource is allocated changes its state according to a random dynamical process with known statistical descriptions and generates a reward. The remaining projects remain frozen and contribute no reward. Projects are assumed independent. Gittins (1979) showed that an optimal solution is of the index type. The Gittins indices can be computed separately for each project (or bandit) and thus, finding an optimal policy reduces to selecting the project with the highest index. The task prioritization problem can be seen as a variation of the classical MAB problem. In the following we introduce the notation and provide the formal problem definition. We consider a collection of N tracking tasks (bandits), labeled n ∈ N = {1, . . . , N}. At ping interval k, task n can be in one of the finite number of states ikn ∈ Sn . Let akn ∈ {0, 1} be the action for task n at ping time k. If task n in state ikn is selected (i.e., akn = 1), then an active reward R1ik is earned, and its state changes in a Markovian fashion, n according to an active transition probability matrix P1 into state jnk+1 with probability p1ik jk+1 . If the task n n is not selected (i.e., akn = 0), then a passive reward R0ik is received, and its state then changes according to n a passive transition probability matrix P0 into state jnk+1 with probability p0ik jk+1 . The tracking tasks are n n characterized as restless bandits because each track’s AOU state evolves over time even when it is not being selected (Whittle 1998). In the TPRB problem, the system reward represents the value assigned according to the track quality (smaller AOU states correspond to higher track quality). If task n is selected or active, information may be updated depending on whether the ping will result in a detection or not. This probability is captured by P1 . When the state of the n-th task is ikn , we suppose that the active reward is the mean track quality value, which can be represented as (1) R1in = ∑ p1in jn Vin , jn ∈Sn 3730 Wakayama and Zabinsky where Vin is the track quality value assigned for being in state in , which is pre-specified by the system operator. Similarly, the passive reward is defined as R0in = p0in jn Vin . ∑ (2) jn ∈Sn In this formulation, rewards are time discounted by a discount factor β , where 0 < β < 1. At each ping interval k = 0, 1, . . . , K, exactly M tasks must be selected, i.e., ∑Nn=1 akn = M, where 1 ≤ M < N, and M is determined by the number of orthogonal waveforms that the system can transmit simultaneously without interference. Note that in the classical MAB problem, M = 1. Tasks are selected over time according to a Markovian prioritization policy u (which selects the current action as a function of the current state). Let U denote the class of admissible Markovian policies. The prioritization policy u ∈ U is [akn ] ∈ {0, 1}N×K . The TPRB problem consists of finding a prioritization policy that maximizes the total expected discounted reward over an infinite horizon (we assume K is large enough for the infinite-horizon approximation) subject to the constraint on the number of tasks that must be selected and the dynamical constraint of each of the Markov decision processes represented by a tuple < P0 , P1 , R0 , R1 , β >. The total expected discounted reward objective function can be described as " # ∞ max u∈U 5 Z = Eu N akn ∑ ∑ β k Ri k=0 n=1 k n . (3) THE TPRB SCHEME In this section, we solve the TPRB problem mentioned in Section 4 using the primal-dual index heuristic algorithm based on a first-order relaxation to form the TPRB scheme which has been demonstrated to have less complexity and nearly optimal performance (Bertsimas and Niño-Mora 2000). 5.1 A First-order LP Relaxation The first order relaxation as proposed by Whittle states that the original requirement that exactly M tasks must be selected at any ping interval is relaxed to an averaged version: the total expected discounted number of active tasks must be M/(1 − β ) (Whittle 1998). According to Whittle this relaxed version must be interpreted as the problem of controlling optimally N separate Markov decision processes, subject to the binding constraint on the average number of active tasks. In this section Whittle’s relaxation is reformulated using a linear programming model developed in Bertsimas and Niño-Mora (2000). The convenient decision variables (denoted here by xiann ) for a linear programming model for the Markov decision processes are defined as follows. For each n ∈ N , in ∈ Sn and an ∈ {0, 1}, let xiann be the expected total discounted time that task n is in state in and action an is made, when the distribution of the initial state is known and given by a vector α ; i.e., " # ∞ xiann = Eu k=0 where ak Iikn n =  1 0 akn ∑ Ii k n βk , if task n is in state ikn and action akn is taken at interval k, otherwise. (4) (5) The initial states of the tracking tasks are assumed to be known from the output of the tracker, and the vector α is represented as  1 if project n is in state in at interval k = 0, (6) αin = 0 otherwise. 3731 Wakayama and Zabinsky The objective function in (3) can then be translated into max ∑ ∑ Z= ∑ n∈N in ∈Sn an ∈{0,1} Rainn xiann , (7) and the value Z is interpreted as the expected total discounted cost. Applying classical results on the long-run properties of Markov decision processes, the constraints on xiann are given by ∑ xiann − β an ∈{0,1} ∑ ∑ painnjn xiann = α jn , for jn ∈ Sn , n ∈ N , (8) 0, for an ∈ {0, 1}, in ∈ Sn , n ∈ N . (9) in ∈Sn an ∈{0,1} xiann ≥ Now, Whittle’s condition on the average number of active tasks can be written as " # ∞ ∞ M ∑ ∑ xi1n = ∑ Eu ∑ k∑ Ii1kn β k = ∑ Mβ k = 1 − β . k=0 k=0 n∈N i ∈S n∈N in ∈Sn n (10) n The objective function (7) and the constraints (8) – (10) form a first-order LP relaxation model, denoted by (P1 ), of the TPRB problem. 5.2 Primal-Dual Index Heuristic To find each action akn once the xiann values are obtained, we use the primal-dual index heuristic proposed in (Bertsimas and Niño-Mora 2000). The dual of the linear program (P1 ) is given by (D1 ) min ∑ ∑ Y= n∈N jn ∈Sn α jn y jn + M y 1−β subject to yin − β ∑ p0in jn y jn ≥ R0in , for in ∈ Sn , n ∈ N , (11) jn ∈Sn yin − β ∑ p1in jn y jn + y ≥ R1in , for in ∈ Sn , n ∈ N . jn ∈Sn We denote {x̄iann } and {ȳin , ȳ} as an optimal and dual solution pair to the primal problem (P1 ) and its dual problem (D1 ) respectively. We denote γ̄iann as the corresponding optimal reduced cost coefficients which are given by γ̄i0n = ȳin − β ∑ p0in jn ȳ jn − R0in , jn ∈SN γ̄i1n = ȳin − β ∑ p1in jn ȳ jn + ȳ − R1in , (12) jn ∈SN where γ̄i0n , γ̄i1n ≥ 0. γ̄i0n and γ̄i1n can be interpreted as the rates of decrease in the objective function value of the primal problem (P1 ) per unit increase in the value of variables xi0n and xi1n respectively. Given the current states (i1 , . . . , iN ) of the N tasks, the index for each of the task when it is in state in is then computed as (13) ζin = γ̄i1n − γ̄i0n . According to the primal-dual index heuristic, the tasks that have the M smallest indices are set active. In case of ties, tasks with x̄i1n > 0 are set active. 3732 Wakayama and Zabinsky 6 SIMULATION RESULTS In this section we illustrate the performance of the TPRB policy on multi-target scenarios using active sonar tracking system simulations (Section 2). A monostatic active system with 36 nodes distributed in an operational region of approximately 120 × 120 km2 in a 6 × 6 grid with a spacing of 15 km between neighboring nodes is used. The field layout and the probability of detection map associated with a node are shown in Figure 4a. Randomly-generated tracks are implemented to simulate tracking tasks. A 4-target tracking scenario example is illustrated in Figure 4b. The number of tracks vary for different case studies. (a) (b) Figure 4: (a) Probability of detection map of sonar node 17. (b) A 4-target tracking scenario example. For performance comparison we implement traditional task prioritization policies including a round robin policy and a greedy policy in addition to the TPRB policy. In this study, we set M = 1 assuming only one ping from one node can be transmitted at each ping interval. The round robin policy uses a pre-specified sequence for the N tracks to determine the ping decisions, i.e., at each ping time a ping is transmitted from the node that will result in a minimum AOU and a detection probability of greater than 0.5 for the selected track. The greedy policy selects a track with the largest reward, denoted by n∗ , from the N tracks at each ping time, i.e., n∗ = argmaxn∈N (R1n + ∑i∈N ,i6=n R0i ). As currently implemented, the simulation experiments were performed using MATLAB. The linear programming models were solved using LP Solve 5.5 (LP Solve 2015). The process flow of the active sonar tracking system simulation can be summarized as follows: given the number of tracking tasks and their positions and velocities, pings are first generated according to a round robin policy for track initialization. Task prioritization is initialized when at least one detection from each track has been obtained. The task prioritization model (round robin, greedy, or TPRB policy) is solved to select a set of tracking tasks to perform next. Based on the selected tasks, the scheduling model provides the ping decisions using sonar performance model predictions, and pings are generated accordingly. Detections from the pings are processed by the multi-target tracker and the AOU state estimates of the tracking tasks are updated. At this point the optimization procedure iterates. The task prioritization and scheduling models are solved using the updated tracking tasks to generate new ping decisions. This process is repeated until the end of the operational scenario time window. To gain insights into the effect of task prioritization on the tracking performance, a 4-target tracking scenario example (Figure 4b) is presented. In this example, it is desired to maintain tracks with their AOU values below 1 km2 (i.e., Gaou = 1 km2 ). Typically, the goal AOU threshold Gaou will be selected by the 3733 Wakayama and Zabinsky system’s operator. The AOU performance on the tracks with the round robin, greedy and TPRB policies are plotted in Figure 5. To initialize task prioritization, pings are first generated according to a round robin policy. At 5 minutes into the scenario, all the 4 tracks have been initialized and task prioritization is being performed for the duration from 5 minutes to 35 minutes. With each ping the AOU of the selected track decreases if it results in a detection and increases if it results in a miss, while the AOUs of the non-selected tracks increase. The total number of detections over the scenario duration are the same for each of the policies. However, the round robin policy cannot maintain continuous tracks below the goal AOU threshold of 1 km2 . The greedy and TPRB policies strive to maintain 2 continuous tracks below 1 km2 threshold. In this example, the greedy policy schedules a ping for an additional track at 30 minutes yielding an inferior tracking result to the TPRB policy. (a) (b) (c) Figure 5: The AOU trajectories of the tracks in the 4-target scenario example with (a) the round robin policy, (b) the greedy strategy, and (c) the TPRB policy. The tracking performance of the TPRB, greedy and round robin policies is evaluated for N-target scenarios, where N = 3, 4, and Gaou = 0.5, 0.7, 1, 3 km2 . Setting Gaou to a smaller value imposes a more stringent requirement on the sonar tracking system as each tracking task demands more pings. For each scenario, N random tracks are generated. Performance evaluation of each policy for each case study is based on 10 N-target scenarios with 10 trials for each scenario (a total of 100 trials). The other simulation parameters are defined as follows: the ping interval is set to 1 minute. The number of pings at each interval is set to M = 1. The AOU state of each tracking task is quantized into 15 states with the corresponding AOU ranges shown in Table 1. The values associated with being in each state is set to Vi = v for each state i = 1, . . . , 15, where v = 40 for the desirable states (states with their AOU values below Gaou ), v = −1, −2, . . . for each deteriorating state thereafter, with v = −15 for i = 15. The time discounting parameter for the TPRB policy is set to β = 0.6. Each scenario is initiated with a round robin policy, and the task prioritization module is activated when N tracks have been detected. The task prioritization duration is set to 30 minutes. In Table 2 we report the results of our experiments for various values of the parameters. For a given set of parameters, the following quantities are computed for each of the policies: Zaou : mean AOU value of the portions of the output tracks with values below Gaou (in km2 ), nT : mean number of successfully tracked targets with AOU values below Gaou , nΓ : mean number of track fragments, dΓ : mean duration of track fragments (in minutes), and DΓ : mean of the maximum duration of track fragments (in minutes) in each trial. In general, the values of Zaou <= Gaou are of equivalent performance. With nT , dΓ , and DΓ , the higher the values, the more desirable the system performance is. With nΓ , the smaller values are more desirable due to less track fragmentation. It is observed that in general the performance of the greedy and TPRB policies significantly improves the track quality of the prioritized tracks in terms of track fragmentation at the expense of neglecting tracking 3734 Wakayama and Zabinsky tasks that cannot be fulfilled. The mean numbers of successfully tracked targets (nT ) are comparable among all policies except for the cases with a stringent goal AOU threshold (Gaou = 0.5) or an easy goal AOU threshold (Gaou = 3). In the cases with Gaou = 0.5, without any successive detection opportunities on a single track, the round robin policy struggles to bring the AOU values below the threshold. In the case with Gaou = 3 and N = 3, the greedy and TPRB policy is overly conservative by putting too much weights on maintaining quality tracks and their overall performance is inferior to that of the round robin policy, suggesting that the tracking system can provide performance without task prioritization. The value of task prioritization in terms of track quality parameters (nΓ , dΓ , DΓ ) is apparent when the requirement for the tracking tasks exceeds available ping resources. It can be observed that the greedy policy significantly improves track quality over the round robin policy except for the case with Gaou = 3 and N = 3. Although not by much, the greedy policy even outperforms the TPRB policy for the cases with Gaou = 0.5. In all the other cases, the TPRB policy provides at least an equivalent or better performance over the greedy policy in terms of track fragmentation metrics. Since the TPRB policy provides a more consistent performance with limited ping resources, it can be concluded that the TPRB policy is superior to that of the round robin and greedy policies. Table 2: Numerical Experiments Parameters Policies Zaou nT nΓ dΓ DΓ N = 4, M = 1, Gaou = 0.5 TPRB Greedy Round robin 0.29 0.28 0.38 0.92 0.95 0.47 3.04 2.44 26.14 8.85 11.31 0.15 20.81 21.99 0.41 N = 4, M = 1, Gaou = 0.7 TPRB Greedy Round robin 0.32 0.38 0.48 0.97 0.99 0.95 3.90 9.01 25.37 6.70 2.33 0.18 20.54 11.11 0.96 N = 4, M = 1, Gaou = 1 TPRB Greedy Round robin 0.58 0.59 0.61 1.60 1.51 1.40 4.73 5.13 19.11 5.35 4.84 0.57 20.05 20.53 1.19 N = 4, M = 1, Gaou = 3 TPRB Greedy Round robin 0.82 1.22 1.22 1.99 2.77 2.81 3.83 8.36 7.89 6.83 2.59 2.84 20.49 9.95 17.31 N = 3, M = 1, Gaou = 0.5 TPRB Greedy Round robin 0.29 0.28 0.38 0.87 0.89 0.48 3.87 3.63 28.19 6.75 7.27 0.07 17.16 17.67 0.21 N = 3, M = 1, Gaou = 0.7 TPRB Greedy Round robin 0.34 0.39 0.48 0.93 0.95 0.92 4.74 6.84 24.99 5.33 3.38 0.20 17.76 13.37 0.71 N = 3, M = 1, Gaou = 1 TPRB Greedy Round robin 0.56 0.58 0.59 1.42 1.35 1.36 5.90 6.67 19.11 4.09 3.50 0.57 15.59 16.31 1.48 N = 3, M = 1, Gaou = 3 TPRB Greedy Round robin 0.87 1.18 1.12 1.90 2.51 2.51 4.04 4.55 2.71 6.42 5.59 10.06 19.51 16.83 27.34 3735 Wakayama and Zabinsky 7 CONCLUSIONS In this paper, we addressed the dynamic task prioritization problem for an active sonar tracking system. Finite-state Markov chains that capture the dynamical behavior of tracks’ AOUs are derived from the sonar system simulations. The task prioritization problem is established as a restless bandit problem, which is solved by the primal-dual index heuristic algorithm based on a first-order relaxation to form the TPRB scheme. Extensive simulation results illustrate the superior performance of the proposed scheme over the conventional round robin approach in terms of track fragmentation especially when the system is overloaded with a large number of tracking tasks and/or has a stringent AOU requirement for the tracking tasks. REFERENCES Bar-Shalom, Y., P. K. Willett, and X. Tian. 2011. Tracking and Data Fusion: A Handbook of Algorithms. Storrs, CT: YBS Publishing. Bertsimas, D., and J. Niño-Mora. 2000. “Restless Bandits, Linear Programming Relaxations and a PrimalDual Heuristic”. Operations Research 48:80–90. Gittins, J. 1979. “Bandit Processes and Dynamic Allocation Indicies (with discussion)”. Journal of Royal Statistical Society 41 (2): 148–177. Grimmett, D., and S. Coraluppi. 2006. “Multistatic Active Sonar System Interoperability, Data Fusion, and Measures of Performance”. Technical Report NURC-FR-2006-004, NATO Undersea Research Centre. LP Solve 2015. “LP Solve Reference Guide Menu”. Accessed Feb. 1, 2015. http://lpsolve.sourceforge.net. Mahajan, A., and D. Teneketzis. 2008. “Multi-Armed Bandit Problems”. In Foundations and Applications of Sensor Management, edited by A. O. H. III, D. A. Castanon, D. Cochran, and K. Kastella, 121–151. Springer-Verlag. MSPM 2010. “Multistatic Sonar Performance Model”. [software] (version 1.4). Signal Systems Corporation, Maryland. Wakayama, C. Y., Z. B. Zabinsky, R. Plate, and J. Hoff. 2015. “Dynamic Adaptive Ping Scheduling for Monostatic Active Sonar Systems in Convergence Zone Environments”. U.S. Navy Journal of Underwater Acoustics (accepted). Whittle, P. 1998. “Restless Bandits: Activity Allocation in a Changing World”. Journal of Applied Probability 25:287–298. AUTHOR BIOGRAPHIES CHERRY WAKAYAMA is a Research Engineer in the Maritime Systems Division at SPAWAR Systems Center Pacific. She is particularly involved in research and development concerning multistatic active sonar information fusion, target tracking and sensor management. She holds M.S. and Ph.D. degrees in Electrical Engineering from the University of Washington. Her research interests include dynamical systems, optimal control theory and applications of operations research. Her e-mail address is wakayama@spawar.navy.mil. ZELDA B. ZABINSKY is a Professor in the Department of Industrial and Systems Engineering at the University of Washington, with adjunct appointments in the departments of Electrical Engineering, Mechanical Engineering, and Civil and Environmental Engineering. She is an IIE Fellow. Professor Zabinsky’s research interests are in global optimization under uncertainty for complex systems. Her book, Stochastic Adaptive Search in Global Optimization, describes research on theory and practice of algorithms useful for solving problems with multimodal objective functions in high dimension. Her email address is zelda@u.washington.edu. 3736