survey

Open access

From Reactive to Active Sensing: A Survey on Information Gathering in Decision-theoretic Planning

Authors:

Tiago Veiga,

Jennifer RenouxAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 13s

Article No.: 280, Pages 1 - 22

https://doi.org/10.1145/3583068

Published: 13 July 2023 Publication History

PDF eReader

Abstract

In traditional decision-theoretic planning, information gathering is a means to a goal. The agent receives information about its environment (state or observation) and uses it as a way to optimize a state-based reward function. Recent works, however, have focused on application domains in which information gathering is not only the mean but the goal itself. The agent must optimize its knowledge of the environment. However, traditional Markov-based decision-theoretic models cannot account for rewarding the agent based on its knowledge, which leads to the development of many approaches to overcome this limitation. We survey recent approaches for using decision-theoretic models in information-gathering scenarios, highlighting common practices and existing generic models, and show that existing methods can be categorized into three classes: reactive sensing, single-agent active sensing, and multi-agent active sensing. Finally, we highlight potential research gaps and suggest directions for future research.

1 Introduction

Sequential decision-making under uncertainty, i.e., the problem of optimizing a sequence of actions given a stochastic or partially observable environment, includes an information-gathering component inherently. The agent acts based on its world model and observes the result of its actions to decide on the next one. In most scenarios, usually modeled with decision-theoretic models, information gathering is a by-product of a reward defined on states (e.g., using the appropriate treatment, reaching a certain point in space, finding a specific object). However, some scenarios consider information gathering as their primary goal, for instance, target tracking or patrolling applications. This sub-field of decision-theoretic planning has gathered the attention of several researchers over the past few years. These are the scenarios we consider in this article, which identifyies the problem and compiles the current state of the art for this topic.

Therefore, in this survey, we analyze the state of the art of decision-theoretic approaches for information gathering. We observe that many branches in the literature claim to do information gathering with many different approaches, some more empirical and others more comprehensive.

We categorize the literature into three main branches, which follow different levels of systematization, agents’ self-awareness, and capability of analyzing the environment and other actors’ intentions. We term these categories as reactive sensing, single-agent active sensing, and multi-agent active sensing, which separate agents that perform information gathering without explicitly reasoning about their internal knowledge state from agents that decide based on this knowledge state and, finally, agents that incorporate in their knowledge state information about the other actors in the environment. After analyzing these categories of approaches, we discuss our main findings while reviewing those articles. Among others, we argue that the current state of the art points out that decision-making processes that explicitly depend on knowledge are more generalizable and make it possible to target more types of problems. However, there is a lack of literature on real applications using these general methods.

1.1 Contributions

This article aims to bring a renewed overview of the information gathering and epistemic planning frameworks with decision-theoretic planning methods. The main contributions of this article are as follows:

•

A principled formulation of the relation between information gathering and epistemic planning, where we identify the intersection between epistemic planning and information-gathering fields.

•

A comprehensive literature review on sequential decision-making with information-gathering goals, where we propose a grouping between different approaches within each sub-field we identified.

•

The identification of future research steps in this topic, based on our literature review analysis.

1.2 Problem Definition

The term information gathering usually refers to problems in which collecting information about the environment is the goal of the system in itself. Over the years, many methods have been developed for information gathering, which differ in how they model the knowledge acquisition process and the kind of reasoning of which the planning agent is capable.

Active sensing has been first defined in Reference [8] as “the problem of an intelligent data acquisition process,” more specifically a “problem of controlling strategies applied to the data acquisition process that will depend on the current state of the data interpretation and the goal or the task of the process.” Even though it initially focused on computer vision systems, this definition still holds for other active perception systems developed over the years. Bajcsy et al. [9] later revisited this definition, emphasizing that the most essential aspect of the active perceiver compared to other artificial agents is that the active perceiver knows why it is sensing, chooses what to perceive, and determines when, where, and how to perceive. This specificity implies that the active perceiver is not only able to decide how to act to perceive better but also when and how not to act, for instance, if it infers that the action is not worth the information gain. By being able to reason over its own knowledge and plan for the best way to acquire new knowledge, the active perceiver is inherently performing Epistemic Planning.

The term Epistemic Planning has been first introduced in Reference [14] as the approach that allows planning on epistemic models and specifically using Dynamic Epistemic Logic (DEL). DEL considers lack of knowledge as an input for reasoning, the same way knowledge is. It also considers that partial observability and lack of knowledge are related but different problems. Therefore, the DEL-based epistemic planning allows for actions (also called event models) to be factual (i.e., changing the world itself, also referred to as ontic), epistemic (changing the agent’s knowledge about the world but not the world itself), or a combination of thereof.

More recently, the meaning of the term epistemic planning has broadened to encompass any approach that allows planning with epistemic states, actions, and goals [10]. In its broader sense, the epistemic planning problem encompasses the active sensing problem but is not limited to it. Indeed, epistemic planning can be performed without the goal of information gathering, e.g., to coordinate with a team or influence another agent’s knowledge state. However, any problem that reason on knowledge acquisition, i.e., active sensing, is, in essence, an epistemic planning problem.

However, when referring to information-gathering problems, it is not strictly necessary that the agent is an active perceiver and can decide not to act. In many cases, especially in patrolling and surveillance problems, the agent’s model is designed so that the agent can reach a desirable level of knowledge but without any reasoning on the impact of the sensing action per se. We will refer to these models in the remainder of this article as reactive sensing. We define reactive sensing as the process by which an agent gathers information without explicitly optimizing for information gain but by reacting to external stimuli (e.g., being in the same location as a moving target).

Figure 1 summarizes the relationship between these different concepts. We claim that active sensing combines epistemic planning in its broadest sense and information gathering. Reactive sensing is a set of methods to perform information gathering that does not optimize for information gain per se.

Fig. 1.

1.3 Illustrative Example: Target Tracking

Throughout this survey, we will use a target tracking scenario as an example to illustrate the concepts. This is a generic but common scenario found as a testing environment for many planning methods. In this scenario, depicted in Figure 2, a mobile agent must patrol a given scenario environment containing several nodes to track a target. In this context, tracking might have different meanings, such as triggering an alarm whether or not a target (e.g., an intruder) is found in the environment or being able to follow the target accurately, given its uncertain movements. These differences might also influence the planning method and will be further discussed throughout the article.

Fig. 2.

We first consider a simple scenario with one agent and one target and will extend this scenario in Section 5 to multi-agent problems.

1.4 Methodology

We build the set of papers we surveyed iteratively using specialized search tools (Google Scholar, Semantic Scholar, and Connected Papers) and combinations of the following keywords: information-gathering, (partially observable) Markov decision processes (including their acronyms), decision-theoretic planning, active sensing, and multi-agent systems. We considered three inclusion criteria as follows:

(1)

Information gathering as a goal: Many papers consider information-gathering actions as a way to improve the planning agent’s knowledge about its environment, e.g., in Reference [49]. However, the main goal of these systems is still to reach a given state, and information gathering is only a means to reach such a state. In the current review, we only included papers in which information gathering is the main problem being addressed.

(2)

Decision-theoretic formalization: We only included papers that formulated their information-gathering problem as a decision-theoretic problem (regardless of how the problem is solved).

(3)

Peer reviewed: We only included papers that have been accepted in a peer-reviewed venue before the April 1, 2022.

These criteria resulted in 51 collected papers, which we classified into three categories: reactive sensing (Section 3), single-agent active sensing (Section 4), and multi-agent active sensing (Section 5). We decided to separate single and multi-agent active sensing as the multi-agent aspect brings additional challenges that ought to be considered separately and will be reviewed in Section 5. Table 1 presents a summary of how the papers in this survey have been organized.

Table 1.

	Reactive Sensing	Active Sensing
Single Agent	[4, 7, 21, 25, 26, 31, 32, 33, 34, 35, 36, 39, 46, 50, 51, 53, 61, 67, 70, 72, 73]	[3, 16, 17, 18, 19, 20, 23, 24, 29, 41, 52, 55, 63, 64, 65, 74, 76, 77, 78] (application) [79] (application) [80] (application)
Multi Agent	[38]	[12, 13, 22, 42, 43, 44, 45, 58, 59, 60, 68]

Table 1. Summary of the Classification of Each Surveyed Paper

1.5 Outline

The remainder of this article will be organized as follows. Section 2 will present some of the required background knowledge on decision-theoretic planning. Then Section 3 will give our definition of reactive sensing and review papers falling in this category. Section 4 will review papers proposing models to perform active sensing. Section 5 will focus on models specifically designed to tackle multi-agent aspects. Finally, Section 6 will summarize and discuss our findings and propose leads for future research areas.

2 Background On Decision-theoretic Planning

This section will review some of the basics of decision-theoretic planning, limited to the concepts needed to understand the remainder of this article. For a more detailed introduction to decision-theoretic planning, we refer the reader to Reference [40].

Decision-theoretic planning can be defined as “an extension of the classical AI planning paradigm because it allows one to model problems in which actions have uncertain effects, the decision maker has incomplete information about the world” [15]. It roots in decision theory to weigh the value of different courses of action and extends toward the problem of constructing plans to optimize those outcomes. A general model that captures this problem is given by Markov Decision Processes (MDP) and their extension to state uncertainty, Partially Observable Markov Decision Processes (POMDP) [37, 71]. Here, we provide an overview of decision-theoretic planning under uncertainty in stochastic environments modeled as a POMDP.

2.1 Definition

A POMDP models the interaction of an agent with a stochastic and partially observable environment, and it provides a rich mathematical framework for acting optimally in such environments. It can be represented by a tuple \(\langle S,A,O,T,\Omega ,R,h,\gamma \rangle\). At any timestep, the environment is in a state \(s \in S\), the agent takes an action \(a \in A\) and receives a reward \(R(s,a)\) from the environment as a result of this action, while the environment switches to a new state \(s^{\prime }\) according to a known stochastic transition model \(T : p(s^{\prime }|s,a)\). After transitioning to a new state, the agent perceives an observation \(o \in O\) that may be conditional on its action, which provides information about the state \(s^{\prime }\) through a known stochastic observation model \(\Omega : p(o|s^{\prime },a)\). The agent’s task is defined by the reward it receives each timestep \(t\), and its goal is to maximize its expected long-term reward \(E[\sum _{t=0}^h \gamma ^t R(s_t,a_t)]\), where \(h\) is the planning horizon and \(\gamma\) is a discount rate, \(0 \le \gamma \lt 1\).

Given the transition and observation model, the POMDP can be transformed to a belief-state MDP: The agent summarizes all information about its past using a belief vector \(b(s)\). The initial state of the system is drawn from the initial belief \(b_0\), and every time the agent takes an action \(a\) and observes \(o\), its belief is updated through Bayes’s rule as follows:

\begin{equation} b^{ao}(s^{\prime }) = \frac{p(o|s^{\prime },a)}{p(o|a,b)} \sum _{s \in S} p(s^{\prime }|s,a) b(s), \end{equation}

(1)

where

\begin{equation} p(o|a,b) = \sum _{s^{\prime } \in S} p(o|s^{\prime },a) \sum _{s \in S} p(s^{\prime }|s,a) b(s) \end{equation}

(2)

is a normalizing constant.

2.2 Solutions

Solving a POMDP means finding the best possible policy that maps the belief state to agent actions. Solving POMDPs optimally is hard, and thus algorithms that compute approximate solutions are often used [34, 56, 75].

There are two approaches to compute policies, dubbed offline and online solving. As the names indicate, offline solving pre-computes the policy given a description of the world model [66]. Then, at runtime, looking up the best action for the current belief is efficient. These methods typically exploit the fact that POMDP solutions are piecewise linear and convex (PWLC), allowing for their compact representation, but are limited to linear rewards over the belief space. However, online methods attempt to compute a good local policy for the current belief state by interleaving planning and execution [62]. Generally, they use heuristics to expand a tree of reachable belief states and compute the best possible action path down the tree. Online methods do not have restrictions on the form of rewards, but heuristics are often derived from offline approximations, which have this restriction.

2.3 Multi-agent Planning

In multi-agent settings, the model chosen to represent the agents in the system will depend on the relation between the agents in the system. The two most extreme types of relations are fully cooperative and fully competitive. In fully cooperative settings, the agents share the same reward function and work for a systemwide goal. In fully competitive settings, an agent can only win when another loses. Between these two extremes, agents may share some goals but also have personal objectives that may or may not compete with other agents’ objectives. Many extensions of existing decision-theoretic models have been proposed to tackle the multi-agent planning problem, such as the Decentralized POMDP (Dec-POMDP) [11], the Multi-agent Team Decision Problem [57], or the Interactive POMDP (I-POMDP) [28].

3 Reactive Sensing

In this section, we will tackle the literature review on what we named reactive sensing. Roughly, this includes a broad set of approaches that tackle the problem of finding information about some feature in the agent’s environment but without directly tackling information measures. We claim this set of methods to be reactive, in the sense that the agent’s decisions are not actively directed toward information gain but, instead, guided as reactions to external stimuli. We start by framing an example in our target tracking problem to illustrate the type of behavior exhibited by these methods and then classify them in the literature review in five different sub-categories based on similarity.

3.1 Illustrative Example

Figure 3 illustrates the reactive sensing strategy in the target tracking example. In this approach, the agent must obtain some information from the environment (the position of the target) but is driven by physical events, namely, being in the same node as the target at the same time (or detecting the target in its field of view). The side effect of this approach is that negative past observations are not as valuable as they could be for the agent. For instance, even if the agent has visited all but one node in the system and theoretically “knows” that the target is in the non-visited one, it is not rewarded for this knowledge and must visit the last node to receive its reward.

Fig. 3.

3.2 Literature Review

Our literature analysis revealed that works dealing with reactive sensing problems present some commonalities in how they formulate and solve the problem. Based on these commonalities, we identified five strategies used to tackle reactive sensing problems: (1) the target tracking strategy, (2) the classification action strategy, (3) the external information metrics strategy, (4) the sensing actions strategy, and (5) the oracular observation providers strategy. Sections 3.2.1 to 3.2.5 present each of these strategies. Each section starts with a generic presentation of the strategy, with mathematical formulations when applicable, followed by a review of the works that follow this strategy.

3.2.1 Target Tracking.

One strategy often taken to gather information is to reward the agent when the gatherer (or gatherer’s field of view) coincides with the “target’s” position. This is the strategy usually chosen by patrolling and target tracking applications. The reward function for such cases generally looks like the following:

\begin{equation} R(s) = {\left\lbrace \begin{array}{l@{\quad}l} a & \text{if } X = Y \\ b & \text{if } X \ne Y \end{array}\right.}, \end{equation}

(3)

where \(X\) can be the position of the agent and \(Y\) a generic representation of the target’s position or any observable feature, and \(b \le 0 \lt a\). Note that, in this case, the planning solver weights the reward function with the current belief to determine the current belief-based reward:

\begin{equation} R(b,a) = \sum _{s \in S} R(s,a) b(s). \end{equation}

(4)

In practice, this means that an agent will try uncertainty reduction measures if that helps perform a given task, but its goals are formulated to achieve a determined world configuration and not with respect to its internal knowledge state.

A typical scenario in which perception plays an important role is target tracking, either with mobile agents or sensor networks. The objective formulation, in this case, can be formulated by giving a fixed reward when two desired values of state variables collide or a fixed cost otherwise. Then this triggers the behavior of trying to find the information of interest in the environment to receive a positive reward. This formulation was implemented in Reference [34], which uses a target tracking scenario as the use case to test a POMDP solver. Here, the agent is awarded a positive reward for being in the same cell as the target, and the only source of uncertainty is the non-deterministic target movement. In the same spirit, a few authors exploited decision theory to extend classic classifiers with a Bayesian cost formulation. Hitchings and Castañón [32, 33] tackle the problem of, for each of a set of available sensors, selecting the location to observe and the sensor mode. Deriving from a control-theoretic formulation but decoupling into several POMDPs, the objective function is obtained from a Bayes cost formulation, i.e., the cost of selecting an estimate \(v_i\) when the actual state is \(x_i\). A similar approach is found in Reference [36]. Also, Ji and Carin [35] implement a classification system (i.e., given a set of classes and feature acquisition actions, declare which class is the real one for the problem) with a POMDP-like structure. In Reference [7], a sensor scheduling problem is formulated as a POMDP. There is an energy cost for sensor usage and a fixed tracking cost for each time unit that the target is not observed. Kartal et al. [38] presents an extension of Monte-Carlo Tree Search (MCTS) for multi-agent continuous patrolling of an environment. The reward is defined as a positive reward if one member of the patrolling team is in the same location as the intruder.

3.2.2 Classification Actions.

Historically, classification rewards can be seen as an ancient version of active sensing models. The agent still does not actively reason about its knowledge state but indirectly through an internal variable that represents the current knowledge of the agent. This is implemented by extending the world model with classifying actions \(C\), which declare that a given state variable is of a given type. When the classification is correct, the reward is positive, and if it is incorrect, then the reward is negative. Additionally, the agent has the option to not classify in case its knowledge is not sufficient for such a decision. Formally, the reward function takes the following form:

\begin{equation} R\left(s,a_c^{X=y}\right) = {\left\lbrace \begin{array}{l@{\quad}l} a & \text{if } X = y\ \cap \ C = yes \\ b & \text{if } X \ne y \cap \ C = yes\\ c & \text{if } C = no \end{array}\right.}, \end{equation}

(5)

where \(X\) is the state variable of interest, \(y\) any of its possible values, \(a_c^{X=y}\) is a binary classifying action declaring that the environment feature \(X\) takes the value \(y\), and \(b \le 0 \le c \lt a\).

This subset of methods is arguably borderline between reactive and active sensing in that they are designed with the information-gathering aspect at their core. In practice, the goal is to maximize the information available to the agent. However, we decided to include them in the reactive sensing section of our analysis for two main reasons. First, the reward system is still constructed on top of state-based rewards with an extra bookkeeping variable to force the model to improve its knowledge level. Second, the model designer defines the reward levels empirically with no guiding rules on how to design them. This contrasts with the methods presented later under the active sensing umbrella, in Section 4, which directly reward the level of information and have guidelines for designing rewards according to the minimal threshold of knowledge level desired by the system designer.

Guo [31] presents the first formulation for this approach and sets the equivalence of the problem modeled as a Bayesian network classifier and a POMDP. Then, it describes a POMDP framework for information gathering in which the actions use a particular sensor when enough information has been gathered, outputting a particular classification label. Then Spaan [70] and Spaan et al. [73] generalize this idea for factored models, which allows us to focus on specific features of the environment and present a practical implementation with POMDPs in surveillance problems with networks of cameras.

3.2.3 External Information Metrics.

A different approach within this strategy is to reflect external information metrics in the state-based reward function. For instance, Spaan and Lima [72] models a dynamic sensor selection problem as a POMDP to select a subset of cameras to observe a given area. Here, the reward is obtained through the covariance matrix of observing a spot through a selection of cameras. Although formulated as an MDP and therefore not incorporating any uncertainty tracking, Kent and Chernova [39] tackle the problem of visual coverage of targets, in which the objective is formulated as a measure of the coverage of a region of interest by the robot’s field of view, with added costs for collisions with humans, the degree of intrusion in the human’s space, and power consumption. Liu and Williams [51] mixes classification actions with costs based on covariances in a sensor selection setting, exploiting the cost function’s submodular properties. This approach is also followed with reinforcement learning, given the difficulty of explicitly maintaining internal information measures. Murad et al. [53] implements a reinforcement-learning-based method to decide on sensing actions in which the state-based reward is based on information from an external inference process. Eck and Soh [21] uses reinforcement learning to compute policies for Observer Effect POMDPs, which balance the use of sensing resources that deteriorate over usage with rewards based on externally measured knowledge refinement. Linke et al. [50] surveys several other approaches under this topic.

3.2.4 Sensing Actions.

The advance of factored models allowed the exploitation of statistical independence between model variables, either state or action variables. This allows a better, easier design of world models and the development of more efficient optimizers [67]. Factored state spaces allow algorithms to focus on features of interest in the environment and, therefore, the system designer to specify which of these features the perception system should highlight. In addition, pure sensing actions do not physically affect the environment but influence the agent’s knowledge state. Thus, the actions space can be redefined as \(A = A_d \times A_s\), where \(A_d\) is the set of non-sensing actions that affect the environment and \(A_s\) is the set of sensing actions that do not have an effect on the environment but affect how the agent observes it. Lauri and Ritala [46] exploit this to define sets of sensing and non-sensing actions, which allows the agent to select which sensor to use at each timestep, independently of the actions that interact with the environment. This idea will be explored later by more systematic approaches to information gathering. Ghasemi and Topcu [25, 26] systematize this idea into an online algorithm and an extension of point-based value iteration methods, respectively. The main point is to extend the regular POMDP decision loop with a greedy perception scheme that chooses the best perception action for that timestep.

3.2.5 Oracular Observation Providers.

Similarly to the inclusions of sensing and non-sensing sets of actions, we find a line of work that extends the POMDP model with sets of oracular actions, which extends the actions space with a query action. Therefore, \(A = A_d \cup {a_{ask}}\), where \(A_d\) is the previous set of actions available to the agent and \(a_{ask}\) the query action, allowing the agent to ask an external observer, typically a human, about the true nature of a given feature of the environment [4]. This decision can be based on the uncertainty levels [61], but the proposed formulation resorts to solving the underlying MDP and then, at execution time, adding the current value of information and the possible gain for asking. This results in a myopic policy that does not directly consider the effects of the agent’s actions on the value of its information.

4 Single-agent Active Sensing

Contrasting to the approaches summarized in the previous section, we define Active Sensing as the category of methods that actively reason about the features of interest in the environment and proactively take actions to improve the level of information about them. Although they might have similarities in spirit, there are key differences in the inner motivations for the agent’s actions. An illustrative example is, for instance, the value of negative observations with which an agent can infer that, by not observing something, there is a higher chance somewhere else. Reactive approaches, in this case, are not able to make such inferences. Consequently, active methods tend to be more generalizable as they focus less on specific problems but rather on the general problem of how to plan for information gain.

4.1 Illustrative Example

We now go back to the illustrative target tracking example and describe what an active sensing approach looks like, illustrated by Figure 4. Again, the general goal of the agent is to detect the target in the environment. Unlike in the information-gathering approach, the agent incorporates previous observations to generate a (typically probabilistic) state of knowledge that also learns from negative observations. For instance, let us say that the agent maintains a probabilistic distribution of the possible target locations. It starts as a uniform distribution (Figure 4(a)) and is updated based on the output of the observations of the agent and its (known a priori) probabilities. A negative observation at a given node will reduce the probability of the target being there and increase the probability of being somewhere else. Therefore, there are now two different ways that the agent can localize the target. First, it can directly observe the target (Figure 4(b)) with a search that can be guided by the higher probability level for different nodes. Alternatively, the agent might infer the target’s position without directly observing it but rather collecting a set of negative observations at all other locations (Figure 4(c)).

Fig. 4.

4.2 Literature Review

The most straightforward and studied way to switch from reactive to active sensing is to change the reward function of decision-theoretic models to reward the agent on a belief state instead of an environmental state. Such a belief-based reward, usually noted \(\rho (b,a)\),¹ may introduce information quantifying measures such as in Reference [65], which seems to be the first formal proposal in the direction of active sensing, with a scenario of choosing sensor modes according to an entropy-based measure. Commonly used information quantifying measures are the entropy of the belief, Equation (6), or the distance to the center of the belief simplex, Equation (7).

\begin{equation*} \qquad\qquad\qquad{ \rho (b,a) =} \left\lbrace \begin{array}{l@{\quad}l} log_s(|S|) + \sum _{s \in S} b(s) log_{2}(b(s)) & \text{if entropy}\ \ \qquad\qquad\qquad\qquad\qquad {(6)} \\ \Vert b - c \Vert _p & \text{if distance to center simplex}. \qquad\ \ \ {(7)} \end{array}\right. \end{equation*}

This approach is followed by the first general model dedicated to information gathering: the \(\rho\)-POMDP [3]. The \(\rho\)-POMDP is a modified POMDP in which the state-based reward function \(R(S, a)\) is replaced by a general belief-based reward \(\rho (b,a)\).

However, switching from a state-based reward to a belief-based reward brings many challenges in solving the resulting model. Indeed, much of the literature on POMDP solving relies on the PWLC property of the value function, which in turn, depends on a state-based formulation of the reward function. In the remainder of this section, we will explore how contributions in active sensing dealt with this issue.

4.2.1 Linear Rewards.

Early active sensing works, like the one already mentioned [3, 65], rely on linearizations of convex uncertainty functions (entropy and distance to the center of the belief simplex). They note that the solving methods existing at the time, in particular Incremental Pruning, require piecewise linear rewards and, therefore, propose a linearization with tangents of a non-linear reward function.

Araya-López et al. [3] also replace the classical reward representation with a set of vectors representing a PWLC reward function and show that if these vectors are tangents of convex information measures, then the error in the value function is bounded relative to using the true reward function. The alternative POMDP with Information Rewards (POMDP-IR) approach [74] extends the action space such that, at each decision step, the agent decides on a so-called domain action and predictions actions. Domain actions physically affect the environment and influence transitions and observations. In contrast, prediction actions only affect the reward, such that a positive reward is given if the predictions match the real value of the state features or negative otherwise. The reward function is still piecewise linear, but the balance between these predictions rewards will is such that the agent only gets a positive reward if the belief for that particular state feature is above a pre-defined threshold, giving the system designer the possibility to define priorities between gathering information on different features. Later, Satsangi et al. [64] proved the equivalence between both formulations such that a problem formulated as a \(\rho\)POMDP can be reduced to a POMDP-IR and vice versa. Dressel and Kochenderfer [18] extends another point-based POMDP solver, SARSOP, to use belief-based rewards. These belief-based rewards are linearizations of information measures to avoid losing the PWLC property. Fehr et al. [23] claims that there are situations in which the desired information-based reward is non-convex and replace the PWLC requirement of the reward function to Lipschitz continuity, demonstrating that, for finite horizons, Lipschitz reward functions lead to Lipschitz continuous optimal value function. In practice, the approach is also to linearize the reward function using the Lipschitz continuity properties, and, therefore, it can still be solved with adapted point-based methods. An extension of the information reward concept to reinforcement learning is proposed in Reference [63], which introduced deep anticipatory networks that drive the agent to take actions that help it predict the true value of an unknown variable.

4.2.2 Non-linear Rewards.

Alternative approaches attempt to circumvent the PWLC requirement. For instance, Chong et al. [17] discusses POMDPs for adaptive sensing with references to information gain metrics as the rewards with approximations to \(Q\) function computation that do not restrict the reward function, although with limitations. One common approach is the use of online methods. This class of algorithms constructs a tree with possible future outcomes with the current belief as the root. It does not rely on an explicit closed-form representation of the value function but on forward simulations from the current situation. Therefore, in theory, any kind of reward is acceptable in this approach. However, there is a distinction between the case when the full belief is used and propagated down the tree or when particle filter approaches are used to approximate the belief state. In the latter, at any stage, one can compute any belief-based reward at the cost of computing belief updates for every node. Arora et al. [5, 6] and Lauri and Ritala [48] use this approach in the field of autonomous robots, one to tackle the problem of autonomously deciding whether observed events are of scientific interest when exploring remote environments while the other maximizes the amount of information collected about the environment. Furthermore, these approaches used mutual information between state and observation as the reward function, which roughly indicates the information gain between current and future knowledge conditioned on the received observation. Similar approaches are found in Reference [54] and Reference [47] with, respectively, greedy search algorithms for teams of agents and planning for inference of random variables by relaxing the problem to a multi-armed bandit equivalent. When computing exact beliefs is not feasible or is costly, approximations such as Partially Observable Monte-Carlo Planning (POMCP) approximate a search in the belief space by performing many simulations in the state space, sampling the root particle from the root belief. Usually, the tree construction process is guided by meaningless rollouts for information-oriented problems, as one cannot estimate the best paths down the tree with a few simulations. Expanding on this concept, the \(\rho\)POMCP [76] extends the POMCP algorithm by generating a set of additional particles, kept in parallel and propagated down the tree such that it offers an initial estimate for the belief (and subsequently, a measure of the information quality) as a node is initialized. A similar approach is followed in Reference [24] for continuous state and observation domains. Similarly to the approach of defining parallel sensing actions, Greigarn et al. [29] presents an approach to compute the best sensing actions with information-based rewards. It proposes a myopic, online procedure that, at each timestep, computes the best sensing action to minimize the entropy of future task actions. Välimäki and Ritala [77] formulates the problem of a robot deciding where to turn its cameras with a POMDP formulation that assumes Gaussian dynamics. However, it considers only myopic optimization of the immediate information gain. A hierarchical belief space planning is proposed in Reference [41], where different levels deal with different kinds of uncertainty. Planning is based on a forward search for \(n\) steps into the future, allowing directly belief-based rewards.

4.2.3 Belief-MDP.

Some approaches circumvent the problem of solving POMDPs by reducing the formulation to a belief MDP. This definition of belief MDP implies that the belief is one of the features included in the state space of an MDP, along with other fully observable state features. Typically, this implies that the belief is maintained externally and is not directly part of the planning process. Dressel and Kochenderfer [19] uses this approach in a drone tracking scenario, where the cost includes the entropy of the belief, and Oh and Powell [55] similarly models a contamination cleaning problem with this approach. Eck and Soh [22] presents a multi-agent approach to reason when to sense the environment, request, or share information. An information-gathering POMDP formulation is reduced to a Knowledge State MDP, with fully observable knowledge variables in the state space (e.g., a discretization of the belief entropy), although physical actions (e.g., where to sense?) is left out of the model. How the authors deal with the multi-agent aspect is reviewed in Section 5.

4.2.4 Hybrid Rewards.

So far, we have distinguished state-based and belief-based rewards as two opposing approaches to formulating information gain objectives. However, most of the reviewed approaches may accommodate a mix of both, which is dubbed in the literature as multi-objective performance criterion [52], mixed criterion [16], or hybrid rewards [20]. Generally, those represent the total agent’s reward as a weighted sum between task utilities and cost with information measures. This is typically represented with a mix of Equations (4) and one of the possible representations for \(\rho (b,a)\), such as Equations (6) and (7): \(R(b,a) = \omega \sum _{s \in S} R(s,a) b(s) + (1-\omega)\rho (b,a)\), where \(\omega\) is the optional weight factor. In practice, this approach is the most suitable for real scenarios as one typically wants to balance different costs (energy, maintenance, communication, etc.) with information gains.

5 Multi-agent Active Sensing

In Section 4, we covered studies for single-agent active sensing. Even though multi-agent problems could be represented in such models (through a centralized planning process), the models are not specifically designed to address multi-agent-specific challenges. This section focuses on inherently multi-agent models, both in cooperative and competitive settings.

5.1 Illustrative Example

We will now extend our illustrative example to multi-agent settings, as seen in Figure 5. We consider the case in which two patrolling agents must locate the target. This scenario presents both cooperative and competitive multi-agent settings. In the cooperative setting (Figure 5(a)), the agents’ goal is to coordinate to detect the target faster. One robot communicates that it intends to go down. The second agent can then reason on this fact and decide to go up to avoid overlapping. In the competitive setting (Figure 5(b)), the target is considered an agent in the system instead of a moving feature of the environment, and the planning agent can incorporate one additional level of reasoning, which is the internal state of the target. Therefore the planning agent reasons not only about the target’s “environmental” properties (location, speed, etc.) but also its possible beliefs, intentions, goals, and so on, which allows for more efficient action planning. In our example, the robot knows that the target intends to go down (from previous sensing actions, such as intercepting a message, observing the target’s past behavior, or inferring from the target’s type) and therefore plans to go right itself to intercept the target.

Fig. 5.

5.2 Literature Review

Multi-agent active sensing development is relatively recent. Models in the literature can address cooperative, self-interested, or adversarial settings. Cooperative settings usually focus on coordination to efficiently cover the environment or to maintain a common belief state. In self-interested and adversarial settings, the agents typically focus on gathering information to anticipate the other agents’ actions and react appropriately to avoid an undesirable state such as a collision or counter an adversarial attack. In both settings, however, it is interesting to note that the information-gathering process targets not only environmental factors (including a possible opponent location, speed, etc.) but also the other agents’ internal states (their intention, goals, beliefs, etc.).

5.2.1 Cooperative Settings.

To our knowledge, the first models to address multi-agent active perception are proposed by Renoux et al. [58] and Eck and Soh [22], who develop two different models focusing on cases in which agents cannot collaborate in advance to create a joint policy. Renoux et al. [58] represents each agent in the system with a POMDP that reasons over extended belief states, which includes the agent’s own beliefs over the environmental factors but also a level-1 nested belief over the other agents in the systems. The reward function is based on a model of agent-based relevance, defined in Reference [59] and presented in Equation (8),

\begin{equation} \operatorname{rel}_{i}\left(o\right)=\alpha D_{K L}\left(b_{t} \Vert b_{t+1}\right)+\beta \left(H\left(b_{t}\right)-H\left(b_{t+1}\right)\right)+\delta , \end{equation}

(8)

where \(b_{t}\) is the belief state of the agent before receiving the observation \(o\), \(b_{t+1}\) is the belief state of the agent after receiving the observation \(o\), \(D_{KL}\) is the Kullback–Leibler distance, and \(H\) is the negative entropy. This formulation assumes that an observation is relevant for a given agent if it is new (i.e., changes the agent’s belief significantly, measured by the Kullback–Leibler distance) or allows to render the agent’s beliefs more precise (measured by the difference in entropy). The parameters \(\alpha\) and \(\beta\) allow balancing these two (sometimes contradictory) aspects, and the parameter \(\delta\) ensures that the relevance remains positive. Using this formulation and the extended belief state, the agents can proactively weigh the expected impact of an observation for another agent and optimize to send the most impactful one. This decentralized information sharing allows decentralized cooperation. To solve the extended POMDP, Renoux et al. [58] transform it into a Belief-MDP, similarly to what has been described in the single-agent case (Section 4.2.3). Later, Renoux et al. [60] use similar extended belief states to extend the POMDP-IR framework into a Communicative POMDP-IR (Com-POMDP-IR). The Com-POMDP-IR extends the prediction actions from the POMDP-IR to incorporate predictions about a human operator’s belief state and uses these prediction actions to optimize a communication strategy. In this case, the resulting POMDP is solved using standard POMDP approaches, similarly to what has been described in the single-agent case (Section 4.2.1).

Eck and Soh [22] also represents each agent as a POMDP and considers that each agent could request information from their neighbors. When they share information, agents share their entire belief space, and the receiving agent updates its own beliefs according to Equation (9),

\begin{equation} b^{\prime }(x)= \frac{b(x)\left[w \cdot b_{S h}(x)+(1-w)\left(1-b_{S h}(x)\right)\right]}{\sum _{x^{\prime } \in DOM(X)} b\left(j, x^{\prime }\right)\left[w \cdot b_{S h}\left(j, x^{\prime }\right)+(1-w)\left(1-b_{S h}\left(x^{\prime }\right)\right)\right]}, \end{equation}

(9)

where \(b\) is the belief state of the agent receiving the information, \(b_{Sh}\) is the shared belief space, \(x,x^{\prime } \in DOM(X)\) are the possible values for a partially observable phenomenon \(X\), and \(w\) is a constant weight that dampens shared information, as is commonly used in information fusion literature [27]. The resulting model is solved by transforming it into a Knowledge POMDP, according to what has already been described in Section 4.2.3. The main difference between Reference [58] and Reference [22] lies in the fact that Reference [58] exchanges single observations based on their estimated relevance and update the planning agent’s belief state using standard POMDPs’ Bayes rule, while Reference [22] exchanges entire belief states and use an information fusion approach to merge the two belief states. However, both approaches focus on “on-the-go” multi-agent cooperation, where previous synchronization is unnecessary.

Lauri et al. [42] propose an alternative approach to multi-agent active perception by extending the \(\rho\)-POMDP approach to decentralized systems. The resulting \(\rho\)Dec-POMDP applies an entropy measure to the reward function to perform active sensing. The authors use an exact algorithm and assume periodic explicit communication to compute a joint belief estimate. Lauri et al. [44] relax the explicit communication hypothesis and present a heuristic for the \(\rho\)Dec-POMDP, making it possible to solve bigger problems. This heuristic uses a final reward definition, i.e., a reward granted at the end of the finite horizon. This reward is also based on the Shannon entropy of the joint belief, as in Reference [42]. Because of this final reward, the algorithm in References [44] and [45] still requires the explicit computation of the joint belief estimate, requiring a large memory and computation overhead. Lauri and Oliehoek [43] show that the \(\rho\)Dec-POMDP can be converted into a Dec-POMDP with linear rewards, similarly to the \(\rho\)-POMDP and POMDP-IR equivalence. To do so, the authors introduce individual prediction actions in the \(\rho\)Dec-POMDP model, similarly to the prediction actions of the POMDP-IR model. This equivalence then allows us to use standard Dec-POMDP solvers and therefore does not require the computation of joint state estimates. Finally, Best et al. [12] introduced the Decentralized Monte-Carlo Tree Search (Dec-MCTS), an online algorithm for asynchronous multi-robot coordination. Even though the Dec-MCTS is not limited to active perception scenarios, the authors illustrate its efficiency in an active perception scenario. Coordination is achieved by considering the plans of the other robots, which are communicated during a dedicated communication phase.

In the approaches above, the active perception process only concerns environmental factors, and mechanisms are implemented to allow the agents to reach a common belief over these factors. However, very little work has been done to include active perception of the other agents’ internal states. To our knowledge, only Best et al. [13] chose this approach and extended the Dec-MCTS model with a communication planning algorithm to optimize a communication requests sequence. Agents evaluate other agents’ belief evolution and send a request for other agents’ to send them their plan. In doing so, the agents introduce active perception actions over other agents’ internal states by only requesting their plan when needed.

5.2.2 Non-cooperative Settings.

Non-cooperative settings focus on predicting the external agents’ behavior by performing actions that will give the most information on their internal state. The external agents can be self-interested, i.e., they share the same environment, but their goals do not conflict with the planning agents’ goals, or adversarial, i.e., their goals conflict with the planning agents’ goals. One could relate the problem of active sensing in adversarial settings to the Adversarial Intention Recognition (AIR) problem. The AIR problem is concerned with recognizing an adversarial agent’s goal or policy to prevent or counter its attack. However, existing decision-theoretic approaches to solve the AIR problem focus on identifying behavior based on observation of the adversary’s action but do not act explicitly to gather evidence and therefore do not include active perception in their framework [2, 30].

Shen and How [68] explicitly consider an active perception scenario in which an opponent is either neutral, in which case its behavior is modeled as a reactive policy, or adversary, in which case its behavior is modeled with an MDP and bounded rationality. They then use a belief-space reward and a soft Q-Learning approach to optimize the policy. So far, this is the only framework that tackles the problem of active perception in a non-cooperative multi-agent setting.

6 Discussion

6.1 Reactive Sensing

In our literature review, we grouped our findings into five major sub-categories: target tracking, classification actions, external information metrics, sensing actions, and oracular observation providers. In common, all of them attempt to give agents tools to plan their actions to improve the available information about the environment or some features of the environment. However, one of our major findings is that they differ in formulating that goal. In other words, it lacks a common modeling framework that can systematically be used across different application scenarios, although, of course, one might be able to reproduce most of them by adapting to a specific scenario.

Also interesting to note that part of these approaches lay the foundations for the active sensing category. For instance, classification rewards are a primitive form of defining the value of information, although not flexible in formalizing the exact values of desired information. Extending the set of actions with extra actions that do not directly influence the environment but rather the agent’s internal state is also commonly seen to influence the agent’s decisions toward information gain. However, they are not coupled with explicit measures of the value of information. Finally, the outsourcing of belief inference and use of external value of information metrics for a state-based designed reward also shows hints of what is directly incorporated in active sensing methods.

Traditionally, for classic decision-theoretic formulations, information gain is a mean and not a goal in itself. In other words, there is some physical objective, and sensing actions are an intermediate step toward achieving the goal. As a consequence of not defining the reward based on information value metrics, reactive sensing approaches follow this assumption but, in one way or another, tricks that formulation to perform some information-gathering task, e.g., the target tracking positive reward if the agent is in the same state as the target. Thus, we claim that these methods lack generality, for their models need scenario-dependent tweaks to achieve information-gathering behaviors.

6.2 Single-agent Active Sensing

Contrary to our findings with reactive sensing methods, most of the literature found here follows systematic approaches, i.e., that attempt to generally formalize the theoretical problem instead of focusing on solving a specific scenario. We see a distinction between offline and online solvers. Offline solvers are mostly variations of the POMDP-IR and \(\rho\)POMDP models. They are formulated around linearizations of the reward function, making them compatible with the formulation of classic state-of-the-art solvers, in particular, point based. However, online solvers are more flexible, given that there are generally no restrictions to the format of the reward. The efficiency of these approaches is still limited by how they handle key issues, such as the ability to choose the most promising paths down the search tree. Typically, this can be achieved by maintaining lower and upper bounds or computing rollouts from leaf nodes with particle-based methods. However, lower bounds are based on rough offline approximations (which are PWLC-restricted), and rollouts are performed from state particles. Solutions such as the bag of particles by \(\rho\)POMCP are promising, but their complexity is dependent on the inner structure of the observation dynamics and, therefore, might not be feasible for complex problems.

We also note a remarkable contrast between the generality of active sensing approaches and their usage in real-world scenarios. Although they present a systematic approach to the problem, we found only a few papers reporting their application. Within those, there are information-gathering problems for localization of objects in domestic scenarios through an information-guided construction of semantic maps [78, 79], and the use of \(\rho\)POMDPs on smart grid management associated with privacy concerns [80]. However, this ability to transfer between scenarios suggests that this framework generalizes better.

6.3 Multi-agent Active Sensing

We found that research on multi-agent active sensing is still extremely recent, and a vast majority of models target cooperative settings and focus on gathering environmental information. We noted two distinct approaches. The first one focuses on ad hoc configurations, in which it is not possible or desirable for the agents to coordinate a plan a priori and integrate measures on what or when to share information in an “extended” single-agent POMDP. The second approach focuses on extending existing multi-agent decision-theoretic models, namely Dec-POMDPs, to belief-based reward functions.

Research on non-cooperative settings and non-environmental information gathering is still extremely recent and limited.

6.4 Future Research Directions

This survey allowed us to identify four directions in which future work would be beneficial.

Real-world applications of general frameworks. In our literature review, we found a surprising unbalance between the number of tailored solutions for specific problems compared to the use of more general frameworks applied to real-world applications. This indicates that, unlike what is commonly observed in other fields in which theoretical advances are followed by the transfer of such knowledge to real-world problems, practitioners in information gathering do not resort as much to the existing literature.

Model-free information gathering. By definition, frameworks that depend on the belief as the internal knowledge state are model based since the model of the world is needed to update the belief state. However, model-free methods that learn policies for autonomous agents directly from the interaction with the environment are continuously improving with impressive results by reinforcement learning agents in, for example, competitions against humans [69]. Furthermore, online planning algorithms based on Monte Carlo tree search have shown state-of-the-art results without needing the full environment model but access to a generative model (similar to reinforcement learning). This indicates possible research opportunities to scale up active sensing methods, with a couple of approaches cited in this article already trying to alleviate the need for a full belief representation.

Non-cooperative multi-agent information gathering. The problem of multi-agent information gathering in non-cooperative settings is still very unexplored in the decision-theoretic community. The possibility for agents to reason about other agents’ internal states and act to perform active sensing on such internal states would be a valuable step toward more generic and applicable models. To our knowledge, the most complete model allowing an agent to reason on other agents’ models is the I-POMDP [28]. However, it has not yet been extended or applied to active sensing scenarios.

Human-centered information gathering. All of the works reviewed in this survey consider only artificial planning agents. When present, humans are merely receivers of the information (as in Reference [60]). However, Hybrid Human–Artificial Intelligence, i.e., the development of artificially intelligent agents that can interact with humans to leverage their strengths and empower them [1], is a rapidly growing field, and collaborative human–agent information-gathering scenarios are an important research domain. Future works could investigate how existing models could account for human teammates and how to develop new approaches for hybrid active sensing.

Acknowledgments

The icons used in this article’s figures were designed by Freepik.²

Footnotes

Notation may differ, from using \(\rho (b,a)\) to represent any function or to represent only non-linear function such as entropy.

https://www.freepik.com.

References

[1]

Zeynep Akata, Dan Balliet, Maarten de Rijke, Frank Dignum, Virginia Dignum, Guszti Eiben, Antske Fokkens, Davide Grossi, Koen V. Hindriks, Holger H. Hoos, Hayley Hung, Catholijn M. Jonker, Christof Monz, Mark A. Neerincx, Frans A. Oliehoek, Henry Prakken, Stefan Schlobach, Linda C. van der Gaag, Frank van Harmelen, Herke van Hoof, Birna van Riemsdijk, Aimee van Wynsberghe, Rineke Verbrugge, Bart Verheij, Piek Vossen, and Max Welling. 2020. A research agenda for hybrid intelligence: Augmenting human intellect with collaborative, adaptive, responsible, and explainable artificial intelligence. Computer 53, 8 (2020), 18–28. DOI:

Abstract

1 Introduction

1.1 Contributions

1.2 Problem Definition

1.3 Illustrative Example: Target Tracking

1.4 Methodology

1.5 Outline

2 Background On Decision-theoretic Planning

2.1 Definition

2.2 Solutions

2.3 Multi-agent Planning

3 Reactive Sensing

3.1 Illustrative Example

3.2 Literature Review

3.2.1 Target Tracking.

3.2.2 Classification Actions.

3.2.3 External Information Metrics.

3.2.4 Sensing Actions.

3.2.5 Oracular Observation Providers.

4 Single-agent Active Sensing

4.1 Illustrative Example

4.2 Literature Review

4.2.1 Linear Rewards.

4.2.2 Non-linear Rewards.

4.2.3 Belief-MDP.

4.2.4 Hybrid Rewards.

5 Multi-agent Active Sensing

5.1 Illustrative Example

5.2 Literature Review

5.2.1 Cooperative Settings.

5.2.2 Non-cooperative Settings.

6 Discussion

6.1 Reactive Sensing

6.2 Single-agent Active Sensing

6.3 Multi-agent Active Sensing

6.4 Future Research Directions

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Information gathering during planning for Web Service composition

Multi-agent plan based information gathering

Planning executing sensing and replanning for information gathering

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations