3.1 Presentation of the Model
The model we present, as a proof-of-concept, is an implementation of the PCM principles close to the implementation we introduced in Reference [
2]. The approach has similarities with Reference [
79], but instead of considering the expectation of a reward function for multiple agents, we consider a mean of free-energy quantities within each agent that takes into consideration the simulation of active inference in other agents. In other words, each agent embeds a multi-agent model or system [
85] to simulate others. Agents must thus be able to reverse infer preferences and the order of ToM of other agents, while in Reference [
79], it was the policy of other agents that was inferred. More generally, our approach is quite close to Recursive Modeling Methods that have been proposed in similar contexts, and include mechanisms of inferences about preferences of others and factors of social influences [
75,
76,
78]. The innovation is that we integrate an explicit model of the three-dimensional subjective perspective of consciousness in the process, which performs functions ascribed to consciousness based on view-point dependent subjective parameters (see Section
2.2 above). The model entails affective and epistemic (curiosity) drives based on projective geometrical mechanisms, and is applied to control virtual humans in virtual environments.
We extend the previous version of the model [
2] with more advanced capacities of inferences, so that agents can infer preferences and ToM capacities in others, based on retrospective or prospective simulations of their behaviours, in a recursive manner. Predictions yielding best predictive power are used to update beliefs, in a manner that considers not only the emotion expression and orientation of other agents but also their relative behaviours of approach and avoidance, as indicators of interest labeled with affective valence (see Figure
1).
Each agent
\(A_i\) computes projections about itself as subject
S, and about other agents
\(A_j\) , using the same basic processing pipeline. For a given state or move
\(m_t\) , evaluated by the agent, the agent computes a projective chart
\(\psi (m_t)\) , corresponding to the FoC it attributes to a given agent, including itself. Perceived value
\(\mu\) and uncertainty with respect to sensory evidence
\(\sigma\) (given the current state of the agent) are computed based on
\(\psi (m_t)\) and the preferences attributed by
\(A_i\) to the agent under consideration. These parameters are used to define a parametric probability distribution
\(P(\mu ,\sigma)\) , which is compared to an ideal distribution
\(P(\mu _0,\sigma _0)\) through the
Divergence of Kullback-Leibler (DKL). This yields a cost function that is sensitive to divergence from both preferences and uncertainty. Emotions are also expressed by the agents accordingly (not indicated in Figure
1). The process is repeated recursively to assess successive moves, according to the depth of processing used by the agent (large round arrow, top right in Figure
1). The algorithm entails a
Multi-agent System (MAS) embedded within each agent. Multiple alternate sequences of moves
M are computed, to define a series of anticipated states. The sequence of moves that the agent retains corresponds to that which minimizes its overall FE, taking or not into account anticipations about other agents states. The first move of the sequence is chosen by the agent as its actual move
\(m(S)\) . That actual move controls the state of the associated virtual human
\(VH(S)\) . The agent then takes as inputs the observed states in the world, including of other agents (locations, orientations, emotion expressions). If those states diverge above a certain threshold
\(\theta\) from the anticipated states, then a mechanism of reverse inference is triggered. Otherwise, the agent keep computing projections based on its current beliefs, including preferences, ToM parameters, and more generally, states (locations, orientations and emotion expressions of others). The mechanism of reverse inference tests different hypotheses about parameters such as preferences attributed to others and order of ToM used by others. It runs the same recursive algorithm used by the agent to simulate new projections, and retains the parameters that best explain the observed states to update its beliefs.
When referring to active inference, we mean the process of inferring and acting according to inference recursively, which can be summarized as follows. Let
S be the space of sensory inputs and
\(\Gamma\) the space of states that the agent can be in, and let
M be the set of action the agent can perform. In the inference step, a state
\(\gamma \in \Gamma\) is induced by sensory input
h by minimizing a cost function
\(c: S\times \Gamma \rightarrow \mathbb {R}\) ,
and during the action selection step, the subject chooses the action according to a second cost function
\(c_1: \Gamma \times M\rightarrow \mathbb {R}\) ,
which in turn induces a change at the level of the sensory input, since the environment reacts to this action.
In our setting, we consider a collection of entities, E, constituted of objects and agents. Agents express emotions and can infer and act according to their preferences and those ascribed to others, with respect to a situation, while objects cannot act. When singling out an agent, for example, when making explicit how active inference works for this agent, we will call it a subject. The space of agents will be denoted A. An agent \(a\in A\) can express a positive emotion \(e_{+}\in [0,1]\) and a negative emotion \(e_{-}\in [0,1]\) . The space of sensory inputs of a subject is constituted of the configurations of other entities in the ambient space and the emotions that agents express. The space of states is the preferences it can have for other entities when the subject does only ToM-0 (that is no ToM in our context), and preferences attributed to other agents for higher order of ToM. Subjects act in two ways, they can move and express emotions.
The details of the following model we use are presented in Reference [
2].
The preference for an entity is a real number in
\([0,1]\) denoted as
p. Every subject,
s, has an embodied perspective on the Euclidean ambient space that corresponds to a choice of a projective transformation that we denote as
\(\psi _s\) ; we will call it the projective chart associated to the agent. The quantity that links perspective taking and pleasantness of a situation is the perceived value
\(\mu\) that is computed for each entity
\(e\in E\) as
where
\(v_p\) is the perceived volume of the entity in the total FoC of the subject of volume
\(v_{tot}\) . The perceived value
\(\mu\) is an average of the preference for the entity and a reference preference
\(q_n\) weighted by the relative perceived volume of the entity; the power
\(1/4\) on the volume is taken to match documented psychophysical laws (see Reference [
2] for a psychophysical and computational justification of this variable).
The subject also computes an uncertainty with respect to sensory evidence, denoted \(\sigma\) , that is greater with larger eccentricity with respect to the point of view of the subject, and the distance of the entity. In other words, there is more certainty about entities that appear actually or would be expected by imagination to be in front of and close to the subject.
The subject is driven toward an ideal with high perceived value and low uncertainty. To compute the divergence from this ideal, the perceived value and uncertainty are associated with a probability distribution,
\(Q(.\vert \mu , \sigma)\in \mathbb {P}([0,1])\) , centered in
\(\mu\) and of “width”
\(\sigma\) . This divergence is computed with the Kullback-Leibler divergence of
Q from the ideal distribution
P narrowly centered on values close to 1. Let us recall that for any two probability distributions
\(P,Q\in \mathbb {P}(\Omega)\) , over a space
\(\Omega\) , with
\(dQ= f dP\) ,
Let us now detail the active inference cycles of subjects with
ToM of order 0 (ToM-0) to
ToM of order 2 (ToM-2). Here, we shall not focus on the inference part of the process nor on emotion expression but rather on how agents select their moves, one can refer to Reference [
2] for a detailed presentation on how preferences are updated and emotion expressed.
The preferences of a subject for the other entities with ToM-0 is encoded in a vector
\((q_e,e\in E)\) . The configuration of an entity,
e, is a subset of
\(\mathbb {R}^3\) denoted as
\(X_e\subseteq \mathbb {R}^3\) and the collection of configurations will be denoted as
X. The subject chooses its move
m from a set of moves
M by minimizing the following average of Kullback-Leibler divergences,
When the subject performs T0M-1 (ToM of order 1), it has a preference matrix
\((p_{sae}\in [0,1],a\in A,e\in E)\) that encodes preferences that agents have with respect to other entities according to the subject. The true preferences of the subject, i.e., the preference vector of
s is
\(p_{ss.}\) . Agents may be influenced by other agents in the way they infer preferences and the way they act, this is encoded by the influence vector on preferences
\((J^p_{se},e\in E)\) and on moves
\((J_{se}^m,e\in E)\) , respectively. Subjects with ToM-1 can predict the move of the other agents assuming that they have order 0 ToM; in fact they cannot assume that the other agents have a higher order of ToM or else it would contradict the fact that the subject has ToM-1. The number of steps in the future up to which the subject can predict the moves of the other agents is called the depth of processing and denoted as
dp. At step 0 of the prediction, the subject attributes to another agent,
a, the preference vector
\(\tilde{q}^0_e=p_{sae}\) for entities
\(e\in E\) ; and the position of the entities is
\(X^0\) . At step
\(k\lt n\) the predicted position,
\(X^k\) , expressed emotions
\(e^k\) and preference vectors
\(\tilde{q}^k\) are used to predict the displacement, preference update and emotion expression of the others agents, by applying active inference for ToM-0 as described in the previous paragraph. Furthermore, the subject has also an updated version of its preference matrix
\(p^k\) . The subject then chooses its move at step
\(m^{k+1}\) by minimizing the following cost function, for
\(m\in M\) ,
where
\(Y^{k}\) is the configuration of the entities that are not the subject at step
\(k+1\) and
\(Y^k_s\) is
\(X^k_s\) .
\(Y^k_m\) is a mean to recall that the configuration of the entities depend on the move the subject decides to make, through
\(Y^k_{s,m}\) . Here, for any agent
\(a\in A\) and entity
\(e\in E\) ,
One can remark that \(C_1\) is in fact a weighted mean of several \(C_0\) .
From this prediction
n steps in the future, the subject chooses the best set of moves that we assimilate to paths,
\((m^{k*}_s, k\in [0,n])\) , in a set of paths,
\(\mathcal {P}\) , by minimizing
where
\(\sum _{k=1...n} a_k=1\) and
\(a_k\) are chosen here to be
\(a_k=\frac{1}{n}\) . The best move to make for the subject is the first move of the best path.
For a subject that has ToM-2, the same procedure as for a subject with ToM-1 holds. The subject can simulate the behaviours of the other agents with respect to the degree of ToM it attributes to them,
\((d_a,e\in A)\) . From these simulations, it can decide what best sequence of moves to make. To do so, one should consider that the subject has a preference tensor
\((h_{sabe}, a\in A, b\in A, e\in E)\) and influence matrices
\((I^p_{sab},a\in A, b\in A),(I^m_{sab},a\in A, b\in A)\) . The case we consider is simpler, as we restrict the preference tensor
h to a preference matrix
p, such as in ToM-1, by posing that
\(h_{sabe}=p_{sbe}\) . When the subject starts its prediction of the behaviour of the other agents, i.e., at step 0 of the prediction, the influence vectors of an agent
a believed to have ToM-1 by the subject are defined as
\(\tilde{J}^{0}_{a.}= I_{sa.}\) . The cost function
\(C_{{{\color {black}2}}}\) for the choice of the action of the subject at step
k of the prediction is a mean of the cost functions of the other agents depending on the degree of ToM that is attributed to them. We do not enter into more details on how
\(C_{{{\color {black}2}}}\) is computed nor on the cost function for higher dimensions of ToM; they are computed recursively. In the experiment we consider, we assumed that the agents are not influenced in their action by how they believe other agents would feel as a result; what makes the difference between a subject of ToM-1 and ToM-2 is how it predicts the behaviour of the agents, respectively, attributing to them order 0 or 0 to 1 of ToM. In the main experiment of this report, we only focus on inference about ToM order with fixed preferences. In supplementary simulations, we illustrate (see Section
5.6) how our model can tackle situations in which agents simultaneously perform inferences about others’ ToM order and preferences.
3.2 Inverse Inference for ToM and Preferences
A subject with ToM-2 can attribute to another agent ToM-0 or 1 and for the subject to truly be able to perform ToM-2, it must be able to attribute correctly to the other agent the order of ToM it truly operates at. To do so, the subject must inverse infer the degree of ToM by analysing the behaviour of the agent.
When the prediction of the subject with respect to the actions of an agent diverges too much from its observed actions, it can start doubting its beliefs on the parameter it previously used to model the other agent’s behaviour. It can then find better suited parameters. Here the parameters being considered are the preferences and the order of ToM attributed to the other agents. To do so, the subject uses a measure of divergence from the predictions it made about the action of the other agent, \(a^p\) , with respect to the real action it has observed and memorized, a.
Let us consider the following example, at time
t, the subject predicts the action
\(a^{p}(t)\) with respect to the information it holds about the preferences,
\(p(t)\) , and order of ToM,
\(d(t)\) , of the other agent. If the divergence,
\(f(a^p(t), a(t))\) , is too large, then the subject will update the parameters of its model of the agent, i.e.,
p and
d, to increase predictive power by minimizing a divergence,
The experiment we use below as a proof-of-concept is a two-choice simulation scenario. In this scenario, both the subject and another agent try to reach a vending machine among two, one being intrinsically more attractive than the other. In the main experiment, both agents try at the same time to avoid running into each other (they have negative preferences toward each other). In the supplementary experiment, the other agent may have negative, neutral or positive preferences toward the subject. The subject cannot see the other agent before the near end of the experimental trial (except at the very beginning). The experiment is divided into two trials. At the first trial, in both experiments, a subject with ToM-2 assumes by default that the other agent is also trying to avoid the subject. It is expected that if the subject encounters the other agent, it will learn from its mistake. In the main experiment, it will revise the order of ToM attributed to the other agent (the only parameter it tries to infer in this experiment). In the supplementary experiment, it will revise both the order of ToM attributed to the other agent, and the preference attributed to the other agent toward the subject. Let us now explain how our agents revise their beliefs when confronted with evidence of misprediction.
A subject s models the order of ToM of an agent by a random variable, that we denote as D, that takes two values 0 and 1. It has as prior law, \((p_1,p_0)\in \mathbb {P}(D\in \lbrace 0,1\rbrace)\) for D, that can be parameterized by \(p_1\) . The probability distribution plays the role of the belief the subject has on the order of ToM of the other agent. In the simulation scenario, the subject knows where the agent is at time 0, but until the end of the trial, it does not have confirmation of its position. The subject speculates on the position of the agent at each time until it sees it (or not), and gathers new information on its position if it sees it eventually. To do so, it keeps in memory, the predicted position \((x_t^0,x_t^1)\) of the agent at each times t, respectively, assuming that it performs ToM of order 0 or 1. At time \(t+1\) , if it does not see the agent, then it predicts \(x_{t+1}^0\) using the predicted position of the agent \(x_{t}^0\) assuming that the agent acts as an agent with ToM of order 0, and it predicts \(x_{t+1}^1\) from \(x_t^1\) assuming the agent acts as an agent with order 1 ToM.
Therefore, at each time t the subject predicts two positions \(x^0_t\) and \(x_t^1\) for the agent. When the subject can attest the real position of the agent, it can confront it to the predicted positions. For example, if this assessment occurs at time \(t_0\) , then the subject can confront its predictions with the true position of the agent, by considering \(\vert x^0_{t_0}-x_{t_0}\vert\) and \(\vert x^1_{t_0} -x_{t_0}\vert\) , respectively, the distance between the predicted position and true position of the other agent when the subject assumes that the agent has an order of ToM of 0 versus 1.
The subject computes
as a metric for consistency of predictions; This metric can naturally be extended to the case where theory of mind and preferences are inferred. If this value is too high, then the subject starts to doubt its priors and will look for
\(p_1\) that minimizes the previous quantity,
One shows that this problem is the same as finding the minimum
\(d^*\) ,
as
\({p_1}_{t_0}^*=\delta (d^*)\) , where
\(\delta (d^*)\) equals 1 on
\(d^*\) and 0 on the complementary.
In the supplementary experiment, the subject also attempts to infer the preference the other agent may have about it, by comparing predicted and observed outcomes in terms of position, orientation, and emotion expression. For instance, if the subject encounters the other agent and the other agent expresses more positive emotion than expected along with a different pattern of approach versus avoidance, then the subject may infer that the other agent might actually have positive preferences toward it and adjust its strategic behaviour accordingly. Note that in the supplementary experiment, for the interest of the simulation, the subject was capable of up to ToM-3 and the other agent of up to ToM-2.