1 Introduction

In sequential recommendation problems (Ye et al. 2018, 2019), where the system needs to recommend multiple items to the user while responding to the user’s feedback, there are multiple decisions to be made in sequence. For example, in our application of program recommendation to taxi drivers on the large-scale ride-hailing platform, the system recommends a personalized driving program to each driver, and a program consists of multiple steps, where each step is recommended according to how the previous steps were followed. Therefore, recommending the program steps is a sequential decision problem, and it can be naturally tackled by reinforcement learning (RL) (Sutton and Barto 2018).

As a powerful tool for learning decision-making policies, RL learns from interactions with the environment via trial-and-errors (Sutton and Barto 2018). In digital worlds where interactions with the environment are feasible and cheap, it has made remarkable achievements, (e.g., Mnih et al. 2015; Silver et al. 2016; Brown and Sandholm 2017; OpenAI et al. 2019). When it comes to real-world applications, physical environments in the real world are no longer as convenient as digital environments. It is not practical to interact with the real-world environment directly for training the policy, because of the high interaction cost, the potential unbearable risk and the huge amount of interactions required by the current RL techniques. A recent study (Shi et al. 2018) disclosed a viable option to conduct RL on real-world tasks, which is by estimating a virtual environment from the past data. Once a virtual environment is built, the RL process could be more efficient by interacting with it, and the physical cost in real-world environments could be avoided as well.

The environment estimation can be done by treating the environment as a policy that makes responses to the interactions, and employing the imitation learning methods (Schaal 1999; Argall et al. 2009) to learn the environment policy from the past data, which has drawn a lot of attention recently (Chen et al. 2019). Comparing with using supervised learning, i.e., behavior clone, to learn the environment policy, a more promising solution in Shi et al. (2018) is to formulate the environment policy learning as an interactive process between the environment and the system in it. The advantage of such a setting is that it could make a better generalization to evaluate a new system policy, especially when the environment policy changes over time and the distribution of new collected data shifts as well (Zhao et al. 2020). Take the example of the commodity recommendation system: the user and the platform could be viewed as two agents interacting with each other, where the user agent views the platform as the environment and the platform agent views the user as the environment. By this multi-agent view, Shi et al. (2018) proposed a multi-agent adversarial imitation learning (MAIL) method, extending the generative adversarial imitation learning (GAIL) framework (Ho and Ermon 2016), to learn the two policies simultaneously by beating the discriminator which aims to find the difference between the generated and the real interaction data.

However, the MAIL method (Shi et al. 2018) is under the assumption that the whole world only consists of two agents. From the perspective of the real users, they can receive much more information from the real-world that is not recorded in the data. Therefore, it is still quite challenging to build a realistic environment in real-world applications, since the real-world scenario is too complex to offer a fully observable environment, which means that there may exist unobservable variables that can implicitly affect the interaction. As shown in Fig. 1, in the classical setting, the next state in an MDP depends on the previous state and the executed action. While in most of real-world scenarios, the next state could be additionally influenced by some hidden variables, which result in the state being partially observable. If we follow the assumption of a fully observable world, the estimation would be misled by the appeared fake associations in the data, which are commonly caused by the hidden variables. Thus, it is essential to take such hidden variables into consideration.

Fig. 1
figure 1

Illustration of the graph structure and the collected data (a) in the classical environment that assumes fully observable, and (b) in the more realistic environment with unobserved variables

Hidden state problems arise in many real-world decision tasks. The state of the environment is only incompletely known to the learning agent. Partially observable MDPs (POMDPs) (Singh et al. 1994) are an appropriate model for hidden state problems. Most previous approaches to such problems have combined computationally expensive state estimation techniques with learning control (Kaelbling et al. 1998; Pineau et al. 2003; Cassandra et al. 2005). In control theory, it is widely accepted that learning a model of the environment is useful for policy control in such cases, which is called system identification. There has been some work on learning discrete-state models for the partially observable environment (Sallans 1999). While in many real-world applications, the state of the environment is commonly high dimensional and continuous. Little work has been done in this promising area. In this study, we try to use reinforcement learning to learn the continuous-state environment model for the partially observable tasks.

To involve hidden variables into the environment estimation, we propose a partially-observed multi-agent environment estimation method, named POMEE. First, we formulate two representative polices, the agent policy \(\pi _a\) and the environment policy \(\pi _e\). Then, in order to simulate the effect of hidden variables, we add a hidden agent \(\pi _h\) into the interaction. According to the influence relationship, the hidden agent \(\pi _h\) interacts with the other two agents. Based on the formulation, we learn policies of these three agents only using the interaction data between \(\pi _a\) and \(\pi _e\). Since hidden variables are unobservable, we propose two techniques to learn the policy of it: the partially-observed environment model and the compatible discriminator under the framework of GAIL (Ho and Ermon 2016). As the training converges, the partially-observed environment is successfully generated.

Based on the built virtual environment, RL algorithms can be used to optimize the agent policy. Policy optimization mainly includes two steps: policy evaluation and policy control. In the policy evaluation, the performance of the policy is evaluated according to the reward function in the simulator. When the hidden variables exist, the response of the environment could be additionally influenced by them. So the causal relationship between actions and responses must be depicted accurately in the simulator. Moreover, since the real-world application prefers online A/B testing to evaluate the improvement effect of the policy (Agarwal et al. 2016), it is more desirable to build a causal reward function in the simulation environment. Generally, causal modeling aims at modeling the uplift, which is the incremental impact of an action or a treatment on an individual unit.

Recently, there are a lot of studies to learn the uplift model by tree-based methods (Hansotia and Rukstales 2002; Radcliffe and Surry 2011; Rzepakowski and Jaroszewicz 2012; Athey and Imbens 2015; Wager and Athey 2018), which demonstrate stable performance on many tasks. In this paper, to learn an uplift model compatible with the simulation environment, we propose a deep uplift inference network model, named DUIN, to infer the uplift of each action. By analogy with the experimental setting of online AB testing, the DUIN model has two output branches: the control branch and the treatment branch. The control branch is trained to predict the potential outcome of the control action. The treatment branch is trained to infer the uplift behind the observed outcome. When the model converges, the output of the treatment branch converges to the true value of the uplift.

By implementing the environment policy in the DUIN structure, we propose a POMEE with uplift inference approach, named POMEE-UI, to build a partially-observed environment with a causal reward mechanism. First, we use the randomized trial data to train the environment policy following the DUIN optimization. Then, the parameters of the treatment branch in environment policy are set to be fixed during the training process of POMEE. In this way, the learned environment policy can generate simulation data similar to the real data, and meanwhile offer the uplift value of each action as a causal reward.

To verify the effectiveness of POMEE-UI, we first use an artificial environment abstracted from the real-world application to conduct toy experiments. Then, we apply POMEE-UI to a large-scale recommender system for ride-hailing driver programs in Didi Chuxing. The results of toy experiments show that the environment learned by POMEE-UI can not only have a reliable causality, but also restore the real policy functions well. In the real-world application, POMEE-UI achieves a promising performance on both simulation evaluation and offline policy optimization. Finally, based on the virtual environment built by POMEE, a recommender policy is optimized and deployed online for A/B testing. The results of online experiment further demonstrate the effectiveness of applying our method to real-world applications. The contribution of this work is summarized as follows:

  1. (1)

    We propose a novel environment estimation method POMEE to tackle the real-world situation where the state of the environment is partially observable.

  2. (2)

    By treating the hidden variables as a hidden policy, we formulate the hidden effect into a multi-agent interactive environment. We define the partially-observed environment model and the compatible discriminator to learn policies effectively.

  3. (3)

    We propose a novel deep uplift inference network DUIN model to learn the uplift effectively. Due to the flexibility in various settings, it makes deep neural networks a step further in uplift modeling.

  4. (4)

    By implementing the environment policy in the DUIN structure, we propose the POMEE-UI approach to build a partially-observed environment with uplift inference. A general, feasible and reliable pipeline solution is built to enable RL to release the powerful sequential decision-making ability in real-world applications.

  5. (5)

    We deploy the proposed framework to the program recommender system on a large-scale riding-hailing platform, and achieve significant improvements in the test phase.

The rest of this paper is organized as follows: we introduce the background in Sect. 2 and the proposed method POMEE in Sect. 3. The DUIN model and the POMEE-UI approach are proposed in Sect. 4. We describe the application of POMEE-UI to the driver program recommendation system in Sect. 5. Experiment results are reported in Sect. 6. Finally, we conclude the paper in Sect. 7.

2 Background

2.1 Reinforcement learning

The problem to be tackled by Reinforcement Learning (RL) can usually be represented by a Markov decision process (MDP) quintuple \((S, A, T, R, \gamma )\), where S is the state space and A is the action space and \(T:S \times A \mapsto S\) is the state transition model and \(R:S \times A \mapsto {\mathbb {R}}\) is the reward function and \(\gamma\) is the discount factor of cumulative reward. Reinforcement learning aims to optimize policy \(\pi :S \mapsto A\) to maximize the expected \(\gamma\)-discounted cumulative reward \({\mathbb {E}}_\pi \left[\varSigma _{t=0}^T\gamma ^tr_t \right]\) by enabling agents to learn from interactions with the environment. The agent observes state s from the environment, selects action a given by \(\pi\) to execute in the environment and then observes the next state, obtains the reward r at the same time until the terminal state is reached. Consequently, the goal of RL is to find the optimal policy

$$\begin{aligned} \pi ^\star = \mathop {\arg \max }_\pi {\mathbb {E}}_\pi \left[\varSigma _{t=0}^T \gamma ^tr_t \right] , \end{aligned}$$
(1)

of which the expected cumulative reward is the largest.

Partially observable Markov decision process The POMDP framework is general enough to model a variety of real-world sequential decision-making problems. The general framework of Markov decision processes with incomplete information was described by Astrom (1965) in the case of a discrete state space, and it was further studied in the operations research community where the acronym POMDP was coined. It was later adapted for problems in artificial intelligence and automated planning by Kaelbling et al. (1998). A discrete-time POMDP can formally described as a 7-tuple \((S, A, T, R, \Omega , O, \gamma )\), where S is a set of states and A is a set of actions and T is a set of conditional transition probabilities \(T(s' | s, a)\) for the state transition \(s \rightarrow s'\) and \(R:S \times A \mapsto {\mathbb {R}}\) is the reward function and \(\Omega\) is a set of observations and O is a set of conditional observation probabilities \(O(o|s', a)\) and \(\gamma \in \left[ 0, 1\right]\) is the discount factor.

At each time period, the environment is in some state \(s \in S\). The agent chooses an action \(a \in A\), which causes the environment transition to state \(s' \in S\) with probability \(T(s' | s, a)\). At the same time, the agent receives an observation \(o \in \Omega\) which depends on the new state of the environment with probability \(O(o | s', a)\). Finally, the agent receives a reward \(r = R(s, a)\). Then the process repeats. The goal is for the agent to choose actions at each time step that maximizes its expected future discounted reward, which is the same as the goal of MDP defined in Eq. (1).

Imitation learning Learning a policy directly from expert demonstrations has been proven very useful in practice, and has made a significant improvement of performance in a wide range of applications (Ross et al. 2011). There are two traditional imitation learning approaches: behavioral cloning, which trains a policy by supervised learning over state-action pairs of expert trajectories (Pomerleau 1991), and inverse reinforcement learning (Russell 1998), which learns a cost function that prioritizes the expert trajectories over others. Generally, common imitation learning approaches can be unified as the follow formulation: training a policy \(\pi\) to minimize the loss function \(l(s, \pi (s))\), under the discounted state distribution of the expert policy: \(P_{\pi _e}(s) = (1-\gamma )\varSigma _{t=0}^T \gamma ^t p(s_t)\). The object of imitation learning is represented as

$$\begin{aligned} \pi = \mathop {\arg \min }_{\pi } {\mathbb {E}}_{s\sim P_{\pi _e}} \left[l(s, \pi (s)) \right] . \end{aligned}$$
(2)

2.2 Environment estimation

Reinforcement learning relies on an environment. However, when it comes to real-world applications, it is not practical to interact with the real-world environment directly to optimize the policy because of the low sampling efficiency and the high-risk uncertainty, such as online recommendation in E-commerce and medical diagnosis. A viable option is to build a virtual environment (Shi et al. 2018) for offline policy training. As a result, the training process could be more efficient by interacting with the virtual environment, and the interaction cost could be avoided as well.

Generative adversarial nets Generative adversarial networks (GANs) (Goodfellow et al. 2014) and its variants are rapidly emerging unsupervised machine learning techniques. GANs involve training a generator G and discriminator D in a two-player zero-sum game:

$$\begin{aligned} \mathop {\arg \min }_G \mathop {\arg \max }_{D\in (0,1)} {\mathbb {E}}_{x \sim p_E} \left[ \log D(x) \right] + {\mathbb {E}}_{z\sim p_z} \left[\log (1-D(G(z))) \right] , \end{aligned}$$
(3)

where \(p_z\) is some noise distribution. In this game, the generator learns to produce samples (denoted as x ) from a desired data distribution (denoted as \(p_E\)). The discriminator is trained to classify the real samples and the generated samples by supervised learning, while the generator G aims to minimize the classification accuracy of D by generating samples like real ones. In practice, the discriminator and the generator are both implemented by neural networks, and updated alternately in a competitive way. The training process of GANs can be seen as searching for a Nash equilibrium in a high-dimensional parameter space, so it has very strong ability of data representation. Recent studies (Menick and Kalchbrenner 2018) have shown that GANs are capable of generating faithful real-world images, demonstrating their applicability in modeling complex distributions.

Generative adversarial imitation learning GAIL (Ho and Ermon 2016) has become a popular imitation learning method recently. It allows the policy to interact with the environment but no reward signals. It was proposed to avoid the shortcoming of traditional imitation learning, such as the instability of behavioral cloning and the complexity of inverse reinforcement learning. It adopts the GAN framework to learn a policy (i.e., the generator G) with the guidance of a reward function (i.e., the discriminator D) given expert demonstrations as real samples. GAIL formulates a similar objective function like GANs, except that here \(p_E\) stands for the expert’s joint distribution over state-action pairs:

$$\begin{aligned} \mathop {\arg \min }_{\pi } \mathop {\arg \max }_{D\in (0,1)} {\mathbb {E}}_{\pi } \left[ \log D(s,~a) \right] + {\mathbb {E}}_{\pi _E} \left[\log (1-D(s,~a)) \right] - \lambda H(\pi ) , \end{aligned}$$
(4)

where \(H(\pi ) \triangleq {\mathbb {E}}_{\pi } \left[-\log \pi (a|s) \right]\) is the entropy of policy \(\pi\).

GAIL allows the agent to execute the policy in the environment and update it with policy gradient methods (Schulman et al. 2015). The policy is optimized to maximize the similarity between the policy-generated trajectories and the expert trajectories measured by D. Similar to the Eq. (2), the policy \(\pi\) is updated to minimize the loss function

$$\begin{aligned} l(s, \pi (s)) = {\mathbb {E}}_{\pi } \left[ \log D(s, a) \right] - \lambda H(\pi ) \cong {\mathbb {E}}_{\tau _i} \left[\log \pi (a|s)Q(s,a) \right] -\lambda H(\pi ). \end{aligned}$$
(5)

where \(Q(s, a) = {\mathbb {E}}_{\tau _i} \left[\log (D(s, a))| s_0=s, a_0=a \right]\) is the state-action value function. The discriminator is trained to predict the conditional distribution: \(D(s,a)=p(y|s,a)\) where \(y \in \{\pi _E, \pi \}\). In other words, D(sa) is the likelihood ratio that the pair (sa) comes from \(\pi\) rather than from \(\pi _E\). GAIL is proven to achieve similar theoretical and empirical results as IRL (Finn et al. 2016) while it is more efficient.

Recently, the multi-agent extension of GAIL (Shi et al. 2018) has been proven effective to build a virtual environment. A subset of this work in this paper has been published before (Shang et al. 2019). The previous publication proposed an environment reconstruction method to virtualize a real-world recommendation environment with a response model. In this paper, a causal uplift model is additionally designed to learn a more reliable environment model for better policy optimization. Additionally, we have revamped the exposition of our environment generation method from the POMDP perspective.

2.3 Causal inference and uplift modeling

Uplift modeling refers to the set of techniques used to model the incremental impact of an action or a treatment on a customer outcome. For example, a manager at an e-business company could be interested in estimating the effect of sending an advertising e-mail to different customers on their probability to click the links to promotional ads. With that information at hand, the manager is able to target potential customers efficiently.

Uplift modeling is both a causal inference and a machine learning problem (Gutierrez and Gérardy 2017). It is a causal inference problem because one needs to estimate the difference between two outcomes that are mutually exclusive for an individual (either a user receives a promotional e-mail or does not receive it). To overcome this counter-factual nature, uplift modeling crucially relies on randomized experiments. Uplift modeling is also a machine learning problem as one needs to train different models and select the one that yields the most reliable uplift prediction according to some performance metrics. More prerequisite knowledge can be seen in “Appendix A.1 and A.2”.

The most popular methods for uplift modeling in the literature remain the tree-based ones (see Hansotia and Rukstales 2002; Radcliffe and Surry 2011; Rzepakowski and Jaroszewicz 2012; Athey and Imbens 2015; Wager and Athey 2018). However, little work (Johansson et al. 2016) has been done to release the strong representation ability of deep neural network for uplift modeling. In this paper, we make a further step to use the deep neural network for uplift modeling, which is also compatible with the training process of environment estimation.

3 Partially-observed multi-agent environment Estimation

To estimate the environments where hidden states exist, we propose a novel partially-observed multi-agent environment estimation (POMEE) method.

3.1 Formulation

In this study, by treating the hidden variables as a hidden policy, we formulate the partially-observed environment estimation as follows:

Partially-observed multi-agent environment.

  • Observable agent A : known as the policy agent, denoted as \(\pi _a\) with observation \(o_A\) as input and action \(a_A\) as output.

  • Observable agent E: known as the environment, denoted as \(\pi _e\) with observation \(o_E\) as input and action \(a_E\) as output.

  • Unobservable agent H: known as hidden variables, denoted as \(\pi _h\) with observation \(o_H\) as input and action \(a_H\) as output. It plays a role of hidden effect in the sequential interactions between the policy agent A and the environment agent E.

Simulation of interaction.

  • Start of the simulation trajectory: Given \(o_A\) (sampled from the initial states in the historical data) as the observation of agent A, it takes an action \(a_A = \pi _a(o_A)\).

  • Hidden effect for the action: the observation \(o_H\) of agent H is formatted as the concatenation \(o_H = <o_A, a_A>\), and the action \(a_H = \pi _h(o_H)\) has the same format as \(a_A\).

  • Hidden effect for the environment response: the observation \(o_E\) of agent E is formatted as the concatenation \(o_E = <o_A, a_A, a_H>\), and its action is \(a_E = \pi _e(o_E)\) which can be used to move forward to the new state for next step.

Goal We assume that the true policies \(\pi _e^\star\) and \(\pi _h^\star\) behind the observed trajectories are fixed in the time of a trajectory. The objective is to use only observable interactions, that is, trajectories \(\tau _{real} = \{(o_A, a_A, a_E)\}\), to imitate the policies \(\pi _a\) and \(\pi _e\), together with recovering the hidden effect of H by inferring the hidden policy \(\pi _h\).

3.2 Objective function

The objective function of multi-agent imitation learning is defined analogy to Eq. (2):

$$\begin{aligned} (\pi _a, \pi _e, \pi _h) = \mathop {\arg \min }_{(\pi _a,\, \pi _e,\, \pi _h)} {\mathbb {E}}_{o_A\sim P_{\tau _{real}}} \left[ L(o_A, a_A, a_E) \right] , \end{aligned}$$
(6)

where \(a_A,~a_E\) depend on three policies. By adopting the GAIL framework, according to Eq. (5), we can get the imitation loss for environment estimation as

$$\begin{aligned} L(o_A, \pi _a, \pi _h, \pi _e) = {\mathbb {E}}_{\pi _a,\, \pi _h,\, \pi _e} \left[\log D(o_A, a_A, a_E) \right]-\lambda \varSigma _{\pi \in \{\pi _a,\, \pi _h,\, \pi _e\}}H(\pi ). \end{aligned}$$
(7)

We observe that \(\pi _a\) is independent with \(\pi _h\) and \(\pi _e\) given \(o_A\) and \(a_A\), then using conditional independence rule, \(D(o_A, a_A, a_E)\) under GAIL framework can be decomposed as

$$\begin{aligned} \begin{aligned} D(o_A, a_A, a_E)&= p(\pi _a, \pi _h, \pi _e | o_A,a_A,a_E) \\&= p(\pi _a |o_A,a_A,a_E)~p( \pi _h, \pi _e|o_A,a_A,a_E) \\&= p(\pi _a |o_A,a_A)~p( \pi _h, \pi _e|o_A,a_A,a_E) \\&= D_a(o_A, a_A)~D_{he}(o_A,a_A, a_E) . \end{aligned} \end{aligned}$$
(8)

where \(D_a(o_A, a_A)\) denotes the imitation item of policy \(\pi _a\), and \(D_{he}(o_A,a_A, a_E)\) denotes the imitation item of policies \(\pi _h\) and \(\pi _e\). Combining Eqs. (7) and (8), we can decompose the loss function as

$$\begin{aligned} \begin{aligned} L(o_A,\pi _a, \pi _h, \pi _e) =&~{\mathbb {E}}_{\pi _a, \pi _h, \pi _e} \left[\log D_a(o_A, a_A) D_{he}(o_A,a_A, a_E)\right] - \lambda \varSigma _{\pi \in \{\pi _a, \pi _h, \pi _e\}}H(\pi ) \\ =&~{\mathbb {E}}_{\pi _a} \left[\log D_a(o_A, a_A) \right] - \lambda H(\pi _a) \\&\quad +{\mathbb {E}}_{\pi _h, \pi _e} \left[\log D_{he}(o_A,a_A, a_E) \right] -\lambda \varSigma _{\pi \in \{\pi _h, \pi _e\}}H(\pi )\\ =&~l(o_A, \pi _a(o_A)) + l((o_A, a_A),\, \pi _e \circ \pi _h((o_A,a_A))) \end{aligned} \end{aligned}$$
(9)

which indicates that the optimization can be decomposed as optimizing policy \(\pi _a\) and joint policy \(\pi _{he} = \pi _e \circ \pi _h\) individually by minimizing the loss functions

$$\begin{aligned} \begin{aligned} l(o_A, \pi _a(o_A))&= {\mathbb {E}}_{\pi _a} \left[\log D_a(o_A, a_A) \right] - \lambda H(\pi _a) \\&\cong {\mathbb {E}}_{\tau _i} \left[\log \pi _a(a_A|o_A)Q(o_A,a_A) \right] -\lambda H(\pi _a) , \end{aligned} \end{aligned}$$
(10)

where \(Q(o_A, a_A) = {\mathbb {E}}_{\tau _i} \left[\log (D(o_A, a_A))| o_0=o_A, a_0=a_A \right]\) is the state-action value function of \(\pi _a\), and

$$\begin{aligned} \begin{aligned} l((o_A, a_A), \pi _{he}((o_A,a_A))) =&~{\mathbb {E}}_{\pi _h, \pi _e} \left[\log D_{he}((o_A,a_A), a_E) \right] - \lambda \varSigma _{\pi \in \{\pi _h,\pi _e\}} H(\pi ) \\ \cong&~{\mathbb {E}}_{\tau _i} \left[\log \pi _{he}(a_E|o_A, a_A)Q(o_A,a_A, a_E) \right] - \lambda \varSigma _{\pi \in \{\pi _h,\pi _e\}} H(\pi ) , \end{aligned} \end{aligned}$$
(11)

where \(Q(o_A, a_A, a_E) = {\mathbb {E}}_{\tau _i} \left[\log (D((o_A, a_A), a_E))| o_0=o_A, a_{A0}=a_A, a_{B0}=a_E \right]\) is the state-action value function of \(\pi _{he}\).

Based on this result, we propose the partially-observed environment model and the compatible discriminator to achieve the goal of imitating polices of agents A and E together with the hidden agent H, thus obtaining the POMEE approach.

3.3 Partially-observed environment model

In this study, the interaction between the agent A (known as the policy agent) and the agent E (known as the environment) could be observed, while the policy and data of the agent H (known as hidden variables) are unobservable.

Based on the decomposition result of objective function, we combine the hidden policy \(\pi _h\) with the observable policy \(\pi _e\) as a joint policy, named \(\pi _{he} = \pi _e \circ \pi _h\). Under the GAIL framework, together with the policy \(\pi _a\), the generator is formalized as an interactive environment of two policies as shown in the top of Fig. 2. The joint policy can actually be expressed as

$$\begin{aligned} \pi _{he} (o_A, a_A) = \pi _e(o_A, a_A, \pi _h(o_A, a_A)) \end{aligned}$$
(12)

in which the input \((o_A, a_A)\) and the output \(a_E\) are both observable in the historical data. Therefore, we can use imitation learning methods to train these two policies by imitating the observed interactions.

The policies in generator are updated alternatingly in each training step: first, the joint policy \(\pi _{he}\) is updated with the imitation reward \(r^{he}\) given by the discriminator. Second, the policy \(\pi _a\) is updated with the corresponding reward \(r^a\) given by the discriminator as well. Though there is no explicit updating step for the hidden policy \(\pi _h\), it has been inferred potentially by these two steps. Intuitively, the generated hidden policy \(\pi _h\) is just like a by-product along with the process of optimizing policies \(\pi _a\) and \(\pi _{he}\) towards the truth, and consequently it can recover the real hidden effect to some extent. To make the training process more stable, we employ TRPO ( Schulman et al. 2015) to update the two policies.

Fig. 2
figure 2

The generator and the discriminator in POMEE. The multi-agent interactive environment plays a role of generator, and can generate simulation interaction data. The discriminator is designed to be compatible for classify the state-action pairs of both the policy \(\pi _a\) and the joint policy \(\pi _{he}\)

3.4 Compatible discriminator

In most of generative adversarial learning frameworks, there is only one task to model and learn in the generator. In this study, it is essential to simulate and learn different reward functions for the two policies \(\pi _{a}\) and \(\pi _{he}\) consisted in the generator, respectively.

We design the discriminator compatible with two classification tasks. As Fig. 2 illustrates, one task is designed to classify the real and generated state-action pairs of \(\pi _{a}\) while the other one is to classify the state-action pair of \(\pi _{he}\). Correspondingly, the discriminator has two kinds of input: the state-action pair \((o_A,~a_A,~a_E)\) of policy \(\pi _{he}\) and the zero-padded state-action pair \((o_A,~a_A,~{\mathbf {0}})\) of policy \(\pi _a\). This setting indicates that the discriminator splits not only the policy \(\pi _{he}\)’s state-action space, but also the policy \(\pi _a\)’s state-action space. The loss function of each task is defined as

$$\begin{aligned} E_{\tau _{sim}} \left[\log (D_\sigma (o_A,a_A,a_E)) \right]+E_{\tau _{real}} \left[\log (1-D_\sigma (o_A,a_A,a_E))\right] \end{aligned}$$
(13)

for \(\pi _{he}\) , and

$$\begin{aligned} E_{\tau _{sim}} \left[\log (D_\sigma (o_A,a_A,{\mathbf {0}})) \right]+E_{\tau _{real}} \left[\log (1-D_\sigma (o_A,a_A,{\mathbf {0}})) \right] \end{aligned}$$
(14)

for policy \(\pi _a\).

The output of the discriminator is the probability that the pair data comes from the real data distribution. The discriminator is trained with supervised learning by labeling the real state-action pair as 1 and the generated fake state-action pair as 0. Then it is used as a reward giver for the policies while simulating interactions. The reward function for policy \(\pi _{he}\) can be written as:

$$\begin{aligned} r^{he} = -\log (1-D(o_A,~a_A,~a_E)) , \end{aligned}$$
(15)

and the reward function for policy \(\pi _a\) is

$$\begin{aligned} r^a = -\log (1-D(o_A,~a_A,~{\mathbf {0}})) . \end{aligned}$$
(16)

3.5 Simulation

We simulate interactions in the generator module. The simulated trajectory is generated as follows: First, we randomly sample one trajectory from the observed data and set its first observation as the initial observation \(o_0^A\). Then we can use the two policies \(\pi _a, \pi _{he}\) to generate a whole trajectory triggered from \(o_0^A\). Given the observation \(o_t^A\) as the input of \(\pi _a\), the action \(a_t^A\) can be obtained. In consequence, the action \(a_t^E\) can be obtained from the joint policy \(\pi _{he}\) with the concatenation \(<o_t^A, a_t^A>\) as input. Then we can get the imitation reward \(r_t^{he}\) by Eq. (15) and \(r_t^a\) by Eq. (16) which are used for updating policies in the adversarial training step. Finally, we can get the next observation \(o_{t+1}^A\) based on \(o_t^A\) and \(a_t^E\) by the predefined transition dynamics. This step is repeated until a terminal state, and a fake trajectory is generated.

figure a

3.6 POMEE algorithm

Based on the partially-observed environment model and the compatible discriminator, we propose the POMEE method to achieve the goal of estimating environment with hidden variables from the observed data.

Algorithm 1 shows the details of POMEE. The whole algorithm adopts the generative adversarial training framework. In each iteration, firstly the generator simulates interactions using policies \(\pi _a\) and \(~\pi _{he}\) to collect the trajectory set \(\tau _{sim}\) corresponding to Line 5 to Line 15. Then the policies \(\pi _{a}\) and \(\pi _{he}\) are updated in turn using TRPO with generated trajectories \(\tau _{sim}\) in Line 16. After K generator steps, the compatible discriminator is trained by two steps as shown in Line 18. Specifically, the predefined transition dynamics in Line 11 depends on specific tasks. In this way, the algorithm can effectively imitate the policies of observed interactions and recover the hidden variables beyond observations.

4 Partially-observed environment estimation with uplift inference

In reinforcement learning, the environment model mainly consists of two parts: the state transition dynamics and the reward function. The POMEE approach introduced in the previous section achieves the modeling of transition dynamics. In this section, we will introduce a novel uplift model to build the reward function in the simulation environment. It is important to concern the causality between rewards and actions when hidden variables exist in the environment. Only when the causality of different actions is accurately depicted, can the policy optimization based on the simulator make sense. An illustration of the importance of uplift modeling can be seen in “Appendix A.3”.

To learn a causal reward function in the virtual environment, we propose a novel deep uplift inference network model DUIN that applies to the training process of POMEE. In addition, the DUIN model can be used flexibly to binary treatment settings and multi-treatment settings, as well as the classification tasks and regression tasks.

4.1 DUIN model structure

The uplift modeling is generally based on a randomized trial experiments. Given the data of control and treatment groups, deriving a variant Eq. (17) from the Eq. (24) in “Appendix A.1”, we propose the DUIN model trained on the randomized experiment data to infer the uplift. Figure 3 illustrates the detailed structure of this model. The inputs of this network are the observation X fed into the input layer and the treatment indicator t fed into the intermediate layer. The output is the predicted potential outcome under X and t. We use supervised learning method to train this model. We have the following relationship regarding the uplift inference.

$$\begin{aligned} {\mathbb {E}} \left[Y_i(t)|X_i \right] = {\mathbb {E}} \left[Y_i(0)|X_i \right] + \tau _t(X_i) . \end{aligned}$$
(17)
Fig. 3
figure 3

The model structure of Deep Uplift Inference Network (DUIN) under the multi-treatment setting, and it will be the binary setting when \(n = 1\). The observation X and the treatment t are fed as input and the the potential outcome Y is the output. The uplift outputs through an intermediate layer

The whole network consists of two modules: the representation module and the inference module. The representation module is trained to learn high-level features that can effectively represent the potential outcome space. Based on the high-level features, the inference module is trained to predict the outcome. The inference module splits into two branches: the control branch and the treatment branch. The output of the control branch is the outcome if not treated, corresponding to the \({\mathbb {E}} \left[Y_i(0)|X_i \right]\) in Eq. (17). The output of the treatment branch is the uplift estimation for treatment t, corresponding to \(\tau _t(X_i)\). The two branches are merged by adding the outputs of each branch like Eq. (17), and the output just becomes the outcome of treatment t, corresponding to the \({\mathbb {E}}[Y_i(t)|X_i]\) in Eq. (17).

4.2 DUIN optimization method

We use the supervised alternating optimization approach to train the DUIN model. We train the control branch together with the representation module on the control group data. Similarly, we train the treatment branch together with representation module on the treatment group data. The objective function can be formulated as

$$\begin{aligned} l(x, t, y, \theta , \omega _0, \omega _1) = L \left(y,~ \hat{y_0}(x, \theta , \omega _0) + e_t * u_n(x, \theta , \omega _1 )\right) , \end{aligned}$$
(18)

where y is the ground truth outcome, \(\hat{y_0}\) is the predicted outcome under the observation x with no treatment, \(u_n\) is the uplift vector under n different treatments and \(e_t\) is a mask row vector with the tth bit set as 1. Specifically, \(e_t\) is a zero vector when the treatment is control (not treated). The loss function L can be either regression loss, e.g., MSE and RMSE, or the classification loss, e.g., the logarithmic loss.

The whole training process of DUIN is shown in Algorithm 2. In each iteration, we update K steps for the parameters \(\theta , \omega _0\) of the control branch, then update the same K steps for the parameters \(\theta , \omega _1\) of the treatment branch. Experiment results show that the smaller K can make a better generalization and faster convergence under an ideal condition. As the model converges, the representation module and the treatment branch can be used as an uplift inference module. Intuitively, the uplift inference in DUIN is to fit the residual between the controlled outcome and the treated outcome.

figure b

4.3 POMEE with uplift inference

By implementing the environment policy \(\pi _e\) in the DUIN structure, we propose a POMEE with uplift inference approach POMEE-UI as shown in Algorithm 3. Based on the POMEE framework, it can achieve the simulation of transition dynamics in a partially-observed environment. At the same time, due to the DUIN structure of the environment policy, a reward function with causality is also constructed. The integrated environment model can be more reliable for policy evaluation.

The computation graph of the environment policy is shown in Fig. 4. By analogy with the DUIN structure, the environment policy \(\pi _e\) also contains the representation module and the inference module. In the inference module, the output of the treatment branch \(u_E\) is the uplift value of action \(a_A\) under observation \(o_A\). The output of the control branch \(a_{E0}\) is the potential outcome of the environment under none treatments. The final output of the environment policy \(a_E\) is calculated by \(a_{E0}\) plus \(u_E\), which can be used to simulate the state transition process. The treatment branch acts as a reward function in the environment, of which the output \(u_E\) can be used as a reward for policy evaluation. In addition, considering the interaction relationship of the partially-observed environment, the output of the hidden policy \(a_H\) is fed into the control branch by splicing with the output of the representation module. Due to the unobservability of the hidden policy, the placeholder for \(a_H\) is fed with a zero vector during the training process of DUIN.

Fig. 4
figure 4

The computation graph of the environment policy \(\pi _e\) implemented in the DUIN structure. The treatment action \(a_A\) is fed into the treatment branch and the hidden action \(a_H\) is fed into the control branch. The output \(a_E\) and \(u_E\) are the response action and the estimated uplift value, respectively

figure c

Algorithm 3 describes the training process of POMEE-UI. First, a DUIN-style environment policy model \(\pi _e\) is trained on the randomized trial dataset \(D_{rand}\). Second, the representation module and the treatment branch of \(\pi _e\) remain fixed as an uplift model, and the parameters \(\theta , \omega _1\) are set to be untrainable in the following step. Finally, the training process of POMEE is carried out on the observed dataset \(D_{real}\). In other words, only the parameter \(\omega _0\) of the control branch in \(\pi _e\) is updated in the POMEE training. In addition, since the hidden action \(a_H\) is not observable, it is initialized as a zero vector during the DUIN training of \(\pi _e\) in the first step in Line 1.

5 Application in driver program recommendation

5.1 Driver program recommendation

We have witnessed a rapid development of on-demand ride-hailing services in recent years. In this economic pattern, the platform often need to recommend programs to drivers, aimed to help them finish more orders. Specifically, the platform would select the appropriate program to recommend the drivers to participate every day, and then adjust the program content according to the drivers’ feedback behavior. This is a typical sequential recommendation task and can be naturally tackled by reinforcement learning (Qin et al. 2020). However, since the behavior of drivers is not only influenced by the recommended programs, but also influenced by some other unobservable factors, such as the response to special events and so on, that is, hidden variables exist in this application scenario. In order to optimize the recommender policy, it is essential to take into account the potential influence of hidden factors when recommending programs.

However, traditional reinforcement learning approaches are applied in these problems without exploring the impact of hidden variables, which would consequently degrade the learning performance. Thus, a more adaptive approach such as POMEE-UI proposed in this paper is desirable to tackle these problems.

In this paper, we propose a general pipeline for applying reinforcement learning to optimize a policy in a real-world application based on historical data. First and foremost, we build a virtual environment, namely simulator, to precisely recover the transition dynamics and reward mechanism of the real-world environment by using historical data. We then apply RL algorithms to optimize the system policy by interacting with the virtual environment. Such simulator-based RL method can be very efficient without any interaction cost with the real-world environment. A more detailed illustration of the pipeline work can be seen in Fig. 15 in “Appendix A.4”.

5.2 POMEE-UI based driver program recommendation

As for the driver program recommendation, we apply POMEE-UI to build a virtual environment with hidden variables by using historical data. As shown in Fig. 5, there are three agents in the environment, representing driver policy \(\pi _d\), platform policy \(\pi _p\) and hidden policy \(\pi _h\). We can see that the driver policy and the platform policy have the nature of “mutual environment” from the perspective of MDP. From the platform’s point of view, its observation is the driver’s response, and its action is the recommendation program to the driver. Correspondingly, from the driver’s point of view, its observation is the platform’s recommendation program, and its action is the driver’s response to the platform. The hidden variables are modeled as a hidden policy according to POMEE, so as to make a dynamic effect at each time step.

Fig. 5
figure 5

The POMEE-UI framework applied in the driver program recommendation. While real-world data only collects the interactions between the drivers and the Didi Chuxing platform, the virtual environment contains three policies simulating the drivers, the platform, and the hidden variables

Data preparation Based on the real-world scenario, we integrated the historical data and then construct historical trajectories \(D_{hist} = \left\{ \tau _1, \ldots , \tau _i, \ldots , \tau _n \right\}\) representing trajectories of n drivers. Each trajectory \(\tau _i = \left\{o_0^P, a_0^P, a_0^D, o_1^P, \ldots , o_t^P, a_t^P, a_t^D, o_{t+1}^P, \ldots , o_T^P \right\}\) represents the T steps of observable interactions between the driver \(d_i\) and the platform system. For the DUIN training, we collected some randomized trial data as \(D_{rand} = \left\{ \left(o^P, a^P, a^D\right) \right\}\) from the recommender system.

Definition of policies According to the interaction among agents in this application, the observation and action of each agent policy are defined as follows:

  • platform policy \(\pi _p\): The observation \(o_t^P\) consists of the driver’s static characteristics (using real data) and the simulated response behavior \(a_{t-1}^D\). The action \(a_t^P\) is the program information recommended for the driver, represented as a 2-tuple of (TM) integers, where T indicates the target and M is the amount of bonus for achieving the target.

  • hidden policy \(\pi _h\): The observation \(o_t^H\) consists of \(o_t^P\) and \(a_t^P\). The action \(a_t^H\) is the same format as \(a_t^P\).

  • driver policy \(\pi _d\): The observation \(o_t^D\) consists of \(o_t^P\), \(a_t^P\) and \(a_t^H\). The action \(a_t^D\) is the simulated driver’s behavior at the current step, which indicates the completion degree of the recommended program \(a_t^P\).

Analogy from the POMEE-UI method, we implement the driver policy \(\pi _d\) in the DUIN structure, and further combine the policies \(\pi _h,~\pi _d\) into a joint policy. We then apply POMEE-UI to train \(\pi _d\) and \(\pi _h\). Afterwards, the partially-observed environment of driver program recommendation is reconstructed.

5.3 RL in the virtual environment

Once the virtual environment is built, we can perform RL efficiently to optimize the policy \(\pi _p\) by interacting with the environment. The challenge with simulated training is that even the best available simulators do not perfectly capture reality, which is often called the “reality gap”. Models trained purely on static data fail to generalize to the real world, as there is a discrepancy between simulated and real environments in terms of some physical properties. A number of related works have sought to address the reality gap in robotics, such as domain adaptation (Tzeng et al. 2016) and randomization of simulated environments (Sadeghi and Levine 2016), but they are not verified in real-world environments.

In this work, we design these mechanics to try to close the gap in this application. With the uplift model embedded in the virtual environment, we can design the recommendation reward with uplift values, which have a good causal relationship with the recommended program. In addition, due to the simulated hidden variables in the environment, the reinforcement learning approach could learn a more robust policy with improved performance in the real world.

6 Experiments

In this section, we conduct two groups of experiments to verify the effectiveness of the proposed POMEE-UI method. The first is a group of toy experiments in which a rule-based environment is designed, the second is a real-world application of driver program recommendation in Didi Chuxing.

6.1 Toy experiments

We firstly expect to design an artificial environment to verify the effectiveness of the proposed method POMEE-UI. However, it is rather difficult to design such an artificial environment that can verify both the hidden effects and the uplift learning performance. Considering that the uplift model produced by the DUIN training remains fixed during subsequent POMEE training in POMEE-UI, we firstly design a randomized trial experiment to evaluate the learning performance of uplift model independently. We then validate the policy simulation effects of POMEE-UI in a well-defined artificial environment.

6.1.1 DUIN on synthetic data

We design separately an artificial randomized trial dataset to verify the effectiveness of the DUIN model. All function rules and parameter values are designed to mimic the real-world environment. Three rule-based functions are defined: the artificial control outcome function \(f^C\), the artificial uplift function \(f^U\), and the artificial treatment outcome function \(f^T = f^C + f^U\) like Eq. (17). We conduct DUIN and two other meta-algorithms (Künzel et al. 2019) of uplift modeling as a comparison:

  • S-Learner the treatment is included as a feature similar to the observation features to estimate a combined outcome function. It is a “single” response estimator.

  • T-Learner the control response estimator and the treatment response estimator are learned separately, “T” being short for “two”.

  • DUIN the uplift modeling method proposed in this paper.

Fig. 6
figure 6

Illustration of the uplift toy experiment settings.The \(f^C\) function is the control outcome function. The uplift function \(f^U\) is defined to mimic the uplift under different conditions as shown in Fig. 14 in “Appendix A.3”. The treated function \(f^T\) is defined by adding the uplift function \(f^U\) to the control function \(f^C\)

Rule-based artificial randomized trial data The observation is simplified as a two-dimensional vector, and the treatment is binary of 0 or 1. We first sample individual units from the observation space randomly, and then randomly target each unit as 0 for control and 1 for treatment. Based on the observation and treatment action, we generate the simulation data by the following rule-based outcome functions.

Denote the observation as \((x_1,~x_2)\), and constrain \(x_1, x_2\) between −1 and 1. Figure  6 illustrates the three function spaces. The treated function \(f^T\) is represented as \(f^T = f^C + f^U\). The controlled function \(f^C\) is defined as a a hemispherical surface with radius 1 above the XOY plane. It can be formulated as

$$\begin{aligned} f^C = \max \left( 0, \sqrt{1-x_1^2-x_2^2}\right) . \end{aligned}$$

The uplift function, named \(f^U\), is defined as a weighted sum of two conjugate two-dimensional Gaussian functions. The formulation is

$$\begin{aligned} f^U = \frac{3}{4}\left( \frac{\exp \left( -\frac{1}{2} \left( {\mathbf {x}} - \mathbf {\mu _1}\right) ^T \varSigma ^{-1} \left( {\mathbf {x}} - \mathbf {\mu _1}\right) \right) }{2\pi \sqrt{\left| \varSigma \right| }}\right) - \frac{1}{2}\left( \frac{\exp \left( -\frac{1}{2} \left( {\mathbf {x}} - \mathbf {\mu _2}\right) ^T \varSigma ^{-1} \left( {\mathbf {x}} - \mathbf {\mu _2}\right) \right) }{2\pi \sqrt{\left| \varSigma \right| }}\right) , \end{aligned}$$

where

$$\begin{aligned} \mu _1 = \begin{pmatrix} \tfrac{1}{3} \\ \tfrac{1}{3} \end{pmatrix} ,~ \mu _2 = \begin{pmatrix} -\tfrac{1}{3} \\ -\tfrac{1}{3} \end{pmatrix} ,~ \varSigma = \begin{pmatrix} \tfrac{1}{16} &{}\quad 0 \\ 0 &{}\quad \tfrac{1}{16} \end{pmatrix} . \end{aligned}$$

Results. Uplift evaluation differs drastically from the traditional machine learning model evaluation, because of the invisibility of the ground truth. Here, we use Qini curve/coefficient (Radcliffe 2007) and \(Q^{TO}\) (Athey and Imbens 2015) to evaluate uplift models under binary treatment settings. Qini-Coefficient is an indicator to measure the ranking performance of the causal effect estimated by a model. The larger Qini-Coefficient, the better performance. \(Q^{TO}\) is a measure similar to MSE in supervised learning by exploiting the transformation of the potential outcome. The smaller \(Q^{TO}\), the better performance. The detailed introduction to the two uplift evaluation metrics can be seen in “Appendix A.4”.

The Qini curves of three models are shown in Fig. 7, which demonstrate the quality of uplift ranking inferred by the causal models. Although the rule-based setting is simple, the DUIN model has a significant out-performance than the other models on both metrics. The area under the uplift curve of the DUIN model is significantly larger than those of S-Learner and T-Learner methods, and this curve almost coincides with the Optimal one.

Fig. 7
figure 7

Qini curves of three models evaluated on testing dataset: S-Learner, T-Learner and DUIN. Besides, the ground truth, as the Optimal model, and the random baseline are also plotted in this figure

The performance of quantitative metrics are shown in Table 1. The Qini-Coefficient is the area between the Qini curve and the random curve. \(Q^{TO}\) is a measure similar to MSE in supervised learning. It is consistent with the uplift curves that the DUIN model has a larger Qini-Score and a smaller \(Q^{TO}\) than the other models. Furthermore, the gap between the DUIN model and the GROUND-TRUTH is very small, which potentially shows a strong causality of the DUIN model.

Table 1 Comparison of Qini-Coefficient and \(Q^{os,TO}\) on three models

The uplift function space learned by the three models are shown in Fig. 8. The uplift function space, inferred by DUIN, is precisely close to the real defined one as shown in Fig. 6, while the ones learned by S-Learner and T-Learner approaches are deviated severely and they are not smooth which means a higher variance. As a result, we further demonstrate the ability of the DUIN model to infer the uplift function precisely and smoothly.

Fig. 8
figure 8

Uplift function spaces learned by three different methods: S-Learner, T-Learner and DUIN-Model

6.1.2 Artificial environment for POMEE-UI

We hand-craft an artificial environment with deterministic rules, consisting of the artificial platform policy \(\pi _p\), the artificial driver policy \(\pi _d\), and the artificial hidden policy \(\pi _h\). In the same way, all function rules and parameter values are designed to mimic the real-world environment. We use POMEE and POMEE-UI to learn the policies and compare them with the real ones. Additionally, we conduct MAIL and MAIL-UI methods, without modeling hidden variables, as a comparison.

Description of the artificial environment Similar to the interaction in the driver program recommendation, we define a triple-agent environment to simulate a partially observable Markov decision process (POMDP). The semantic drawing of this toy experiment is shown in Fig. 9. In POMDP, the key variant v (denotes the driver’s response) is affected by three policies at each time step. The policy \(\pi _d\) has an intrinsic evolution trend on the variant v in the period of 7 time steps, as defined in Eq. (22). The policy \(\pi _p\) has a positive effect on the variant v if the value of v is under the green line, otherwise no effect. Oppositely, the policy \(\pi _h\) has a negative effect on the variant v if the value of v is above the blue line, otherwise no effect. The green and blue lines can be seen as the thresholds of \(\pi _p\) and \(\pi _h\) to make effect on the evolution trend of v. Here we set the policy \(\pi _h\) as a role of hidden variables in this environment, of which the effect on the interaction would not be observed.

Fig. 9
figure 9

Schematic drawing of interaction in the toy environment: t represents the time step and v is a variant affected by all three policies. TP and TH are the thresholds for policies taking effect, and V(t) describes the intrinsic evolution trend of the artificial driver policy \(\pi _d\)

POMDP definition. All the hyperparameters in the following rule-based functions are selected randomly from an appropriate range of values.

The observation o is a tuple (twrv), in which \(tw \in \{1, 2, \ldots , 7\}\) is the time step in one period, r is a static factor used to make a difference on the effect of each agent and v is the key variant in the interaction process. The initial value \(v_0\) is sampled from a uniform distribution \(U(9+wave, 9-wave), wave = 1.2\), where wave denotes the sampling range of \(v_0\). We add the static factor \(r = 1-0.5\times \frac{v_0-9}{wave}\) into the state to make the episodes generated by this setting more diverse.

The action is defined as the output of the deterministic policy. The thresholds of green line TP and blue line TH are 10 and 8 correspondingly. Then we define the deterministic policy rule of each agent as follows:

$$\begin{aligned}&a_p = \pi _p(tw, r, v) = \max \left(0, \min \left(1, r\times \left(TP - v \right) \times \frac{tw}{7} \right) \right) , \end{aligned}$$
(19)
$$\begin{aligned}&\quad a_h = \pi _h(tw, r, v, a_p) = \max \left(-1, \min \left(0, r\times \left(TH - v - \frac{a_p}{2}\right) \times \frac{tw}{7} \right) \right) ,\end{aligned}$$
(20)
$$\begin{aligned}&\quad a_d = \pi _d(tw, r, v, a_p, a_h) = \varDelta V(tw) + a_p +a_h . \end{aligned}$$
(21)

where

$$\begin{aligned} \varDelta V(tw) = {\left\{ \begin{array}{ll} 1 &{} \text {if tw = 5;}\\ -1 &{} \text {else if tw = 7;}\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(22)

The transition dynamics is simply defined as: \(v_{t+1} = v_t + a_d^t\) and r is a constant once initialized. tw is a timestamp indicator cycling in the sequence \(\left[ 1, 2, \ldots , 7 \right]\). In this experiment, we set the length of trajectory T to 8.

By running the defined rules in the toy environment, we collect many episodes as training data \(D_{real} = \left\{\left(o_p^0, a_p^0, a_d^0, o_p^1, \ldots , o_p^T\right)\right\}\). By randomly sampling from the observation space and the platform action space, we generate a randomized trial dataset \(D_{rand} = \left\{\left(o_p, a_p, a_d\right)\right\}\). Based on these two datasets, we can perform the comparative algorithms to verify the effectiveness of POMEE-UI.

Implementation details We conduct four training methods on this artificial environment: POMEE, POMEE-UI, MAIL and MAIL-UI. The main difference between the first two methods and the second two methods is that there is no hidden policy in the MAIL and MAIL-UI settings. The main changes of MAIL-UI and POMEE-UI methods with respect to MAIL and POMEE methods are that the environment policy is implemented in the DUIN structure and the training process follows Algorithm 3. We aim to compare the similarity between the generated policies and the defined rules.

In detail, each policy or module is embodied by a neural network with 2 hidden layers and combined sequentially into a joint policy network illustrated in Fig. 2. There are 64 neurons in each hidden layer activated by \(\tanh\) functions. To control the same complexity of the policy model, the joint policy networks in these four methods have the same number of hidden layers. The discriminator network adopts the same structure as each policy network. Different from GANs training, we perform \(K = 3\) generator steps per discriminator step, and sample \(N = 200\) trajectories per generator step. The detail of the training process is described in the previous sections.

Results The generated policy functions trained by these four methods are shown in Fig. 10. First of all, from the perspective of the two observable policies, the policy function maps of \(\pi _p\) and \(\pi _d\) produced by POMEE-type methods are both more similar to the real function spaces than those by MAIL-type methods, as shown in Fig. 10a, b. MAIL-type methods produce sharp distortion shape locally when r is large. We believe that this is because the hidden variables have a greater impact on the interaction as r increases, and a large unobservable bias has reached a point where it cannot be neglected.

Fig. 10
figure 10

Visualization and comparison of policy functions, with \(r=1.3\). More visualizations with various of r values are presented in “Appendix A.4

Additionally, compared with the basic methods MAIL and POMEE, the MAIL-UI and POMEE-UI methods can restore the policy function spaces more realistically. In particular, MAIL-UI can significantly alleviate the distortion in the policy function space learned by MAIL, which probably implies that the environment policy model implemented by a causal DUIN model can alleviate the hidden bias to some extent in the learning process.

Then we further compare the similarity between the hidden policies generated by POMEE-type methods and the true policy \(\pi _h\). In Fig. 10c, it can be seen that the generated hidden policies can describe threshold effects well and match the real function map roughly, although it is difficult under the setting of fully unobservable variables. Similar to the results of \(\pi _p\) and \(\pi _d\), the hidden policy learned by POMEE-UI is closer to the real policy \(\pi _h\) than that learned by POMEE. Our results show the potential of using observational data to infer the hidden effect model.

6.2 Experiments on real world applications

Similar to the toy experiments in the previous subsection, the experiment on real-world application data is also divided into two steps: First, we evaluate the learning performance of the DUIN model on the randomized trial data collected from the real application system. Then, we apply POMEE-UI and several comparative methods to the real-world application data, and evaluate the performances of simulation and policy optimization. Finally, we deploy a recommender policy online, which is optimized in the POMEE-based environment, and results of online A/B test are reported at the end.

6.2.1 DUIN on real-world data

We apply DUIN to the real-world randomized trial dataset that is collected from the real-world recommender system. The dataset has 1.16 million recommendation record samples. Although the huge dataset can release the power of deep models, it involves a lot of noise data and a large randomness lies behind the observed outcome. It is still very challenging to infer the uplift effect from such real-world data.

We perform the Causal Forest method (Wager and Athey 2018), a popular algorithm for uplift modeling in observational studies, on this real-world dataset as a comparison. The Qini-Coefficient and \(Q^{TO}\) are used to evaluate the performance of two models.

Implementation details In the training of the DUIN model, we find that the frequency of alternate optimization, that is, the number of learning steps in one alternate round K in Algorithm 2, can affect the model performance to some extent. The model trained under \(K = 5\) can have a better performance and stability than that under \(K=1\). We believe that the lower frequency of alternate optimization, that is, the larger value of K, can help the model eliminate the influence of noise and randomness.

Results The Qini curves of Causal Forest model and DUIN model are shown in Fig. 11. It can be seen that the Causal Forest model trained on the real-world dataset can only have a very small performance improvement compared with the random model. The DUIN model has a better performance overall despite poor performance in the middle part.

Fig. 11
figure 11

Qini curves of two models evaluated on testing dataset: the Causal Forest model and the DUIN model. The Qini curve of the random model is also plotted as baseline to be compared

The values of Qini-Coefficient and \(Q^{TO}\) metrics are listed in Table 2. The Qini-Coefficient of the DUIN model is larger than that of the Causal Forest model, which shows a better ability to rank uplift. The \(Q^{TO}\) of the DUIN model is smaller than that of the Causal Forest model, which means a lower estimation error of the uplift value. These results can further demonstrate the ability of the DUIN model to infer the uplift effect.

Table 2 Comparison of Qini-Coefficients and \(Q^{TO}\) on real-world data by two uplift models: the Causal Forest model and the DUIN model

6.2.2 Real-world experiment for POMEE-UI

In this part, we apply POMEE-UI to a real-world application of driver program recommendation as introduced in Sect. 5.1. We first use historical data to build different virtual environments by six comparative methods. We then evaluate these environments from various statistical measures. Finally, we train different recommender policies in these environments by the same training method, and evaluate these policies in offline and online environments. Specifically, we include six methods in our comparison:

  • SUP Supervised learning of the driver policy with historical state-action pairs, i.e., behavioural cloning;

  • GAIL GAIL to learn the driver policy, given the historical record of program recommendation as a static environment;

  • MAIL Multi-agent adversarial imitation learning, without modeling the hidden variables.

  • MAIL-UI MAIL-type method, in which the environment policy is implemented in the DUIN structure. The main difference between it and POMEE-UI is that it does not model the hidden variables, just like MAIL compared to POMEE;

  • POMEE The proposed method described in Algorithm 1;

  • POMEE-UI The proposed method described in Algorithm 3.

We evaluate the models by different statistical metrics.

Log-likelihood of real data on models We evaluate the learned policy distribution of six different models by the mean log-likelihood (MLL) of real state-action pairs on both training set and testing set. As shown in Table 3, the models trained by POMEE-type methods achieve the highest mean log-likelihood on both data sets. Since the evaluation is on the view of each state-action pair, the behavioural cloning method SUP achieves a better performance than MAIL-type methods. Meanwhile, the POMEE-type methods make a significant improvement compared with the MAIL-type methods, which indicates the positive influence of our hidden variables setting.

Table 3 Comparison of mean log-likelihood by six different methods on the real-world test set

Correlation of key factors trend Another important measurement of generalization performance is the trend of drivers’ response. We use the trend lines of two indicators to compare different simulators: number of Finished Orders (FOs) and Total Driver Incomes (TDIs). The same as above, we apply the simulator to a subsequent testing data and simulate the trends of FOs and TDIs. Then we calculate the Pearson correlation coefficient (PCC) between the simulated trend line and the real one. As shown in Table 4, the simulated trend lines of two indicators by POMEE and MAIL achieve high correlations to the real ones, with Pearson correlation coefficient of 0.8 approximately. While the methods SUP and GAIL, trained directly with static data, get lower performance in this evaluation. Though the PCC by the MAIL-UI and POMEE-UI methods is not the highest, these two methods still have a decent performance on this metric.

Table 4 Comparison of Pearson correlation coefficients on FOs and TDIs trend lines by six different methods

Distribution of driver response To further compare the generalization performance of models, we apply the built simulators to subsequent program recommendation records. We simulate the drivers’ responses by using real program records on testing data, then compare the simulated distribution of drivers’ responses with the real distribution. Here we use FOs as an indicator. Figure 12 shows the error of FOs distributions simulated in six simulators. The simulation distributions by SUP and GAIL are biased apparently when FOs are low. The reason is that these two methods use static real data directly for building simulators, which could limit the generalization performance of simulators, and the lower FOs mean the higher uncertainty, especially zero. The FOs distribution by POMEE is closer to the real one than that by MAIL, where the hidden variables setting makes difference explicitly. The same applies to POMEE-UI and MAIL-UI. In addition, it can be seen that the FOs distributions by MAIL-UI and POMEE-UI are respectively more realistic than those by MAIL and POMEE, which also shows the effect of the DUIN structure.

Fig. 12
figure 12

Error of FOs distribution generated by six different methods on testing data. Y-axis is the error of FOs distribution between the simulation and the real. The original FOs distribution is presented in “Appendix  C

Policy evaluation results in offline environments In this part, we evaluate the effect of different simulators for policy optimization. First, we use the policy gradient method TRPO (Schulman et al. 2015) to optimize a recommender policy in each simulator. Then, by using testing data, we build four virtual environments for policy evaluation by four methods, named EvalEnv-MAIL, EvalEnv-MAIL-UI, EvalEnv-POMEE and EvalEnv-POMEE-UI respectively. Given these four environments, we execute the optimized policies under a constrained budget, and compare the improvement of mean FOs. It would be expected that the simulator built by SUP or GAIL method would produce a policy that performs badly in the real environment because it is trained on static data.

As shown in Fig. 13, the policy \(\pi _{POMEE-UI}\) optimized in the simulator built by POMEE-UI achieves best performance in all environments, while the policies \(\pi _{SUP}\) and \(\pi _{GAIL}\) perform bad in these environments. The promotion by \(\pi _{POMEE}\) compared to \(\pi _{MAIL}\) can further verify that training in a virtual environment with hidden variables can bring better performance to traditional reinforcement learning. Compared with MAIL and POMEE, the improvements by MAIL-UI and POMEE-UI demonstrate that an uplift model, used as a reward function in a simulator, could improve the performance of policy optimization than a response model. Additionally, the performance of policies \(\pi _{SUP},~\pi _{GAIL}\) shows a significant degradation in EvalEnv-POMEE and EvalEnv-POMEE-UI, while not shown up in EvalEnv-MAIL and EvalEnv-MAIL-UI, which also indicates that the environment built with modeling hidden variables can recover the real environment more precisely.

Fig. 13
figure 13

Comparison of performance of different policies trained from different simulators in four evaluation environments: EvalEnv-MAIL, EvalEnv-MAIL-UI, EvalEnv-POMEE and EvalEnv-POMEE-UI. Y-axis is the mean FOs by executing different policies. The Data default is the mean FOs in the real testing data. The Simulated default is the mean FOs of the original simulation in each evaluation environment

Fig. 14
figure 14

The illustration of uplift value under different observation types. The vertical direction represents the potential outcome when treatment is applied, and the horizontal direction represents the potential outcome when controlled

Policy evaluation results in online A/B tests We further conduct online A/B tests to evaluate the effect of the policy \(\pi _{POMEE}\). The online tests are conducted in three cities of different scales. The drivers in each city are divided randomly into two groups of equal size, namely the control group and the treatment group. The programs for the drivers in the control group are recommended by an existing recommendation policy, which can be viewed as a baseline policy. The drivers in the treatment group are recommended by \(\pi _{POMEE}\). The results of online A/B tests are shown in Table 5. The policy \(\pi _{POMEE}\), optimized in the simulator built by the proposed method POMEE in this study, achieves significant improvements on FOs and TDIs in all three cities, and the overall improvements are 11.74% and 8.71%, respectively.

Table 5 Results of online A/B tests on the platform of Didi Chuxing

7 Conclusion

This paper explores how to estimate a partially observable environment with uplift inference from the past data. We first propose the POMEE method following the generative adversarial training framework. We design the partially-observed environment model as an important part of the generator and make the discriminator compatible with two different classification tasks so as to guide the imitation of each policy precisely. To build a causal reward function in the virtual environment, we then propose a novel DUIN model to learn the uplift effect of each action. By implementing the environment policy in the DUIN structure, we propose the POMEE-UI approach to estimate the partially observable environment with an uplift inference module. Further, we apply POMEE-UI to build a virtual environment of driver program recommendation system on a large-scale ride-hailing platform, which is a highly dynamic and partially observable environment. Experiment results verify that the policies generated by POMEE-UI can be very similar to the real ones and have better generalization performance in various aspects. Furthermore, the simulator built by POMEE-type methods can produce a better policy with common RL training methods. It is worth noting that the proposed method POMEE-UI can be used not only in this task, but also in many other real-world partially observable environments.