1 Introduction

Users expect varied outcomes from their web experiences. Enterprises aim to create digital experience that not only cater to user intent but also help to improve their own business metrics. Given the variety of content, manual creation of customized and adaptive experiences is infeasible. Session-based future path prediction is necessary to understand user needs, pre-fetch future content, or even for adapting future experiences. Users’ content creation and consumption patterns define their intent and needs. Their web tracks; i.e. the path that the user takes during their web journey is an essential ingredient for defining their interests and goals. In this work, we aim to create user intent models leveraging their consumption patterns combined with their website footprint to predict the potential user path and content needs.

Extensive studies have been conducted in the related space of recommender systems using traditional [2], deep-learning [6, 7], and reinforcement learning [8] based techniques on both historic user-item interactions and session behavior. While the historic webpages visited in a session capture the users’ local preferences, this work shows that the instantaneous global content preferences can further assist in understanding the future behavior of the users. We describe one such scenario in Fig. 1a. Specifically, we present a Deep Reinforcement Learning (RL) System, based on Local and Global preferences (DRS-LaG). Given the historic webpage content and analytics in a user session, our agent predicts the future preferences of the user. The model is trained on offline logs of a sports news website. Through offline evaluations, we show how the proposed model can be used to predict the next page user will go to. Our online evaluation shows how the predictions can be used to adapt future experiences of the users. RL allows our system to tackle the dynamic user preferences in news domain, while also incorporating expected future rewards when deployed in an online environment.

Fig. 1.
figure 1

Internal workings of the proposed DRS-LaG framework.

figure a

2 DRS-LaG: Proposed Framework

Problem Formulation: We define an agent which models a user’s session-level behavior to predict the next webpage user visits, based on the content and instantaneous analytics of the webpages visited in the current session. Since the predictions capture user preferences, they can then be recommended to the user or used to adapt future webpage experiences. At each timestep, the user (environment) provides feedback on the actions taken by the agent in the form of rewards. The agent is trained on offline session-level logs extracted from a sports news website. We illustrate this setup in Fig. 1b.

The task is modeled as a Markov Decision Process (MDP) with the tuple (\(\mathcal {S}\), \(\mathcal {A}\), \(\mathcal {P}\), \(\mathcal {R}\), \(\gamma \)): (1) State space \(\mathcal {S}\): captures the current local and global content preference, (2) Action space \(\mathcal {A}\): set of all webpages, (3) Transition probabilities \(\mathcal {P}\): probability \(p(s'|s,a)\) to move to state \(s'\) by taking an action a in the state s, (4) Rewards \(\mathcal {R}\): capturing the feedback received by the agent after taking a particular action and, (5) \(\gamma \): the discount factor for future rewards in the current user session. The goal is to learn a policy \(\pi : \mathcal {S} \rightarrow \mathcal {A}\) to maximize the cumulative reward of the system.

To deal with the dynamic action spaces, we use a Deep Q-Learning model-free approach. Figure 1c shows our architecture. Given a state-action pair, the network outputs the corresponding Q-value Q(sa). The optimal Q-value \(Q^{*}(s,a)\) should follow the Bellman equation [1]: \( Q^{*}(s,a) = E_{s^{'}}[r + \) \(\gamma \max _{a'}Q^{*}(s',a')|s,a], \) where r is the corresponding reward for the given state-action pair.

Actions: Representing Webpages. The agent actions correspond to various webpages or URLs on the given website. Given the current state of a user, the agent returns a set of plausible webpages, using both the local and global content preferences. We hence represent webpages using both the content and the corresponding instantaneous analytics.

Webpage Content: The webpage text content is represented using Universal Sentence Encoder [3]. We leverage the pre-trained model using Tensorflow HubFootnote 1 which returns a \(d_{C}=512\) dimensional representation for a given input.

Instantaneous Webpage Analytics: The incorporation of analytics allows the agent to better predict the future content preferences of the users, while also catering to business objectives. We divide the time scale into fixed-sized intervals. Let’s consider a set of k analytics KPIs such as number of views and number of exits. While training and subsequent testing, we track the KPIs for all webpages seen until now. Analytics representation is obtained by combining the value for most recent \(d_{A}\) time intervals for each of the k KPIs hence resulting in a \(d_Ak\)-dimensional vector. The final representation for an action is computed by concatenating both the content and analytics representations of the webpage, ending up with a \((d_C+d_Ak)\) dimensional vector.

States: Historic Action Sequence. The current state must capture the session-level preference of the users. Hence, we aggregate the representations of all the historically visited webpages in the current session to define the state of the user. DRS-LaG uses two LSTM networks to combine the historic content and action analytics.

Defining Rewards. At each timestep, the agent receives a reward from the user, based on the action chosen in the given state. The complete reward for a given state-action pair r(sa) is a combination of prediction and instantaneous analytics: \(r(s,a) = r_P(s,a) + (r_A^1(a) + r_A^2(a) + r_A^3(a) ... r_A^k(a))\).

Where \(r_P(s,a)\) refers to prediction reward, whether the corresponding webpage was visited by the user in the offline data logs, and \(r_A^i\) refers to the instantaneous analytics of the action a with respect to KPI i.

Learning Stage. The training algorithm is discussed here (see Algorithm 1). First, experience replay [4] and target network [5] are used to stabilize the training process. Second, at each timestep, apart from considering the actual action from the data, we also sample N negative actions from the webpage pool P. This is necessary as the offline logs only contain the positive samples for next-page prediction. Moreover, this allows the agent to explore the instantaneous analytics values of webpages, beyond those seen in the current session. Third, the model is trained using the Bellman Equation. Fourth, we skip the replay memory update for the first few webpages in every session, owing to the inadequacy of the initial webpages to capture the context. This detail is removed from Algorithm 1 for simplicity. Finally, since the model is trained on instantaneous analytics values, we update both the webpage pool and analytics values after each episode.

Test Stage - Offline: Given the state, the model is asked to predict the next webpage user will go to. Keeping \(\gamma =0\), the model is trained using Algorithm 1 to incorporate only the immediate reward, as appropriate for next-page prediction. The test data is parsed similar to the training procedure. At every timestep, the recall is observed based on the predictions from a trained model \(Q_M(s,a)\) and the actual action from offline logs.

Online: We also evaluate our framework in an online simulated environment. Given the complexity of setting up an online evaluation, following prior work [8], we resort to a framework which effectively simulates the real-time environment with the capability to provide immediate feedback given state and action. We split our data into two and train this simulator on the first half, keeping the second for training. The simulator architecture is same as in Fig. 1c and is trained to only predict the immediate feedback. The performance of our model in the offline setting attests to the performance of the simulated environment.

3 Experiments

Dataset: The experiments are based on a snapshot of a sports websiteFootnote 2. The clickstream is gathered using an enterprise analytics tool deployed on the website. The data consists of 37, 667 user sessions. We maintain a temporal order in the paths based on timestamps associated with each session. Minimum path length is kept at 3 and maximum as 50. The data contains 1, 599 unique urls. The first 33, 900 paths are kept for training, next 1, 883 paths for validation, and last 1, 884 paths for testing.

Fig. 2.
figure 2

Training progress and the impact of the number of negative samples for DRS-LaG.

Hyperparameters: Content representations are 512 dimensions while the instantaneous analytics are 50-d. The batch size is 16, with learning rate for Adam optimizer as 0.01, number of negative samples as 2, interval size as 5 s and size of replay buffer as 5000 transitions. The weights are transferred to the target network after every 1000 replay iterations. The prediction reward is set to 3 for correct prediction and 0 otherwise, while the analytics reward is fixed to the total change in KPI value over the past 50 intervals. Number of views is considered as the KPI for all experiments. These parameters are tuned on the validation dataset. Once tuned, the models are trained on ‘training+validation’ data for evaluation on the test data.

Training Progress: Fig. 2a visualizes the training progress of DRS-LaG. We track two metrics: (1) P Q-values: Q-values of the actual action taken from the data and (2) N Q-values: Average Q-values for the negative webpages, sampled uniformly from the action pool at each timestep. As expected, the two graphs for Cumulative Log values deviate as the training proceeds.

Offline Results: We use two metrics, Recall@20 and Recall@40: what percent of times the correct webpage visited by the user appears in top 20 and 40 webpages returned by the model respectively.

DRS-LaG is trained to predict only the immediate reward at every timestep by keeping \(\gamma =0\). Comparison against baseline models is provided. Random ignores the current state and returns a random set of webpages at every timestep. Majority returns a list of most-viewed webpages at every timestep. Majority can be a really strong baseline in hierarchical website environments. W-Avg-c combines only the content representations of the past webpages in the current session using a exponentially-decaying weighted average, to predict the future path. Given the dynamic nature of the websites, instead of predicting a softmax over all the webpages, given the historic webpages and a plausible next webpage, W-Avg is trained to predict a score that the plausible webpage will next be visited. At the time of testing, the model returns the webpages with the maximum scores. Similarly, LSTM-c uses a Long Short Term Memory recurrent network to capture the historic webpage content. DRS-LaG: \(r_{A}=0\) is trained with both local and global representations similar to DRS-LaG but without the analytics reward. Table 1 shows the results. W-Avg-c performs similar to the Majority, failing to capture the local context or preferences of the users. LSTM-c shows improvements, by using a recurrent network to combine historic content visited by the user. With the capability to incorporate both local and global content preferences, DRS-LaG: \(r_A=0\) outperforms the baseline methods. Using the analytics reward \(r_A\), DRS-LaG shows further enhancement in the performance, attesting to the utility of our approach. We analyze the sensitivity of DRS-LaG in the offline evaluation task towards the number of negative examples sampled at each timestep in Fig. 2b. If the number is too low, the model may end up learning nothing, by learning to predict a high score for every webpage. If the number is too high, the model may consider some in-context webpages as negative, again countering its own learning mechanism. We empirically identify the value 2 for our experiments (see Fig. 2b).

Table 1. Performance based on the offline logs for next-page prediction task.
Fig. 3.
figure 3

Performance comparison on our online test based on average reward in a session.

Online Results: These experiments evaluate the model, if deployed to recommend webpages or adapt future experiences of the users, in a simulated environment. We observe the average rewards in a session to evaluate the models. Session length considered are 5, 10, 15 and 20. To incorporate cumulative future rewards, DRS-LaG is trained is \(\gamma =0.95\). Random, Majority and LSTM are implemented in the same manner as before. DRS-LaG-c and DRS-LaG RA0 are trained similar to DRS-LaG. However, the former only considers the content (local preference) and the latter keeps \(r_A=0\). The results for our online experiments are plotted in Fig. 3. DRS-LaG-c outperforms LSTM which is only trained to predict the immediate feedback, attesting to the utility of reinforcement learning. This observation is more evident in longer sessions. DRS-LaG \(R_A=0\) and DRS-LaG further improve the performance.

4 Conclusion

We presented DRS-LaG framework, with the objective of improving user web experiences while simultaneously catering to analytics KPIs. Using Deep RL, our model incorporates both local and global content preferences. We show the proposed method effectively predicts user behavior in a dynamic web environment using both offline and online setups.