1 Introduction

Nowadays, search engines have become an indispensable tool for users to accomplish various information seeking tasks, e.g., finding particular Web pages, locating target resources, or accessing information of certain topics. However, it is often difficult for users to formulate a query that can describe their information needs properly, and find the right results quickly. Most users may reformulate their queries several times before satisfaction. According to our analysis, more than 60 % of search sessions contain query reformulation.Footnote 1 Therefore, how to assist users to better formulate queries becomes a valuable and challenging problem for modern search engines.

Fig. 1
figure 1

Distribution of query reformulation on Yandex query logs

Query recommendation has been widely recognized as a key technique to alleviate users’ reformulation burden and improve the usability of search engines. A major approach on query recommendation is relevant query recommendation, which focuses on providing alternative queries similar to a user’s initial query. Collective user behaviors are usually employed to help find similar queries. For example, queries that share common terms (Wen et al. 2001), click the same URLs (Beeferman and Berger 2000; Li et al. 2008; Mei et al. 2008) or occur in the same search sessions (Zhang and Nasraoui 2006; Boldi et al. 2008) are often considered as similar.

However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can acquire their desired information successfully and quickly. Only recommending similar queries, as in relevant query recommendation approaches, is apparently not directly toward this goal. For example, given a user’s initial query “company name original seven” which aims to find “which company made the original seven movie”, the possible recommendations may include “company name original 7”, “original seven company” and “seven movie”. Obviously, the three recommended queries are all similar to the user’s initial query, but only the last one can find search results satisfying the user’s need. If the user clicks on the first two recommendations, he/she may feel frustrated since no satisfied results can be found through these recommendations.

Therefore, in this paper, we argue that it is more important and beneficial to directly recommend queries with high utility, i.e., queries that can better satisfy users’ information needs. Intuitively, a query should satisfy two conditions to be useful for users: (1) The query’s search results should be attractive to the user, so that she would like to click the results for further inspecting; (2) The clicked search results can actually satisfy the user as they are relevant to her original information needs. Therefore, the query utility can be further specified by two components. One is related to the attractiveness of the query’s search results, referred to as perceived utility. The other is related to the satisfaction from the clicked search results, referred to as posterior utility. The query utility is then defined as the product of the two components. We will show that under this definition, the query utility stands for the expected satisfaction users obtain from the search results of the query according to their original information needs.

By recommending high utility query, we actually emphasize users’ post-click satisfaction, i.e., whether users’ information needs will be satisfied after clicking the recommendation. We argue that users’ post-click satisfaction reflects the true effectiveness of query recommendation. It is worth noting that although the concept “utility” has been used in literatures of query recommendation (Jain et al. 2011; Anagnostopoulos et al. 2010), they are significantly different from our work. Both studies (Jain et al. 2011; Anagnostopoulos et al. 2010) define a global utility function over the recommendation set, which emphasize either the diversity (Jain et al. 2011) or the expected click-through rate (Anagnostopoulos et al. 2010) of the recommendations. They do not define and learn the query utility towards users’ post-click satisfaction as ours.

The central challenge in our recommendation problem is how to learn the query utility according to users’ original information needs. The basic idea is that users’ collective search behaviors, especially their sequential reformulation and click behaviors in search sessions, embed rich information on the usefulness of queries. Therefore, we propose a novel dynamic Bayesian network, referred as Query Utility Model (QUM), to inference query utility by simultaneously modeling users’ reformulation and click behaviors. By learning query utility in hand, we can provide query recommendations with high utility to help users better accomplish their search tasks.

To evaluate the performance of our approach, we compare it with the state-of-the-art query recommendation approaches based on a publicly released query log. We propose two evaluation metrics, namely Query Relevant Ratio (QRR) and Mean Relevant Document (MRD), to evaluate the effectiveness of recommendations with respect to users’ post-click satisfaction. The experimental results show that, by recommending high utility queries, our approach is far more effective in helping users find relevant search results and thus satisfy their information needs.

The main contributions of this paper are summarized as follows:

  1. (1)

    We propose to recommend queries with respect to utility rather than relevance, and take into account both attractiveness and post-click satisfaction for the utility definition;

  2. (2)

    We introduce a dynamic Bayesian network (i.e., QUM) which can mine query utility from users’ reformulation and click behaviors;

  3. (3)

    We propose two novel metrics (i.e., QRR and MRD) to evaluate the performance of different methods in recommending high utility query, and conduct empirical studies to demonstrate the effectiveness of our approach.

1.1 Comparison with previous publication

We extend in this article a preliminary study published in Zhu et al. (2012), introducing the following novel contributions, which allow as to give a complete picture on our algorithmic solutions:

  1. (1)

    We provide more precise definitions on query utility, and correct the potential flaws in the formulation of the query utility model.

  2. (2)

    We provide details on the maximum likelihood learning of the query utility model for the ease of the reproduction of our approach.

  3. (3)

    We add extensive experiments to evaluate the recommendation performance according to query difficulty, and to study the impact of the hyper-parameters of the model.

  4. (4)

    We add a new section to discussion the application of our model in real-world scenarios as well as how to address the new/rare queries in practice.

In the remainder of this paper, Sect. 2 reviews the related work. Section 3 presents our proposed approach and parameter estimation in detail. In Sect. 4, we conducted empirical experiments based on a publicly available data set to evaluate the performance of the proposed approach. Finally, in Sect. 4.7 we conclude and discuss some ideas for the future work.

2 Related work

In this section, we review two research topics which are relevant to our work: query recommendation and click model.

2.1 Query recommendation

Query recommendation has been employed as a key technique by modern industrial search engines. While most early query recommendation methods explore document information, query log data has been widely used in recent work. Query click-through (Beeferman and Berger 2000; Li et al. 2008; Mei et al. 2008), and search sessions (Zhang and Nasraoui 2006; Boldi et al. 2008) are among the most used types of information in query logs.

A major approach on query recommendation is relevant query recommendation, which focuses on providing alternative queries similar to the user’s initial query. Collective users’ behaviors are usually employed to help find similar queries. For example, Wen et al. (2001) attempted to find similar queries by clustering queries in query logs based on both query content information and user click-through data. Beeferman and Berger (2000) applied agglomerative clustering algorithm over the click-through bipartite graph to identify related queries for recommendation. Li et al. (2008) recommended related queries by computing the similarity between queries based on query-URL vector model and leveraging a hierarchical agglomerative clustering method to rank similar queries. Ma et al. (2008) built two bipartite graphs (i.e., user-query and query-URL bipartite graphs) based on the click-through data and then developed a two-level query recommendation method. Zhang and Nasraoui (2006) first proposed to model users’ sequential querying behaviors as a query graph and calculate query similarity based on search sessions for recommendations. Boldi et al. (2008) further introduced the concept the query-flow graph by aggregating the session information of users, and then performed a random walk on this graph to find relevant queries.

A number of studies have been conducted to take into account diversity in query recommendation. Mei et al. (2008) boosted long tail queries to enhance recommendation diversity by using a hitting time approach based on the query-URL bipartite graph. Guo et al. (2010) introduced a structured recommendation approach to recommend diverse queries that satisfy both users’ exploratory interests and search interests. Zhu et al. (2011) explicitly addressed the diversity in query recommendation based on an extended manifold ranking method.

Recently, some studies consider the utilityFootnote 2 in query recommendation (Jain et al. 2011; Anagnostopoulos et al. 2010; Ozertem et al. 2012), and formalize query recommendation as a problem of optimizing a global utility function. However, the utility defined in these works is largely different from ours. For example, In Jain et al. (2011), the utility actually refers to the diversity of the recommendations. While in Anagnostopoulos et al. (2010), the query utility is defined as the possibility that users will be satisfied by its search results, and simply calculated by its click-through rate. Obviously, this utility definition is independent of users’ initial information needs, while in our case it is dependent. Moreover, we will show later that their utility is actually part of our model, which cannot reveal the true utility of the query alone. In Ozertem et al. (2012), the query utility is defined as the difference in discounts of clicked documents between a reformulated query and the original query. Obviously, this utility definition again only captures the click-through information without considering the post-click behaviors. In Zhu et al. (2013), the query utility is defined over the click-through and reformulation behaviors without a specific form as in this paper, and an absorbing random walk based method is employed over the session-flow graph to estimate the query utility for recommendation.

2.2 Click model

Our model is inspired by the click models on modeling sequential user behaviors in Web search, based on the analogy between search result examination/click behaviors and query reformulation/click behaviors. Therefore, we also briefly review the related work on click models.

Click model Chuklin et al. (2015) is proposed to tackle the task of estimating unbiased relevance of documents from query logs in Web search ranking. Granka et al. (2004) first noticed the position bias in their eye-tracking experiment, i.e., a document appearing in higher position is more likely to be clicked than those documents in lower positions. Richardson et al. (2007) introduced a multiplicative factor to increase the relevance of documents in lower positions. After that, Craswell et al. (2008) first formalized this idea as the examination hypothesis, which assumes that the user will click a document if and only if it is both examined and relevant. There are a variety of variants Dupret and Piwowarski (2008), Hu et al. (2011), Liu et al. (2014) based on the examination hypothesis but with different derivation of the examination probability. For example, Dupret and Piwowarski (2008) extended the examination hypothesis and proposed the user browsing model (UBM), which assumes that the examination event depends not only on the current position but also on the previous clicked position. Hu et al. (2011) assumed that users’ clicks are also affected by intent bias, and proposed to characterize the diversity of search intents. Liu et al. (2014) proposed a two-stage examination model based on they eye-tracking/click-through data, and found that the two stage examination behaviors acan be predicted with mouse movement behavior.

Craswell et al. (2008) further introduced the cascade hypothesis, which assumes that users examine search results in strictly sequential order without skips, and the first document is always examined. The cascade model Craswell et al. (2008) combined the examination hypothesis and the cascade hypothesis, and it also constrained that users stop as soon as a relevant document is clicked and abandon the search session. However, this model is too restrictive and cannot deal with the search sessions with multiple clicks. To alleviate this limitation, some variants (Guo et al. 2009a, b; Chapelle and Zhang 2009) were further proposed. Dependent click model (DCM) Guo et al. (2009b) used position-dependent parameters as conditional probabilities of examining the next document after a click and generalized the cascade model to sessions with multiple clicks. Click chain model (CCM) Guo et al. (2009a) assumed that the probability of examining the current position also depends on the relevance at the previous position. Dynamic Bayesian network (DBN) Chapelle and Zhang (2009) distinguished the perceived relevance and actual relevance, and assumed that a user continues to examine the next document if she is unsatisfied with the clicked document. There are also some other models that do not employ the cascade hypothesis, such as the session utility model (SUM) Dupret and Liao (2010), the pure relevance model (PRM) Srikant et al. (2010), the federated click model (FCM) (Chen et al. 2012; Chuklin et al. 2014), The vertical click model (VCM) Wang et al. (2013), the task-centric click model (TCM) Zhang et al. (2011), and Partially Sequential Click Model (PSCM) Wang et al. (2015).

The most related work to our QUM is the DBN model proposed in Chapelle and Zhang (2009). However, our approach differs from DBN in two significant ways: (1) The problem studied and the data set are completely different. Their goal is to estimate the relevance of URLs for Web search ranking based on the search results and clicks given a query, while we consider the problem of estimating the utility of queries for query recommendation based on the reformulations and clicks given an initial query. (2) The model is also different. In DBN, the satisfaction event only depends on the current state, while in our case, we assume that it is dependent on all previous states.

3 Our approach

In our work, we propose to recommend high utility queries, i.e., queries that can better satisfy users’ information needs, to users. The major problem is how to learn the query utility according to users’ original information needs. In this section, we first take a look at a typical users’ search session to show what reveals query utility, and give the definition of the query utility. We then introduce a novel dynamic Bayesian network, referred to as Query Utility Model (QUM), to inference query utility by simultaneously modeling users’ sequential reformulation and click behaviors. Finally, we show how to estimate the parameters in our model and infer the query utility.

3.1 Search session and query utility

Here we investigate search sessions to see what makes a query useful for users according to their original information needs. A search session, or session, is defined as the sequence of user search actions (such as submitting a query, clicking on a search result URL and reformulating a query) within a time limit for a particular search task. A typical search process in a session is illustrated in Fig. 2. Given some information needs, a user submits an initial query to the search engine, and obtains some returned results. Here the documents which are relevant to the user’s information needs are shown in red, otherwise in black. The user then goes through the search results, and clicks the second and third results for further inspecting (i.e., denoted by ticks before the results) as they seem to be relevant. The user obtains some useful information from the third result but is not satisfied yet. Therefore, the user reformulates a second query and submits it to the search engine. However, this time the user finds no interesting results and thus no clicks are performed on the results. Therefore, a third query is generated by the user. From the returned results, the user finds two possibly relevant results, i.e., the first and third results, and clicks them for further investigation. The user then accumulates enough information he/she needs and accomplishes the search task.

Fig. 2
figure 2

A typical user search session. Tick ’\(\surd\)’ denotes a user click on a document, and red color denotes the relevance of a document

From the above search session, we can see that the first and third queries are useful for the user according to the user’s original information needs. It shows that the usefulness of a query should satisfy two conditions: (1) The query’s search results should be attractive to the user, so that he/she would like to click the results for further inspecting, such as the first and third queries. If the search results do not attract the user to click, as in the second query case, the query will not be useful for the user even if it contains relevant search results. (2) The clicked search results can actually satisfy the user to some extent by providing relevant information to the user’s original needs, e.g., the first and third search results under the third query. Otherwise, the user cannot obtain any useful information after clicking, e.g., the second search result of the first query.

Based on the above analysis, we further divide the utility of a query into the following two components.

  • Perceived utility, which refers to the attractiveness of the query’s search results perceived by the user before investigating the content of these results. The higher the perceived utility a query has, the more likely the user would click the returned results.

  • Posterior utility, which refers to the satisfaction of the user from the clicked search results given his/her original needs. The higher the posterior utility a query has, the more likely the user would obtain useful information from the clicked results.

The query utility is then defined as the product of the two components, since the first one models the click likelihood while the latter models the post-click satisfaction.

The remaining problem is how to learn the query utility automatically. From the above search session, we can see that the problem would be quite simple if we have observed all the search behaviors from users and the exact relevance labels of the search results. However, the true relevance labels are not observed in real search logs, which makes such a learning problem non-trivial. Without loss of generality, here we assume that a user will be more likely to click the search results of a query if she deems the results relevant to their information needs. We also assume that the user will acquire some useful information if the clicked results are relevant to their information needs, and will choose to reformulate the next query if she has not been satisfied. Based on these assumptions, we are able to infer queries’ utility from the collective search sessions automatically.

3.2 Query utility model

Based on the above analysis, we now introduce a novel dynamic Bayesian network, referred to as Query Utility Model (QUM), to learn query utility based on users’ sequential search behaviors. Suppose there are N search sessions starting from the same information needs,Footnote 3 and there are a total of T distinct reformulated queries occurring in these N search sessions, denoted by \(q_t (1\le t \le T)\).

For a particular search session, we define four binary random variables, \(R_i\), \(C_i\), \(A_i\) and \(S_i\) to model reformulation, click, attractiveness and satisfaction events at the i-th reformulation position:

  • \(R_i\): whether there is a reformulation at position i;

  • \(C_i\): whether the user clicks on some of the search results of the reformulation at position i;

  • \(A_i\): whether the user is attracted by the search results of the reformulation at position i;

  • \(S_i\): whether the user’s information needs have been satisfied at position i;

where the first two events are observable from search sessions and the last two events are hidden.

Figure 3 plots the graphical model of QUM for a particular search session. The full model specification that accompanies Fig. 3 is as follows:

$$\begin{aligned} P(C_i=1|R_i=1,A_i=1)= & {} 1, \end{aligned}$$
(1)
$$\begin{aligned} P(A_i=1)= & {} \alpha _{\phi (i)}, \end{aligned}$$
(2)
$$\begin{aligned} P(S_i=1|C_{1:i})= & {} 2\sigma \left( \sum _{k=1}^i{\beta _{\phi (k)}\cdot I(C_k=1)}\right) -1, \end{aligned}$$
(3)
$$\begin{aligned} P(R_i=1|R_{i-1}=1,S_{i-1}=1)= & {} 0. \end{aligned}$$
(4)

where \(\phi (i)\) denotes the index of the query at the position i in a search session.

Fig. 3
figure 3

Graphical model representation of query utility model. Observed click and reformulation variables are shaded

The above equations describe our model in the following way: The user will reformulate queries sequentially, and stop reformulation if her information needs have been satisfied. There are some clicks on the search results of a reformulated query if and only if the user reformulates the query and is attracted by the search results (1). The probability of the attractiveness only depends on the search results of the reformulated query (2), which is controlled by the variable \(\alpha _{\phi (i)}\ge 0\), i.e., the perceived utility of the query at position i. After the user clicks and inspects the search results, there is certain probability she will be satisfied at the current position. The probability of users’ satisfaction at position i depends on how much utilities she has accumulated from the previously clicked search results (3).Footnote 4 Note here \(C_{1:i}\) denote the vector \((C_1,C_2,\ldots ,C_i)\), \(\beta _{\phi (i)}\ge 0\) denotes the posterior utility of the query at position i, \(I(C_k=1)\) is an indicator function, and \(\sigma (x)\) is the logistic function:

$$\begin{aligned} \sigma (x) = \frac{1}{1+e^{-x}}. \end{aligned}$$
(5)

Once the user is satisfied, she will not reformulate the next query (4).

In our model, there are two parameters \(\alpha _t\) and \(\beta _t\) \((1\le t \le T)\) related to the utility of a query. The first one models the perceived utility, as it reflects the attractiveness of a query’s search results. The second one models the posterior utility, since it captures the satisfaction of a user from the clicked search results. As aforementioned, we define the utility of a query as the product of the two components

$$\begin{aligned} u_t := \alpha _t\times \beta _t. \end{aligned}$$
(6)

Note that in QUM, \(\alpha _t\) also denotes the probability that users click the search results of the query \(q_t\) (i.e., click likelyhood), and \(\beta _t\) models the post-click satisfaction. Based on the above definition, we can see that \(u_t\) actually stands for the expected satisfaction users obtained from the search results of the query according to their original information needs.

So far as we know, this is the first dynamic Bayesian network which attempts to model the query utility from users’ sequential search behaviors. To simplify our query utility model, in this work we only consider whether the search results of a reformulated query have some clicks or not, but do not specify which URL has been clicked. We may further extend our utility model to capture the specific clicked URLs for finer modeling. Moreover, we take a search session terminated with a user clicking some results as a successful session, otherwise a failure. It is clear that this assumption may not be realistic since users may also stop searching after clicking some results because of impatience. However, even under this simple assumption, we find our model can capture the utility of queries very well as shown in the experimental section. In the future work, we may further improve our model by introducing a persistent parameter to capture the impatience.

3.3 Parameter estimation

We use the maximum likelihood estimation to estimate the model parameters \(\alpha =\{\alpha _t|1\le t \le T\}\) and \(\beta =\{\beta _t|1\le t \le T\}\). For a specific search session with M reformulations (Note that the initial query is not modeled), the joint probability of all the variables is given by:

$$\begin{aligned} P&(C_{1:M},R_{1:M},A_{1:M},S_{1:M}) \nonumber \\&\quad = \prod _{i=1}^M P(C_i|A_i,R_i) \cdot P(R_i|R_{i-1},S_{i-1}) \nonumber \\&\quad \cdot P(A_i) \cdot P(S_i|C_{1:i}) , \end{aligned}$$
(7)

where

$$\begin{aligned} P(C_i|R_i,A_i)&=(R_i\cdot A_i)^{C_i} \cdot (1-R_i\cdot A_i)^{1-C_i}, \end{aligned}$$
(8)
$$\begin{aligned} P(R_i|R_{i-1},S_{i-1})&=(R_{i-1} \cdot (1-S_{i-1}))^{R_i} \nonumber \\&\quad \cdot (1-R_{i-1} \cdot (1-S_{i-1}))^{1-R_i},\end{aligned}$$
(9)
$$\begin{aligned} P(A_i)&={\alpha _{\phi (i)}}^{A_i} \cdot (1-\alpha _{\phi (i)})^{1-A_i},\end{aligned}$$
(10)
$$\begin{aligned} P(S_i|C_{1:i})&=\left( 2\sigma \left( \sum _{k=1}^i{\beta _{\phi (k)}\cdot I(C_k=1)}\right) -1\right) ^{S_i}\nonumber \\&\quad \cdot \left( 2-2{\sigma \left( \sum _{k=1}^i{\beta _{\phi (k)}\cdot I(C_k=1)}\right) }\right) ^{1-S_i}. \end{aligned}$$
(11)

Note here \(R_i\) is deterministic in the search session that \(R_i=1\) for all \((1 \le i \le M)\). By substituting Eqs. (8)–(11) into the Eq. (7), we obtain the joint probability as follows:

$$\begin{aligned} P&(C_{1:M},R_{1:M},A_{1:M},S_{1:M})\nonumber \\&\quad = \prod _{i=1}^{M} {{\alpha _{\phi (i)}}^{A_i}\cdot (1-\alpha _{\phi (i)})^{1-A_i}} \nonumber \\&\quad \cdot \left( 2\sigma \left( \sum _{k=1}^{i}{\beta _{\phi (k)}\cdot I(C_k=1)}\right) -1\right) ^{S_i} \nonumber \\&\quad \cdot \left( 2-2\sigma \left( \sum _{k=1}^{i}{\beta _{\phi (k)}\cdot I(C_k=1)}\right) \right) ^{1-S_i}. \end{aligned}$$
(12)

Therefore, the overall log-likelihood of the N search sessions can be written as:

$$\begin{aligned} \mathcal {L}&= \sum _{j=1}^{N} \sum _{i=1}^{M} A_i^j\cdot \log {(\alpha _{\phi _j(i)})} + (1-A_i^j)\cdot \log (1-\alpha _{\phi _j(i)}) \nonumber \\&\quad + s_i^j\cdot \log \left( 2\sigma \left( \sum _{k=1}^{i}{\beta _{\phi _j(k)}\cdot I(C_k^j=1)}\right) -1\right) \nonumber \\&\quad + (1-S_i^j)\cdot \log \left( 2-2\sigma \left( \sum _{k=1}^i\beta _{\phi _j(k)}\cdot I(C_k^j=1)\right) \right) , \end{aligned}$$
(13)

where the click events \(C_i\) are observed in the search session. If \(C_i=1\), we can infer that the results of the reformulation at position i is attractive (i.e., \(A_i=1\)); Otherwise, \(A_i=0\). Besides, based on our assumption, we have that \(S_{1:M-1}=0\) and \(S_M=1\) if the search session ends with a user clicking some results (i.e., a successful search session); Otherwise, we have that \(S_{1:M}=0\).

By maximizing the log-likelihood function in Eq. (13), we can easily obtain the estimation of the perceived utility

$$\begin{aligned} \alpha _t= & {} \frac{\sum _{j=1}^{N} \sum _{i=1}^{M} A_i^j\cdot I(\phi _j(i)=t)}{\sum _{j=1}^{N} \sum _{i=1}^{M} I(\phi _j(i)=t)} \nonumber \\= & {} \frac{\sum _{j=1}^{N} \sum _{i=1}^{M} I(C_i^j=1)\cdot I(\phi _j(i)=t)}{\sum _{j=1}^{N} \sum _{i=1}^{M} I(\phi _j(i)=t)}. \end{aligned}$$
(14)

Note here \(\phi _j(i)\) denotes the index of the query at the position i in the j-th search session. From the result we can see that, the perceived utility \(\{\alpha _t\}_{t=1}^T\) captures the probability that users click the search results of the query \(q_t\) given their original information needs.

The remaining problem is to estimate the posterior utility \(\{\beta _t\}_{t=1}^T\). We now consider the problem of maximizing \(\mathcal {L}\) subject to the inequality constraints of the \(\beta _t \ge 0\) \((1 \le t \le T)\). The new objective function is:

$$\begin{aligned} \Lambda = \mathcal {L} + \sum _{t=1}^{m}{\lambda _t \beta _t} - \mu ||\beta ||_2, \end{aligned}$$
(15)

where \(\lambda _t\) is the Lagrangian coefficient and \(\mu\) is the regularization parameter.Footnote 5 Note here we impose a L2-norm constraint on the objective function to avoid over-fitting.

An optimal solution to this problem should satisfy the Karush–Kuhn–Tucker (KKT) Kuhn and Tucker (1951) optimality conditions:

$$\begin{aligned} \left\{ \begin{aligned}&\beta _t \ge 0, \\&\lambda _t \ge 0, \\&\lambda _t \cdot \beta _t = 0. \end{aligned} \right. \end{aligned}$$
(16)

As in Tognola and Bacher (1999), we convert the inequality constraints \(\beta _t \ge 0\) \((1 \le t \le T)\) to equality constraints by introducing slack variables \(z_t\) \((1 \le t \le T)\) which satisfy

$$\begin{aligned} \beta _t - z_t = 0, \end{aligned}$$
(17)

with \(z_t \ge 0\). Due to the fact that \(\lambda _t\) and \(z_t\) must be positive or zero, we put all the \(\lambda _t\) and \(z_t\) in a quadratic form, i.e., \(\lambda _t^2\) and \(z_t^2\).

Applying these steps to the optimality conditions, we get the following transformed optimality conditions:

$$\left\{ \begin{array}{l} \dfrac{\partial }{\partial \beta _t}(\mathcal {L}+ \sum _{t=1}^{m}{\lambda _t^2 \beta _t} - \mu ||\beta ||_2)=0,\\ -\beta _t + z_t^2=0,\\ \lambda _t^2 \beta _t=0. \end{array} \right.$$
(18)

The solution of the above optimization problem is achieved by a Newton–Raphson algorithm. The partial derivatives with respect to all the variables \(\beta _t\), \(\lambda _t\) and \(z_t\) are as follows:

$$\begin{aligned} \left[ \begin{array}{ccc} \dfrac{\partial ^2\mathcal {L}}{\partial \beta _t^2}-2\mu &{} 0 &{} 2\lambda _t \\ -1 &{} 2z_t &{} 0\\ \lambda _t^2 &{} 0 &{} 2\lambda _t\beta _t \end{array} \right] \left[ \! \begin{array}{c} \varDelta \beta _t\\ \varDelta z_t\\ \varDelta \lambda _t \end{array} \right] = - \left[ \begin{array}{c} \dfrac{\partial \mathcal {L}}{\partial \beta _t}-2\mu \beta _t+\lambda _t^2\\ -\beta _t + z_t^2\\ \lambda _t^2 \beta _t \end{array} \right] . \end{aligned}$$
(19)

Solving the above linear system of equations (19), we obtain the updating functions:

$$\begin{aligned} \left\{ \begin{aligned} \varDelta \beta _t&=\dfrac{-\beta _t \frac{\partial \mathcal {L}}{\partial \beta _t} + 2\mu \beta _t^2}{\beta _t \frac{\partial ^2\mathcal {L}}{\partial \beta _t^2} - 2\mu \beta _t - \lambda _t^2},\\ \varDelta z_t&= \dfrac{\beta _t - z_t^2 + \varDelta \beta _t}{2z_t},\\ \varDelta \lambda _t&= \dfrac{-\lambda _t^2 \beta _t - \lambda _t^2 \varDelta \beta _t}{2\lambda _t \beta _t}. \end{aligned} \right. \end{aligned}$$
(20)

3.4 High utility recommendation

Based on the above query utility model, we finally obtain our high utility query recommendation approach. Given the original information needs, typically represented by an input query, we collect all the search sessions starting from the same given query to approximate the search behaviors from that information needs. All the queries in these search sessions other than the input query become the recommendation candidates. We then apply our query utility model and infer the utilities of the recommendation candidates according to the original information needs. Finally, we rank all the candidate recommendations according to their utilities and provide the top ranked ones to users.

4 Experiments

In this section, we empirically evaluate the effectiveness of our proposed high utility recommendation method. We first introduce the dataset, baseline methods used in comparison, and evaluation metrics. Then we conduct extensive experiments to compare the performance of our approach with state-of-the-art query recommendation methods. Finally, we also study the effect of the parameters on our proposed model.

4.1 Dataset

Our experiments are based on a publicly available dataset, namely UFindIt log data,Footnote 6 which was collected over a period of 6 months. In this dataset, there are totally 40 initial queries representing 40 original information needs, which are real queries selected from sites such as wiki.answers.com and Yahoo! Answers. The selection criteria follow that, the queries can clearly express users’ information needs while such needs can not be simply satisfied by directly issuing these test queries to search engines. Some of the examples are shown in Fig. 4. Users are then required to search and find right answers (i.e., relevant results) according to these information needs. All the users’ search sessions are logged in this process. For each session, the following details are recorded: the queries, clicked URLs, click positions, browsing trails, and the time-stamps of all the events. The ground-truth (relevance) labels are then annotated on users submitted search results indicating whether this result is a right answer of the original information need.

We process the data by ignoring some interleaved sessions, where the participants search for multiple information needs in one search process. We also remove sessions which have no reformulations, and sessions started without queries. After processing, we obtain 1298 search sessions, 1086 distinct queries and 1555 distinct clicked URLs. For each initial query, the average number of search sessions is 32 and the average number of distinct reformulated queries is 26.

For experiments, we take all the 40 initial queries for testing, and evaluated the quality of top 10 recommendations for each query based on ground-truth labels from the UFindIt dataset. Note that for each test query, we collect all the search sessions in the log data to infer the parameters in our model for recommendation.

Fig. 4
figure 4

Some examples of test queries from UFindIt collection

4.2 Baseline methods

To evaluate the performance of our QUM method, we compare it with four baseline query recommendation methods. These baseline methods can be grouped into two categories, namely frequency-based methods and graph-based methods:

Frequency-based methods:

  • Adjacency (ADJ): Given a test query q, the top frequent queries in the same session adjacent to q are recommended to users Jones et al. (2006).

  • Co-occurrence (CO): Given a test query q, the top frequent queries co-occurred in the same session with q are selected as recommendations Huang et al. (2003).

Graph-based methods:

  • Query-Flow Graph (QF): This is a state-of-the-art query recommendation method proposed by Boldi et al. (2008). In QF, a query-flow graph is first constructed based on collective search sessions, and a random walk is then performed on this graph for query recommendation.

  • Click-through Graph (CT): This is another state-of-the-art query recommendation method proposed by Mei et al. (2008). In CT, a query-URL bipartite graph is first constructed by mining query logs, a random walk is then performed on this graph and the hitting time is leveraged as a measure to select queries for recommendation.

Besides, to further investigate the influence of the two component utilities (i.e., perceived utility and posterior utility) in our QUM method, we also use them separately to recommend queries as two baselines, namely Perceived Utility method (PCU) and Posterior Utility method (PTU). Note that the PCU method is actually similar to the model proposed in Anagnostopoulos et al. (2010), which defines the query utility by its click-through rate.

4.3 Evaluation metrics

In this paper, we propose to recommend high utility queries to directly toward the goal of query recommendation, i.e. assisting users to reformulate queries so that they can acquire their desired information successfully and quickly. Correspondingly, the evaluation should also reflect this goal, i.e. whether users will be satisfied by the recommendation with respect to their original information needs. Traditional evaluation methods in relevant query recommendation (e.g., precision and recall He et al. (2009)) no longer fit our scenario since they are not directly toward our goal of high utility query recommendation. Traditional evaluation in relevant query recommendation is usually based on relevance labels manually assigned to recommended queries (Cao et al. 2008). However, a query assigned as relevant may not necessarily help users find target results, thus such evaluation cannot evaluate the true effectiveness of query recommendation. This is the particular reason why some larger datasets (e.g., Yandex relevance prediction challenge dataset)Footnote 7 cannot be used for the evaluation of utility based recommendation although they also contain reformulations and relevance labels.

Fortunately, the UFindIt log data provides ground-truth (relevance) labels on users’ submitted search results indicating whether this result is a right answer of the original information need. We can rely on these labels to evaluate the utility of the recommended queries. Note that none of these labels are used in our training process. Specifically, we define two evaluation metrics, namely the Query Relevant Ratio (QRR) and the Mean Relevant Document (MRD), to measure the quality of the recommendations.

For a specific information need, the metric QRR is defined as:

$$\begin{aligned} QRR_{q_0}(q)=\frac{RQ_{q_0}(q)}{N_{q_0}(q)}, \end{aligned}$$
(21)

where \(RQ_{q_0}(q)\) denotes the total frequency of query q with relevant results submitted by users with respect to the original search need \(q_0\), and \(N_{q_0}(q)\) denotes the total frequency of query q issued by users with respect to the original search need \(q_0\). This metric measures the probability that a user finds relevant results when she uses query q for her search task given the original query \(q_0\). A higher QRR means that a user will be more likely to find useful results with respect to the original information needs.

Moreover, for a specific information need, the metric MRD is defined as:

$$\begin{aligned} MRD_{q_0}(q)=\frac{RD_{q_0}(q)}{N_{q_0}(q)}, \end{aligned}$$
(22)

where \(RD_{q_0}(q)\) denotes the total frequency of relevant results submitted by users when they use query q for their search tasks with respect to the original search need \(q_0\), and \(N_{q_0}(q)\) denotes the total frequency of query q issued by users with respect to the original search need \(q_0\). This metric measures the average number of relevant results a user finds when she uses query q for her search task given the original query \(q_0\). A higher MRD means that a user will find more relevant results according to the original information needs.

4.4 Overall evaluation results

Figure 5a, b show the performance of top recommendations from different methods under the metric QRR and MRD, respectively. From Fig. 5, we can see that the two frequency-based methods ADJ and CO perform poorly under the two metrics. It shows that by simply considering the most frequently adjacent or co-occurring queries in the same session with the given query (which are usually highly relevant), we cannot guarantee to recommend useful queries to satisfy users’ information needs. The two graph-based methods, i.e., QF and CT, show better performance than the frequency-based methods. It indicates that by leveraging the local relationships (i.e., either the co-click or the reformulation relationship) between query pairs to collectively reveal the global relationships between queries, we are able to find better query recommendations.

Fig. 5
figure 5

Comparison of the performance of all approaches (ADJ,CO,QF,CT,PCU,PTU,QUM) in terms of QRR and MRD

The PCU method, which only relies on queries’ perceived utility for recommendation, presents a comparable performance with the two graph-based methods. As we know, the PCU method actually recommends queries according to the expected click-through rate over their search results (i.e., perceived utility). Since users are more likely to click the search results that they deem relevant, the perceived utility implicitly convey the utility information of the queries. However, only after inspecting the content of the clicked results, users can decide whether the results are truly relevant. Therefore, a useful query may have high perceived utility, but the query with high perceived utility is not necessary to be highly useful. This explains the reason why the PCU method may work well at top 5 positions but perform poorly at top 10 positions as compared with the graph-based methods.

Moreover, the PTU method, which takes queries’ posterior utility for recommendation, shows better performance as compared to the above baseline methods under both metrics. It indicates that by simultaneously consider users’ reformulation and click behaviors, the learned posterior utility can better reflect the usefulness of the queries. Moreover, when compared with the PCU method, we can find that the PTU method shows constantly better performance. It further demonstrates the importance to take into account users’ reformulation behaviors (i.e., satisfaction event) to capture the utility of queries.

Finally, as we can see from Fig. 5, our QUM method performs better than all the baseline recommendation methods. We conduct t-test (p-\(value \le 0.05\)) over the results and find that the performance improvements are significant as compared with both the frequency-based and graph-based baselines. It shows that by leveraging our query utility model, we are able to infer and recommend high utility queries for users. Meanwhile, as compared with the two component utility methods (i.e., PCU and PTU methods), the QUM methods can also constantly outperform the two baselines. It demonstrates that to recommend high utility queries, we need to find those queries that can not only attract users to click their search results, but also provide useful information with the clicked search results. The two aspects are complementary and neither can be ignored for high utility recommendation.

4.5 Evaluation according to query difficulty

To further investigate the performance of our QUM method in high utility query recommendation, we break down the comparison among different approaches according to the query difficulty. According to the UFindIt log data, we separate the test queries into three groups (easy, medium, hard) based on their difficulty.Footnote 8 The evaluation results under the metrics QRR and MRD are shown in Table 1, where the percentages in the parentheses are the relative improvements of our QUM method over the corresponding methods.

From the results we can see that, not surprisingly, the performances of all the recommendation methods decline with the increase of query difficulty. However, our QUM method consistently shows better performance than all the other methods under all the query difficulty levels. Interestingly, we can find that with the increase of the query difficulty, the improvements of our QUM method over other recommendation methods also increase. For example, when the queries are easy, the performances of the frequency-based methods and the graph-based methods under the metric MRD@10 are above 0.6 and 0.8, respectively, while our QUM methods reaches 0.844. It seems that for easy queries, most recommendation methods may work well since popular or relevant recommendations can already help to find the right results.

However, when the queries become hard, the performances of the two frequency-based methods (i.e., ADJ and CO) drops to 0.284 and 0.34 in terms of MRD@10, respectively, and that of the two graph-based methods (i.e., QF and CT) drops to 0.414 and 0.424, respectively. While our QUM method reaches 0.504 under MRD@10. The relative improvements of QUM over the frequency-based methods and the graph-based methods are above 40 % and around 20 %, respectively. It shows that when users’ search task is difficult, it would be more beneficial to directly recommend high utility queries beyond relevant queries. Similar results can also be observed under the metrics MRD@5, QRR@5, and QRR@10.

Table 1 Performance comparison of different methods under different query difficulty levels

4.6 Impact of parameter μ

There is one parameter \(\mu\) in our QUM model, which denotes the weight of the regularization over the posterior utility \(\{\beta _t\}_{t=1}^T\). In this section, we discuss the impact of the parameter \(\mu\) on the performance of our QUM method. We varied \(\mu\) from 0 to 10 with step length 1 in our experiment, and evaluate the performance of our QUM method in terms of QRR and MRD, respectively. The results are depicted in Fig. 6a, b.

From Fig. 6, we observe that our QUM method achieves the lowest performance when \(\mu =0\), while obtains better performance when \(\mu >0\). Moreover, our approach reaches the best performance when \(\mu\) is set between 1 and 4, while the performance slightly decreases with further increase of the value of \(\mu\). This phenomenon illustrates that imposing the L2-norm constraint on the objective function plays a positive role. Generally speaking, the impact of \(\mu\) to the performance of QUM is relatively stable when \(\mu\) is small, e.g., between 1 and 4.

Fig. 6
figure 6

The impact of parameter \(\mu\) to the performance of QUM method in terms of QRR and MRD

4.7 Discussion

In this paper, we conducted experiments and evaluation based on the UFindIt log data due to the following reasons. On one hand, it is a publicly available data set for performing controlled, yet realistic and reproducible studies. On the other hand, the data set conveys the necessary information for evaluating the true effectiveness of query recommendation, i.e. explicit original information needs and the relevance labels on users’ clicked search results with respect to their original needs. While other public query logs, such as AOL,Footnote 9 cannot directly support such kind of evaluation.

In fact, our approach can be directly applied to real scenarios, e.g. query recommendation in search engines based on real search logs. With richer session information from large scale log data, we may expect to infer the query utility more accurately. However, there may also be some challenging problems when applying the recommendation approach based on real search logs. For example, it is not a trivial task to detect sessions in real query logs. Recently the problem of query session detection has been studied in many literatures (Boldi et al. (2008); He and Göker (2000)), where the using of timeout threshold (e.g. 30 min) for splitting sessions are commonly employed and empirically proved to be effective. However, the detection of sessions is out of the scope of this paper. Meanwhile, real log data may convey some noise which may affect the performance of recommendation. This can be alleviated by employing some pre-process of data cleaning.

A potential concern on applying the proposed recommendation approach in real-world settings is the availability of the reformulations for a particular query. For example, if a query has never appeared in a search session (e.g., rare/new query), we will not be able to know its reformulations and estimate their utilities in advance. Such rare/new query case is the most difficult problem in practice for most query recommendation methods. A possible solution to this problem is to first find some candidate queries (which have session information) with similar intent to such rare/new query, based on certain similarity metrics (e.g., query string similarity or search result similarity). Then we can rely on the search sessions of these candidate queries to estimate the utility of the possible reformulations, and produce the recommendation for the original rare/new query with respect to both the utility and intent similarity. We will leave this as our future work.

5 Conclusions

In this paper, we argue that we shall recommend high utility queries rather than only relevant queries, to directly achieve the ultimate goal of query recommendation, i.e., to assist users to reformulate queries so that they can acquire their desired information successfully and quickly. We formally define the query utility and identify its two components (i.e., perceived utility and posterior utility). We then propose a novel dynamic Bayesian network QUM to infer query utility from users’ search behaviors for recommendation. We evaluate the performance on a real query log, and the experimental results show that the proposed approach can outperform the state-of-the-art baselines in providing useful recommendations.

For the future work, one major aspect is to improve our generative model to better capture query utility. We may further extend our utility model to capture the specific clicked URLs for finer modeling. Besides, we may introduce a persistent parameter into our model to capture the case that the user stops searching because of impatience. As mentioned above, we may also make use of search sessions of similar queries in our QUM to help address the rare/new query recommendation problem. Moreover, it would be interesting to take into account diversity in our model to reduce the redundancy in a set of recommended high utility queries.