1 Introduction

Multi-label classification (MLC) is a machine learning problem in which models are sought that assign a subset of (classes) labels to each instance, unlike conventional (single-class) classification that involves predicting only a single class. Multi-label classification problems are ubiquitous and naturally occur, for instance, in assigning keywords to a paper, tags to resources in a social network, objects to images or emotional expressions to human faces.

In general, the problem of multi-label learning comes with two fundamental challenges. The first refers to the computational complexity of the algorithms. If the number of labels is large, then a complex approach might not be applicable in practice. Therefore, the scalability of algorithms is a key issue in this field. The second problem is related to the ’own nature’ of multi-label data. Not only is the number of labels typically large, but each instance also belongs to a variable-sized subset of labels simultaneously. Moreover, and perhaps even more importantly, the labels will normally not occur independently of each other; instead, there are statistical dependencies between them. From a learning and prediction point of view, these relationships constitute a promising source of information, in addition to that coming from the mere description of the instances. Thus, it is hardly surprising that research on MLC has very much focused on the design of new methods that are able to detect—and benefit from—interdependencies among labels.

Several approaches have been proposed in the literature to cope with MLC. Firstly, researchers tried to adapt and extend different state-of-the-art binary or multi-class classification algorithms (Elisseeff and Weston 2005; McCallum 1999; Zhang and Zhou 2007). Secondly, they further analyzed in depth the label dependence and attempt to design new approaches that exploit label correlations (Dembczyński et al. 2012). In this regard, two kinds of label dependence have been formally distinguished, namely, conditional dependence (Dembczyński et al. 2010; Montañés 2011, 2014; Read et al. 2011) and marginal (unconditional) dependence (Cheng and Hüllermeier 2009). Also, pairwise relations (Elisseeff and Weston 2005), relations in sets of different sizes (Read et al. 2011; Tsoumakas and Vlahavas 2007), or relations in the whole set of labels (Cheng and Hüllermeier 2009; Montañés 2011) have also been exploited.

Regarding conditional label dependence, the approach called probabilistic classifier chains (PCC) has aroused great interest among the multi-label community, since it offers the excellent property of being able to estimate the conditional joint distribution of the labels. However, the original PCC algorithm (Dembczyński et al. 2010) suffers from high computational cost, since it performs an exhaustive search as inference strategy to obtain optimal solutions in terms of a given loss function. Several efforts that use different searching and sampling strategies in order to overcome this drawback are being made currently. This includes uniform-cost search (Dembczynski et al. 2012), beam search (Kumar et al. 2012, 2013) and Monte Carlo sampling (Dembczynski et al. 2012; Read et al. 2014). All of these algorithms successfully estimate an optimal solution reached by the original PCC (Dembczyński et al. 2010), at the same time that they reduce the computational cost in terms of both the number of candidate solutions explored and execution time. The main contribution of this paper is to propose an alternative based on an heuristic search strategy. In particular, the proposal consists of obtaining admissible heuristics for the well-known A* algorithm (Hart et al. 1968). In this respect, we have already started to fill this gap in the literature with a recent published preliminary work (Mena et al. 2015), concluding that the proposal guarantees, not only optimal predictions in terms of subset 0/1 loss, but also that it explores fewer nodes than all previous methods that also provide optimal predictions. Unfortunately, and after studying in depth this heuristic, two main drawbacks can be stated: (i) it could only be defined for linear models and (ii) its computation is moderately high. In this direction, the goal of this paper is twofold. On the one hand, this work goes further by defining a family of heuristics through a parameter that controls the trade-off between the number of nodes explored and the cost of computing the heuristic. Besides, a special value of the parameter leads to an heuristic suitable for non-linear models. On the other hand, this work also studies and analyzes situations in which the computation of the heuristic compensates the whole computational cost, showing a steady behavior with regard to other algorithms. All these methods are analyzed and experimentally compared over a wide range of multi-label datasets.

The rest of the paper is organized as follows. Section 2 formally describes the multi-label framework and the principles of PCC. Section 3 describes and discusses the properties and behavior of the different state-of-the-art approaches for inference in PCCs. Section 4 details the heuristic search framework and defines admissible heuristics for the A* algorithm. Exhaustive experiments are shown and discussed in Sect. 5. Finally, Sect. 6 offers some conclusions and includes new directions for future work.

2 Probabilistic classifier chains in multi-label classification

This section formally describes the MLC task and the PCC methods.

2.1 Formal settings of multi-label classification and loss functions

Let be \(\mathcal {L}=\{\ell _1, \ell _2, \ldots , \ell _m\}\) a finite and non-empty set of m labels and \(S = \{(\varvec{x}_1, \varvec{y}_1),\ldots ,\) \((\varvec{x}_n, \varvec{y}_n)\}\) a training set independently and randomly drawn according to an unknown probability distribution \({\mathbf {P}}({\mathbf {X}},{\mathbf {Y}})\) on \(\mathcal {X}\times \mathcal {Y}\), where \(\mathcal {X}\) and \(\mathcal {Y}\) are the input and the output space, respectively. The former is the space of the instance description, whereas the latter is given by the power set \(\mathcal {P}(\mathcal {L})\) of \(\mathcal {L}\). To ease notation, we define \(\varvec{y}_{i}\) as a binary vector \(\varvec{y}_{i}= (y_{i,1}, y_{i,2}, \ldots , y_{i,m} )\) in which \(y_{i,j}=1\) indicates the presence (relevance) and \(y_{i,j}=0\) the absence (irrelevance) of \(\ell _j\) in the labeling of \(\varvec{x}_{i}\). Hence, \(\varvec{y}_{i}\) is the realization of a corresponding random vector \({\mathbf {Y}}=(\mathbf {Y_1}, \mathbf {Y_2},\ldots , \mathbf {Y_m})\). Using this convention, the output space can also be defined as \(\mathcal {Y}=\{0,1\}^{m}\). The goal in MLC is to induce from S a hypothesis \(\varvec{f}: \mathcal {X}\longrightarrow \mathcal {Y}\) that minimizes the risk in terms of certain loss function \(L(\cdot )\) when it provides a vector of relevant labels \(\varvec{y}=\varvec{f}(\varvec{x})=(f_1(\varvec{x}),f_2(\varvec{x}),\ldots ,f_m(\varvec{x}))\) for unlabeled query instances \(\varvec{x}\). This risk can be defined as the expected loss over the joint distribution \({\mathbf {P}}({\mathbf {X}},{\mathbf {Y}})\), that is,

$$\begin{aligned} R_L(\varvec{f})=\mathbb {E}_{{\mathbf {X}},{\mathbf {Y}}}L({\mathbf {Y}},\varvec{f}({\mathbf {X}})). \end{aligned}$$
(1)

therefore, denoted by \({\mathbf {P}}(\varvec{y}\, | \,\varvec{x})\) the conditional distribution \({\mathbf {Y}}=\varvec{y}\) given \({\mathbf {X}}=\varvec{x}\), then the so-called risk minimizer \(\varvec{f}^*\) can be expressed by

$$\begin{aligned} \varvec{f}^*(\varvec{x})=\arg \min _{\varvec{f}}\sum _{\varvec{y}\in \mathcal {Y}} {\mathbf {P}}(\varvec{y}\, | \,\varvec{x})L(\varvec{y},\varvec{f}(\varvec{x})). \end{aligned}$$
(2)

Let us comment that the conditional distribution \({\mathbf {P}}(\varvec{y}\, | \,\varvec{x})\) presents different properties which are crucial for optimizing different loss functions. In this respect, the strategy followed by a certain MLC algorithm for modeling label dependence determines the loss function that is optimized. But, unfortunately, for most of the algorithms it is quite complex and confusing to discover the loss function they attempt to optimize.

With regard to the loss functions, several performance measures have been taken for evaluating MLC. The most specific ones are the subset 0/1 loss and the Hamming loss, but there exist other measures that have been taken from other research fields, as such \(F_1\) or the Jaccard index. Here, we will focus on just the subset 0/1 loss, since it is the measure PCCs are able to optimize. The subset 0/1 loss checks if the predicted and relevant label subsets are equal or not and is defined byFootnote 1

(3)

In the case of this evaluation measure, it is sufficient to take the mode of the entire joint conditional distribution for optimizing this loss. Formally, the risk minimizer adopts the following simplified form

$$\begin{aligned} \varvec{f}^*(\varvec{x})=\arg \max _{\varvec{y}\in \mathcal {Y}} {\mathbf {P}}(\varvec{y}\, | \,\varvec{x}). \end{aligned}$$
(4)

2.2 Probabilistic classifier chains

PCCs (Dembczyński et al. 2010) [such as CC (Read et al. 2009, 2011)] are based on learning a chain of classifiers. These methods take an order of the label set and train a probabilistic binary classifier for estimating \({\mathbf {P}}(y_j\, | \,\varvec{x}, y_1, \ldots , y_{j-1})\) for each label \(\ell _j\) following this order. Hence, the probabilistic model obtained for predicting label \(\ell _j\), denoted by \(f_j\) is of the form

$$\begin{aligned} f_{j}: \, \mathcal {X}\times \{0,1\}^{j-1} \longrightarrow [0,1]. \end{aligned}$$
(5)

The training data for this classifier is the set \(S_j=\{(\overline{\varvec{x}}_{1}, y_{1,j}), \ldots , (\overline{\varvec{x}}_{n},y_{n,j})\}\) where \(\overline{\varvec{x}}_{i}=(\varvec{x}_i, y_{i,1}, \ldots , y_{i,j-1})\), that is, the features are \(\varvec{x}_{i}\) supplemented by the relevance of the labels \(\ell _1, \ldots , \ell _{j-1}\) preceding \(\ell _j\) in the chain and the category is the relevance of the label \(\ell _j\).

In the testing stage of the methods based on learning a chain of classifiers, the goal is to perform inference for each instance, which consists of estimating the risk minimizer for a given loss function over the estimated entire joint conditional distribution. The idea revolves around repeatedly applying the general product rule of probability to the joint distribution of the labels \({\mathbf {Y}}=(\mathbf {Y_1}, \mathbf {Y_2},\ldots , \mathbf {Y_m})\), that is, computing

$$\begin{aligned} {\mathbf {P}}(\varvec{y}\, | \,\varvec{x})=\prod _{j=1}^m{\mathbf {P}}(y_j\, | \,\varvec{x}, y_1, \ldots , y_{j-1}). \end{aligned}$$
(6)

Before analyzing this issue in the next section, note that from a theoretical point of view, this expression holds for any order considered for the labels. But, in practice, these methods are label order dependent for several reasons. On the one hand, it is not possible to assure that the models obtained in the training stage perfectly estimate the joint conditional probability \(P(\varvec{y}\, |\,\varvec{x})\). On the other hand, predicted values instead of true values are successively taken in the testing stage. This is more serious if the highest errors occur at the beginning of the chain, since error predictions are successively propagated (Montañés 2014; Senge et al. 2012a, b). In any case, in this paper we assume the order of the labels in the chain to be given, since the goal is just to analyze the performance of the methods, without taking into account the effect of different orders. Hence, we do not include any study about which order can be the best.

Before going into the detailed description of the state-of-the-art of the inference approaches and of our proposal, we note that the training phase is common to all of them, thus, the models \(f_j\) induced by the binary classifiers will be the same. So, in what follows we will focus just on the testing stage.

3 Inference in probabilistic classifier chains

First of all, let us consider the task of performing different inference procedures as different manners of exploring a probability binary tree in order to facilitate the explanation and analysis of the inference approaches in the next section. In such a tree, the k-th node of level \(j<m\) with \(k\le 2^j\) is labeled by \(y_j^k=(v_1,v_2,\ldots , v_j)\) with \(v_i\in \{0,1\}\) for \(i=1, \ldots , j\). This node has two children respectively labeled as \(y_{j+1}^{2k-1}=(v_1,v_2,\ldots , v_j, 0)\) and \(y_{j+1}^{2k}=(v_1,v_2,\ldots , v_j,1)\) and with marginal joint conditional probability \({\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j, y_{j+1}=0\, | \,\varvec{x})\) and \({\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j, y_{j+1}=1\, | \,\varvec{x})\). The weights of the edges between both parent and children are respectively \({\mathbf {P}}(y_{j+1}=0\, | \,\varvec{x}, y_1=v_1, \ldots , y_j=v_j)\) and \({\mathbf {P}}(y_{j+1}=1\, | \,\varvec{x}, y_1=v_1, \ldots , y_j=v_j)\), which are respectively estimated by \(1- f_{j+1}(\varvec{x}, v_1, \ldots , v_j)\) and \(f_{j+1}(\varvec{x}, v_1, \ldots , v_j)\). The marginal joint conditional probability of the children is computed by the product rule of probability. Then, \({\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j, y_{j+1}=0\, | \,\varvec{x})={\mathbf {P}}(y_{j+1}=0\, | \,\varvec{x}, y_1=v_1, \ldots , y_j=v_j)\cdot {\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j\, | \,\varvec{x})\) and \({\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j, y_{j+1}=1\, | \,\varvec{x})={\mathbf {P}}(y_{j+1}=1\, | \,\varvec{x}, y_1=v_1, \ldots , y_j=v_j)\cdot {\mathbf {P}}(y_1=v_1, \ldots , y_j=v_j\, | \,\varvec{x})\). The root node is labeled by the empty set. Figure 1 illustrates this.

Fig. 1
figure 1

A generic node and its children of the probability binary tree. The top part of each node contains the combination of labels and the bottom part includes the joint probability of such a combination. The edges are labeled with the conditional probability

Several approaches have been proposed for inference in PCCs. The method the first proposed is the one based on greedy search (GS), being the integral part of the original CC method (Read et al. 2009). Its successor is the exhaustive search (ES), called the PCC method (Dembczyński et al. 2010). The \(\epsilon \)-approximate (\(\epsilon \)-A) algorithm (Dembczynski et al. 2012) is a uniform-cost (UC) search algorithm that can output optimal predictions in terms of subset 0/1 loss and also reduces significantly the computational cost of ES. A more recent approach based on beam search (Kumar et al. 2012, 2013) (BS) presents good behavior both in terms of performance and computational cost. Finally, Monte Carlo sampling (Dembczynski et al. 2012) is an appealing and simpler alternative (Dembczynski et al. 2012; Read et al. 2014) to overcome the high computational cost of ES.

This section copes with the particularities of these inference methods, except the Monte Carlo sampling approaches. We have already studied Monte Carlo approaches (Dembczynski et al. 2012; Read et al. 2014) and compared them with the \(\epsilon \)-A algorithm and BS techniques (Mena et al. 2015). The conclusions of that work were that, although they are well suited for minimization of other losses, e.g., Hamming loss or example-based F-measure, (i) they need to explore many more nodes to be closer to the optimal, despite the fact that their performance could sometimes be better, (ii) they enlarge as the size of the sample drawn grows and (iii) they are quite slow even for low values of the sample. Hence, we do not consider them as competitive methods in the present work, although they could sometimes be appealing.

3.1 Greedy search

At the testing stage, the GS strategy, originally called CC (Read et al. 2009, 2011), provides an output \(\hat{\varvec{y}}=(\hat{y}_1, \ldots , \hat{y}_m)\) for a new unlabeled instance \(\varvec{x}\) by successively querying each classifier \(f_j\) that estimates the conditional probability \({\mathbf {P}}(y_j\, | \,\varvec{x}, y_1, \ldots , y_{j-1})\). This means exploring just one node in each level j. Given that only the two children of the explored node in level j are taken, their marginal joint conditional probability only differs in the marginal conditional probability \({\mathbf {P}}(y_j\, | \,\varvec{x}, y_1, \ldots , y_{j-1})\), since both children have the same parent. Thus, the path selected is that of the child with the highest marginal conditional probability \({\mathbf {P}}(y_j\, | \,\varvec{x}, y_1, \ldots , y_{j-1})\) and the prediction for an instance \(\varvec{x}\) is of the form

$$\begin{aligned} \hat{\varvec{y}} = ( \, f_1(\varvec{x}), \, f_2(\varvec{x}, f_1(\varvec{x})), \, f_3(\varvec{x}, f_1(\varvec{x}), f_2(\varvec{x}, f_1(\varvec{x}))),\ldots ). \end{aligned}$$
(7)

Figure 2a shows the path followed by an instance using this strategy. In this example, only the right node is explored in each level. The optimal solution is not reached, since the optimal solution is that which ends in the sixth leaf, whereas the method falls in the last leaf.

Fig. 2
figure 2

An example of paths followed by an instance using a Greedy Search (CC) and b exhaustive Search (PCC). The dotted arrows show the path followed by the algorithm

Concerning the optimization of subset 0/1 loss, a rigorous analysis (Dembczynski et al. 2012) establishes bounds for the performance of the GS, showing the poor performance of the method for this loss, although it tends to optimize it.

3.2 Exhaustive search

Unlike GS that explores only a single label combination, ES estimates the entire joint conditional distribution \({\mathbf {P}}(\cdot \, | \, \varvec{x})\) for a new unlabeled instance \(\varvec{x}\), since it provides a Bayes optimal inference. Hence, it explores all possible paths in the tree. Then, for each \(h(\varvec{x})\), it computes \({\mathbf {P}}(\varvec{y} \, | \, \varvec{x})\) and \(L(\varvec{y},\varvec{f}(\varvec{x}))\) for all combination of labels \(\varvec{y}= (y_{1}, y_{2}, \ldots , y_{m})\) and outputs the combination \(\hat{\varvec{y}}=(\hat{y}_1, \ldots , \hat{y}_m)=\varvec{f}^*(\varvec{x})\) with minimum risk for the given loss \(L(\cdot ,\cdot )\). By doing so, it generally improves in terms of performance, since it perfectly estimates the risk minimizer, albeit at the cost of a higher computational cost, as it comes down to summing over an exponential (\(2^m\)) number of label combinations for each \(h(\varvec{x})\).

Figure 2b illustrates this approach where, by exploring all paths, the optimal solution is always reached.

3.3 \(\epsilon \)-approximate algorithm

The \(\epsilon \)-Approximate (\(\epsilon \)-A) algorithm (Dembczynski et al. 2012) arises as an alternative to the high computational cost of ES and to the poor performance of CC. In terms of the probability tree defined above, it expands only the nodes whose marginal joint conditional probability exceeds the threshold \(\epsilon =2^{-k}\) with \(1\le k\le m\) (notice that \(\epsilon =0\) and all values of \(\epsilon \) between 0 and \(2^{-m}\) are in fact the same case of \(\epsilon =2^{-m}\)). This marginal joint conditional probability for a node in level j, which deals with the label \(\ell _j\) and for an unlabeled instance \(\varvec{x}\) is

$$\begin{aligned} {\mathbf {P}}(y_1, \ldots , y_j\, | \, \varvec{x})=\prod _{i=1}^j{\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_{i-1}), \end{aligned}$$
(8)

where \({\mathbf {P}}(y_i, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) is estimated by \(f_i(\varvec{x},y_1,\ldots , y_{i-1})\).

The nodes are expanded in the order established by this probability, calculating the marginal joint conditional probability for their children. So, the algorithm does not follow a specific path, otherwise it changes from one path to another depending on the marginal joint conditional probabilities. In the end, two situations can be found: (i) the node expanded is a leaf or (ii) there are no more nodes that exceed the threshold. If the former situation happens, the prediction for the unlabeled instance \( \varvec{x}\) will be \(\hat{\varvec{y}}=(\hat{y}_1, \ldots , \hat{y}_m)\) corresponding to the combination of the leaf reached (see Fig. 3a). Conversely, if situation (ii) takes place, then GS is applied to the nodes whose children do not exceed the threshold, and the prediction \(\hat{\varvec{y}}=(\hat{y}_1, \ldots , \hat{y}_m)\) for the unlabeled instance \(\varvec{x}\) in this case will be that with the highest entire joint conditional probability \({\mathbf {P}}(y_1, \ldots , y_m \, | \, \varvec{x})\) (see Fig. 3b, c).

The parameter \(\epsilon \) plays an important role in the algorithm. The particular case of \(\epsilon =0\) (or any value in the interval [0, \(2^{-m}\)], that is, \(k=m\)) is of special interest, since the algorithm performs a UC that always finds the optimal solution. Figure 3a illustrates this situation.

Conversely, the method is looking to GS as \(\epsilon \) grows, being the GS in the case of \(\epsilon =0.5\) (or equivalently \(\epsilon =2^{-1}\), that is, \(k=1\)). This is so because in this case two situations are possible: (i) only one node has a marginal joint conditional probability greater than \(\epsilon \), in which case the algorithm follows one path, or (ii) no nodes have a marginal joint conditional probability greater than \(\epsilon \), in which case a GS is applied from here to the bottom of the tree. Figure 3b shows an example of this particular case.

Notice that this method provides an optimal prediction if the entire joint conditional probability of the corresponding label combination is greater than \(\epsilon \). The interpretation of the method for a generic value of \(\epsilon =2^{-k}\) is that the method guarantees reaching a partial optimal solution at least until the \(k-th\) level on the tree. Figure 3c shows the particular case of \(\epsilon =0.25\).

Consequently, this algorithm estimates the risk minimizer for the subset 0/1 loss to a greater or lesser extend, depending on the value of \(\epsilon \). Moreover, a theoretical analysis of this estimation (Dembczynski et al. 2012) allows bounding its goodness as a function of the number of iterations, which in turn depends on \(\epsilon \).

Fig. 3
figure 3

Several examples of the paths followed by the \(\epsilon \)-A algorithm for different values of \(\epsilon \). The nodes with a cross are those that have a marginal joint conditional probability lower than \(\epsilon \) and, hence, they are not explored any more. The dotted arrows show the path followed by the algorithm. The solid arrows indicate the path followed by the algorithm when the marginal joint conditional probability does not exceed the value of \(\epsilon \), and, hence a GS is applied to this node from here to the bottom of the tree. a \(\epsilon \)-A algorithm with \(\epsilon =0\) (\(k=m\)), b \(\epsilon \)-A algorithm with \(\epsilon =0.5\) (\(k=1\)) and c \(\epsilon \)-A algorithm with \(\epsilon =0.25\) (\(k=2\))

Fig. 4
figure 4

An example of paths followed by an instance using Beam Search (BS) with \(b=2\). The nodes with a cross mean that this node has none of the highest marginal joint conditional probabilities and, hence, it is not explored any more. The dotted arrows show the path followed by the algorithm

3.4 Beam search

Beam Search (BS) (Kumar et al. 2012, 2013) also explores more than one path in the probabilistic tree. This method includes a parameter b called the beam width that limits the number of combinations of labels explored. The idea is to explore b possible candidate sets of labels at each level of the tree. Hence, depending on such a value, a certain number of the top levels are exhaustively explored, particularly a total of \(k^*-1\) levels, \(k^*\) being the lowest integer such that \(b<2^{k^*}\). Then, only b number of possibilities are explored for each of the remainder levels. The combinations explored from the level \(k^*\) to the bottom are those with the highest marginal joint conditional probability seen thus far. This marginal joint conditional probability for a node of level j for an unlabeled instance \(\varvec{x}\) is the same as for the \(\epsilon \)-A algorithm. Hence, such a probability is

$$\begin{aligned} {\mathbf {P}}(y_1, \ldots , y_j\, | \, \varvec{x})=\prod _{i=1}^j{\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_{i-1}), \end{aligned}$$
(9)

where \({\mathbf {P}}(y_i, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) is estimated by \(f_i(\varvec{x},y_1,\ldots , y_{i-1})\).

In the end, the algorithm outputs \(\hat{\varvec{y}}=(\hat{y}_1, \ldots , {\hat{y}}_m)\) with the highest entire joint conditional probability \({\mathbf {P}}(y_1, \ldots , y_m \, | \, \varvec{x})\).

BS differs from GS in that (i) BS explores more than one combination whereas GS just explores one and also in (ii) the probability taken for deciding a path to follow in the tree is different from one method to the other. Concerning (ii), both take the marginal joint conditional probability \({\mathbf {P}}(y_1, \ldots , y_j | \varvec{x})\), but in the case of GS, that is equivalent to taking the marginal conditional probability \({\mathbf {P}}(y_j, | \,\varvec{x}, y_1, \ldots , y_{j-1})\) since the two nodes explored in each level of the tree have the same parent, as explained before. However, in the case of BS, the b nodes explored do not have to have the same parent, even in the case of \(b=2\). Of course, if \(b>2\), this is impossible to happen in a binary tree.

Let consider the case of \(b=1\). In this case, BS expands just one node in each level, the one with the highest marginal joint conditional probability that coincides with the highest marginal conditional probability, so BS when \(b=1\) follows just one path that coincides with the one followed by GS. Also, if \(b=2^m\), BS performs an ES. Hence, BS encapsulates both GS and ES respectively considering \(b=1\) and \(b=2^m\). This makes it possible to control the trade-off between computational cost and performance of the method by tuning b between 1 and \(2^m\).

As a final remark, the fact that BS considers marginal joint conditional probabilities makes the method tend to estimate the risk minimizer for the subset 0/1 loss. Kumar et al. (2012) and Kumar et al. (2013) do not include any theoretical analysis about the goodness of the estimation of the risk minimizer, but they empirically show that by taking certain values of b (\(b<15\)), the risk minimizer provided by the method converges to the one obtained using ES.

Figure 4 shows an example of the paths explored by the BS algorithm when \(b=2\).

4 A* algorithm for inference in PCC

The algorithm A* is the most widely-known form of best-first search (Pearl 1984), in which the best node at each iteration, according to an evaluation function e, is expanded. The particularity of the A* algorithm is that e can be seen as a function E of the other two functions g and h, \(e(k)=E(g(k),h(k))\), where g(k) evaluates the cost of reaching node k from the root and h(k) evaluates the cost of reaching a solution (a leaf) from k. Hence, e(k) evaluates the total cost of reaching a solution from the root through k. In general, it is possible to obtain the exact value of the known information (g), but the unknown information must be estimated through an heuristic (h). To obtain an optimal solution, h must not overestimate the actual cost of reaching a solution, that is, it must be an admissible heuristic. These kinds of heuristics are optimistic, because they estimate that the cost of obtaining a solution is less than it actually is. Also, the A* algorithm is optimally efficient for any heuristic, because no other optimal algorithm using the same heuristic guarantees to expand fewer nodes than A*.

4.1 Building an admissible heuristic

In order to adapt A* for inference in PCC, we must take into account that we have probabilities instead of costs. So, (1) A* must select the node with the highest estimated probability, (2) h must not underestimate the probability from the node to a leaf, that is, h must be an admissible heuristicFootnote 2 and (3) E must be the product function, \(e=g \cdot h\). Considering all these aspects, e will provide an estimation of the entire joint conditional probability \({\mathbf {P}}(y_1, \ldots , y_m \, | \, \varvec{x})\) for optimizing subset 0/1 loss. In order to derive g and h, let us say that the product rule of probability to the joint distribution of the labels \({\mathbf {Y}}=(\mathbf {Y_1}, \mathbf {Y_2},\ldots , \mathbf {Y_m})\) can be rewritten as

$$\begin{aligned} {\mathbf {P}}(y_1, \ldots , y_m \, | \,\varvec{x})= & {} {\mathbf {P}}(y_1, \ldots , y_j\, | \,\varvec{x}) \nonumber \\&\times \, {\mathbf {P}}(y_{j+1}, \ldots , y_m \, | \,\varvec{x}, y_1, \ldots , y_j). \end{aligned}$$
(10)

Hence, for a node at level j, let us consider g to be the marginal joint conditional probability \({\mathbf {P}}(y_1, \ldots , y_j \, | \, \varvec{x})\) and h an heuristic that does not underestimate the marginal joint conditional probability \({\mathbf {P}}(y_{j+1}, \ldots , y_m \, | \,\varvec{x}, y_1, \ldots , y_j)\). Let us remember that the values of \(y_1,\ldots , y_j\) are known at level j.

Before discussing the heuristic h proposed here, let us notice that g is the same marginal joint conditional probability that both the \(\epsilon \)-A algorithm and the BS calculate to select the nodes to be expanded. Even more, \(\epsilon \)-A with \(\epsilon =0\) is not only equivalent to the UC search, but also to A* with the constant heuristic \(h=1\). This heuristic is admissible too, since no probability is greater than 1. But, on the other hand, it is also the worst admissible heuristic, since any other admissible heuristic will dominate itFootnote 3 and consequently the A* algorithm using such an heuristic will never expand more nodes than the A* algorithm using \(h=1\).

Let us go now to discuss the heuristic proposed in this paper. Since our heuristic must not underestimate the marginal joint conditional probability \({\mathbf {P}}(y_{j+1}, \ldots , y_m \, | \,\varvec{x}, y_1, \ldots , y_j)\) in order to be admissible, it is quite straightforward to pick the maximum value of such probability for obtaining an optimal heuristic \(h^*\):

$$\begin{aligned} h^*=&\max _{(y_{j+1}, \ldots , y_{m})\in \{0,1\}^{m-j}} \ \ {\mathbf {P}}(y_{j+1}, \ldots , y_m \, | \,\varvec{x}, y_1, \ldots , y_j)\nonumber \\ =&\max _{\mathop {(y_{j+1}, \ldots , y_{m})}\limits _{\in \{0,1\}^{m-j}}}\prod _{i=j+1}^m{\mathbf {P}}(y_{i} \, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1}). \end{aligned}$$
(11)

However, obtaining such a maximum is, in fact, applying an ES over the set of labels \(\mathcal {L}=\{\ell _{j+1}, \ldots , \ell _m\}\). Hence, this optimal heuristic is not computationally applicable. So, we need to obtain a tractable heuristic in exchange for renouncing such optimality. This leads to design an heuristic \(\hat{h}\) also admissible, but less dominant than \(h^*\) (\(h^*\prec \hat{h}\)). For this purpose, let us consider \((\overline{y}_{j+1}, \ldots , \overline{y}_m)\) the values that define \(h^*\), that is

$$\begin{aligned} h^*=\prod _{i=j+1}^m{\mathbf {P}}(y_{i} \, | \,\varvec{x}, y_1, \ldots , y_j, \overline{y}_{j+1}, \ldots , \overline{y}_{i-1}), \end{aligned}$$
(12)

and let us notice that values \((\overline{y}_{j+1}, \ldots , \overline{y}_m)\) which maximize the product do not have to maximize each individual term, then

$$\begin{aligned}&{\mathbf {P}}(y_{i} \, | \,\varvec{x}, y_1, \ldots , y_j, \overline{y}_{j+1}, \ldots , \overline{y}_{i-1}) \nonumber \\&\quad \le \max _{\mathop {(y_{j+1}, \ldots , y_{i-1})}\limits _{\in \{0,1\}^{i-1-j}}} {\mathbf {P}}(y_{i} \, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1}). \end{aligned}$$
(13)

Hence, by defining \(\hat{h}\) as

$$\begin{aligned} \hat{h}=\prod _{i=j+1}^m \max _{\mathop {(y_{j+1}, \ldots , y_{i-1})}\limits _{\ \in \{0,1\}^{i-1-j}}} {\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1}). \end{aligned}$$
(14)

it is easy to deduce that \(\hat{h}\) is admissible and less dominant than \(h^*\) (\(h^* \prec \ \hat{h})\). But again, \(\hat{h}\) is not computationally applicable in general. However, this is not the case if we restrict ourselves to the case of linear models, for instance using logistic regression. Remember that \({\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) is estimated through the model \(f_i(\varvec{x}, y_1, \ldots , y_{i-1})\) using a sigmoid function to transform the output of \(f_i(\varvec{x}, y_1, \ldots , y_{i-1})\) into a probability. Particularly,

$$\begin{aligned} {\mathbf {P}}(y_i=1\, | \,\varvec{x}, y_1, \ldots , y_{i-1})= & {} \frac{1}{1+\exp ^{-f_i(\varvec{x}, y_1, \ldots , y_{i-1})}} \end{aligned}$$
(15)
$$\begin{aligned} {\mathbf {P}}(y_i=0\, | \,\varvec{x}, y_1, \ldots , y_{i-1})= & {} 1-\frac{1}{1+\exp ^{-f_i(\varvec{x}, y_1, \ldots , y_{i-1})}} . \end{aligned}$$
(16)

In order to obtain the maximum value between \({\mathbf {P}}(y_i=1\, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) and \({\mathbf {P}}(y_i=0\, | \,\varvec{x}, y_1, \ldots , y_{i-1})\), we need first to obtain the maximum value of both terms on their own. In this direction and according to the above expressions of both probabilities, \({\mathbf {P}}(y_i=1\, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) will be maximum if \(f_i(\varvec{x}, y_1, \ldots , y_{i-1})\) is the maximum and analogously \({\mathbf {P}}(y_i=0\, | \,\varvec{x}, y_1, \ldots , y_{i-1})\) will be maximum when \(f_i(\varvec{x}, y_1, \ldots , y_{i-1})\) is the minimum. Hence, let us now focus on obtaining the maximum and the minimum of \(f_i(\varvec{x}, y_1, \ldots , y_{i-1})\).

From now on, let us consider that \(f_i\) is a linear model, then \(f_i\) adopts the following form

$$\begin{aligned} f_i(\varvec{x}, y_1, \ldots , y_{i-1})=\langle \varvec{w}_{x}^i , \varvec{x} \rangle + \langle \varvec{w}_{y}^i, (y_1, \ldots , y_{i-1}) \rangle +\beta ^i, \end{aligned}$$
(17)

that splitting the second term in the known part (from \(\ell _1\) to \(\ell _j\)) and unknown part (from \(\ell _{j+1}\) to \(\ell _i\)) it leads to

$$\begin{aligned} f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})=\langle \varvec{w}_{x}^i , \varvec{x}\rangle +\sum _{k=1}^{j}w_{y,k}^iy_k+\sum _{k=j+1}^{i-1}w_{y,k}^iy_k+\beta ^i. \end{aligned}$$
(18)

Since \(\varvec{x}\) is given and \(y_{1},\ldots ,y_{j}\) are fixed, the second summation contains the variables for which the maximum and the minimum must be obtained. Let \(C_i\) be the constant part of \(f_i\) with regard obtaining the maximum and the minimum, that is,

$$\begin{aligned} C_i(\varvec{x}, y_1, \ldots , y_j)=\langle \varvec{w}_{x}^i , \varvec{x}\rangle +\sum _{k=1}^{j}w_{y,k}^iy_k+\beta ^i. \end{aligned}$$
(19)

Then \(f_i\) can be rewritten as

$$\begin{aligned} f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})=C_i(\varvec{x}, y_1, \ldots , y_j)+\sum _{k=j+1}^{i-1}w_{y,k}^iy_k. \end{aligned}$$
(20)

Consequently, the function whose maximum must be obtained is

$$\begin{aligned} \sum _{k=j+1}^{i-1}w_{y,k}^iy_k=w_{y,j+1}^iy_{j+1}+\ldots +w_{y,i-1}^iy_{i-1}. \end{aligned}$$
(21)

Let us now denote by \(K_{i,j}^+\) and \(K_{i,j}^-\) the positive and negative indexes of the coefficients \(w_{y,k}^i\) with \(j+1\le k \le i-1\), that is,

$$\begin{aligned} \begin{array}{l} K_{i,j}^+=\{k \ | \ j+1\le k \le i-1, \ w_{y,k}^i\ge 0\}\\ K_{i,j}^-=\{k \ | \ j+1\le k \le i-1, \ w_{y,k}^i<0\}.\\ \end{array} \end{aligned}$$
(22)

Hence, (21) is maximum when \(y_k\) for \(j+1\le k\le i-1\) are

$$\begin{aligned} y_k=\left\{ \begin{array}{ll} 1 &{} \quad \text{ if } k\in K_{i,j}^+ \\ 0 &{} \quad \text{ if } k\in K_{i,j}^- \\ \end{array} \right. \end{aligned}$$
(23)

and (21) is minimum when \(y_k\) for \(j+1\le k\le i-1\) are

$$\begin{aligned} y_k=\left\{ \begin{array}{ll} 1 &{} \quad \text{ if } k\in K_{i,j}^- \\ 0 &{} \quad \text{ if } k\in K_{i,j}^+ \\ \end{array} \right. \end{aligned}$$
(24)

Hence,

  1. (i)

    Let \(y_{j+1}^{i,1}, \ldots , y_{i-1}^{i,1}\in \{0,1\}\) be the values which maximize (21), that is, the values that maximize \(f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1},\ldots ,y_{i-1})\) and hence, the values which makes \({\mathbf {P}}(y_i=1\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})\) be maximum and,

  2. (ii)

    Let \(y_{j+1}^{i,0}, \ldots , y_{i-1}^{i,0}\in \{0,1\}\) be the values that minimize (21), that is, the values that minimize \(f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})\) and hence, the values which makes \({\mathbf {P}}(y_i=0\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})\) be maximum.

Hence, \(\max _{\mathop {(y_{j+1}, \ldots , y_{i-1}) }\limits _{\in \{0,1\}^{i-1-j}}} {\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}, \ldots , y_{i-1})\) will be

$$\begin{aligned} \max _{v \in \{0,1\}}\{{\mathbf {P}}(y_i=v\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i,v}, \ldots , y_{i-1}^{i,v}) \}. \end{aligned}$$
(25)

Notice that \(y_k^{i,1}=1-y_k^{i,0}\) for \(j+1\le k\le i-1\) and that according to the above definition of the sigmoid function, if \(f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i,1}, \ldots , y_{i-1}^{i,1})\ge -f_i(\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i,0}, \ldots , y_{i-1}^{i,0})\), then the maximum of \({\mathbf {P}}(y_i=v\, | \,\varvec{x}, y_1, \ldots , y_j,\) \(y_{j+1}^{i,v}, \ldots , y_{i-1}^{i,v})\) is reached when \(v=1\) and otherwise when \(v=0\).

Therefore, the final expression for the heuristic will be

$$\begin{aligned} \hat{h}=\prod _{i=j+1}^m \max _{v \in \{0,1\}}\{{\mathbf {P}}(y_i=v\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i,v}, \ldots , y_{i-1}^{i,v}) \}. \end{aligned}$$
(26)

Remember that \(h^*\) and \(\hat{h}\) only differ on the values of \(y_{j+1}, \ldots , y_{i-1}\). In the former, the same values are common for all the factors of the product, whereas in the latter these values depend on each term i of the product, and hence, they can be different, which makes \(\hat{h}\) not be an optimal heuristic. On the other hand, the cost of computing \(\hat{h}\) is of polynomial order, unlike \(h^*\) which is of exponential order. Figure 5 exemplifies \(\hat{h}\). Let us focus on the root node of the subtree in Fig. 5. The marginal joint conditional probability until this node is \(g=0.6\). For computing \(\hat{h}\), first we evaluate the maximum marginal conditional probabilities \(P(y_2 \, | \,\varvec{x}, y_1=1)\) provided by \(f_2(\varvec{x}, y_1)\) and \(P(y_3 \, | \,\varvec{x}, y_1=1, y_2)\) provided by \(f_3(\varvec{x},y_1,y_2)\) which respectively are 0.6 and 1.0. Then we carry out their product by applying the product rule of the probability to estimate the marginal joint conditional probability from that node to a leaf. Notice that the maximum at each level does not correspond to the same branch of the tree. In other words, the maximum in the first level corresponds to \(y_2=1\), whereas the maximum in the second level is obtained by \(P(y_3 \, | \,\varvec{x}, y_1=1, y_2=0)\) when \(y_3=1\) fixing \(y_2=0\). This is what makes the heuristic \(\hat{h}\) not to be the optimal \(h^*\).

Fig. 5
figure 5

An example of the computation of heuristic \(\hat{h}\)

Fig. 6
figure 6

An example of A* using the heuristic \(\hat{h}\). The dotted arrows show the path followed by the algorithm. The values of g are provided inside each node

Figure 6 shows the path followed by A* using \(\hat{h}\). It reaches the optimal leaf as is theoretically expected. Comparing this graph with the one in Fig. 3a that illustrates \(\epsilon \)-A with \(\epsilon =0\), and taking into account the properties of the heuristics related to the dominance, \(\epsilon \)-A, \(\epsilon =0\) (equivalent to A* using \(h=1\) or UC search) explores more nodes than A* using \(\hat{h}\).

As a final remark, the A* algorithm using \(\hat{h}\) perfectly estimates the risk minimizer for the subset 0/1 loss, as both \(\epsilon \)-A with \(\epsilon =0\) and ES do it. Even more, A* with \(\hat{h}\) expands equal or fewer nodes, since \(\hat{h}\) is more dominant than the heuristic \(h=1\) (\(\hat{h}\prec h=1\)). Obviously, computing \(\hat{h}\) is more costly than computing \(h=1\) or applying just a UC search. The question is then if this additional computing time compensates the theoretical guarantee it has of expanding fewer nodes.

4.2 A general admissible heuristic for trading off the number of nodes explored and its computing time

The previous section pointed out that both \(\hat{h}\) and \(h=1\) are admissible heuristics. Besides, using the former theoretically guarantees that the A* algorithm explores fewer nodes than using the latter. But the cost of computing \(\hat{h}\) could be high in comparison to just considering a constant heuristic \(h=1\). This trade-off sheds light on including a parameter d for limiting the depth of the heuristic. Hence, for a node of level j and a value of d with \(0\le d \le m-j\), only the terms \({\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i}, \ldots , y_{i-1}^{i})\) of \(\hat{h}\) are evaluated using \(f_i\) for nodes from level \(j+1\) to level \(j+k\) whereas these terms are estimated by the constant 1 for nodes from level \(j+k+1\) to level m, that is,

$$\begin{aligned} \hat{h}^d=\prod _{i=j+1}^{j+d} {\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i}, \ldots , y_{i-1}^{i}) \cdot \prod _{i=j+d+1}^m 1, \end{aligned}$$
(27)

or equivalently

$$\begin{aligned} \hat{h}^d=\prod _{i=j+1}^{j+d} {\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i}, \ldots , y_{i-1}^{i}). \end{aligned}$$
(28)

It is clear that \(\hat{h}\) is more dominant than \(\hat{h}^d\) (\(\hat{h}\prec \hat{h}^d\)) and it continues being admissible. In turn, \(\hat{h}^d\) is more dominant than \(h=1\) (\(\hat{h}^d\prec h=1\)). Hence, it is expected that using the heuristic \(\hat{h}^d\) with \(0\le d \le m-j\) and \(1\le j \le m-1\), the A* algorithm explores more number or equal number of nodes than \(\hat{h}\), but less number or equal number of nodes than \(h=1\). In fact, \(\hat{h}^d\) encapsulates both heuristics, since taking the extreme values of d leads to them. On the one hand, taking the maximum value of d leads to \(\hat{h}^{m-j}\) with \(1\le j \le m-1\) (we will denote this heuristic as \(\hat{h}^\infty \)), which is in fact the heuristic \(\hat{h}\) detailed in Sect. 4.1. On the other hand, taking the minimum value of d leads to \(\hat{h}^0\) which is indeed the heuristic \(h=1\). In general, \(\hat{h}^{d_1}\) is more dominant than \(\hat{h}^{d_2}\) if \(d_1>d_2\). However, the computational time of obtaining the values of \(\hat{h}^d\) increases as d increases. Hence, tuning d adequately one can obtain a balance between the number of nodes explored and the computational time employed to evaluate the heuristic.

The case of \(d=1\) has especial interest, since the heuristic is also valid for non-linear models \(f_j\). The form of the heuristic is

$$\begin{aligned} \hat{h}^1=\prod _{i=j+1}^{j+1} {\mathbf {P}}(y_i\, | \,\varvec{x}, y_1, \ldots , y_j, y_{j+1}^{i}, \ldots , y_{i-1}^{i}) \cdot \prod _{i=j+2}^m 1, \end{aligned}$$
(29)

that simplifying leads to

$$\begin{aligned} \hat{h}^1= {\mathbf {P}}(y_{j+1}\, | \,\varvec{x}, y_1, \ldots , y_j). \end{aligned}$$
(30)

As seen, this heuristic \(\hat{h}^1\) means just evaluating the model of the node \(f_j(\varvec{x}, y_1, \ldots , y_j)\), hence without making any restriction on the linearity of \(f_j\).

Figure 7 shows an example of the path followed by the A* algorithm when \(\hat{h}^1\) is taken as an heuristic. As seen, the A* algorithm explores more nodes using this heuristic than using \(\hat{h}\), but the computational time of evaluating the heuristic is considerably reduced, as we will show later on in the experiments.

Fig. 7
figure 7

An example of A* using the heuristic \(\hat{h}^1\). The dotted arrows show the path followed by the algorithm. The values of g are provided inside each node

4.3 Implementation details

Here we significantly extend the results of A* using \(\hat{h}\) presented in the preliminary work (Mena et al. 2015). However, they present differences because we have carried out an improvement in the implementation of the algorithms in order to become more efficient. In fact, the time results reported there in comparison with the number of nodes explored was the key to make us reconsider the implementation of the algorithms. Algorithm  1 shows the pseudocode of A* using \(\hat{h}^d\).

figure a

Particularly, the differences between this version and that of the preliminary work (Mena et al. 2015) are that now:

  1. 1.

    The models \(f_i\) (parameters \(\varvec{w}_{x}^i\), \(\varvec{w}_{y}^i\) and \(\varvec{\beta }^i\)) are ordered according to the label order of the chain.

  2. 2.

    All repeated computations are calculated just once and stored:

    1. (a)

      The evaluation of \(\langle \varvec{w}_x^i , \varvec{x} \rangle \) for each linear model \(f_{i}\), which is common for all nodes corresponding to \(f_{i}\), is computed once and saved in variable WX (line 2).

    2. (b)

      Sets \(K^{+}\) and \(K^{-}\), that is the best values of \(y_{k}\) for the unknown part of the heuristic, are computed only once before starting the main loop of A\(^{*}\) (lines 3–8).

  3. 3.

    Open set, Q, stores tuples of four elements: label combination, level, e and g. Q is stored in a resizable vector whose positions are reused. The left child is stored in the position of its parent (line 21). Notice that parent information is not required to obtain the final solution, since Q contains all the required information for each node.

  4. 4.

    The design of Q allows us to optimize the function Max, the operation to obtain the best node to be expanded. The function Max only needs to evaluate the first elements (\(1\ldots last\)) of Q (line 13), because the rest of the vector is unused.

  5. 5.

    Algorithm 1 contains several auxiliary functions to make the pseudocode shorter and more easily understood. However, in the actual program all these functions were coded inline, making the code faster because all the calls to functions with large parameters, like the function Heuristic, are avoided.

4.4 Complexity analysis

Despite the theoretical optimality properties of the A* algorithm, it might not be useful in some problems because it may explore an exponential number of nodes with regard to the depth of the tree [see Pearl (1984) and Russell and Norvig (2003)]. Only under certain conditions over the heuristic taken, is it possible to obtain an algorithm with a theoretical complexity less than exponential. This is the case if one is able to prove that the heuristic h satisfies the following condition for any level j with regard to the optimal heuristic \(h^*\) (Russell and Norvig 2003)

$$\begin{aligned} |h^*(j)-h(j)|\le \mathcal {O} (\log h^*(j)). \end{aligned}$$
(31)

In general, it occurs that that error is at least linear with regard to the path cost, hence leading to exponential growth. However, by taking care of designing good heuristics, one can obtain huge profit in relation to not providing any kind of heuristic information. That is the case with our heuristic \(\hat{h}^d\) with regard to the heuristic \(h=1\), which does not provide any kind of heuristic information, in spite of being an improvement of the ES.

On what follows, let us focus on the heuristic \(\hat{h}^\infty \), since it is the most dominant heuristic among the heuristics proposed. According to the results where the noise is increasing (see Figs. 9, 10 of Sect. 5.3), it is not clear that \(\hat{h}^\infty \) is exponential at all as the rest of the algorithms clearly show, at least for the datasets taken. This behavior of \(\hat{h}^\infty \) sheds light on performing a deeper analysis over other situations. In this sense, let us consider a theoretical case when the difference between \(\hat{h}^\infty \) and \(h^*\) is maximum. For instance, this situation occurs (see Fig. 8) when the first level has probability 0 in the left branch and 1 in the right branch, all the probabilities of the left subtree are 0 (for the left branches) and 1 (for the right branches) and all the probabilities of the right subtree are 1 / 2 (for all the branches). In this case, the error can be bounded as follows for any level j

$$\begin{aligned} |h^*(j)-\hat{h}^\infty (j)|\le |1/2^{j-1}-1|. \end{aligned}$$
(32)
Fig. 8
figure 8

An example of worst cases for heuristic \(\hat{h}\)

This bound tends to 1 as the level j grows, hence, in this case, it is not possible to guarantee that the theoretical complexity is less than exponential. However, these extreme cases are hardly likely to occur. Let us remember that the probabilities are evaluated from the models, where the description \(\varvec{x}\) of the examples plays an important role. This description is equal for obtaining the probabilities of all the nodes of the same level, hence, such probabilities will probably not differ so much among them as in the extreme case considered.

Fig. 9
figure 9

Number of nodes expanded and computational time employed (ms) for the algorithm A* with the heuristic \(\hat{h}\) for different values of the parameter d (\(1, 2, 3, \infty \)) and for the \(\epsilon \)-A with \(\epsilon =0\), for different percentages of noise and for datasets with few labels (from 5 to 7). a Image (5), b emotions (6), c scene (6), d flags (7) and e reuters (7)

Fig. 10
figure 10

Number of nodes expanded and computational time employed (ms) for the algorithm A* with the heuristic \(\hat{h}\) for different values of the parameter d (\(1, 2, 3, \infty \)) and for the \(\epsilon \)-A with \(\epsilon =0\), for different percentages of noise and for datasets with more labels (from 14 to 159). a Yeast (14), b slashdot (22), c medical (45), d enron (53), e mediamill (101) and f bibtex (159)

5 Experiments

This section deals with the experiments carried out with all the approaches. Before discussing the experimental results in Sects. 5.2 and 5.3, let us describe in Sect. 5.1 the common settings of all experiments: datasets, learning algorithms and parameter selection procedures. Finally, Sect. 5.3 reports the computational results of the methods for more complex problems, such as those that include noise.

5.1 Settings

The experiments were performed over several benchmark multi-label datasets whose main properties are shown in Table 1. As can be seen, there are significant differences in the number of attributes, instances and labels. The cardinality—number of labels per instance—varies from 1.07 to 4.27. Concerning the number of labels, there are some datasets with just 5, 6 or 7 labels, whereas others have more than 100, one of them even has almost 400 labels.

Table 1 Properties of the datasets

The approaches for inference in PCC compared with our proposal were those discussed throughout the paper, except for the ES. No experiment was carried out with the ES method due to its computational cost. Hence, the methods compared with the A* algorithm with the heuristic \(\hat{h}\) for different values of the parameter d (\(1, 2, 3, \infty \)) were GS, \(\epsilon \)-A algorithm for different values of \(\epsilon \) (.0, .25, .5) and BS for different values of beam width b (1, 2, 3, 10). Let us remember that the \(\epsilon \)-A algorithm with \(\epsilon =0.5\) is equivalent to GS and to BS with \(b=1\).

The results will be presented in terms of the example-based subset 0/1 loss estimated by means of a 10-fold cross-validation.

The base learner employed to obtain the binary classifiers that compose all these multi-label models was logistic regression (Lin et al. 2008) with probabilistic output. The regularization parameter C was established for each individual binary classifier performing a grid search over the values \(C \in \{10^{-3}, 10^{-2}, \ldots , 10^{3} \}\) optimizing the brier loss estimated by means of a balanced 2-fold cross validation repeated 5 times. The brier loss (Brier 1950) is a proper score that measures the accuracy of probabilistic predictions, as logistic regression does. The expression is as follows

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n(\hat{p_i}-a_i)^2, \end{aligned}$$
(33)

where for an instance i, \(p_i\) is the predicted probability of a certain label and \(a_i\) is the actual value of the label (0 or 1).

Table 2 Subset 0/1 loss for the different methods
Table 3 Number of explored nodes for the different methods
Table 4 Average prediction time (in seconds) per example for all methods

5.2 Results over benchmark datasets

Tables 2, 3 and 4 respectively show the subset 0/1 loss, the number of nodes explored and the computational time (in seconds) averaged per test instances for the different methods compared.

Before discussing the results of the tables, let us remember that only the A* algorithm using \(\hat{h}^d\) (and consequently also the \(\epsilon \)-A with \(\epsilon =0\), since it is the A* algorithm with \(h=1\) or \(\hat{h}^0\)) provides a Bayes optimal inference, as the ES does. This means that they always predict the label combination with the highest joint conditional probability. Despite the fact that other methods may predict other label combinations with lower joint conditional probability for some examples, in a few cases they obtain better subset 0/1 scores. This fact is due to several reasons, mainly (a) the relatively small size of testing sets, and (b) that the models \(f_j\) obtained to estimate the joint conditional probability \(P(\varvec{y}\, |\,\varvec{x})\) do not usually return true estimations. Theoretically, under perfect conditions (large test sets and perfect models), the A* algorithm using \(\hat{h}^d\) would obtain the best scores. In general, the performance of the \(\epsilon \)-A algorithm decreases as the value of \(\epsilon \) increases and the performance of the BS method increases as b increases. The BS method reaches stability for low values of the beam b, it even converges to the performance of the A* algorithm but at the cost of exploring many more nodes.

With regard to the number of nodes explored (see Table 3), GS (equivalent to \(\epsilon \)-A algorithm with \(\epsilon =0.5\) and to BS(1)) is the method which explores the least number of nodes, since it only goes over one path in the tree. In fact, such a number corresponds to the number of labels (plus one if the root is considered as an explored node). It follows the A* algorithm using our heuristic \(\hat{h}^d\) for any value of d, although it is exceeded by the \(\epsilon \)-A algorithm with \(\epsilon >0\) and by BS with \(b>1\) in some cases. However, let us remember that neither GS nor the \(\epsilon \)-A algorithm with \(\epsilon >0\) nor BS with any value of the beam b guarantee reaching an optimal solution as the A* algorithm using heuristic \(\hat{h}^d\) does. So, it is not surprising that they explore fewer nodes. What it is actually surprising is that for most of the cases they explore more nodes even without reaching an optimal solution.

Let us now focus on the methods that theoretically reach the optimum (the A* algorithm with the heuristic \(\hat{h}^d\) or the \(\epsilon \)-A algorithm with \(\epsilon =0\)). As it has theoretically been shown, the A* algorithm with the heuristic \(\hat{h}^\infty \) explores the least number of nodes, followed by the same algorithm but with the heuristic \(\hat{h}^3\), then with the heuristic \(\hat{h}^2\), after that with the heuristic \(\hat{h}^1\) and finally the \(\epsilon \)-A algorithm with \(\epsilon =0\).

The computational time is expected to be higher as more nodes are explored, but looking at Table 4 one can see that this does not happen at all. This is true for GS, the \(\epsilon \)-A algorithm and the BS technique, but the A* algorithm obtains higher computational time in spite of exploring considerably fewer nodes. The reason for that is the time spent in computing the heuristic. In this respect, the A* algorithm is faster as the parameter d diminishes, although this implies exploring more nodes. Hence, considering the methods that reach the optimum, the \(\epsilon \)-A approximation algorithm with \(\epsilon =0\) is the fastest method, followed by A* using the heuristic \(\hat{h}^1\), then \(\hat{h}^2\) and so on until ending with the heuristic \(\hat{h}^\infty \).

As a conclusion, the A* algorithm with the heuristic \(h^1\) can be considered a good alternative for guaranteeing optimal performance in terms of subset 0/1 and balancing the trade-off between the number of nodes explored and the computational time. Besides, it is also applicable for non-linear models.

5.3 Results with noise

We have also performed experiments including noise in the datasets in order to analyze more in depth the power of the A* algorithm for inference in PCCs. This study arises from the fact that the number of nodes explored by A* is quite low in comparison with the number of labels (depth of the tree). This is so even using the heuristic \(h=1\) (\(\epsilon \)-A with \(\epsilon =0\)) which does not provide information as other heuristics do, for instance, \(\hat{h}\). This means that the algorithm A* performs few backtracking in the tree, hence, the probabilities provided by the models may be not so close from 0.5 or, even more, they may be close to 0 or 1. In this way, by adding noise we expect the probabilities to be closer to 0.5, making the problem more complex where more informative heuristics can show their strength.

Certain grades of noise in terms of percentage were added to the datasets. No kinds of noise were included in the description of the examples, that is, in the \(\varvec{x}\) part, otherwise, the noise was only introduced in the labels of the examples, that is, in the \(\varvec{y}\) part. In this sense, a percentage of the values of the labels was swapped from being relevant to become irrelevant and vice versa for the whole dataset. This means that for a given example the relevance of either all their labels or only some of them or even none of them was changed. The percentage of noise included ranges from 0 to 25 % for datasets with fewer than 22 labels. However, the experiments did not finish in a prudential time by using all the percentages of this range for datasets with more than 45 labels. In particular, the maximum percentage considered in this respect for medical (45 labels), enron (53 labels), mediamill (101 labels) and bibtex (159 labels) respectively was 18, 10, 8 and 4 %.

Figures 9 and 10, respectively show for datasets with few labels (from 5 to 7) and for datasets with more labels (from 14 to 159), the number of nodes expanded and the computational time employed by the A* algorithm with the heuristic \(\hat{h}\) for different values of the parameter d (\(1, 2, 3, \infty \)) and by the \(\epsilon \)-A with \(\epsilon =0\), all of them for different percentages of noise.

Obviously, both the number of nodes explored and the computational time increases as the percentage of noise increases. Regarding the number of nodes explored, it decreases as d increases as theoretically expected. In this respect, the number of nodes expanded using the heuristic \(\hat{h}^\infty \) keeps quite steady as the noise increases in comparison to using the rest of the heuristics. In fact, by using these other heuristics, the number of nodes rockets from a certain percentage of noise. This is so in general, but it especially happens for datasets with more than 14 labels.

Concerning the computational time and for datasets with few labels, A* using \(\hat{h}^d\) from \(d>1\) is steadier than using \(\hat{h}^1\) or \(\epsilon \)-A with \(\epsilon =0\) as the noise increases. In fact, these two last approaches are the best options, \(\hat{h}^1\) being clearly better than \(\epsilon \)-A with \(\epsilon =0\). However, for high percentages of noise the other options begin to show their strength. Only flags among the datasets with few labels presents a different behavior. In this case, the behavior of the computational time is similar to that of the datasets with more labels, which coincides with the behavior of the number of nodes. So, the computational time increases as d decreases and \(\hat{h}^\infty \) becomes clearly the best option. In any case, we are measuring the computational time in milliseconds, so the worth of the heuristics becomes crucial for datasets with more than 22 labels.

6 Conclusions

This paper proposes a family of admissible heuristics for the A* algorithm for performing inference in PCCs. In this way, providing admissible heuristics leads obtaining optimal predictions in terms of subset 0/1 loss and exploring fewer nodes than previous approaches that also produce optimal predictions. The family of heuristics (\(\hat{h}^d\)) is defined by a parameter d which determines the depth in the tree for evaluating the heuristic and, hence, controls the trade-off between the number of nodes explored and the computational time of the algorithm due to the evaluation of the heuristic. In this manner, the algorithm A*(\(\hat{h}^d\)) explores fewer nodes but it expends more time in reaching a solution as d increases. So far, only uniform-cost search (or \(\epsilon \)-A(\(\epsilon =.0\)), which in turn is the particular case of the A*(\(\hat{h}^0\))) has been shown to provide optimal subset 0/1 loss, unlike other approaches such as GS, \(\epsilon \)-A(\(\epsilon >0\)) or BS that only estimate it. The algorithm A*(\(\hat{h}^d\)) with \(d>1\) is limited to linear models, but this is not a major drawback because it is quite common to employ linear models. This is usually the case for MLC problems with many labels in which the examples of some classes may be scarce, since using non-linear models in such problems can lead to overfitting. Only the particular case of \(d=1\) is also suitable for non-linear models, since it just involves evaluating the models from all the known values of the labels.

In the experiments performed over the benchmark datasets, the algorithm A*(\(\hat{h}^1\)) or uniform-cost search (or \(\epsilon \)-A(\(\epsilon =.0\)) or A*(\(\hat{h}^0\))) are better choices than A*(\(\hat{h}^d\)) for \(d>1\), since the computational cost of evaluating the heuristic does not seem to compensate for the reduction in nodes explored. In this respect, adding noise to the benchmark datasets, and, hence, obtaining more complex problems, allows us to show the strength of A*(\(\hat{h}^d\)) for \(d>1\). In this sense, the behavior of A*(\(\hat{h}^\infty \)) is quite steady as the percentage of noise increases, unlike A*(\(\hat{h}^d\)) for the rest of the values of d, including \(d=0\) (\(\epsilon \)-A(\(\epsilon =.0\))). Furthermore, only for small datasets in a number of labels, \(\hat{h}^1\) becomes the best alternative.

In our opinion, the heuristic search is a promising line of research for inference in PCCs. As future work, new admissible heuristics can be devised, keeping in mind that we should look for a trade-off between their computational complexity and the quality of the estimations.