1 Introduction

Hierarchies (taxonomies) are the most common and useful data structures for organizing large volume of datasets in various application domains including bioinformaticsFootnote 1 for organizing gene sequences, ImageNetFootnote 2 for organizing images and Yahoo!Footnote 3 web directories for organizing web documents. Given the hierarchy of classes, the hierarchical classification (HC) problem deals with learning models by utilizing (or ignoring) the hierarchical structure to automatically classify unlabeled test instances (examples) into relevant classes (categories). Over the years, HC has gained immense interest among researchers and is evident from the various large-scale online prediction challenges such as LSHTCFootnote 4 (web-page documents classification), BioASQFootnote 5 (published medical abstracts classification) and ILSVRCFootnote 6 (large-scale image classification).

In the past, various methods have been developed to improve the HC performance [2]. One of the simplest methods is to learn binary one-vs-rest classifier for each of the leaf categories, ignoring the hierarchical relationships. This method is referred as flat classification. Other methods involve use of the hierarchies (see Sect. 2) during the learning and prediction process. Hierarchies provide useful structural relationships (such as parent–child and siblings) among different classes that can be exploited for learning a general classification models. Previously, researchers have demonstrated the usefulness of hierarchies for classification and have obtained promising results [3,4,5,6,7,8,9,10,11,12]. However, in many situations hierarchies used for learning models are not consistent due to the presence of inconsistent nodes resulting in excessive error propagation. As such, HC approaches are outperformed by the flat classifiers that completely ignore the hierarchy [13, 14].

Flat classifiers, though effective in some cases, suffer from two major issues: (i) During the prediction phase, flat classifiers invoke all the models for leaf categories and are considerably slower than top-down HC approaches, in which only the models in the relevant path are invoked. (ii) For large-scale HC problems, it is challenging to learn classification models that can discriminate between large number of classes. This problem is worse for datasets with skewed class distributions where plenty of classes have very few examples for training (rare categories problem) [15]. Large-scale datasets show power-law distribution of examples per category [16]. Considering these issues, the focus of this paper is to improve top-down HC approaches, which are computationally feasible for large-scale datasets and handle the imbalance problem by utilizing structural relationships information.

The main drawback of top-down HC approaches that contributes to their inferior classification performance is error propagation—compounding of errors from misclassifications at higher levels which cannot be rectified at the lower levels. This problem can be alleviated to certain extent by restructuring (modifying) the hierarchy to remove set of inconsistent nodes that causes performance deterioration. In this paper, our main contribution includes development of data-driven approaches for removing inconsistencies in the expert-defined (original) hierarchy leading to a hierarchy that achieves higher classification performance irrespective of the HC approach used for training. We propose a flattening approach where inconsistent nodes are selectively removed from the hierarchy. The criterion for flattening a node is based on the optimal regularized risk minimization objective value attained by the model trained for that node on a separate validation set. If the objective value for a node n is above a certain threshold, then we flatten n, i.e., we remove n from the hierarchy and add its children to n’s parent node. Based on the strategy adapted for identifying inconsistent nodes, we propose two different approaches for inconsistent nodes flattening (INF) from the hierarchy: (i) local approach (Level-INF) that computes a level-wise cutoff threshold for each level in the hierarchy and (ii) global approach (Global-INF) that computes a global threshold for the entire hierarchy.

Fig. 1
figure 1

Various hierarchical structures (b)–(f) obtained after flattening some of the nodes (or levels) from the original hierarchy shown in (a). ‘LN’ denotes the leaf node to which examples are assigned. Internal nodes in each level of the original hierarchy are represented with different symbol and color for ease of understanding. a Expert-defined (original) hierarchy. b Flat hierarchy. c Top-level flattened (TLF) hierarchy. d Bottom-level flattened (BLF) hierarchy. e Multiple-level flattened (MLF) hierarchy. f Inconsistent node flattened (INF) hierarchy

Experimental comparisons of top-down HC approach on our proposed modified hierarchy show statistically significant performance improvement in comparison with the baseline hierarchy (expert-defined) and other competitive methods for hierarchy modification [17, 18]. We also performed detailed analysis to show that the reduction in misclassification error at higher levels with our proposed hierarchy modification approach leads to reduced error propagation and hence better classification performance. In comparison with flat classification, our approach is more accurate for classes with few training instances per class (rare categories) and is computationally efficient for large-scale HC problems during the prediction phase.

To summarize, the main contributions of our work are as follows:

  1. 1.

    We propose a local and global methods for flattening inconsistent nodes in the hierarchy that gives comparatively better classification performance over baseline expert-defined hierarchy and other competing hierarchy modification approaches [17, 18] using probabilistic classifier such as logistic regression (LR) and large-margin classifier such as support vector machine (SVM).

  2. 2.

    We perform a case study and experiments on wide range of image and text datasets with varying characteristics to show the strength and effectiveness of our proposed modification approaches.

  3. 3.

    We perform level-wise error analysis to compare the error propagation with modified and the baseline hierarchy.

  4. 4.

    We analyze the performance improvement achieved with proposed modified hierarchy over the flat method on varying distribution of training examples per class.

  5. 5.

    We perform a study that combines the flat and top-down model prediction in an ensemble setting (hybrid prediction).

2 Literature review

There has been a large body of research focusing on the HC problem. Besides completely ignoring the hierarchy (i.e., flat classifiers), one class of HC methods solves various local subproblems that train individual classifier(s) for each of the nodes (or parent nodes) in the hierarchy or learn classifiers for each of the levels. These methods are referred as local classification because only local structural relationship information is used during training these classifiers. To predict the labels of instances, top-down local hierarchical methods proceed by selecting the most relevant node at the topmost level and then recursively selecting the best child node until a leaf category is reached, which is the final predicted label. Local approaches are more popular due to their computational benefits [2, 6]. Contrary to local classification, global classification methods [3, 19] learn a single complex model over all the nodes in the hierarchy and are computationally more expensive than flat and local methods. Therefore, in this paper we focus on top-down local classification methods for training models and predicting labels.

Some of the earlier studies focus on exploiting hierarchies among categories for the purpose of classification [5, 20,21,22], but the number of categories was limited to a few hundreds. One of the earlier breakthroughs in the field of hierarchical text categorization was by Koller et al. [6]. This approach used a divide and conquer paradigm for solving the HC problem which can easily be adapted in large-scale settings. Following this, numerous approaches have been developed to improve HC performance on a large datasets, both accuracy and scalability. Liu et al. [23] studied the classification performance on Yahoo! web taxonomy using a SVM-based method that scales for millions of categories. Gopal et al. [3] used a regularization term within the optimization function to capture the parent–child relationships in the hierarchy. This approach referred as HR-LR and HR-SVM shows improved classification performance, but the training procedure for large-scale problem requires a distributed implementation and map-reduce supported infrastructure. Xue et al. [12] proposed a two-stage approach. For each test document, in the first stage a set of candidate categories are retrieved based on similarity to the test document. Then the second stage builds a classifier on the hierarchy restricted to the set of categories fetched in first stage and classifies the test document using the restricted hierarchy. Although pruning reduces the hierarchy to a manageable size, one severe drawback of this approach is having to train a different classifier for each test document which is expensive. Other works in the field of HC can be found in a detailed survey by Silla et al. [2].

2.1 Hierarchy modification

Most HC approaches rely on hierarchical relationships for learning complex models to improve the classification performance. However, the performance can be negatively impacted if the hierarchy used during learning models is inconsistent. Therefore, it is of utmost importance to generate an improved hierarchical representation that is suitable for classification prior to learning models. Inconsistencies in the hierarchy are due to the following reasons:

  1. 1.

    Hierarchies are designed for efficient search and navigation without considering HC performance.

  2. 2.

    Hierarchical groupings of categories are done based on semantics, whereas classification depends on data characteristics such as term-frequency.

  3. 3.

    Multiple hierarchies are possible for the same dataset (such as SCOP and CATH [24] hierarchy for protein structures). However, there is no consensus regarding which hierarchy is better for classification.

  4. 4.

    Consistent hierarchy design for dataset with large number of categories is prone to errors.

Several approaches that restructure the hierarchy have been developed in past. Level flattening [17] is one of the approaches used in earlier works of hierarchy modification, where some of the levels are flattened (removed) from the original hierarchy prior to learning models. Based on levels that are flattened, various methods of modification exist. Top-level flattening (TLF) as shown in Fig. 1c modifies the hierarchy by flattening the top level in the original hierarchy. Model learning and prediction for flattened hierarchy is done in similar fashion as top-down methods. Bottom-level flattening (BLF) and multiple-level flattening (MLF), shown in Fig. 1d, e are similar methods of hierarchy modification where bottom and multiple levels are removed, respectively. As done in Wang et al. [17], we removed the first and third levels for evaluating the MLF approach.

Babbar et al. [18] proposed a maximum-margin-based strategy for hierarchy modification. This method selectively removes some of the inconsistent nodes in the hierarchy based on margins rather than removing complete levels. Hierarchy modification using this approach, shown in Fig. 1f, is beneficial for classification and has been theoretically justified [25]. We followed a similar approach for hierarchy modification. However, our method differs in following two aspects: (i) We developed a more systematic and elegant approach for threshold selection to identify the inconsistent nodes for flattening. Our approach is based on deviation from mean that is empirically tuned for each dataset, and (ii) we also considered a global perspective of the hierarchy (Global-INF) for threshold selection to identify the inconsistent set of nodes. This approach is more intuitive and realistic measure for threshold selection because it prevents excessive flattening of the nodes that is just based on local decisions, thereby allowing the benefits of leveraging the hierarchy during the model learning and classification process, especially for rare categories (see Sect. 6 for justification).

Hierarchy modification using a supervised learning approaches is also proposed in the literature [15, 21], where the hierarchy is gradually modified to achieve better hierarchy for improving the classification performance. These methods have an expensive evaluation cost that needs to be performed after each modification, making it computationally in-feasible for large-scale settings. Hence, we do not compare our approach to these methods. Other competitive methods that involve restructuring the hierarchy are developed by us and appear in an arXiv publication [26].

3 Methods

Table 1 summarizes the common notations used in this paper. We use bold letters to indicate vector variables.

Table 1 Notation description

3.1 Hierarchical classification

Given, a hierarchy \(\mathcal {H}\) we train a binary one-vs-rest classifiers for each of the node \(n\in \mathcal {N}\)—to discriminate its positive examples from the examples of other nodes (i.e., negative examples) in the hierarchy. We followed the ‘inclusive policy’ for training classifiers, where all the descendant categories of node n (including node itself) are considered as positive examples and the remaining categories as negative examples [27]. In this paper, we explore two choices for training the model, namely logistic regression (LR) and support vector machine (SVM) [3]. The LR model uses logistic loss to minimize the empirical risk, whereas SVM model uses hinge loss. Both these models are coupled with l2-norm term (denoted by \(||\cdot ||_{2}^{2}\)) to control the model complexity and prevent from overfitting. The objective function \(f_n\) for training a model corresponding to node n using LR and SVM objective is provided in Eqs. (1) and (2), respectively.

$$\begin{aligned} f_n= & {} \min _{\varvec{\Theta }_\mathbf{n}}\Bigg [C\sum _{i=1}^{N}\log \left( 1+\exp \left( -y_{i}^{n}{\varvec{\Theta }_\mathbf{n}}^{T}\mathbf {x}_i\right) \right) \nonumber \\&+\frac{1}{2}\left\| {\varvec{\Theta }_\mathbf{n}}\right\| _{2}^{2}\Bigg ] \end{aligned}$$
(1)
$$\begin{aligned} f_n= & {} \min _{\varvec{\Theta }_\mathbf{n}}\Bigg [C\sum _{i=1}^{N}\left| 1 - y_i^n{\varvec{\Theta }_\mathbf{n}}^{T}\mathbf {x}_i\right| _{+}+\frac{1}{2}\left\| {\varvec{\Theta }_\mathbf{n}}\right\| _{2}^{2}\Bigg ] \end{aligned}$$
(2)

where \(|a|_+\) denotes the hinge loss given by max(0, a).

For each node n, we solve Eq. (1) [or Eq. (2)] to obtain the optimal weight vector denoted by \({\varvec{\Theta }}_{n}\). The complete set of parameters for all the nodes \(\left\{ {\varvec{\Theta }}_{n}\right\} _{n\in \mathcal {N}}\) constitutes the learned model for the hierarchical top-down classifier.

For a test example with feature vector \({\widehat{\mathbf{x}}}_i\), the top-down classifier predicts the class label \(\widehat{y}_i\in \mathcal {L}\) as shown in Eq. (3). Essentially, the algorithm starts at the root node and recursively selects the best child nodes till it reaches a terminal node belonging to the set of leaf categories \(\mathcal {L}\).

$$\begin{aligned} \widehat{y}_i=\left\{ \begin{array}{l}\mathbf {initialize}\quad p:=\mathrm{root}\\ \mathbf {while}\ p\notin \mathcal {L}\\ \quad p:=\mathbf {argmax}_{q\in \mathcal {C}(p)}{\varvec{\Theta }}_{q}^{T}{\widehat{\mathbf{x}}}_i\\ \mathbf {return}\ p \end{array}\right\} \end{aligned}$$
(3)

Although in this paper we have explored two of the most widely used classifier (LR and SVM), other binary or multi-class classifiers [28,29,30] can also be used to perform HC.

3.2 Inconsistent node flattening

Motivation: Gao et al. [25] showed theoretically that for any classifier that correctly classifies m random input-output pairs using a set of \(\mathcal {D}\) decision nodes with margins {\(\gamma _{n}\), \(\forall n \in \mathcal {D}\)}, the generalization error with probability estimates greater than (\(1 - \zeta \)) is bounded by the expression shown in Eq. (4).

$$\begin{aligned}&\frac{130 R^{2}}{m}\Bigg [\sum _{n\in \mathcal {D}}\left( \frac{1}{\gamma _{n}^{2}}\right) \log (4em)\log (4m)\nonumber \\&\quad +\mathcal {\left| D\right| }\log (2m)-\log \left( \frac{2}{\zeta }\right) \Bigg ] \end{aligned}$$
(4)

where R is the radius of the ball containing the distribution’s support.

This bound provides two significant strategies in designing our approach to reduce the generalization error: (i) Increase the margin \(\gamma _{n}\) for learned models at node \(n \in \mathcal {N}\) in the hierarchy, or (ii) decrease the number of decision nodes \(|\mathcal {D}|\) involved in making the prediction. For achieving the optimum classification performance, we need to balance the trade-off between the margin \(\gamma _n\) and the number of decision nodes \(|\mathcal {D}|\). Two of the extreme cases for learning hierarchical classifiers are flat and top-down methods. For flat classifiers, we have to make single decision (i.e., \(|\mathcal {D}|\) = 1) but margin width \(\gamma _{n}\) is presumably small due to the large number of leaf categories that needs to be distinguished, which makes it difficult to obtain large margin. For top-down hierarchical classifiers, we have to make a series of decisions from root to leaf nodes (i.e., \(|\mathcal {D}| \ge 1\)) but margin \(\gamma _n\) is larger due to the fewer number of categories that needs to be distinguished at each of the decision nodes. Motivated by this trade-off, in this work we propose a method that removes some of the inconsistent nodes in the hierarchy \(\mathcal {H}\), and thereby, increasing the value of margin \(\gamma _{n}\) for learned models at node n in the hierarchy, while minimizing the number of decision nodes to classify an unlabeled test instances.

In order to improve the effectiveness of classification, we need to identify these inconsistent nodes and flatten them. We mark a node n within the hierarchy as inconsistent if the value of the objective function \(f_{n}^{*}\) becomes greater than a chosen threshold value. To get a more reliable estimate of the \(f_{n}^{*}\), we first train the regularized LR models on a training set locally for each node and then compute the objective function on a separate validation set, which is different from the training set. The objective value on validation set for node n is denoted by \(f_{n}^{*}\). We develop the following approaches for setting the threshold for flattening.

Level-wise inconsistent node flattened: In this approach, referred as Level-INF, we select a different threshold \(\tau _{k}\) locally for each level k in the hierarchy. Algorithm 1 presents the level-wise approach that selects inconsistent nodes at each level in a top-down manner. The threshold \(\tau _{k}\) for level k is computed as the sum of mean and \(\psi _k\) times the standard deviation of the set of values \(\left\{ f_{n}^{*}\right\} _{n\in \mathcal {N}_{k}}\), where \(\psi _k\) is a fitness parameter at level k that is empirically estimated for each dataset based on \(\left\{ f_{n}^{*}\right\} _{n\in \mathcal {N}_{k}}\) values (see Sect. 6.3) and \(\mathcal {N}_{k}\) represents the set of nodes in level k. All nodes \(n\in \mathcal {N}_{k}\) that satisfy \(f_{n}^{*}>\tau _{k}\) are marked as inconsistent and added to the set of inconsistent nodes denoted by \(\mathcal {I}_\mathrm{L}\). This procedure is repeated for all levels of the hierarchy. Finally, we flatten the nodes in set \(\mathcal {I}_\mathrm{L}\)—remove \(n\in \mathcal {I}_\mathrm{L}\) and corresponding edges, and add edges from children of n to n’s parent node. The modified hierarchy thus obtained is denoted by \(\mathcal {H}_\mathrm{L}\). Using the modified hierarchy, we retrain a top-down classifier.

Table 2 Dataset statistics
figure a

Global inconsistent node flattened: Different from Level-INF approach, which sets different thresholds for each level, the global method shown in Algorithm 2 computes a single threshold value for all levels. The threshold \(\tau \) is computed as the sum of mean and \(\psi \) times the standard deviation of the set of value \(\left\{ f_{n}^{*}\right\} _{n\in \mathcal {N}}\), where \(\psi \) is a fitness parameter that is empirically estimated for dataset considering all \(\mathcal {N}\) nodes \(f_n^*\) values. \(\tau \) is used to identify the set of inconsistent nodes \(\mathcal {I}_\mathrm{G}\) in the hierarchy (i.e., all nodes n with \(f_{n}^{*}>\tau \)). The hierarchy obtained by flattening the nodes present in \(\mathcal {I}_\mathrm{G}\) is denoted by \(\mathcal {H}_\mathrm{G}\). Using the modified hierarchy, we retrain a top-down classifier. In this paper, we refer to this approach as Global-INF.

figure b

3.3 Hybrid prediction

One of the major issues faced by top-down approaches is that the prediction errors made at the higher levels cannot be rectified at the lower levels. In order for the top-down prediction algorithm [given in Eq. (3)] to predict the correct label for a test example and avoid error propagation, all the classifiers in its path must classify correctly. The effect of this error propagation worsens with the increase in depth of the hierarchy, leading to poor performance of top-down models [15]. To overcome this problem, we propose a hybrid approach for prediction that combines the prediction made by flat and top-down classifiers. Essentially, in this method we consider the predictions obtained from both the flat and top-down classifiers and in case of disagreement in label assignment, we choose the most confident prediction based on probability estimates. In our experimental evaluations, the hybrid approach performed better than both the flat and top-down methods.

Let (\(P_i^\mathrm{F}, \widehat{y}_i^\mathrm{F}\)) denote the probability score and predicted label pair obtained from flat classifier for i-th test instance, and (\(P_i^\mathrm{TD}, \widehat{y}_i^\mathrm{TD}\)) denotes the corresponding values for top-down classifier. Since label for top-down model is obtained by firing multiple classifiers in the top-down path, we compute the final probability score \(P_i^\mathrm{TD}\) as the multiplication of probabilities predicted by the classifiers in the path [31], as shown in Eq. (5).

$$\begin{aligned} P_i^\mathrm{TD} = \prod _{k \in \mathcal {A}(\widehat{y}_i^\mathrm{TD})} p_i^k \end{aligned}$$
(5)

where \(p_i^k\) is the probability score obtained for i-th instance at k-th level for the selected top-down path and is computed as shown in Eq. (6).

$$\begin{aligned} p_i^{k} = \frac{\exp \Big ({{\varvec{\Theta }}_k^T{\widehat{\mathbf{x}}}_i}\Big )}{\exp \Big ({{\varvec{\Theta }}_k^T{\widehat{\mathbf{x}}}_i}\Big ) + \sum _{n \in \xi (k)}\exp \Big ({{\varvec{\Theta }}_n^T{\widehat{\mathbf{x}}}_i}\Big )} \end{aligned}$$
(6)

Similarly, to compute the probability score for flat classifier \(P_i^\mathrm{F}\), we average the prediction over all labels as shown in Eq. (7).

$$\begin{aligned} P_i^\mathrm{F} = \frac{\exp \Big ( {\mathop {\mathbf{argmax }}\limits _{n \; \in \;\mathcal {L}}}\,{\varvec{\Theta }}_n^T{\widehat{\mathbf{x}}}_i\Big )}{\sum _{n \in \mathcal {L}}\exp \Big ({{\varvec{\Theta }}_n^T{\widehat{\mathbf{x}}}_i}\Big )} \end{aligned}$$
(7)

In case \(\widehat{y}_i^\mathrm{F} \ne \widehat{y}_i^\mathrm{TD}\), we obtain the final prediction by assigning the label with highest probability score as shown in Eq. (8).

$$\begin{aligned} \text {Final prediction} (\widehat{y}_i)=\left\{ \begin{array}{l} \widehat{y}_i^\mathrm{F}, \quad P_i^\mathrm{F} > P_i^\mathrm{TD}\\ \widehat{y}_i^\mathrm{TD}, \quad \hbox {otherwise} \end{array} \right\} \end{aligned}$$
(8)

4 Experimental evaluations

4.1 Datasets

We have used text and image datasets for evaluating the performance of our proposed approaches. Various statistics of the datasets used in our experiments are listed in Table 2. All these datasets are single-labeled, and the examples are mandatorily assigned to the leaf nodes in the hierarchy. For all text datasets, we have applied the tf-idf transformation with l2-norm normalization to the word-frequency feature vector. Description of the used datasets is as follows:

4.1.1 Image datasets

CLEF [32]—Medical images annotated with medical applications codes. Each image is represented by the 80 features that are extracted using local distribution of edges.

DIATOMS [33]—Diatom images that was created as the part of the ADIAC project. Features for each image are created using various feature extraction techniques mentioned in [33]. Further, we have preprocessed the original dataset by removing the examples that belongs to the internal nodes.

Fig. 2
figure 2

Distribution of DMOZ datasets visualizing majority of the classes with rare categories. a DMOZ-SMALL. b DMOZ-2010. c DMOZ-2012

4.1.2 Text datasets

IPCFootnote 7—Collection of patent documents organized in international patent classification hierarchy.

DMOZ-SMALL, DMOZ-2010 and DMOZ-2012Footnote 8—Multiple web documents organized into various classes using the hierarchical structure. It is subset of the web pages from open directory project and has been released as the part of the LSHTCFootnote 9 challenge. For evaluating the DMOZ-2010 and DMOZ-2012 datasets we have used the provided test split and prediction scores are obtained using the web portal interfaceFootnote 10 \(^{,}\) Footnote 11 that was used during the competition. DMOZ datasets are highly imbalanced with more than 75\(\%\) of the classes belonging to rare categories (i.e., classes having \(\le \) 10 examples) as shown in Fig. 2.

4.2 Evaluation metrics

For evaluating the performance of HC models, we have used different flat and hierarchy-based metrics used in the online prediction challenges such as LSHTC and BioASQ. Description about various metrics used in our evaluation is as follows:

Flat evaluation measures: Standard set-based metrics [34] Micro-\(F_1\) (\(\mu F_1\)) and Macro-\(F_1\) (M\(F_1\)) are used for evaluating the performance of various methods. To compute \(\mu F_1\), we sum up the category specific true positives \((\hbox {TP}_c)\), false positives \((FP_c)\) and false negatives \((\hbox {FN}_c)\) for different categories and compute the score as:

$$\begin{aligned}&P = \frac{\sum _{c \in \mathcal {L}}\hbox {TP}_c}{\sum _{c \in \mathcal {L}}(\hbox {TP}_c + FP_c)},\nonumber \\&R = \frac{\sum _{c \in \mathcal {L}}\hbox {TP}_c}{\sum _{c \in \mathcal {L}}(\hbox {TP}_c + \hbox {FN}_c)}\nonumber \\&\mu F_1= \frac{2PR}{P + R} \end{aligned}$$
(9)

Unlike \(\mu F_1\), \(\hbox {M}F_1\) gives equal weight to all the categories so that the average score is not skewed in favor of the larger categories. It is defined as follows:

$$\begin{aligned}&P_c = \frac{\hbox {TP}_c}{\hbox {TP}_c + FP_c}, R_c = \frac{\hbox {TP}_c}{\hbox {TP}_c + \hbox {FN}_c} \nonumber \\&\hbox {MF}_1 = \frac{1}{|\mathcal {L}|}\sum _{c \in \mathcal {L}}\frac{2P_cR_c}{P_c + R_c} \end{aligned}$$
(10)

where \(|\mathcal {L}|\) is the number of leaf categories.

Hierarchical evaluation measures: Different from flat measures that penalizes each of the misclassified examples equally, hierarchical measures [35] take into consideration hierarchical distance between the true label and predicted label for evaluating the classifier performance. In general, misclassifications that are closer to the actual class are less severe than misclassifications that are farther from the true class with respect to the hierarchy (e.g., an example from hockey class misclassified as the baseball class is less severe in comparison with the hockey misclassified as cat). The hierarchy-based measures include hierarchical \(F_1\) (\(\hbox {hF}_1\)) [harmonic mean of hierarchical precision (hP) and hierarchical recall (hR)] and tree-induced error (TE) [22], which are defined as follows:

$$\begin{aligned}&\hbox {hP} = \frac{\sum _{i=1}^{N}|{\mathcal {{A}}}(\widehat{y_i}) \cap {\mathcal {{A}}}({y_i})|}{\sum _{i=1}^{N}|{\mathcal {{A}}}(\widehat{y_i})|},\nonumber \\&\hbox {hR} = \frac{\sum _{i=1}^{N}|{\mathcal {{A}}}(\widehat{y_i}) \cap {\mathcal {{A}}}({y_i})|}{\sum _{i=1}^{N}|{\mathcal {{A}}}({y_i})|} \nonumber \\&\hbox {hF}_1 = \frac{2*\hbox {hP}*\hbox {hR}}{\hbox {hP} + \hbox {hR}} \end{aligned}$$
(11)
$$\begin{aligned}&TE = \frac{1}{N}\sum _{i=1}^{N}\delta (\widehat{y_i}, y_i) \end{aligned}$$
(12)

where \({\mathcal {{A}}}(\widehat{y_i})\) and \({\mathcal {{A}}}({y_i})\) are, respectively, the set of ancestors of the predicted and true labels which include the label itself, but do not include the root node. \(\delta (a, b)\) gives the length of the undirected path between categories a and b in the graph. For TE lower scores are better, whereas for all other measures higher scores are better.

Note that for consistent evaluation of hierarchical measures we have used the original hierarchy for all methods unless where noted.

4.3 Experimental protocol

In all the experiments, we have divided the training dataset into train and small validation dataset in the ratio 90:10. Each experiment was run five times with different sets of train and validation split chosen randomly. Testing is done on an independent held-out dataset as provided by these benchmarks. The model is trained by choosing misclassification penalty parameter (C) in the set [\(10^{-3}\), \(10^{-2}\), \(10^{-1}\), 1, \(10^{1}\), \(10^{2}\), \(10^{3}\)]. The best parameter is selected using a validation set. The best parameters are used to retrain the models on the entire training set, and the performance is measured on a held-out test set. For the INF methods, we compute and save the \(f^*_n\) value for each node in the hierarchy using a validation set. Setting the threshold as \(\mu + \psi \sigma \) (or \(\mu + \psi _k\sigma \) for k-th level in Level-INF approach), we remove the inconsistent nodes where best value of fitness parameter \(\psi \) (or \(\psi _k\)) is computed empirically for each dataset (see Sect. 6.3). All experiments were conducted using a modified version of liblinearFootnote 12 software [36] and were run on ARGO, a research computing cluster provided by the Office of Research Computing (URL: http://orc.gmu.edu), at George Mason University, VA. The source code, data description and experimental settings used in this paper are made available at our website.Footnote 13

5 Comparative approaches

5.1 Top-down hierarchical methods

For all top-down hierarchical baselines, we train a binary one-vs-rest classifiers for each of the node (except root) in the hierarchy and predictions are made starting from the root node and recursively selecting the best scoring child nodes until a leaf node is reached [see Eq. (3)]. Depending upon the hierarchy that we use during the training and prediction process, we compare with the following baselines.

Top-down (TD): Original hierarchy provided by the domain experts is used for classifiers’ training and label prediction. Based on the model used for training, we have TD-LR and TD-SVM where we train LR and SVM model, respectively.

Level flattening: Modified hierarchy obtained by flattening different level(s) is used instead of original hierarchy. Depending on level(s) flattened, we have top-level flattening (TLF), bottom-level flattening (BLF), multiple-level flattening (MLF) hierarchy as discussed in Sect. 2.1.

Maximum-margin-based taxonomy adaptation (MTA): Original hierarchy is modified using the margin value computed at each node in the hierarchy as described in Babbar et al. [18].

5.2 Flat methods

Logistic regression (LR): We train binary one-vs-rest regularized LR classifiers for each of the leaf categories, ignoring the hierarchical structure. The prediction decision \(\widehat{y}\) for unlabeled test instance x is based on the maximum prediction score achieved when compared across the one-vs-rest classifiers as shown in Eq. (13).

$$\begin{aligned} \widehat{y} = \mathop {\mathbf{argmax}}\limits _{{n \; \in \;\mathcal {L}}} \,{\varvec{\Theta }}_n^T\mathbf{x} \end{aligned}$$
(13)

Error-correcting output code (ECOC) [37]: This approach combines binary classifiers to exploit correlations and correct errors. Codewords are generated randomly with bits assigned for representing the hierarchical information between the categories. Experiments were done with codeword length varying from 32 to 1024 bits depending on the dataset. For testing an unlabeled example, the output codeword is compared to the codeword of each categories, and the one with the minimum hamming distance is selected as the class label for that example.

Fig. 3
figure 3

Different hierarchical structures (b)(c) obtained from expert-defined hierarchy (a). a Expert-defined hierarchy. b Level-INF modified hierarchy. c Global-INF modified hierarchy

6 Results and discussion

6.1 Case study

To understand the quality of different hierarchical structures (expert-defined and flattened) for the newsgroup dataset shown in Fig. 3, we perform top-down HC using LR model with each of the hierarchy, separately. The dataset has 11,269 training instances, 7505 test instances and 20 classes. We evaluate each of the hierarchy by randomly selecting five different sets of training and test split in the same ratio as original dataset.

The results of classification performance are shown in Table 3. We can see that using these modified hierarchies substantially improves the classification performance in comparison with the baseline expert-defined hierarchy. On comparing the proposed approaches for flattening hierarchies, the classification performance obtained from using the Global-INF hierarchy is found to be significantly better than the Level-INF hierarchy. This is because Global-INF method has the flexibility of not removing any nodes (or removing all the nodes) from each of the level resulting in better set of inconsistent nodes and hence better classification performance. On contrary, Level-INF removes at least one node from each level which may not be optimal because all the nodes in the level might be useful for classification.

6.2 Comparison to top-down hierarchical baselines

Performance based on flat metrics: Table 4 (Table 5) shows the \(\mu F_1\) and \(\hbox {MF}_1\) performance comparison of our proposed hierarchical modification approaches with TD-LR (TD-SVM) and competitive hierarchy modification approaches. We have used LR and SVM as the baseline model for evaluation [3]. From table we can see that our proposed Global-INF approach consistently outperforms all other approaches for different datasets across all metrics. In fact, for the CLEF and DIATOMS image datasets we see a relative performance improvement of >7\(\%\) in M\(F_1\) scores on comparing Global-INF with the best top-down modification baseline, i.e., MTA. To validate the performance improvement, we conducted pairwise statistical significance tests between our best approach, Global-INF and best top-down baseline for all datasets except DMOZ-2010 and DMOZ-2012, where true test labels (and class-wise performance) are not available from the online evaluation. Specifically, we compute sign test for \(\mu F_1\) [38] and nonparametric Wilcoxon rank test for M\(F_1\) scores (it should be noted that significance tests are between two approaches for single run). In Fig. 4 we present the pairwise statistical comparisons of LR model for different approaches studied here on the DMOZ-SMALL dataset. The Global-INF consistently outperforms other baseline approaches studied here.

Overall, results of statistical tests show that Global-INF approach significantly outperforms the best baseline (and hence other baselines) for all the datasets (see Tables 4 and 5).

Table 3 \(\mu F_1\), \(\hbox {MF}_1\) and \(\hbox {hF}_1\) performance comparison of different hierarchical structures on newsgroup dataset for LR model
Table 4 \(\mu F_1\) and \(\hbox {MF}_1\) performance comparison of various top-down hierarchical baselines against our proposed approaches for LR model
Table 5 \(\mu F_1\) and \(M F_1\) performance comparison of various top-down hierarchical baselines against our proposed approaches for SVM model
Fig. 4
figure 4

P value comparison of different approaches for LR model on DMOZ-SMALL dataset. Significant improvements are denoted by dark black color. a \(\mu F_1\). b \(\hbox {MF}_1\)

Table 6 Hierarchical performance comparison of various top-down hierarchical baselines against our proposed approaches over original (expert-defined) and modified hierarchy for LR model
Table 7 Hierarchical performance comparison of various top-down hierarchical baselines against our proposed approaches over original (expert-defined) and modified hierarchy for SVM model

On comparing our two proposed approaches, Global-INF has better performance over Level-INF. This is because the Level-INF approach strictly enforces some of the nodes to be flattened from each levels although there \(f^*_n\) value may be much lower than the other nodes at different levels in the hierarchy and vice versa. In contrast, Global-INF approach takes all nodes \(f^*_n\) values into consideration while making a decision and hence it determines a better set of inconsistent nodes. MTA approach has poor performance due to the similar issues as with Level-INF approach. Performance of level flattening approaches, viz. TLF, BLF and MLF, suffers because these methods remove the entire level(s) in the hierarchy and do not take into consideration whether any node in that level is important for HC. TD-LR approach has the worst performance because of the inconsistent nodes present in the original hierarchy that are negatively impacting the generalization capabilities of learned models at the higher levels (see Sect. 6.4), which results in error propagation.

Fig. 5
figure 5

MF1 performance comparison of flat approach (marked in dotted line) against best top-down approach, Global-INF (marked in solid line) for LR model with different selection of threshold (\(\tau \)) for CLEF and DMOZ-SMALL datasets. Validation data are used for plotting the graph

Fig. 6
figure 6

Total fan-out (\(\#\) children) at each level for CLEF and DMOZ-SMALL datasets before and after flattening inconsistent nodes using Global-INF approach. a CLEF. b DMOZ-SMALL

Performance based on hierarchical metrics: Hierarchical evaluation metrics \(\hbox {hF}_1\) and TE compute errors for misclassified examples based on the definition of a defined hierarchy. Table 6 (Table 7) shows the \(\hbox {hF}_1\) and TE score for all top-down approaches evaluated over the original hierarchy and the modified hierarchy (obtained by flattening). We can see that our proposed approach, Global-INF outperforms other approaches because it is able to identify a better set of inconsistent nodes. On comparing the classification performance over the original hierarchy and the modified hierarchy, we can see that for most of the approaches classification on modified hierarchy show an improved performance. This is because flattening of hierarchies results in the reduction of hierarchical path length for misclassified examples contributing to performance improvement.

6.3 Empirical study for threshold \((\tau )\) selection

Figure 5 shows the \(M F_1\) performance comparison of flat LR approach against our best proposed approach (i.e., Global-INF) with varying selection of threshold \((\tau )\) in the interval [\(\mu \), \(\mu + 3\sigma \)] (performance deteriorates after \(\mu + 3\sigma \)) with step-size 0.1\(\sigma \) for CLEF and DMOZ-SMALL datasets. We choose these datasets for evaluation because they have different data characteristics. The CLEF dataset is well balanced and does not suffer from the rare categories issue, whereas DMOZ-SMALL dataset is highly imbalanced and majority of the classes belong to rare categories (i.e., having \(\le \)10 examples) as shown in Fig. 2a. In order to identify the set of inconsistent nodes in the hierarchy, we compare the computed \(f_n^*\) value of each internal node with the chosen threshold \((\tau )\) and mark the node as inconsistent iff \(f_n^* > \tau \). It can be seen from the figure that for CLEF dataset, performance improves as the threshold (\(\tau \)) decreases giving intuition that \(\tau \) should be kept smaller, i.e.,  removing more internal nodes from the hierarchy (enforcing flat structure) is better and hence reducing the threshold value \(\tau \) can possibly lead to better results. However, for the DMOZ-SMALL dataset, performance first increases and then decreases with maximum performance achieved at \(\tau = \mu + 1.8\sigma \). This behavior suggests that for imbalanced data distribution with potentially large number of rare categories, we should generally keep the threshold higher. It helps to leverage the hierarchical information while reducing error propagation by removing inconsistent nodes. The best threshold for a specific dataset can be chosen empirically using a small validation set as done in this study.

To further understand the behavior of modified hierarchy using Global-INF approach, we analyzed the datasets in terms of level-wise fan-out (\(\#\) children) in the hierarchy, before and after removing inconsistent nodes. We can see from Fig. 6 that with both datasets maximum flattening takes place at higher levels in the hierarchy, which results in increased fan-out value. Reason for inconsistencies at higher levels is the presence of many dissimilar classes beneath each node, which makes it comparatively difficult to learn generalized classifiers resulting in higher \(f_n^*\) values (i.e., inconsistent nodes marked for removal).

Table 8 Level-wise error for TD-LR and Global-INF approach

6.4 Level-wise misclassification error

Table 8 shows the level-wise error analysis that is obtained for TD-LR and our best approach, Global-INF for CLEF and DMOZ-SMALL datasets. We can see that at higher levels, the Global-INF approach misclassifies fewer examples (shown in #ME column) that results in less error propagation down the levels and hence better overall performance. This experiment supports our hypothesis that Global-INF approach identifies better set of inconsistent nodes that helps in minimizing the error propagation. Results for other top-down baselines are intermediate between these two and hence not shown in the paper for brevity.

Table 9 \(\hbox {MF}_1\) and \(\hbox {hF}_1\) performance comparison between Global-INF and flat baselines with varying distribution of training examples per class
Fig. 7
figure 7

Performance improvement of hybrid models over baseline TD-LR model. For DMOZ-2010 dataset \(\hbox {h}F_1\) score is not available from online web portal hence not reported. a \(\hbox {M}F_1\). b \(\hbox {h}F_1\)

Fig. 8
figure 8

Prediction runtime comparison (in mins) between TD, flat and hybrid approach. Image datasets have small difference hence omitted. a Small datasets. b Large datasets

6.5 Comparison to flat baselines

Table 9 shows the \(\hbox {MF}_1\) and \(\hbox {hF}_1\) performance comparison of our best approach, Global-INF against flat baselines—LR and ECOC approach. For easier analysis, we have showed the results for datasets separated by varying distribution of training size. (For evaluating DMOZ-2010 and DMOZ-2012 datasets, we have used a separate held-out dataset because we don’t know the actual labels of test dataset from the online evaluation.) We show the results for \(\hbox {MF}_1\) because it gives equal importance to all the classes while evaluation and hence provides better essence of the results for datasets with skewed distribution. For computing \(\hbox {hF}_1\), we have used the original hierarchy for consistent evaluation. Moreover, we perform significance test on the overall results (denoted by avg.) for each dataset. As we can see from the table, the LR approach outperforms Global-INF approach for CLEF, DIATOMS and IPC datasets because these datasets are well balanced and have smaller number of categories that needs to be discriminated which are easily separable by linear classifiers. However, for the DMOZ datasets, our approach Global-INF has better performance because hierarchical structure provides useful information for categorizing classes with rare categories. Within the DMOZ datasets, rare categories make up more than 75\(\%\) of the classes as shown in Fig. 2. The ECOC approach has the worst performance because the codewords used in our experiments are chosen randomly and merging of categories may require nonlinear discriminants instead of the linear classifiers used in this paper.

6.6 Hybrid prediction

Figure 7 shows the hybrid prediction results for different LR models. It can be seen from the figure that hybrid method of prediction gives consistently better performance in comparison with the individual classifiers performance shown in Table 4. For example, CLEF dataset shows an improvement of \(\sim \)15\(\%\) in M\(F_1\) score with hybrid model, TD-LR + LR as compared to the hierarchical model, TD-LR. Overall, Global-INF + LR model has the best performance for all datasets. Although hybrid models have better performance, one of the major drawbacks of hybrid approach is the expensive prediction time (shown in Fig. 8) because it invokes all the models for leaf categories (for flat classifier prediction) as well as the relevant models corresponding to top-down prediction path (for hierarchical prediction).

Table 10 Total training runtime comparison (in mins) between Global-INF and LR approach

6.7 Computational run time

Although the flat LR approach outperforms Global-INF approach for some datasets in terms of classification performance, their prediction runtime is significantly higher and it can be untenable for large-scale problems [3, 23]. The prediction runtime comparison of Global-INF and LR approach along with hybrid approach, Global-INF + LR, is shown in Fig. 8. As expected, Global-INF approach has comparatively lower prediction runtime (up to 4x improvement). The difference is significant for large-scale datasets (DMOZ-2010 and DMOZ-2012). For completeness, we also report the total training runtime in Table 10. The Global-INF approach has higher training runtime due to the overhead involved with classifiers retraining after hierarchy modification and also involves training one-vs-rest binary classifiers for internal nodes in addition to leaf categories. (It should be noted that for hybrid approach, Global-INF + LR, training runtime is same as Global-INF approach.) Nevertheless, both flat and top-down approaches are trivially parallelizable due to decoupling (i.e., no interactions) between the classifiers learnt at different nodes n in the hierarchy. For reporting training runtime, we trained classifiers in parallel across multiple compute nodes in the cluster and sum up the time taken at each node. In our experiments, we choose expensive one-vs-rest binary classifiers over comparably cheaper one-vs-sibling binary classifiers because our preliminary experiments showed better results with one-vs-rest approach. It should also be noted that there is no significant difference between the prediction and training runtime of different top-down approaches, and hence, we do not report them here.

7 Conclusion and future work

In this paper, we proposed two different approaches for hierarchy modification that restructures the hierarchy by flattening most prominent set of inconsistent nodes, thereby improving the hierarchy representation which is more suited for HC. Performance evaluation on wide range of datasets over the proposed modified hierarchy shows improved classification results because fewer examples are misclassified at higher levels, resulting in less error propagation. Comparison of our proposed approach with the competitive hierarchy modification approaches in the literature showed significant performance improvement supporting the hypothesis that our approach identifies the better set of inconsistent nodes. We also performed experiments to compare our approach with the flat approach with varying distribution of training examples per categories. Results demonstrated the usefulness of leveraging hierarchical information for classifying classes with fewer training examples. Further, we showed that combining the flat and top-down approach prediction results in an ensemble setting results in a much better performance but at the cost of expensive prediction time.

In future, we plan to study the effect of our modified hierarchy on various state-of-the-art HC approaches [3].