Hierarchical Multi-Label Classification For Large Scale Data
Hierarchical Multi-Label Classification For Large Scale Data
Hierarchical Multi-Label Classification For Large Scale Data
1) Asst. Prof. Peerapon Vateekul, Ph.D. Dept. of Computer Eng., Chulalongkorn Univ.
2) Prof. Boonserm Kijsirikul, Ph.D. (Mentor) Dept. of Computer Eng., Chulalongkorn Univ.
January 2020
Contact No: TRG5780220
Final Report
1) Asst. Prof. Peerapon Vateekul, Ph.D. Dept. of Computer Eng., Chulalongkorn Univ.
2) Prof. Boonserm Kijsirikul, Ph.D. (Mentor) Dept. of Computer Eng., Chulalongkorn Univ.
Abstract ...............................................................................................................................1
Executive summary ............................................................................................................. 2
1) Introduction to Research ................................................................................................4
2) Literature review ............................................................................................................. 5
2.1) Problem Statement in Hierarchical Multi-Label Classification ...........................5
1.1 Existing Strategies ..............................................................................................6
1.2 Performance Evaluation .....................................................................................7
3) Objective .........................................................................................................................8
4) Research Methodology and Results .............................................................................. 8
4.1) SVM Based Solution...........................................................................................8
4.2) Deep Learning Based Solution ............................................................................19
5) Output ...........................................................................................................................28
5.1) International Journal Publication ..........................................................................28
5.2) Research Utilization and Application ...................................................................28
5.3) Others e.g. national journal publication, proceeding, international conference,
book chapter, patent ....................................................................................................29
6) Reference......................................................................................................................30
Appendix (List of Research Outputs)................................................................................32
Abstract
1
Executive summary
Hierarchical Multi-Label Classification (HMC) is a complex type of classification
problem, where each example can be annotated into more than one label, and the labels
are hierarchically organized. It has received a lot of attentions due to its need in broad
domains of applications. In the text domain, this problem is also known as Hierarchical
Text Categorization (HTC). There are many challenging issues: (1) the size of text corpora
is usually large, e.g., Wikipedia, Dmoz, WIPO, etc. (2) the number of classes in the
hierarchy is large with class dependencies.
This research project aims to propose a novel algorithm for HMC, especially in the
domain of application in “text categorization”. There are many research outputs based on
modern algorithms: Support Vector Machines (SVM) and Deep Learning techniques,
along with label correction strategy to further improve an accuracy. As an output of this
project, there are four outcomes: one internal journal and three internal conference
proceedings. Furthermore, there is also an open-source software publicly available for
non-commercial.
First, we propose a technique based on the flat approach by generating one SVM
per each leaf node in the hierarchy. There are two proposed strategies including “Top-
Level Pruning (TLP)” and “Exclusive Top-Level Training Policy (ETT).” The prior one, TLP,
helps to reduce the number of candidates (leaf-node classifiers) by adding top-level
classifiers to prune unrelated branches resulting in a smaller set of candidates. The latter
strategy, ETT, is a method that uses the top-level data to construct a binary training set.
Comparing to use the whole data for each class node, ETT can generate a smaller and
less imbalanced training set.
Second, we propose a technique called “Label Correction” to correct an unusual
prediction by taking a class correlation into account resulting in F1-improvement. There
are three modules in the proposed system: (i) label correction, (ii) performance
enhancement, and (iii) hierarchical path adjustment. The results show that the model with
our label correction significantly outperforms baseline techniques without any corrections.
Third, we propose a Deep Learning based algorithm called “Level-Based Neural
Networks with Sharing Layer Information (SHL-NN)”. The model is based on a multiclass
classifier per each class level in the hierarchy, which requires a smaller number of
classifiers than the other HMC approaches (top-down and flat). At first, we use Word2Vec
model training from sentences to represent all words in the sentences into word vectors.
Then, we applied the convolutional neural network for features extraction from word
vectors which is sequentially combined to be the same as the word order. Next, the model
is enhanced specifically for HMC by considering information from the previous hierarchical
level (layer) called “Sharing Layer Information”. Finally, we perform the label correction to
fix the prediction.
2
Fourth, it is also based on Deep Learning model by focusing on an improvement of
our previous attempt “SHL-NN” since it is based on a sequence of unsupervised word
embedding vectors, so the performance should be limited. In this work, we propose a
supervised document embedding specifically designed for hierarchical text categorization
based on Autoencoder, which is trained from both words and labels. To enhance the
embedding vectors, the document embedding strategies are invented to utilize the class
hierarchy information in the training process. To transfer the prediction result from the
parent classes, the shared information technique has been improved to be more flexible
and efficient.
3
1) Introduction to Research
The amount of data in the world has been exploding caused by the rise of multimedia,
social media, and the Internet. Particularly, the volume of World Wide Web (WWW)
information is growing annually at minimum rate of 59%; therefore, it is necessary to have
an appropriate indexing scheme to organize the data into categories. However, manual
creation of such scheme is impractical due to the size of data and the costs of human
experts. This is why some machine-learning scientists have proposed a classification
technique to automate the process.
Traditional classification approach assumes that each example should be assigned
to one (and only one) out of two or more classes. However, this scenario digresses from
real-world applications in two aspects. First, each example can belong to many classes
simultaneously (multi-label classification). This aspect commonly appears in many areas.
For example, each movie can belong to more than one genre, e.g., Harry Potter is
adventure and also fantasy movie. Likewise, a book can be categorized into multiple
topics: engineering, technology, science, computer, etc. Second, the class can be
hierarchically ordered (hierarchical classification). When the number of classes is large, it
is helpful to organize them into groups and subgroups. For instance, large collections of
web pages can be handled in conceptual hierarchies such as the one from Figure 1. This
hierarchical arrangement is essential because it allows users to focus only on the level of
details, they are interested in. The task seeking to address both of these issues is called
“hierarchical multi-label classification” (HMC).
Root
Figure 1. Parts of top-level concepts in the Dmoz Open Web Directory project
There are many applications which can benefit from HMC including: gene function
prediction [1-3], image recognition [4], and web repositories/digital libraries (e.g., Dmoz,
Wikipedia1, Yahoo [5, 6], EUROVOC [7], Reuters [8], OHSUMED [9]). In this research,
we focus on an application in “text categorization,” particularly on the data from Wikipedia.
Along with their widespread use, comes the need for automated classification of new
documents to the categories in the hierarchy. As the size of the hierarchy grows and the
1The data sets from Wikipedia (www.wikipedia.org) and the Dmoz Open Web Directory (www.dmoz.org) are available through the Pascal
Challenge on Large Scale Hierarchical Text Classification (LSHTC).
4
number of documents to be classified increases, a number of interesting machine learning
problems arise.
There are 3 main challenges in processing the Wikipedia corpus. First, some classes
in the hierarchy have more than one parent, in which case the mutual relationships have
to be described by a hierarchy structured as “a directed acyclic graph (DAG).”
Unfortunately, most of the previous HMC techniques support only the tree structured
hierarchy, where each class can strictly inherit from one parent. Second, traditional
classification techniques tend to perform low prediction accuracy. In the HMC domain, it
is common to induce a separate binary classifier for each class in hierarchy. As a result,
some of these classifiers have to induce from “imbalanced training sets,” where one class
outnumbers the other – a circumstance known to hurt the performance. Third, the size of
the Wikipedia corpus is so large that it cannot be processed by any classical supervised
algorithms since most of them are memory-resident, while it is impossible to load all the
data into memory at once. Moreover, the computational cost is also prohibitively
expensive. In particular, the largest data set (based on Wikipedia) contains approximately
2 million documents and 1.6 million features. The class hierarchy consists of more than
300,000 categories on 729 levels!
2) Literature review
2.1) Problem Statement in Hierarchical Multi-Label Classification
Multi-Label classification is a domain where an example can be associated to multiple
classes simultaneously. Let 𝑋 ⊆ ℝ% be an example space of n-dimensional features,
and 𝑌 be a finite set of classes. Let 𝐷 be a set of training examples of (+++⃗,𝑥* 𝑦/ ) which
contains a feature vector 𝑥 +++⃗* ∈ 𝑋 and belongs to multiple classes 𝑦/ ⊆ 𝑌. The goal
of multi-label classification is to find the classifier 𝑔: 𝑋 → 26 that maps example 𝑥 to
set of classes 𝑦. This kind of tasks can be encountered in various domains, including
image, text, biology, audio, and video. For example, movie genre classification is multi-
label classification because each movie can belong to more than one genre, e.g., both
adventure and also fantasy movie simultaneously.
Hierarchical Multi-Label classification (HMC) refers to a task that follows a multi-label
assumption, and classes are organized in a hierarchical structure. In addition, if an
example belongs to a class ci, it must belong to all ci’s superclasses. This property is
called “a hierarchical constraint”. Let 𝐻 = (𝐶, 𝛦) be a class hierarchy consisting of a
set of class nodes, 𝐶 ⊆ 𝑌, and a set of edges, 𝛦. Let ≺ represents superclass
relationship, and classes {𝑐? , 𝑐@ } ⊂ 𝐶; 𝑐? ≺ 𝑐@ if and only if 𝑐? is a superclass of
𝑐@ . Denote (𝑥 +++⃗,* 𝑦/ ) as a tuple of example, which +++⃗
𝑥* is a feature vector and 𝑦/ is a
set of classes. Note that 𝑦/ must satisfy the hierarchical constraint if 𝑐? ≺ 𝑐@ and
5
𝑐@ ∈ 𝑦/ then 𝑐? ∈ 𝑦/ . The HMC goal is to find the classifier 𝑔: 𝑋 → 26 that maps
example 𝑥 to set of classes 𝑦 such that 𝑔(𝑥) follows the hierarchical constraint.
Recently this domain has gained a lot of attentions in various applications including
functional genomics, image annotation, and text classification.
Two versions of this task exist. In the mandatory leaf-node problem (MLNP), only the
leaf-node classes are used. By contrast, in the non-mandatory leaf node problem
(NMLNP), an example can be labeled with any class from the given class hierarchy.
Considering the class hierarchy from Figure 2, MLNP permits an example to be labeled
only with a subset of { C1.1, C1.2, C2.1, C2.2.1, C2.2.2 }, but NMLNP allows also the
other class labels (e.g., C1 or C2.2).
In this research, we focus on the hierarchy from Wikepedia. The hierarchical structure
is DAG. Each example can belong to many classes and must be assigned to only classes
at leaf nodes, MLNP. From our preliminary analysis, cyclic relationships are found in the
hierarchy; thus, they are needed to be removed in order to preserve the property of DAG.
1 2
6
1.1.2 Top-Down Approach
This is the most common approach in the HMC domain. One or more classifiers are
trained for each (i) class node or (ii) class level of the hierarchy. This produces a tree of
classifiers in the tree structure or a graph of classifiers in the DAG structure. In the training
process, those classifiers are constructed from “local information”. In the testing process,
each classifier collaborates to predict an example in a top-down fashion.
Sun and Lim [11] applied this strategy to text classification where the class hierarchy
was a plain tree structure. For each class, they induced two SVMs: a local classifier and
a subtree classifier. An example is labeled as Ci by the local classifier, while the latter
one decides whether or not this example should be passed to Ci's subclassifiers.
Seeking to further improve the performance, Secker et al. [12] used several different
induction algorithms for each node of the hierarchy: Naive Bayes, SMO, 3-NN, etc. For
each node, ten classifiers were trained, and the one with the best classification results
was selected. This improved classification accuracy, but the computational costs were
even higher than in the previous attempt.
1.1.3 Big-Bang Approach (Global Approach)
Departing from the idea to induce a separate binary classifier for each node, some
authors preferred to induce one big (global) classification model to cover the entire class
hierarchy. This may have certain advantages: the size of the global classification model
is usually less than the sum of the node classifiers, and mutual interdependencies of the
classes are more easily taken into account.
A good example of the big-bang approach is the one reported by Clare and King [13]
who developed a hierarchical extension to the decision-tree generator C4.5 [14] and
applied it to functional-genomics data (they called their system HC4.5). To give higher
priority to more specific classes, they introduced a mechanism to weigh the entropy
formula accordingly.
Blockeel et al. [1] proposed another attempt called Clus-HMC, which is a decision
tree based algorithm to handle the HMC task. Their algorithm is a hierarchical version of
an earlier predictive clustering tree (PCT) [15]. Ven et al. [3] improved Clus-HMC, so that
it could be used in DAG-specified class hierarchies. An ensemble version of the algorithm,
Clus-HMC-ENS, was proposed by Schietgat et al. [2].
1.2 Performance Evaluation
We used two evaluation metrics to compare each model, F-measures with macro-
and micro-averaging for a multi-label purpose. They are defined as the performance
criteria as shown in Table 1. The macro-averaging is appropriate for measurement when
we want to ignore the overcoming of majority classes. In opposition, micro-averaging is
used to measure its globally counts.
7
A comparison in HTC often use F1 macro to be the main evaluation metric. For
example, the LSHTC challenge on Kaggle, the competition in HTC, also used F1 macro
to be a criterion for finding the winner of the competition.
|C | |C | |C |
Pr = TP (TP + FP )
1
MaPr = |C | å Pri MiPr = å TPi å (TPi + FPi )
i =1 i =1 i =1
|C | |C | |C |
Recall
Re = TP (TP + FN )
1
MaRe = |C | å Rei MiRe = å TPi å (TPi + FNi )
i =1 i =1 i =1
2 |C | 2
( b + 1) ´ Pr ´ Re ( b + 1) ´ MiPr ´ MiRe
1
å
F𝜷
Fb = MaFb = |C | Fb ,i MiFb =
2 2
b ´ Pr + Re i =1 b ´ MiPr + MiRe
3) Objective
- This research project aims to propose a novel hierarchical multi-label classification
(HMC) algorithm which supports large-scale data. It will focus on the domain of
application in “text categorization”.
- The proposed techniques should apply modern machine learning techniques, e.g.,
SVM and Deep Learning techniques.
4) Research Methodology and Results
In this research, there are two main proposed solutions.
- First, it is based on SVM using the flat approach in Section 2.2.1, where it
generates a binary classifier per each class in the hierarchy.
- Second, we further propose a better solution based on Deep Learning techniques
also using the top-down approach in Section 2.2.2. Anyway, it generates only a
multi-label classifier per each level in the hierarchy, so the number of classifiers
in the second solution (Deep Learning) is much less than that in the first solution
(SVM).
4.1)SVM Based Solution
Proposed Model
This work presents a novel flat-based hierarchical text classification that can achieve
promising prediction performance. There are two strategies including “Top-Level Pruning
(TLP)” and “Exclusive Top-Level Training Policy (ETT).” The prior one, TLP, helps to
reduce the number of candidates (leaf-node classifiers) by adding top-level classifiers to
prune unrelated branches resulting in a smaller set of candidates. The latter strategy,
ETT, is a method that uses the top-level data to construct a binary training set. Comparing
to use the whole data for each class node, ETT can generate a smaller and less
imbalanced training set. SVM is a baseline classifier employed in this work since it is
widely used and shows a good performance.
There are five main steps in the proposed method (Figure 3): (i) feature selection,
(ii) top-level pruning, (iii) similarity measure, (iv) training classifiers, and (v) classification.
The first step aims to select a small set of relevant features. The second step builds and
trains a classifier locally for each node in the top level. In the testing process, each
example is classified by these classifiers. If any of them predict it as negative (unrelated),
their corresponding branches with leaf-node classes can be pruned. Third, a cosine
similarity is calculated between the testing example and a vector of the remaining classes
from the previous step giving a set of candidates. The fourth step is to induce a classifier
for each candidate classes and, finally, predict the result.
9
The X2 has a natural value of zero if feature and class are independent. For each
feature, X2 scores are computed for all classes and, then, only the maximum value is
selected as in the following equation:
Next, the features are sorted by their X2 in descending order and removed if their
score cannot pass a predefined threshold. To avoid removing all of the features, the
threshold must be at least the minimum value of the feature used in the training set.
2) Top-Level Pruning: In the text corpus, the number of leaf-node classes in the
hierarchy is usually very large causing a high computational cost and low prediction
accuracy due to noisy classes. To avoid this issue, the method called “Top-Level Pruning
(TLP)” is proposed to apply a top-level classifier to prune its corresponding leaf-node
subclasses if a testing example is predicted as negative. For instance, Figure 4 shows
the original class hierarchy, where the nodes surrounded by a dashed rectangle are
candidate. With our strategy, TLP first induces top-level classifiers for classes c1, c2, and
c3. Next, they are used to classify a testing example. If it is predicted as negative by the
classifier c3, all of c3’s subclasses can be removed from the consideration as shown in
Figure 5. From the prelinary experiment, it showed that these top-level classifiers can give
high prediction accuracy since the training set is large enough and less imbalanced.
Sometimes a testing example is predicted as not belonging to any of the top-level
classifiers. If it happens, there are two possible solutions. The first one is called “TLP root
type (TLP (r))” by including only the root node and discarding all other subclasses in the
prediction (skip the next module). The second option is called “TLP candidate type (TLP
(c))” by ignoring the TLP strategy and using the result based on the similarity score in the
next module. Which option is better will be discussed in the experiment.
10
Figure 5. Example of top-level pruning hierarchy. The nodes surrounded by dashed
rectangles with round corners are leaf nodes which can be candidate classes of test
examples.
3) Similarity Measure: For each testing example, a set of candidate classes which
is likely to be a true class is selected based on a cosine similarity from the subset of leaf
nodes obtained from the prior step. Let A and B are vectors, the cosine similarity is
defined as below. First, a centroid vector for each leaf-node class is computed as an
average of the training documents which belong to that class. Next, a vector of testing
data is compared to all leaf-node vectors using the cosine similarity. Finally, the top
classes with the highest similarity score is chosen as a set of candidates.
4) Training Classifier: After obtaining a set of candidate classes, there are many
possbile methods to construct a binary classifiers. In this work, a novel training policy
called “Exclusive Top-Level Training Policy (ETT)” is proposed, and it is specifically
designed to cooperate with TLP (the second module of our framework). The notations are
defined as follows: the class ci, let FL(ci) be a set of classes in first level of the hierarchy.
ETT is defined as in the following equation:
T+(ci) = *(ci) ∪ T+(Child(ci)) and T-(ci) = T(FL(ci))\ T+(ci)
The ci’s positive training set includes all examples whose most specific class is ci
union with ci’s subclasses. The ci’s negative training set is a set of examples belonging
to the first-level class in the hierarchy which has ci as its subclass excluding ci’s positive
examples. Comparing to other policies, ETT is less imbalanced than EAT and more
effective than EPT since unseen examples (from the different branches) are already
filtered out by our pruning strategy, TPL, at the top level.
5) Classification and Evaluation: In the final step, the candidate classifiers built in
the previous step is used to predict as to whether or not the testing example truly belongs
to them.
11
Results
The experimental data is extracted from the Wikipedia dataset from Large Scale
Hierarchical Text Classification Challenge. Its statististics is illustrated in Table 2 showing
that the number of features is larger than the number of examples. SVM is our baseline
classifier, where the implementation is employed from a publicly available tool, LIBSVM
[18].
Table 2. Statistics of small subset of Wikipedia dataset
#Doc #Classes #Features Max Depth
421 43 6,372 5
The experimental results are demonstrated in Table 3. The winner method is our
approach with an option of “ETT + TLP (r)”. It showed the highest performance in all
metrics. This shows that our proposed strategies can improve the model accuracy
including (1) “Exclusive Top-Level Training Policy (ETT)” and (2) “Top-Level Pruning
(TLP)”
Conclusion
In this work, we present a hierarchical classification framework based on the flat
approach for the text categorization domain, where the number of classes and levels in
the hierarchy can be very large. The proposed method utilizes data at the first level of
the hierarchy in two aspects by adding the Top-Level Pruning classifiers (TLP) and
constructing a training set using Exclusive Top-Level Training policy (ETT). The
experiment was conducted on a subset of the Wikipedia dataset. The result showed that
our method (TLP & ETT) is the winner and it is statistically significant better than baseline
methods.
2 The results reported in this table is summarized from the total results in Table 5 of 16. Phachongkitphiphat, N. and P. Vateekul. An
improvement of flat approach on hierarchical text classification using top-level pruning classifiers. in 2014 11th International Joint Conference on
Computer Science and Software Engineering (JCSSE). 2014..
12
4.1.2) Label Correction Strategy
The main drawback of our previous work in Section 0 is that the top-down approach
always suffers from error propagation and yields such a poor performance of classifiers
at the lower levels. In this work, we present a novel method called “label correction”,
which takes label correlation into consideration and corrects the results of unusual
prediction patterns. The outcome of this work is a publication in an internaltional
conference in Scopus as shown details below [19]:
Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in Pattern
Recognition. 2014. Cham: Springer International Publishing.
Proposed Model
There are three main modules in the proposed system (Figure 6): (i) label correction,
(ii) performance enhancement, and (iii) hierarchical path adjustment. The first module
aims to correct a prediction label set to match the closest label set in the training data.
Then, the second module ensures that the corrected label sets do not lessen the overall
system performance. There is a selection heuristic to correct only some classes; not all
classes and excluding classes whose performance is decreased after the correction.
Finally, the third module is only applied in the hierarchical domain in order to adjust the
result to satisfy the hierarchical constraint.
1) Label Correction: A set of distinct label sets is extracted from the training data.
Each of them is a vector of +1 (positive) and -1 (negative). Then, a distance (cost)
between each prediction to the distinct label sets is computed. Finally, if the prediction
result does not match to any label sets, it is corrected to match to the closest one (the
smallest cost). There are two choices of how to compute the cost: (i) constant cost and
(ii) score cost. In the first equation below, the cost is the total number of labels with
different sign between a prediction vector and a training label set vector. In the second
equation below, a vector of SVM scores is used instead of a vector of predictions. The
cost is a summation of square root of the absolute score of classes with different sign.
13
Figure 7 illustrates an example of how to compute the cost in the label correction
module. Assume there are 5 classes, L = {A, B, C, D, E}, with 4 distinct label sets,
S = {A, AB, BE, ACE}. Given a testing example x, if a vector of SVM prediction scores is
[-0.2, 0.1, 1.5, -0.8, 0.4], a vector of predictions will be [-1, +1, +1, -1, +1] using a sign
function. Unfortunately, this prediction pattern does not match any label sets, so this result
should be corrected. Using the constant cost, the new prediction is BE with the cost of 1,
while the new result is ACE with the score cost of 0.76. Whether distance cost function
is preferred will be investigated in the experiment.
Class = {A, B, C, D, E}
Distinct Set = {A, AB, BE, ACE}
SVM Score Vector = [-0.2, 0.1, 1.5, -0.8, 0.4]
SVM score
vector -0.2 0.1 1.5 -0.8 0.4 costconstant costscore
distinct set
A [+1,-1,-1,-1,-1] √0.2 √0.1 √1.5 0 √0.4 4 2.62
AB [+1,+1,-1,-1,-1] √0.2 0 √1.5 0 √0.4 3 2.30
BE [-1,+1,-1,-1,+1] 0 0 √1.5 0 0 1 1.22
ACE [+1,-1,+1,-1,+1] √0.2 √0.1 0 0 0 2 0.76
From 1st equation, the lowest costconstant is 1 and the prediction result should be BE.
From 2nd equation, the lowest costscore is 0.76 and the prediction result should be ACE.
2) Performance Enhancement: In the first process, the prediction results are forced
to exactly match one of the distinct label sets. It is possible that the performance of some
classes may be dropped. The performance enhancement module is proposed to
guarantee that this scenario has never happened; therefore, there is always an
improvement in the overall system performance – in terms of F1.
All the steps of this module are conducted on the training data. First, F1-score for
each class is evaluated on the baseline system, SVM-BR (SVM Binary Relevance without
label correction). Second, our label correction (1st module) is applied and, then, F1-score
for each class is reevaluated. Finally, if the performance of any classes is dropped after
using the label correction, this module will keep those classes’ result of the original system
without any correction.
A mechanism behind this module is illustrated via the below figures. Figure 8 shows
F1-score of each class before applying the label correction, where the macro-averaging
F1 is 0.472. After applying our strategy, F1-result for each class is shown in Figure 9.
Figure 9 (a) and (b) are the results without and with applying the performance
enhancement module. In Figure 9 (b), this module excludes classes A and B from the
label correction since their F1-scores are dropped as in Figure 9 (a). Because some
classes are excluded from label correction, the hierarchical constraint may not be
14
satisfied. In this case, the next phase will adjust the result to guarantee the hierarchical
constraint.
A
0.8
B C
0.55 0.62
D E F
0.16 0.12 0.58
A A
0.76 ↓ 0.8
B C B C
0.49 ↓ 0.65 ↑ 0.55 0.65 ↑
D E F D E F
0.44 ↑ 0.56 ↑ 0.62 ↑ 0.44 ↑ 0.56 ↑ 0.62 ↑
(a) F1-score without performance enhancement (b) F1-score with performance enhancement
Figure 9. An example of performance enhancement process. ↑ and ↓ refer to as
increase or decrease in terms of F1 comparing to SVM-BR in Fig. 3. The gray node is a
class, whose F1 is dropped.
3) Threshold Adjustment: To adjust the result along the hierarchical path, there are
three proposed strategies. First, the decreasing strategy is to follow the superclass’
prediction. All subclasses’ results are assigned to be negative when their superclassifier
predicts as negative. Second, the increasing strategy is to depend on the subclass’ result.
If the subclassifier predicts as positive, the results of its ancestors must also be positive.
Third, the voting strategy is whether adjust the result depending on the majority class of
the prediction. On the one hand, if the majority class along the path is negative, the result
is adjusted using the decreasing strategy. On the other hand, the prediction is modified
using the increasing strategy.
Figure 10 illustrates an example of the hierarchical adjustment for the class path {A ,
B, C}. Figure 10 (a) is the result without the adjustment as {Yes, No, Yes}, where the
results of classes B and C disagree and violate the hierarchical constraint. The results of
Figure 10 (b) and (c) are {Yes, No, No} and {Yes, Yes, Yes} adjusted using the decreasing
and increasing strategies. Since most of classifiers predict as “Yes”, the voting strategy
in Figure 10 (d) gives the same result as in the increasing strategy, {Yes, Yes, Yes}.
15
A A
Yes Yes
B C B C
No Yes No Yes
D E F D E F
Yes No Yes No No Yes
B C B C
Yes Yes Yes Yes
D E F D E F
Yes No Yes Yes No Yes
Results
For the HMC domain, we compare three hierarchical path adjustment strategies: (1)
decreasing, (2) increasing, and (3) voting as mentioned. SVM-BR is considered as our
baseline (without label correction). There are 10 HMC datasets, and their results are
provided in Table 4 and
16
Table 5. Table 4 demonstrates that the best hierarchical adjustment is clearly “the
voting strategy”. On the wipo dataset in
17
Table 5, it has significantly increased from SVM-BR (baseline) for 75%.
18
Table 5. F1-results on HMC datasets. ELC stands for “Enhanced Label Correction”. The
number in parentheses is a percentage (%) comparing to SVM-BR. * represents a
significant difference on a paired t-test with α=0.05. The boldface method is a winner
on that dataset.
Method
SVM-BR Decreasing ELC Increasing ELC Voting ELC
Dataset
D0_yeast_GO 0.493 0.502 (+2) 0.511 (+4)* 0.511 (+4)*
D15_scop_GO 0.435 0.499 (+15)* 0.507 (+17)* 0.511 (+18)*
D16_struc_GO 0.313 0.333 (+6)* 0.378 (+21)* 0.377 (+21)*
D17_interpro_GO 0.159 0.180 (13)* 0.182 (+14)* 0.186 (+17)*
diatoms 0.431 0.618 (+43)* 0.644 (+50)* 0.644 (+50)*
enron 0.336 0.339 (+1) 0.347 (+3) 0.347 (+3)
imclef07a 0.668 0.728 (+9)* 0.745 (+11)* 0.745 (+11)*
imclef07d 0.722 0.732 (+1) 0.764 (+6)* 0.764 (+6)*
reuters 0.540 0.564 (+5) 0.574 (+6) 0.574 (+6)
wipo 0.220 0.386 (+75)* 0.386 (+75)* 0.386 (+75)*
Conclusion
In this work, we propose a technique called “Label Correction” to correct an unusual
prediction by taking a class correlation into account resulting in F1-improvement. There
are three modules in the proposed system: (i) label correction, (ii) performance
enhancement, and (iii) hierarchical path adjustment. Comparing to SVM-BR, our method
significantly won on all ten HMC datasets.
19
a neural network is constructed, where its input is a sequence of word embedding vectors
generated from Convolutional Neural Networks (CNN). Also, a training strategy to avoid
imbalance issues is proposed called “the balanced resampling with mini-batch training.”
Furthermore, a label correction strategy is proposed to conform the predicted results from
all networks on different class levels. The outcome of this work is a publication in an
internaltional journal in Scopus (Q4) as shown details below [20]:
Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level based
neural networks of word embedding sequences with sharing layer information. Walailak
Journal of Science and Technology, 2019. 16(2): p. 121-131.
Proposed Model
Figure 11 displays the brief workflow of overall process from the sentences to the
prediction. At first, we use Word2Vec model training from sentences to represent all words
in the sentences into word vectors. Then, we applied the convolutional neural network for
features extraction from word vectors which is sequentially combined to be the same as
the word order. In our work, we propose a novel level-based neural network to perform
the Hierarchical Text Categorization (HTC) in the prediction process with sharing layer
information in topic A, namely Shared Hidden Layer and Shared Pooling Layer. Finally,
we perform the label correction to fix the prediction as described in topic B and also apply
the training strategy to enhance the accuracy of the NN in the last topic.
20
features. We could interpret that they capture the same short sentence and sum up the
words into a feature. Therefore, with this technique, we use this model on our HTC
architecture to extract the features from sentences.
21
Figure 14. Architecture of Shared Pooling Layer model.
Result
The experiment was conducted on three datasets: WIPO-D, WIPO-C, and Wiki-small.
We examine our novel models in many variations, and the results are showed in
22
Table 6. Among variants of our proposed model, SHL-NN is the winner since it
outperforms others in 2 out 3 datasets.
HR-SVM [21] is a baseline model, an adaptive support vector machine for HMC. To
evaluate the model, we tested using the same training and testing sets with TF-IDF
features. Table 7 shows that our proposed model (SHL-NN) is the best of all micro-
averaging F1, but not in the macro-averaging F1. From our further investigation, we found
that our model performs remarkably well in the top level and the performance drops
quickly in the lower level. It shows that the neural network method still faces the blocking
problem and could be improved in macro-average results if we use more models or
features like bypass techniques for lower hierarchies in further experiments.
23
Table 6 Results of hierarchical text categorization. Boldface result is the winner on that
averaging score.
Macro-Averaging Micro-Averaging
Dataset Model
Precision Recall F1 Precision Recall F1
WIPO-D FC-NN 0.0412 0.0246 0.0287 0.6776 0.3303 0.4440
MLP-NN 0.0379 0.0271 0.0295 0.5548 0.3225 0.4073
SHL-NN 0.0399 0.0285 0.0309 0.5529 0.3222 0.4070
SPL-NN 0.0409 0.0223 0.0266 0.7030 0.3110 0.4311
WIPO-C FC-NN 0.0121 0.0116 0.0110 0.3740 0.3200 0.3376
MLP-NN 0.0167 0.0090 0.0109 0.5816 0.3227 0.4151
SHL-NN 0.0184 0.0099 0.0120 0.5589 0.3219 0.4084
SPL-NN 0.0142 0.0113 0.0118 0.4545 0.3281 0.3809
Wiki-small FC-NN 0.1208 0.1036 0.1035 0.5276 0.5436 0.5352
MLP-NN 0.1623 0.1157 0.1256 0.5820 0.5567 0.5690
SHL-NN 0.1599 0.1008 0.1143 0.6093 0.5374 0.5709
SPL-NN 0.1187 0.1037 0.0997 0.4812 0.5558 0.5155
Boldface result is the winner on that dataset.
Table 7 Comparison with existing methods. Boldface result is the winner on that
averaging score.
Dataset SHL-NN HRSVM
Macro-F1 Micro-F1 Macro-F1 Micro-F1
WIPO-D 0.0309 0.4070 0.0569 0.2081
WIPO-C 0.0120 0.4084 0.0316 0.1074
Wiki-small 0.0620 0.5099 0.2254 0.2126
Conclusion
In this work, we describe a series of experiments on the hierarchical text
categorization problem with a neural network using sharing layer information built on top
of word vectors. From experiments, the child-based label correction is the approach that
prevents the blocking problem and could correct more top levels prediction. Also, the
balanced resampling with mini-batch training is proper for avoiding the imbalanced data
situation for HTC. The proposed method (SHL-NN) can achieve the highest micro-
averaging F1.
4.2.2) Encoded Shared Layer Neural Network (ESL-NN)
In our previous work called Shared Hidden Layer Neural Network (SHL-NN), it has
shown that sharing information between levels can improve a performance of the model.
However, this work is based on a sequence of unsupervised word embedding vectors, so
the performance should be limited. In this work, we propose a supervised document
embedding specifically designed for hierarchical text categorization based on
Autoencoder, which is trained from both words and labels. To enhance the embedding
vectors, the document embedding strategies are invented to utilize the class hierarchy
24
information in the training process. To transfer the prediction result from the parent
classes, the shared information technique has been improved to be more flexible and
efficient. The outcome of this work is a publication in an internaltional conference in
Scopus as shown details below [22]:
Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th International
Joint Conference on Computer Science and Software Engineering (JCSSE). 2018.
Proposed Model
The workflow of an overall process from raw documents to their predictions is shown
in Figure 15. First, DocTag2Vec is employed to convert from a raw document to an
embedding vector. Then, the embedding vectors of training documents are fed to the
proposed model called “Encoded Shared Layer Neural Network (ESL-NN)”. Finally, a label
correction is applied to fix an inconsistence prediction in terms of hierarchy.
Result
From the experimental result as shown in Table 8, ESL-NN is the winner and achieves
the best performance in both F1-micro and F1-macro on almost all datasets. As shown
in
26
Table 9, in terms of F1 macro, ESL-NN achieves 14.53% higher than SHL-NN.
Table 8. Performance comparison: ESL-NN and “SHL-NN + OPD-Vec” are ours; while,
“SHL-NN + Word2Vec” and HR-SVM are baselines.
Dataset Score Ours Baselines
ESL-NN SHL-NN + OPD-Vec SHL-NN + Word2Vec HR-SVM
27
Table 9. Percentage of improvements from original methods (columns) to our proposed
methods (rows). In parentheses, we show the number of dataset that the proposed
methods can achieved higher F1 macro than the original method significantly
SHL-NN + OPD-Vec SHL-NN + Word2Vec HR-SVM
ESL-NN 14.53 (3) 204.51 (3) 25.03 (3)
SHL-NN + OPD-Vec 0.00 (3) 166.59 (3) 9.18 (2)
Conclusion
In this work, we aim to propose a novel deep learning network for hierarchical text
categorization called “Encoded Shared Layer Neural Network (ESL-NN).” It is an
extension of our previous work called “Shared Hidden Layer Neural Network (SHL-NN).”
There are two main contributions in this work: (i) a supervised document embedding
strategy and (ii) a novel sharing information technique. The experiment was conducted
on three corpuses: WIPO-C, WIPO-D, and Wiki-small. The macro-F1 results show that
ESL-NN unanimously outperforms all baselines: SHL-NN and HR-SVM.
5) Output
5.1) International Journal Publication
One publication in an international journal in Scopus
- Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level
based neural networks of word embedding sequences with sharing layer
information. Walailak Journal of Science and Technology, 2019. 16(2): p. 121-
131.
5.2) Research Utilization and Application
We have implemented our HMC solution called HR-SVM [21] along with our label
correction strategy [19]. This project is free for download for non-commercial use at the
link https://sites.google.com/site/hrsvmproject/ as in the figure below.
28
5.3) Others e.g. national journal publication, proceeding, international conference,
book chapter, patent
- Three proceedings in international conferences in Scopus
- Phachongkitphiphat, N. and P. Vateekul. An improvement of flat approach on
hierarchical text classification using top-level pruning classifiers. in 2014 11th
International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2014.
- Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in
Pattern Recognition. 2014. Cham: Springer International Publishing.
- Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th
International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2018.
29
6) Reference
1. Blockeel, H., et al., Decision trees for hierarchical multilabel classification: A case
study in functional genomics. Knowledge Discovery in Databases: Pkdd 2006,
Proceedings, 2006. 4213: p. 18-29.
2. Schietgat, L., et al., Predicting gene function using hierarchical multi-label decision
tree ensembles. Bmc Bioinformatics, 2010. 11: p. -.
3. Vens, C., et al., Decision trees for hierarchical multi-label classification. Machine
Learning, 2008. 73(2): p. 185-214.
4. Stenger, B., et al., Estimating 3D hand pose using hierarchical multi-label
classification. Image Vision Comput., 2007. 25(12): p. 1885-1894.
5. McCallum, A., et al., Improving text classification by shrinkage in a hierarchy of
classes. 1998.
6. Mladenie, D., Machine learning on non-homogeneous, distributed text data, in
Faculty of Computer and Information Science. 1998, University of Ljubljana.
7. Vateekul, P. and M. Kubat, Fast Induction of Multiple Decision Trees in Text
Categorization from Large Scale, Imbalanced, and Multi-label Data, in Proceedings
of the 2009 IEEE International Conference on Data Mining Workshops. 2009, IEEE
Computer Society: Miami, FL. p. 320-325.
8. Lewis, D.D., et al., RCV1: A New Benchmark Collection for Text Categorization
Research. J. Mach. Learn. Res., 2004. 5: p. 361-397.
9. Hersh, W., et al. {OHSUMED}: an interactive retrieval evaluation and new large
test collection for research. in SIGIR '94: Proceedings of the 17th annual
international ACM SIGIR conference on Research and development in information
retrieval. 1994. Dublin, Ireland: Springer-Verlag New York, Inc.
10. Silla, C. and A. Freitas, A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery, 2010.
11. Sun, A. and E.-P. Lim, Hierarchical Text Classification and Evaluation, in
Proceedings of the 2001 IEEE International Conference on Data Mining. 2001,
IEEE Computer Society. p. 521-528.
12. Secker, A.a.D., Matthew N and Freitas, Alex A and Timmis, Jon and Mendao,
Miguel and Flower, Darren R, An experimental comparison of classification
algorithms for hierarchical prediction of protein function. 3rd UK Data mining and
Knowledge Discovery Symposium UKKDD 2007 Canterbury, 2007: p. 13–18.
13. Clare, A., et al., Functional bioinformatics for arabidopsis thaliana. Bioinformatics,
2006(22 (9)): p. 1130-1136.
14. Quinlan, J.R., Induction of Decision Trees. Mach. Learn., 1986. 1(1): p. 81-106.
30
15. Blockeel, H., L. De Raedt, and J. Ramon. Top-down induction of clustering trees.
in Proceedings of the 15th International Conference on Machine Learning. 1998.
Morgan Kaufmann.
16. Phachongkitphiphat, N. and P. Vateekul. An improvement of flat approach on
hierarchical text classification using top-level pruning classifiers. in 2014 11th
International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2014.
17. Yang, Y. and J.O. Pedersen, A Comparative Study on Feature Selection in Text
Categorization, in ICML '97: Proceedings of the Fourteenth International
Conference on Machine Learning. 1997. p. 412-420.
18. Chang, C.-C. and C.-J. Lin, LIBSVM: A library for support vector machines. ACM
Trans. Intell. Syst. Technol., 2011. 2(3): p. 1-27.
19. Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in
Pattern Recognition. 2014. Cham: Springer International Publishing.
20. Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level
based neural networks of word embedding sequences with sharing layer
information. Walailak Journal of Science and Technology, 2019. 16(2): p. 121-
131.
21. Vateekul, P., S. Dendamrongvit, and M. Kubat, Improving SVM Performance in
Multi-Label Domains: Threshold Adjustment. International Journal on Artificial
Intelligence Tools, 2013.
22. Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th
International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2018.
31
Appendix (List of Research Outputs)
32