Hierarchical Multi-Label Classification For Large Scale Data

Final Report
Project Title: Hierarchical Multi-Label Classification for Large Scale Data
1) Asst. Prof. Peerapon Vateekul, Ph.D. Dept. of Computer Eng., Chulalongkorn Univ.
2) Prof. Boonserm Kijsirikul, Ph.D. (Mentor) Dept. of Computer Eng., Chulalongkorn Univ.
January 2020
Contact No: TRG5780220
Final Report
1) Asst. Prof. Peerapon Vateekul, Ph.D. Dept. of Computer Eng., Chulalongkorn Univ.
2) Prof. Boonserm Kijsirikul, Ph.D. (Mentor) Dept. of Computer Eng., Chulalongkorn Univ.
Project Granted by the Thailand Research Fund

Office of the Higher Education Commission and Chulalongkorn University
Table of Contents
Abstract ...............................................................................................................................1
Executive summary ............................................................................................................. 2
1) Introduction to Research ................................................................................................4
2) Literature review ............................................................................................................. 5
2.1) Problem Statement in Hierarchical Multi-Label Classification ...........................5
1.1 Existing Strategies ..............................................................................................6
1.2 Performance Evaluation .....................................................................................7
3) Objective .........................................................................................................................8
4) Research Methodology and Results .............................................................................. 8
4.1) SVM Based Solution...........................................................................................8
4.2) Deep Learning Based Solution ............................................................................19
5) Output ...........................................................................................................................28
5.1) International Journal Publication ..........................................................................28
5.2) Research Utilization and Application ...................................................................28
5.3) Others e.g. national journal publication, proceeding, international conference,
book chapter, patent ....................................................................................................29
6) Reference......................................................................................................................30
Appendix (List of Research Outputs)................................................................................32
Abstract
Project Code: TRG5780220

Investigator: Asst. Prof. Peerapon Vateekul Ph.D.
E-mail Address: peerapon.v@chula.ac.th
Project Period: May 12, 2014 to Dec 31, 2019
Abstract:
Hierarchical Multi-Label Classification (HMC) is a variant of classification where an
example can belong to two or more classes at the same time, and the classes are mutually
related by generalization/specialization operations. Recently, the task has gained
considerable attention due to the need to organize data with predefined class hierarchy
across various domains of application, such as, web repositories, digital libraries, protein
function prediction, music genre categorization, etc. In text classification, the size of data
is usually very large, particularly on the web, e.g., Wikipedia and Dmoz Open Directory
Project. There are not only millions of documents and words (features), but also the
number of classes in the hierarchy can be thousands of topics. Unfortunately, most
classification algorithms are memory resident and cannot handle very large scale of data.
Moreover, analysis under this scenario often shows low prediction accuracy caused by
the imbalanced training sets, where one class outnumbers the other.
In this project, we aim to propose an algorithm for HMC, especially for text corpora
that are usually large and contain many challenging issues. To achieve best accuracy,
there are several approaches investigated in this work: Support Vector Machine (SVM)
and Deep Learning techniques. Furthermore, additional preprocessing and
postprocessing are also presented in order to improve both accuracy and training time.
Keywords: Hierarchical Multi-Label Classification, Large Scale Data, Support Vector

Machines, Deep Learning
1
Executive summary
Hierarchical Multi-Label Classification (HMC) is a complex type of classification
problem, where each example can be annotated into more than one label, and the labels
are hierarchically organized. It has received a lot of attentions due to its need in broad
domains of applications. In the text domain, this problem is also known as Hierarchical
Text Categorization (HTC). There are many challenging issues: (1) the size of text corpora
is usually large, e.g., Wikipedia, Dmoz, WIPO, etc. (2) the number of classes in the
hierarchy is large with class dependencies.
This research project aims to propose a novel algorithm for HMC, especially in the
domain of application in “text categorization”. There are many research outputs based on
modern algorithms: Support Vector Machines (SVM) and Deep Learning techniques,
along with label correction strategy to further improve an accuracy. As an output of this
project, there are four outcomes: one internal journal and three internal conference
proceedings. Furthermore, there is also an open-source software publicly available for
non-commercial.
First, we propose a technique based on the flat approach by generating one SVM
per each leaf node in the hierarchy. There are two proposed strategies including “Top-
Level Pruning (TLP)” and “Exclusive Top-Level Training Policy (ETT).” The prior one, TLP,
helps to reduce the number of candidates (leaf-node classifiers) by adding top-level
classifiers to prune unrelated branches resulting in a smaller set of candidates. The latter
strategy, ETT, is a method that uses the top-level data to construct a binary training set.
Comparing to use the whole data for each class node, ETT can generate a smaller and
less imbalanced training set.
Second, we propose a technique called “Label Correction” to correct an unusual
prediction by taking a class correlation into account resulting in F1-improvement. There
are three modules in the proposed system: (i) label correction, (ii) performance
enhancement, and (iii) hierarchical path adjustment. The results show that the model with
our label correction significantly outperforms baseline techniques without any corrections.
Third, we propose a Deep Learning based algorithm called “Level-Based Neural
Networks with Sharing Layer Information (SHL-NN)”. The model is based on a multiclass
classifier per each class level in the hierarchy, which requires a smaller number of
classifiers than the other HMC approaches (top-down and flat). At first, we use Word2Vec
model training from sentences to represent all words in the sentences into word vectors.
Then, we applied the convolutional neural network for features extraction from word
vectors which is sequentially combined to be the same as the word order. Next, the model
is enhanced specifically for HMC by considering information from the previous hierarchical
level (layer) called “Sharing Layer Information”. Finally, we perform the label correction to
fix the prediction.
2
Fourth, it is also based on Deep Learning model by focusing on an improvement of
our previous attempt “SHL-NN” since it is based on a sequence of unsupervised word
embedding vectors, so the performance should be limited. In this work, we propose a
supervised document embedding specifically designed for hierarchical text categorization
based on Autoencoder, which is trained from both words and labels. To enhance the
embedding vectors, the document embedding strategies are invented to utilize the class
hierarchy information in the training process. To transfer the prediction result from the
parent classes, the shared information technique has been improved to be more flexible
and efficient.
3
1) Introduction to Research
The amount of data in the world has been exploding caused by the rise of multimedia,
social media, and the Internet. Particularly, the volume of World Wide Web (WWW)
information is growing annually at minimum rate of 59%; therefore, it is necessary to have
an appropriate indexing scheme to organize the data into categories. However, manual
creation of such scheme is impractical due to the size of data and the costs of human
experts. This is why some machine-learning scientists have proposed a classification
technique to automate the process.
Traditional classification approach assumes that each example should be assigned
to one (and only one) out of two or more classes. However, this scenario digresses from
real-world applications in two aspects. First, each example can belong to many classes
simultaneously (multi-label classification). This aspect commonly appears in many areas.
For example, each movie can belong to more than one genre, e.g., Harry Potter is
adventure and also fantasy movie. Likewise, a book can be categorized into multiple
topics: engineering, technology, science, computer, etc. Second, the class can be
hierarchically ordered (hierarchical classification). When the number of classes is large, it
is helpful to organize them into groups and subgroups. For instance, large collections of
web pages can be handled in conceptual hierarchies such as the one from Figure 1. This
hierarchical arrangement is essential because it allows users to focus only on the level of
details, they are interested in. The task seeking to address both of these issues is called
“hierarchical multi-label classification” (HMC).
Root
Arts Business Computers
Movies TV Jobs Investing Internet Software
Figure 1. Parts of top-level concepts in the Dmoz Open Web Directory project
There are many applications which can benefit from HMC including: gene function
prediction [1-3], image recognition [4], and web repositories/digital libraries (e.g., Dmoz,
Wikipedia1, Yahoo [5, 6], EUROVOC [7], Reuters [8], OHSUMED [9]). In this research,
we focus on an application in “text categorization,” particularly on the data from Wikipedia.
Along with their widespread use, comes the need for automated classification of new
documents to the categories in the hierarchy. As the size of the hierarchy grows and the
1The data sets from Wikipedia (www.wikipedia.org) and the Dmoz Open Web Directory (www.dmoz.org) are available through the Pascal
Challenge on Large Scale Hierarchical Text Classification (LSHTC).
4
number of documents to be classified increases, a number of interesting machine learning
problems arise.
There are 3 main challenges in processing the Wikipedia corpus. First, some classes
in the hierarchy have more than one parent, in which case the mutual relationships have
to be described by a hierarchy structured as “a directed acyclic graph (DAG).”
Unfortunately, most of the previous HMC techniques support only the tree structured
hierarchy, where each class can strictly inherit from one parent. Second, traditional
classification techniques tend to perform low prediction accuracy. In the HMC domain, it
is common to induce a separate binary classifier for each class in hierarchy. As a result,
some of these classifiers have to induce from “imbalanced training sets,” where one class
outnumbers the other – a circumstance known to hurt the performance. Third, the size of
the Wikipedia corpus is so large that it cannot be processed by any classical supervised
algorithms since most of them are memory-resident, while it is impossible to load all the
data into memory at once. Moreover, the computational cost is also prohibitively
expensive. In particular, the largest data set (based on Wikipedia) contains approximately
2 million documents and 1.6 million features. The class hierarchy consists of more than
300,000 categories on 729 levels!
2) Literature review
2.1) Problem Statement in Hierarchical Multi-Label Classification
Multi-Label classification is a domain where an example can be associated to multiple
classes simultaneously. Let 𝑋 ⊆ ℝ% be an example space of n-dimensional features,
and 𝑌 be a finite set of classes. Let 𝐷 be a set of training examples of (+++⃗,𝑥* 𝑦/ ) which
contains a feature vector 𝑥 +++⃗* ∈ 𝑋 and belongs to multiple classes 𝑦/ ⊆ 𝑌. The goal
of multi-label classification is to find the classifier 𝑔: 𝑋 → 26 that maps example 𝑥 to
set of classes 𝑦. This kind of tasks can be encountered in various domains, including
image, text, biology, audio, and video. For example, movie genre classification is multi-
label classification because each movie can belong to more than one genre, e.g., both
adventure and also fantasy movie simultaneously.
Hierarchical Multi-Label classification (HMC) refers to a task that follows a multi-label
assumption, and classes are organized in a hierarchical structure. In addition, if an
example belongs to a class ci, it must belong to all ci’s superclasses. This property is
called “a hierarchical constraint”. Let 𝐻 = (𝐶, 𝛦) be a class hierarchy consisting of a
set of class nodes, 𝐶 ⊆ 𝑌, and a set of edges, 𝛦. Let ≺ represents superclass
relationship, and classes {𝑐? , 𝑐@ } ⊂ 𝐶; 𝑐? ≺ 𝑐@ if and only if 𝑐? is a superclass of
𝑐@ . Denote (𝑥 +++⃗,* 𝑦/ ) as a tuple of example, which +++⃗
𝑥* is a feature vector and 𝑦/ is a
set of classes. Note that 𝑦/ must satisfy the hierarchical constraint if 𝑐? ≺ 𝑐@ and
5
𝑐@ ∈ 𝑦/ then 𝑐? ∈ 𝑦/ . The HMC goal is to find the classifier 𝑔: 𝑋 → 26 that maps
example 𝑥 to set of classes 𝑦 such that 𝑔(𝑥) follows the hierarchical constraint.
Recently this domain has gained a lot of attentions in various applications including
functional genomics, image annotation, and text classification.
Two versions of this task exist. In the mandatory leaf-node problem (MLNP), only the
leaf-node classes are used. By contrast, in the non-mandatory leaf node problem
(NMLNP), an example can be labeled with any class from the given class hierarchy.
Considering the class hierarchy from Figure 2, MLNP permits an example to be labeled
only with a subset of { C1.1, C1.2, C2.1, C2.2.1, C2.2.2 }, but NMLNP allows also the
other class labels (e.g., C1 or C2.2).
In this research, we focus on the hierarchy from Wikepedia. The hierarchical structure
is DAG. Each example can belong to many classes and must be assigned to only classes
at leaf nodes, MLNP. From our preliminary analysis, cyclic relationships are found in the
hierarchy; thus, they are needed to be removed in order to preserve the property of DAG.
1 2
1.1 1.2 2.1 2.2
2.1.1 2.1.2 2.2.1 2.2.2

Figure 2. An example of a DAG-structured class hierarchy
1.1 Existing Strategies
Surveying existing solutions to the HMC problem, [10] distinguishes three
fundamental strategies: (1) flat classification, (2) the top-down approach (local classifier),
and (3) the big-bang approach (global classifier).
1.1.1 Flat Approach
This ignores the class hierarchy altogether, and deals only with the leaf-node classes,
whether by a single multi-label classifier or by a set of binary classifiers (a separate one
for each leaf node). The advantage is that this enables the use of traditional machine-
learning techniques such as neural networks, decision trees, or SVM.
This solution can be employed only if leaf-node class label is known for each
example, and if the nature of the application is such that the user can afford the inability
to identify non-terminal classes.
6
1.1.2 Top-Down Approach
This is the most common approach in the HMC domain. One or more classifiers are
trained for each (i) class node or (ii) class level of the hierarchy. This produces a tree of
classifiers in the tree structure or a graph of classifiers in the DAG structure. In the training
process, those classifiers are constructed from “local information”. In the testing process,
each classifier collaborates to predict an example in a top-down fashion.
Sun and Lim [11] applied this strategy to text classification where the class hierarchy
was a plain tree structure. For each class, they induced two SVMs: a local classifier and
a subtree classifier. An example is labeled as Ci by the local classifier, while the latter
one decides whether or not this example should be passed to Ci's subclassifiers.
Seeking to further improve the performance, Secker et al. [12] used several different
induction algorithms for each node of the hierarchy: Naive Bayes, SMO, 3-NN, etc. For
each node, ten classifiers were trained, and the one with the best classification results
was selected. This improved classification accuracy, but the computational costs were
even higher than in the previous attempt.
1.1.3 Big-Bang Approach (Global Approach)
Departing from the idea to induce a separate binary classifier for each node, some
authors preferred to induce one big (global) classification model to cover the entire class
hierarchy. This may have certain advantages: the size of the global classification model
is usually less than the sum of the node classifiers, and mutual interdependencies of the
classes are more easily taken into account.
A good example of the big-bang approach is the one reported by Clare and King [13]
who developed a hierarchical extension to the decision-tree generator C4.5 [14] and
applied it to functional-genomics data (they called their system HC4.5). To give higher
priority to more specific classes, they introduced a mechanism to weigh the entropy
formula accordingly.
Blockeel et al. [1] proposed another attempt called Clus-HMC, which is a decision
tree based algorithm to handle the HMC task. Their algorithm is a hierarchical version of
an earlier predictive clustering tree (PCT) [15]. Ven et al. [3] improved Clus-HMC, so that
it could be used in DAG-specified class hierarchies. An ensemble version of the algorithm,
Clus-HMC-ENS, was proposed by Schietgat et al. [2].
1.2 Performance Evaluation
We used two evaluation metrics to compare each model, F-measures with macro-
and micro-averaging for a multi-label purpose. They are defined as the performance
criteria as shown in Table 1. The macro-averaging is appropriate for measurement when
we want to ignore the overcoming of majority classes. In opposition, micro-averaging is
used to measure its globally counts.
7
A comparison in HTC often use F1 macro to be the main evaluation metric. For
example, the LSHTC challenge on Kaggle, the competition in HTC, also used F1 macro
to be a criterion for finding the winner of the competition.
Table 1. Evaluation metric

Multi-Label Classification
Classification
Macro-Averaging Micro-Averaging
Precision
|C | |C | |C |
Pr = TP (TP + FP )
1
MaPr = |C | å Pri MiPr = å TPi å (TPi + FPi )
i =1 i =1 i =1
|C | |C | |C |
Recall
Re = TP (TP + FN )
1
MaRe = |C | å Rei MiRe = å TPi å (TPi + FNi )
i =1 i =1 i =1
2 |C | 2
( b + 1) ´ Pr ´ Re ( b + 1) ´ MiPr ´ MiRe
1
å
F𝜷
Fb = MaFb = |C | Fb ,i MiFb =
2 2
b ´ Pr + Re i =1 b ´ MiPr + MiRe
3) Objective
- This research project aims to propose a novel hierarchical multi-label classification
(HMC) algorithm which supports large-scale data. It will focus on the domain of
application in “text categorization”.
- The proposed techniques should apply modern machine learning techniques, e.g.,
SVM and Deep Learning techniques.
4) Research Methodology and Results
In this research, there are two main proposed solutions.
- First, it is based on SVM using the flat approach in Section 2.2.1, where it
generates a binary classifier per each class in the hierarchy.
- Second, we further propose a better solution based on Deep Learning techniques
also using the top-down approach in Section 2.2.2. Anyway, it generates only a
multi-label classifier per each level in the hierarchy, so the number of classifiers
in the second solution (Deep Learning) is much less than that in the first solution
(SVM).
4.1)SVM Based Solution
4.1.1) Flat Approach Using Top-Level Pruning Classifiers

In this work, we aim to propose a hierarchical classification technique based on the
flat approach. It generates one classifier, which is “SVM”, per each leaf node in the
hierarchy. Other strategies have been invented to address challenging issues in large-
scale text corpora. The outcome of this work is a publication in an internaltional
conference in Scopus as shown details below [16]:
8
Phachongkitphiphat, N. and P. Vateekul. An improvement of flat approach on
hierarchical text classification using top-level pruning classifiers. in 2014 11th International
Joint Conference on Computer Science and Software Engineering (JCSSE). 2014.
Proposed Model
This work presents a novel flat-based hierarchical text classification that can achieve
promising prediction performance. There are two strategies including “Top-Level Pruning
(TLP)” and “Exclusive Top-Level Training Policy (ETT).” The prior one, TLP, helps to
reduce the number of candidates (leaf-node classifiers) by adding top-level classifiers to
prune unrelated branches resulting in a smaller set of candidates. The latter strategy,
ETT, is a method that uses the top-level data to construct a binary training set. Comparing
to use the whole data for each class node, ETT can generate a smaller and less
imbalanced training set. SVM is a baseline classifier employed in this work since it is
widely used and shows a good performance.
There are five main steps in the proposed method (Figure 3): (i) feature selection,
(ii) top-level pruning, (iii) similarity measure, (iv) training classifiers, and (v) classification.
The first step aims to select a small set of relevant features. The second step builds and
trains a classifier locally for each node in the top level. In the testing process, each
example is classified by these classifiers. If any of them predict it as negative (unrelated),
their corresponding branches with leaf-node classes can be pruned. Third, a cosine
similarity is calculated between the testing example and a vector of the remaining classes
from the previous step giving a set of candidates. The fourth step is to induce a classifier
for each candidate classes and, finally, predict the result.
Figure 3. Proposed flat-based solution
1) Feature Selection: A text corpus is usually described by a large number of

features, so it is necessary to perform a feature selection. In this work, Chi-Square (X2)
[17] is chosen as a criterion as a measure of dependency between feature f and class c.
Let A be the number of times f and c occur at the same time, B be the number of times
f occurs without c, C be the number of times c occurs without f, D is the number of times
neither f nor c occur, and N be the total number of documents. The feature-goodness
measure is defined as:
9
The X2 has a natural value of zero if feature and class are independent. For each
feature, X2 scores are computed for all classes and, then, only the maximum value is
selected as in the following equation:
Next, the features are sorted by their X2 in descending order and removed if their
score cannot pass a predefined threshold. To avoid removing all of the features, the
threshold must be at least the minimum value of the feature used in the training set.
2) Top-Level Pruning: In the text corpus, the number of leaf-node classes in the
hierarchy is usually very large causing a high computational cost and low prediction
accuracy due to noisy classes. To avoid this issue, the method called “Top-Level Pruning
(TLP)” is proposed to apply a top-level classifier to prune its corresponding leaf-node
subclasses if a testing example is predicted as negative. For instance, Figure 4 shows
the original class hierarchy, where the nodes surrounded by a dashed rectangle are
candidate. With our strategy, TLP first induces top-level classifiers for classes c1, c2, and
c3. Next, they are used to classify a testing example. If it is predicted as negative by the
classifier c3, all of c3’s subclasses can be removed from the consideration as shown in
Figure 5. From the prelinary experiment, it showed that these top-level classifiers can give
high prediction accuracy since the training set is large enough and less imbalanced.
Sometimes a testing example is predicted as not belonging to any of the top-level
classifiers. If it happens, there are two possible solutions. The first one is called “TLP root
type (TLP (r))” by including only the root node and discarding all other subclasses in the
prediction (skip the next module). The second option is called “TLP candidate type (TLP
(c))” by ignoring the TLP strategy and using the result based on the similarity score in the
next module. Which option is better will be discussed in the experiment.
Figure 4. Example of an original hierarchy. The nodes surrounded by dashed rectangles

with round corners are leaf nodes which can be candidate classes of test examples.
10
Figure 5. Example of top-level pruning hierarchy. The nodes surrounded by dashed
rectangles with round corners are leaf nodes which can be candidate classes of test
examples.
3) Similarity Measure: For each testing example, a set of candidate classes which
is likely to be a true class is selected based on a cosine similarity from the subset of leaf
nodes obtained from the prior step. Let A and B are vectors, the cosine similarity is
defined as below. First, a centroid vector for each leaf-node class is computed as an
average of the training documents which belong to that class. Next, a vector of testing
data is compared to all leaf-node vectors using the cosine similarity. Finally, the top
classes with the highest similarity score is chosen as a set of candidates.
4) Training Classifier: After obtaining a set of candidate classes, there are many
possbile methods to construct a binary classifiers. In this work, a novel training policy
called “Exclusive Top-Level Training Policy (ETT)” is proposed, and it is specifically
designed to cooperate with TLP (the second module of our framework). The notations are
defined as follows: the class ci, let FL(ci) be a set of classes in first level of the hierarchy.
ETT is defined as in the following equation:
T+(ci) = *(ci) ∪ T+(Child(ci)) and T-(ci) = T(FL(ci))\ T+(ci)
The ci’s positive training set includes all examples whose most specific class is ci
union with ci’s subclasses. The ci’s negative training set is a set of examples belonging
to the first-level class in the hierarchy which has ci as its subclass excluding ci’s positive
examples. Comparing to other policies, ETT is less imbalanced than EAT and more
effective than EPT since unseen examples (from the different branches) are already
filtered out by our pruning strategy, TPL, at the top level.
5) Classification and Evaluation: In the final step, the candidate classifiers built in
the previous step is used to predict as to whether or not the testing example truly belongs
to them.
11
Results
The experimental data is extracted from the Wikipedia dataset from Large Scale
Hierarchical Text Classification Challenge. Its statististics is illustrated in Table 2 showing
that the number of features is larger than the number of examples. SVM is our baseline
classifier, where the implementation is employed from a publicly available tool, LIBSVM
[18].
Table 2. Statistics of small subset of Wikipedia dataset
#Doc #Classes #Features Max Depth
421 43 6,372 5
The experimental results are demonstrated in Table 3. The winner method is our
approach with an option of “ETT + TLP (r)”. It showed the highest performance in all
metrics. This shows that our proposed strategies can improve the model accuracy
including (1) “Exclusive Top-Level Training Policy (ETT)” and (2) “Top-Level Pruning
(TLP)”
Table 3. Performance comparison, where EPT is baseline2. The integer in parentheses

gives the rank as obtained by ANOVA followed by Bonferroni multiple comparison with
=0.05. The boldface is the method with the highest score.
Method Macro F1 Micro F1
EPT (baseline) 0.144 (2) 0.332 (4)
ETT 0.154 (2) 0.439 (4)
ETT + TLP (r) 0.334 (1) 0.688 (1)
ETT + TLP (c) 0.332 (1) 0.656 (1)
Conclusion
In this work, we present a hierarchical classification framework based on the flat
approach for the text categorization domain, where the number of classes and levels in
the hierarchy can be very large. The proposed method utilizes data at the first level of
the hierarchy in two aspects by adding the Top-Level Pruning classifiers (TLP) and
constructing a training set using Exclusive Top-Level Training policy (ETT). The
experiment was conducted on a subset of the Wikipedia dataset. The result showed that
our method (TLP & ETT) is the winner and it is statistically significant better than baseline
methods.
2 The results reported in this table is summarized from the total results in Table 5 of 16. Phachongkitphiphat, N. and P. Vateekul. An
improvement of flat approach on hierarchical text classification using top-level pruning classifiers. in 2014 11th International Joint Conference on
Computer Science and Software Engineering (JCSSE). 2014..
12
4.1.2) Label Correction Strategy
The main drawback of our previous work in Section 0 is that the top-down approach
always suffers from error propagation and yields such a poor performance of classifiers
at the lower levels. In this work, we present a novel method called “label correction”,
which takes label correlation into consideration and corrects the results of unusual
prediction patterns. The outcome of this work is a publication in an internaltional
conference in Scopus as shown details below [19]:
Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in Pattern
Recognition. 2014. Cham: Springer International Publishing.
Proposed Model
There are three main modules in the proposed system (Figure 6): (i) label correction,
(ii) performance enhancement, and (iii) hierarchical path adjustment. The first module
aims to correct a prediction label set to match the closest label set in the training data.
Then, the second module ensures that the corrected label sets do not lessen the overall
system performance. There is a selection heuristic to correct only some classes; not all
classes and excluding classes whose performance is decreased after the correction.
Finally, the third module is only applied in the hierarchical domain in order to adjust the
result to satisfy the hierarchical constraint.
2. Performance 3. Hierarchical Path

1. Label Correction
Enhancement Adjustment (only HMC)
Figure 6. A process diagram of the proposed system for “Label Correction”
1) Label Correction: A set of distinct label sets is extracted from the training data.
Each of them is a vector of +1 (positive) and -1 (negative). Then, a distance (cost)
between each prediction to the distinct label sets is computed. Finally, if the prediction
result does not match to any label sets, it is corrected to match to the closest one (the
smallest cost). There are two choices of how to compute the cost: (i) constant cost and
(ii) score cost. In the first equation below, the cost is the total number of labels with
different sign between a prediction vector and a training label set vector. In the second
equation below, a vector of SVM scores is used instead of a vector of predictions. The
cost is a summation of square root of the absolute score of classes with different sign.
13
Figure 7 illustrates an example of how to compute the cost in the label correction
module. Assume there are 5 classes, L = {A, B, C, D, E}, with 4 distinct label sets,
S = {A, AB, BE, ACE}. Given a testing example x, if a vector of SVM prediction scores is
[-0.2, 0.1, 1.5, -0.8, 0.4], a vector of predictions will be [-1, +1, +1, -1, +1] using a sign
function. Unfortunately, this prediction pattern does not match any label sets, so this result
should be corrected. Using the constant cost, the new prediction is BE with the cost of 1,
while the new result is ACE with the score cost of 0.76. Whether distance cost function
is preferred will be investigated in the experiment.
Class = {A, B, C, D, E}
Distinct Set = {A, AB, BE, ACE}
SVM Score Vector = [-0.2, 0.1, 1.5, -0.8, 0.4]
SVM score
vector -0.2 0.1 1.5 -0.8 0.4 costconstant costscore
distinct set
A [+1,-1,-1,-1,-1] √0.2 √0.1 √1.5 0 √0.4 4 2.62
AB [+1,+1,-1,-1,-1] √0.2 0 √1.5 0 √0.4 3 2.30
BE [-1,+1,-1,-1,+1] 0 0 √1.5 0 0 1 1.22
ACE [+1,-1,+1,-1,+1] √0.2 √0.1 0 0 0 2 0.76
From 1st equation, the lowest costconstant is 1 and the prediction result should be BE.
From 2nd equation, the lowest costscore is 0.76 and the prediction result should be ACE.
Figure 7. An example of label correction process
2) Performance Enhancement: In the first process, the prediction results are forced
to exactly match one of the distinct label sets. It is possible that the performance of some
classes may be dropped. The performance enhancement module is proposed to
guarantee that this scenario has never happened; therefore, there is always an
improvement in the overall system performance – in terms of F1.
All the steps of this module are conducted on the training data. First, F1-score for
each class is evaluated on the baseline system, SVM-BR (SVM Binary Relevance without
label correction). Second, our label correction (1st module) is applied and, then, F1-score
for each class is reevaluated. Finally, if the performance of any classes is dropped after
using the label correction, this module will keep those classes’ result of the original system
without any correction.
A mechanism behind this module is illustrated via the below figures. Figure 8 shows
F1-score of each class before applying the label correction, where the macro-averaging
F1 is 0.472. After applying our strategy, F1-result for each class is shown in Figure 9.
Figure 9 (a) and (b) are the results without and with applying the performance
enhancement module. In Figure 9 (b), this module excludes classes A and B from the
label correction since their F1-scores are dropped as in Figure 9 (a). Because some
classes are excluded from label correction, the hierarchical constraint may not be
14
satisfied. In this case, the next phase will adjust the result to guarantee the hierarchical
constraint.
A
0.8
B C
0.55 0.62
D E F
0.16 0.12 0.58
Figure 8. F1-score of SVM-BR
A A
0.76 ↓ 0.8
B C B C
0.49 ↓ 0.65 ↑ 0.55 0.65 ↑
D E F D E F
0.44 ↑ 0.56 ↑ 0.62 ↑ 0.44 ↑ 0.56 ↑ 0.62 ↑
(a) F1-score without performance enhancement (b) F1-score with performance enhancement
Figure 9. An example of performance enhancement process. ↑ and ↓ refer to as
increase or decrease in terms of F1 comparing to SVM-BR in Fig. 3. The gray node is a
class, whose F1 is dropped.
3) Threshold Adjustment: To adjust the result along the hierarchical path, there are
three proposed strategies. First, the decreasing strategy is to follow the superclass’
prediction. All subclasses’ results are assigned to be negative when their superclassifier
predicts as negative. Second, the increasing strategy is to depend on the subclass’ result.
If the subclassifier predicts as positive, the results of its ancestors must also be positive.
Third, the voting strategy is whether adjust the result depending on the majority class of
the prediction. On the one hand, if the majority class along the path is negative, the result
is adjusted using the decreasing strategy. On the other hand, the prediction is modified
using the increasing strategy.
Figure 10 illustrates an example of the hierarchical adjustment for the class path {A ,
B, C}. Figure 10 (a) is the result without the adjustment as {Yes, No, Yes}, where the
results of classes B and C disagree and violate the hierarchical constraint. The results of
Figure 10 (b) and (c) are {Yes, No, No} and {Yes, Yes, Yes} adjusted using the decreasing
and increasing strategies. Since most of classifiers predict as “Yes”, the voting strategy
in Figure 10 (d) gives the same result as in the increasing strategy, {Yes, Yes, Yes}.
15
A A
Yes Yes
B C B C
No Yes No Yes
D E F D E F
Yes No Yes No No Yes
(a) A prediction result before adjustment (b) A result of decreasing strategy

A A
Yes Yes
B C B C
Yes Yes Yes Yes
D E F D E F
Yes No Yes Yes No Yes
(c) A result of increasing strategy (d) A result of voting strategy

Figure 10. An example of hierarchical path adjustment. The gray node is a class
adjusted to follow the hierarchical constraint.
Results
For the HMC domain, we compare three hierarchical path adjustment strategies: (1)
decreasing, (2) increasing, and (3) voting as mentioned. SVM-BR is considered as our
baseline (without label correction). There are 10 HMC datasets, and their results are
provided in Table 4 and
16
Table 5. Table 4 demonstrates that the best hierarchical adjustment is clearly “the
voting strategy”. On the wipo dataset in
17
Table 5, it has significantly increased from SVM-BR (baseline) for 75%.
Table 4. A comparison of SVM-BR and 3 label correction strategies in HMC. ELC

stands for “Enhanced Label Correction”. For each cell, X/Y means the numbers of
winner and loser datasets between the method in row to that in column. They must be
significantly different on a paired t-test with α=0.05.
SVM-BR Decreasing ELC Increasing ELC
Decreasing ELC 6/0
Increasing ELC 8/0 6/0
Voting ELC 8/0 6/0 2/0
18
Table 5. F1-results on HMC datasets. ELC stands for “Enhanced Label Correction”. The
number in parentheses is a percentage (%) comparing to SVM-BR. * represents a
significant difference on a paired t-test with α=0.05. The boldface method is a winner
on that dataset.
Method
SVM-BR Decreasing ELC Increasing ELC Voting ELC
Dataset
D0_yeast_GO 0.493 0.502 (+2) 0.511 (+4)* 0.511 (+4)*
D15_scop_GO 0.435 0.499 (+15)* 0.507 (+17)* 0.511 (+18)*
D16_struc_GO 0.313 0.333 (+6)* 0.378 (+21)* 0.377 (+21)*
D17_interpro_GO 0.159 0.180 (13)* 0.182 (+14)* 0.186 (+17)*
diatoms 0.431 0.618 (+43)* 0.644 (+50)* 0.644 (+50)*
enron 0.336 0.339 (+1) 0.347 (+3) 0.347 (+3)
imclef07a 0.668 0.728 (+9)* 0.745 (+11)* 0.745 (+11)*
imclef07d 0.722 0.732 (+1) 0.764 (+6)* 0.764 (+6)*
reuters 0.540 0.564 (+5) 0.574 (+6) 0.574 (+6)
wipo 0.220 0.386 (+75)* 0.386 (+75)* 0.386 (+75)*
Conclusion
In this work, we propose a technique called “Label Correction” to correct an unusual
prediction by taking a class correlation into account resulting in F1-improvement. There
are three modules in the proposed system: (i) label correction, (ii) performance
enhancement, and (iii) hierarchical path adjustment. Comparing to SVM-BR, our method
significantly won on all ten HMC datasets.
4.2) Deep Learning Based Solution

Recently, Deep Learning technique has been showing promising results in many
domains. This has motivated us to propose Deep Learning based algorithms specifically
for the HMC task. Furthermore, prior attempts based on the top-down approach require
a lot of classifiers since it generates one classifier per each class, so it is not suitable for
large-scale corpora. In this section, the approach is based on a multiclass classifier per
each class level in the hierarchy, which requires a smaller number of classifiers than the
tradition top-down one.
4.2.1) Level-Based Neural Networks with Sharing Layer Information (SHL-NN)
One of the prior approaches for HMC is a level-based strategy that induces a
multiclass classifier for each class level independently. However, all prior attempts did not
utilize information from its parent level and employed a bag of words rather than
considered a sequence of words. In this work, we present a novel level-based hierarchical
text categorization with a strategy called “sharing layer information.” For each class level,
19
a neural network is constructed, where its input is a sequence of word embedding vectors
generated from Convolutional Neural Networks (CNN). Also, a training strategy to avoid
imbalance issues is proposed called “the balanced resampling with mini-batch training.”
Furthermore, a label correction strategy is proposed to conform the predicted results from
all networks on different class levels. The outcome of this work is a publication in an
internaltional journal in Scopus (Q4) as shown details below [20]:
Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level based
neural networks of word embedding sequences with sharing layer information. Walailak
Journal of Science and Technology, 2019. 16(2): p. 121-131.
Proposed Model
Figure 11 displays the brief workflow of overall process from the sentences to the
prediction. At first, we use Word2Vec model training from sentences to represent all words
in the sentences into word vectors. Then, we applied the convolutional neural network for
features extraction from word vectors which is sequentially combined to be the same as
the word order. In our work, we propose a novel level-based neural network to perform
the Hierarchical Text Categorization (HTC) in the prediction process with sharing layer
information in topic A, namely Shared Hidden Layer and Shared Pooling Layer. Finally,
we perform the label correction to fix the prediction as described in topic B and also apply
the training strategy to enhance the accuracy of the NN in the last topic.
Figure 11. Workflow of the HTC model.
1) Sentence to Embedded Vector: The process of feature extraction from CNN in

sentences is shown in Figure 12. From the sentences, Word2Vec generates vectors that
represent the words with a vector size of V. In the representation, each word vector from
the Word2Vec model are sequentially combined into the same word order as in the
original sentences. To apply CNN, the continuous vectors are considered as pictures. In
convolution, each filter has a size of the desired length ´ size of word vector (2 ´ V in
Figure 12) and it convolutes word vectors with max-pooling to extract the data into
20
features. We could interpret that they capture the same short sentence and sum up the
words into a feature. Therefore, with this technique, we use this model on our HTC
architecture to extract the features from sentences.
Figure 12 Feature extraction using CNN on word vectors.
2) Sharing Layer Models: There are two proposed variants:

2.1) Shared Hidden Layer Neural Network (SHL-NN): The architecture is shown
in Figure 13. For each prediction level, we use multi-layer perceptron with inputs from
CNN to produces outputs as a prediction of each level. The hidden layers are shared to
the next layer by concatenating with the features. With this adaptation, we can control the
number of nodes in the neural network when we are confronted with a large hierarchy
and we can utilize trained hidden layers from higher levels which connect to other hidden
layers and produce a more complex MLP.
2.2) Shared Pooling Layer Neural Network (SPL-NN): From Figure 14, the
architecture is similar to SHL-NN. The concept came from CNN for text categorization
using filters to capture the continuous words as features. In hierarchical classification, we
can get more information from the hierarchy. This shared pooling layer will be combined
with other input layers. So, the hierarchy features in the next level will be computed from
this shared information and the input features then are sent to the sharing layer in the
next level and so on.
Figure 13. Architecture of Shared Hidden Layer model.
21
Figure 14. Architecture of Shared Pooling Layer model.
3) Label correction: To correct inconsistencies in the prediction, we applied a label

correction for every model before evaluations. There are 2 approaches:
- Parent-Based Correction: Parent-based correction gives a priority to parent
classes over child classes. Any positive labels that its parent class is negative will
be removed.
- Child-Based Correction: Child-based correction is an inverted method. All ancestor
classes of positive labels are fixed to positive predictions.
4) Training Strategy for Imbalanced Data: Balanced resampling with mini-batch:

Balanced resampling is our strategy to train the model with more balanced mini-batch.
For each batch, we sample with replacement of the low appearance classes to keep the
equality of each class. So, the neural networks are trained without a bias from majority
classes and the model will pay more attention to a few example classes.
Result
The experiment was conducted on three datasets: WIPO-D, WIPO-C, and Wiki-small.
We examine our novel models in many variations, and the results are showed in
22
Table 6. Among variants of our proposed model, SHL-NN is the winner since it
outperforms others in 2 out 3 datasets.
HR-SVM [21] is a baseline model, an adaptive support vector machine for HMC. To
evaluate the model, we tested using the same training and testing sets with TF-IDF
features. Table 7 shows that our proposed model (SHL-NN) is the best of all micro-
averaging F1, but not in the macro-averaging F1. From our further investigation, we found
that our model performs remarkably well in the top level and the performance drops
quickly in the lower level. It shows that the neural network method still faces the blocking
problem and could be improved in macro-average results if we use more models or
features like bypass techniques for lower hierarchies in further experiments.
23
Table 6 Results of hierarchical text categorization. Boldface result is the winner on that
averaging score.
Macro-Averaging Micro-Averaging
Dataset Model
Precision Recall F1 Precision Recall F1
WIPO-D FC-NN 0.0412 0.0246 0.0287 0.6776 0.3303 0.4440
MLP-NN 0.0379 0.0271 0.0295 0.5548 0.3225 0.4073
SHL-NN 0.0399 0.0285 0.0309 0.5529 0.3222 0.4070
SPL-NN 0.0409 0.0223 0.0266 0.7030 0.3110 0.4311
WIPO-C FC-NN 0.0121 0.0116 0.0110 0.3740 0.3200 0.3376
MLP-NN 0.0167 0.0090 0.0109 0.5816 0.3227 0.4151
SHL-NN 0.0184 0.0099 0.0120 0.5589 0.3219 0.4084
SPL-NN 0.0142 0.0113 0.0118 0.4545 0.3281 0.3809
Wiki-small FC-NN 0.1208 0.1036 0.1035 0.5276 0.5436 0.5352
MLP-NN 0.1623 0.1157 0.1256 0.5820 0.5567 0.5690
SHL-NN 0.1599 0.1008 0.1143 0.6093 0.5374 0.5709
SPL-NN 0.1187 0.1037 0.0997 0.4812 0.5558 0.5155
Boldface result is the winner on that dataset.
Table 7 Comparison with existing methods. Boldface result is the winner on that
averaging score.
Dataset SHL-NN HRSVM
Macro-F1 Micro-F1 Macro-F1 Micro-F1
WIPO-D 0.0309 0.4070 0.0569 0.2081
WIPO-C 0.0120 0.4084 0.0316 0.1074
Wiki-small 0.0620 0.5099 0.2254 0.2126
Conclusion
In this work, we describe a series of experiments on the hierarchical text
categorization problem with a neural network using sharing layer information built on top
of word vectors. From experiments, the child-based label correction is the approach that
prevents the blocking problem and could correct more top levels prediction. Also, the
balanced resampling with mini-batch training is proper for avoiding the imbalanced data
situation for HTC. The proposed method (SHL-NN) can achieve the highest micro-
averaging F1.
4.2.2) Encoded Shared Layer Neural Network (ESL-NN)
In our previous work called Shared Hidden Layer Neural Network (SHL-NN), it has
shown that sharing information between levels can improve a performance of the model.
However, this work is based on a sequence of unsupervised word embedding vectors, so
the performance should be limited. In this work, we propose a supervised document
embedding specifically designed for hierarchical text categorization based on
Autoencoder, which is trained from both words and labels. To enhance the embedding
vectors, the document embedding strategies are invented to utilize the class hierarchy
24
information in the training process. To transfer the prediction result from the parent
classes, the shared information technique has been improved to be more flexible and
efficient. The outcome of this work is a publication in an internaltional conference in
Scopus as shown details below [22]:
Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th International
Joint Conference on Computer Science and Software Engineering (JCSSE). 2018.
Proposed Model
The workflow of an overall process from raw documents to their predictions is shown
in Figure 15. First, DocTag2Vec is employed to convert from a raw document to an
embedding vector. Then, the embedding vectors of training documents are fed to the
proposed model called “Encoded Shared Layer Neural Network (ESL-NN)”. Finally, a label
correction is applied to fix an inconsistence prediction in terms of hierarchy.
Figure 15. Workflow of the proposed mode “ESL-NN”l
1) Embedding Strategy: In the training DocTag2Vec process, we must select a

suitable class information from the hierarchy to supervise the model. There are two
proposed strategies as shown below:
- Leaf only strategy: It selects only leaf classes that is assigned to the document,
while all non-leaf nodes are not used. The abbreviation of DocTag2Vec using Leaf
only strategy is called “LOD-Vec.”
- Overall path strategy: It selects all classes that assigned to the document. If a
document belongs to a child class, its ancester classes are also included in the
training process. The abbreviation of DocTag2Vec using Overall path strategy is
called “OPD-Vec.”
2) Encoded Shared Layer Neural Network (ESL-NN): The proposed architecture is

shown in Figure 16. The model is developed like our previous work, SHL-NN. An output
of the first level prediction is produced from a multi-layer perceptron with a document
embedding vector. An output is encoded by using a layer of perceptron before it
25
concatenates with a document embedding vector to be a feature to predict classes in the
next level. It is different from SHL-NN that uses the output from hidden layers as shared
information whose disadvantage is the number of hidden nodes will be limited because if
the model has too many hidden nodes, the shared information from SHL-NN will
overwhelm the document embedding feature. On the other hand, the number of hidden
nodes of ESL-NN in the previous level classifier does not affect to the architecture of the
classifier in the next level.
Figure 16. The model architecture of ESL-NN
3) Label Correction: It is applied to fix an inconsistence prediction in terms of

hierarchy. Like in our previous work, “Child-based correction” is chosen that all ancestor
classes of positive labels are fixed to positive predictions.
Result
From the experimental result as shown in Table 8, ESL-NN is the winner and achieves
the best performance in both F1-micro and F1-macro on almost all datasets. As shown
in
26
Table 9, in terms of F1 macro, ESL-NN achieves 14.53% higher than SHL-NN.
Table 8. Performance comparison: ESL-NN and “SHL-NN + OPD-Vec” are ours; while,
“SHL-NN + Word2Vec” and HR-SVM are baselines.
Dataset Score Ours Baselines
ESL-NN SHL-NN + OPD-Vec SHL-NN + Word2Vec HR-SVM
WIPO- F1 macro 0.0836 0.0725 0.0293 0.0572

D F1 micro 0.4512 0.4089 0.3943 0.2085
WIPO- F1 macro 0.0400 0.0355 0.0096 0.0317
C F1 micro 0.4052 0.4008 0.4128 0.1075
Wiki- F1 macro 0.2386 0.2069 0.1121 0.2327
small F1 micro 0.3576 0.3065 0.5763 0.2200
27
Table 9. Percentage of improvements from original methods (columns) to our proposed
methods (rows). In parentheses, we show the number of dataset that the proposed
methods can achieved higher F1 macro than the original method significantly
SHL-NN + OPD-Vec SHL-NN + Word2Vec HR-SVM
ESL-NN 14.53 (3) 204.51 (3) 25.03 (3)
SHL-NN + OPD-Vec 0.00 (3) 166.59 (3) 9.18 (2)
Conclusion
In this work, we aim to propose a novel deep learning network for hierarchical text
categorization called “Encoded Shared Layer Neural Network (ESL-NN).” It is an
extension of our previous work called “Shared Hidden Layer Neural Network (SHL-NN).”
There are two main contributions in this work: (i) a supervised document embedding
strategy and (ii) a novel sharing information technique. The experiment was conducted
on three corpuses: WIPO-C, WIPO-D, and Wiki-small. The macro-F1 results show that
ESL-NN unanimously outperforms all baselines: SHL-NN and HR-SVM.
5) Output
5.1) International Journal Publication
One publication in an international journal in Scopus
- Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level
based neural networks of word embedding sequences with sharing layer
information. Walailak Journal of Science and Technology, 2019. 16(2): p. 121-
131.
5.2) Research Utilization and Application
We have implemented our HMC solution called HR-SVM [21] along with our label
correction strategy [19]. This project is free for download for non-commercial use at the
link https://sites.google.com/site/hrsvmproject/ as in the figure below.
Figure 17. The website to HR-SVM project, our HMC program.
28
5.3) Others e.g. national journal publication, proceeding, international conference,
book chapter, patent
- Three proceedings in international conferences in Scopus
- Phachongkitphiphat, N. and P. Vateekul. An improvement of flat approach on
hierarchical text classification using top-level pruning classifiers. in 2014 11th
International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2014.
- Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in
Pattern Recognition. 2014. Cham: Springer International Publishing.
- Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th
(JCSSE). 2018.
29
6) Reference
1. Blockeel, H., et al., Decision trees for hierarchical multilabel classification: A case
study in functional genomics. Knowledge Discovery in Databases: Pkdd 2006,
Proceedings, 2006. 4213: p. 18-29.
2. Schietgat, L., et al., Predicting gene function using hierarchical multi-label decision
tree ensembles. Bmc Bioinformatics, 2010. 11: p. -.
3. Vens, C., et al., Decision trees for hierarchical multi-label classification. Machine
Learning, 2008. 73(2): p. 185-214.
4. Stenger, B., et al., Estimating 3D hand pose using hierarchical multi-label
classification. Image Vision Comput., 2007. 25(12): p. 1885-1894.
5. McCallum, A., et al., Improving text classification by shrinkage in a hierarchy of
classes. 1998.
6. Mladenie, D., Machine learning on non-homogeneous, distributed text data, in
Faculty of Computer and Information Science. 1998, University of Ljubljana.
7. Vateekul, P. and M. Kubat, Fast Induction of Multiple Decision Trees in Text
Categorization from Large Scale, Imbalanced, and Multi-label Data, in Proceedings
of the 2009 IEEE International Conference on Data Mining Workshops. 2009, IEEE
Computer Society: Miami, FL. p. 320-325.
8. Lewis, D.D., et al., RCV1: A New Benchmark Collection for Text Categorization
Research. J. Mach. Learn. Res., 2004. 5: p. 361-397.
9. Hersh, W., et al. {OHSUMED}: an interactive retrieval evaluation and new large
test collection for research. in SIGIR '94: Proceedings of the 17th annual
international ACM SIGIR conference on Research and development in information
retrieval. 1994. Dublin, Ireland: Springer-Verlag New York, Inc.
10. Silla, C. and A. Freitas, A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery, 2010.
11. Sun, A. and E.-P. Lim, Hierarchical Text Classification and Evaluation, in
Proceedings of the 2001 IEEE International Conference on Data Mining. 2001,
IEEE Computer Society. p. 521-528.
12. Secker, A.a.D., Matthew N and Freitas, Alex A and Timmis, Jon and Mendao,
Miguel and Flower, Darren R, An experimental comparison of classification
algorithms for hierarchical prediction of protein function. 3rd UK Data mining and
Knowledge Discovery Symposium UKKDD 2007 Canterbury, 2007: p. 13–18.
13. Clare, A., et al., Functional bioinformatics for arabidopsis thaliana. Bioinformatics,
2006(22 (9)): p. 1130-1136.
14. Quinlan, J.R., Induction of Decision Trees. Mach. Learn., 1986. 1(1): p. 81-106.
30
15. Blockeel, H., L. De Raedt, and J. Ramon. Top-down induction of clustering trees.
in Proceedings of the 15th International Conference on Machine Learning. 1998.
Morgan Kaufmann.
16. Phachongkitphiphat, N. and P. Vateekul. An improvement of flat approach on
hierarchical text classification using top-level pruning classifiers. in 2014 11th
(JCSSE). 2014.
17. Yang, Y. and J.O. Pedersen, A Comparative Study on Feature Selection in Text
Categorization, in ICML '97: Proceedings of the Fourteenth International
Conference on Machine Learning. 1997. p. 412-420.
18. Chang, C.-C. and C.-J. Lin, LIBSVM: A library for support vector machines. ACM
Trans. Intell. Syst. Technol., 2011. 2(3): p. 1-27.
19. Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction Strategy on
Hierarchical Multi-Label Classification. in Machine Learning and Data Mining in
Pattern Recognition. 2014. Cham: Springer International Publishing.
20. Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using level
based neural networks of word embedding sequences with sharing layer
information. Walailak Journal of Science and Technology, 2019. 16(2): p. 121-
131.
21. Vateekul, P., S. Dendamrongvit, and M. Kubat, Improving SVM Performance in
Multi-Label Domains: Threshold Adjustment. International Journal on Artificial
Intelligence Tools, 2013.
22. Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text Categorization
Based on Deep Learning Network Using Embedding Strategies. in 2018 15th
(JCSSE). 2018.
31
Appendix (List of Research Outputs)
1. [Conference] Phachongkitphiphat, N. and P. Vateekul. An improvement of flat

approach on hierarchical text classification using top-level pruning classifiers. in 2014
11th International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2014.
2. [Conference] Ananpiriyakul, T., P. Poomsirivilai, and P. Vateekul. Label Correction
Strategy on Hierarchical Multi-Label Classification. in Machine Learning and Data
Mining in Pattern Recognition. 2014. Cham: Springer International Publishing.
3. [Journal] Klungpornkun, M. and P. Vateekul, Hierarchical text categorization using
level based neural networks of word embedding sequences with sharing layer
information. Walailak Journal of Science and Technology, 2019. 16(2): p. 121-131.
4. [Conference] Saetia, C. and P. Vateekul. Enhance Accuracy of Hierarchical Text
Categorization Based on Deep Learning Network Using Embedding Strategies. in 2018
15th International Joint Conference on Computer Science and Software Engineering
(JCSSE). 2018.
32

Hierarchical Multi-Label Classification For Large Scale Data

Uploaded by

Copyright:

Available Formats

Hierarchical Multi-Label Classification For Large Scale Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hierarchical Multi-Label Classification For Large Scale Data

Uploaded by

Copyright:

Available Formats

Final Report

Project Title: Hierarchical Multi-Label Classification for Large Scale Data

Project Title: Hierarchical Multi-Label Classification for Large Scale Data

Project Granted by the Thailand Research Fund

Project Code: TRG5780220

Keywords: Hierarchical Multi-Label Classification, Large Scale Data, Support Vector

Arts Business Computers

Movies TV Jobs Investing Internet Software

1.1 1.2 2.1 2.2

2.1.1 2.1.2 2.2.1 2.2.2

Table 1. Evaluation metric

4.1.1) Flat Approach Using Top-Level Pruning Classifiers

Figure 3. Proposed flat-based solution

1) Feature Selection: A text corpus is usually described by a large number of

Figure 4. Example of an original hierarchy. The nodes surrounded by dashed rectangles

Table 3. Performance comparison, where EPT is baseline2. The integer in parentheses

2. Performance 3. Hierarchical Path

Figure 6. A process diagram of the proposed system for “Label Correction”

Figure 7. An example of label correction process

Figure 8. F1-score of SVM-BR

(a) A prediction result before adjustment (b) A result of decreasing strategy

(c) A result of increasing strategy (d) A result of voting strategy

Table 4. A comparison of SVM-BR and 3 label correction strategies in HMC. ELC

4.2) Deep Learning Based Solution

Figure 11. Workflow of the HTC model.

1) Sentence to Embedded Vector: The process of feature extraction from CNN in

Figure 12 Feature extraction using CNN on word vectors.

2) Sharing Layer Models: There are two proposed variants:

Figure 13. Architecture of Shared Hidden Layer model.

3) Label correction: To correct inconsistencies in the prediction, we applied a label

4) Training Strategy for Imbalanced Data: Balanced resampling with mini-batch:

Figure 15. Workflow of the proposed mode “ESL-NN”l

1) Embedding Strategy: In the training DocTag2Vec process, we must select a

2) Encoded Shared Layer Neural Network (ESL-NN): The proposed architecture is

Figure 16. The model architecture of ESL-NN

3) Label Correction: It is applied to fix an inconsistence prediction in terms of

WIPO- F1 macro 0.0836 0.0725 0.0293 0.0572

Figure 17. The website to HR-SVM project, our HMC program.

1. [Conference] Phachongkitphiphat, N. and P. Vateekul. An improvement of flat

You might also like