Discovering and Exploiting Entailment
Relationships in Multi-Label Learning
Christina Papagiannopoulou1 , Grigorios Tsoumakas1 , and
Ioannis Tsamardinos2,3
1
arXiv:1404.4038v2 [cs.LG] 17 Apr 2014
3
Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
cppapagi,greg@csd.auth.gr
2
Computer Science Department, University of Crete, Greece
Institute of Computer Science, Foundation for Research and Technology - Hellas,
N. Plastira 100 Vassilika Vouton, GR-700 13 Heraklion, Crete, Greece
tsamard@ics.forth.gr
Abstract. This work presents a sound probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the
labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all
instances of a dataset. The second concerns exclusion: sets of labels that
do not coexist in the same instances of the dataset. These relationships
are represented with a Bayesian network. Marginal probabilities are entered as soft evidence in the network and adjusted through probabilistic
inference. Our approach offers robust improvements in mean average precision compared to the standard binary relavance approach across all 12
datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.
Keywords: multi-label learning, label relationships, Bayesian networks
1
Introduction
Learning from multi-label data has received a lot of attention from the machine
learning and data mining communities in recent years. This is partly due to the
multitude of practical applications it arises in, and partly due to the interesting
research challenges it presents, such as exploiting label dependencies, learning
from rare labels and scaling up to large number of labels [1].
In several multi-label learning problems, the labels are organized as a tree
or a directed acyclic graph, and there exist approaches that exploit such structure [2,3]. However, in most multi-label learning problems, flat labels are only
provided without any accompanying structure. Yet, it is often the case that
implicit deterministic relationships exist among the labels. For example, in the
ImageCLEF 2011 photo annotation task, which motivated the present study, the
2
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
learning problem involved 99 labels4 without any accompanying semantic metadata, among which certain deterministic relationships did exist. For example,
among the 99 labels were several groups of mutually exclusive labels, such as the
four seasons autumn, winter, spring, summer and the person-related labels single
person, small group, big group, no persons. It also included several positive entailment (consequence) relationships, such as river → water and car → vehicle.
Hierarchies accompanying multi-label data model positive entailment via their
is-a edges, but do not model exclusion relationships.
These observations motivated us to consider the automated learning of such
deterministic relationships as potentially interesting and useful knowledge, and
the exploitation of this knowledge for improving the accuracy of multi-label
learning algorithms. While learning and/or exploiting deterministic relationships
from multi-label data is not new [4], little progress has been achieved in this direction since then. Past approaches exhibit weaknesses such as being unsuccesful
in practice [4], lacking formal theoretical grounding [5,6] and being limited to
existing is-a relationships [3].
Given an unlabeled instance x, multi-label models can output a bipartition of
the set of labels into relevant and irrelevant to x, a ranking of all labels according
to relevance with x, marginal probabilities of relevance to x for each label or even
a joined probability distribution for all labels. The latter is less popular due to
the exponential complexity it involves [7]. Among the rest, marginal probabilities
are information richer, as they can be cast into rankings after tie breaking and
into bipartitions after thresholding. They are also important if optimal decision
making is involved in the application at hand, which is often the case.
This work presents a sound probabilistic method for enforcing adherence of
the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. We focus on two kinds of relationships.
The first one concerns pairwise positive entailment: pairs of labels, where presence of one label implies presence of the other in all instances of a dataset. The
second one concerns exclusion: sets of labels that do not coexist at the same
instances of a dataset. These relationships are represented with a Bayesian network. Marginal probabilities are entered as soft evidence in the network and adjusted through probabilistic inference. Our approach offers robust improvement
in mean average precision compared to the standard binary relavance approach
across all 12 datasets involved in our experiments. The discovery process helps
interesting implicit knowledge to emerge, which could be useful in itself.
The rest of this paper is organized as follows. Section 2 introduces our approach. In particular, Section 2.1 discusses the discovery and Section 2.2 the
exploitation of entailment relationships. Section 3 presents related work and contrasts it with our approach. Section 4 presents the empirical work, with Section
4.1 discussing datasets and experimental setup, Section 4.2 presenting samples
of the knowledge discovered by our approach and Section 4.3 discussing comparative prediction results against binary relevance. Finally, Section 5 summarizes
the conclusions of this work and suggests future work directions.
4
http://www.imageclef.org/system/files/concepts_2011.txt
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
2
2.1
3
Our Approach
Discovering Entailment Relationships
Let A and B be two labels with domain {f alse, true}. For simplicity, we will
be using the common shortcut notation a, ¬a, b and ¬b instead of A = true,
A = f alse, B = true and B = f alse respectively. The following four entailment
relationships can arise between the two labels:
1.
2.
3.
4.
a → b and equivalent contrapositive ¬b → ¬a
b → a and equivalent contrapositive ¬a → ¬b
a → ¬b and equivalent contrapositive b → ¬a
¬a → b and equivalent contrapositive ¬b → a
(positive entailment)
(positive entailment)
(exclusion)
(coexhaustion)
Figure 1 presents a contingency table for labels A and B, based on a multilabel dataset with S + T + U + V training examples. Positive entailment corresponds to T = 0 or U = 0, exclusion to S = 0 and coexhaustion to V = 0.
Furthermore, S = V = 0 corresponds to mutually exclusive and completely
exhaustive labels, while T = U = 0 corresponds to equivalent labels.
a
¬a
b
S
T
¬b
U
V
Fig. 1. Contingency table for labels A and B
In this work, we focus on discovering pairwise positive entailment relationships as well as exclusion relationships among two or more labels. For a multilabel dataset with q labels, it is easy to extract all four types of pairwise entailment relationships from the corresponding contingency tables in O(q 2 ) time
complexity. For discovering exclusion relationships among more than two labels,
we follow the paradigm of the Apriori algorithm [8] in order to find all maximal sets of mutually exclusive labels, such that each of them is not a subset of
another. Starting from the pairwise exclusion relationships, we find triplets of
mutual exclusive labels, then quads and so on.
As a toy example, consider the label values of a multi-label dataset with 6
labels that are given in Table 1, where to improve legibility we have used a value
of 1 to represent true and a value of 0 to represent false. Our approach would in
this case extract the positive entailment relationships a → b, a → c, b → c and
d → c, and an exclusion relationship for the set of labels {A, E, F }.
2.2
Exploiting Entailment Relationships
Motivated by the goal of theoretically sound correction of the marginal probabilities Px (λ), obtained by a multi-label model for each label λ given instance
4
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
Table 1. A toy multi-label dataset with 10 samples and 6 labels.
A
B
C
D
E
F
1
1
0
0
1
0
0
0
0
0
1
1
0
1
1
1
0
0
0
0
1
1
0
1
1
1
1
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
1
0
1
0
1
x, according to the background knowledge expressed by the discovered relationships, we propose using a Bayesian network for the representation of these
relationships. This network initially contains as many nodes as the labels, with
each node representing the conditional probability Px (λ) of corresponding label
λ, given instance x, with uniform prior.
We represent the entailment relationship a → b, among labels A and B,
by adding a link from node A to node B. We set the conditional probability
table (CPT) associated with node B to contain probabilities Px (b|a) = 1 and
Px (b|¬a) = 0. This is easily generalized in the case of multiple relationships a1 →
b, . . . , ak → b, to a CPT with Px (b|A1 , . . . , Ak ) = 1 if A1 ∨ . . . ∨ Ak = true and
Px (b|A1 , . . . , Ak ) = 0 otherwise (i.e. when A1 ∨ . . . ∨ Ak = f alse or equivalently
when ¬A1 ∧ . . . ∧ ¬Ak = true). Such a CPT renders node B deterministic.
Our representation assumes that all causes of B have been considered, which is
seldom true for typical multi-label datasets. To deal with this discrepancy, we
add an additional parent of B as leak node, corresponding to a new virtual label
whose value is set to: (i) false in those training examples where B = f alse, (ii)
true in those training examples where B = true and all other parents of B are
false, and (iii) false for the rest of the training examples. Note that for the last
category of examples where B = true and at least one other parent is true, the
value of the leak node could also be set to true instead of false, but the choice of
false should lead to semantically simpler virtual labels that are easier to learn.
Redundant relationships due to the transitivity property of positive entailment
are not represented in the network.
We represent the mutual exclusion relationship among labels A1 , . . . , Ak by
adding a new boolean deterministic node B as common child of all of these labels. We set the CPT of this node to contain probabilities Px (b|A1 , . . . , Ak ) = 1
if one and only one of the parents is true and Px (b|A1 , . . . , Ak ) = 0 otherwise. We
consider that this node is true as observed evidence (B = true). Our representation assumes that labels A1 , . . . , Ak cover all training examples, which usually
is not the case for typical multi-label datasets. We deal with this discrepancy
similarly to the case of positive entailment. In specific, we add an additional
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
5
parent of B as leak node, corresponding to a new virtual label whose value is
set to: (i) true in those training examples where all other parents of B are false,
and (ii) false in all other training examples.
Continuing the toy example of the previous section, Figure 2 shows the network that our approach would construct to represent the discovered knowledge.
Fig. 2. A network that represents labels A, B, C, D, E, F , entailment relationships a →
b, a → c, b → c, d → c and a mutual exclusion relationship among labels A, E and F .
Having constructed the network, we then use any multi-label algorithm that
can provide marginal probability estimates to fit the extended (with virtual
labels corresponding to leak nodes) training set. For an unlabelled instance x,
we first query the multi-label model in order to obtain probability estimates
Px (λ) for each of the labels λ, including virtual labels. These are then entered
into the network as soft (also called virtual) evidence [9]. A probabilistic inference
algorithm is then used to update the probability estimates for each of the labels,
leading to probabilities that are consistent with the discovered relationships.
Table 2 exemplifies the probability correction process. Each column corresponds to a particular label/node. The first row contains arbitrary probability
estimates for an instance. These probabilities violate the relationships we have
discovered. The probability of label C should be larger than that of B, D and
LeakBD, but only the last constraint holds. In addition the probability of B
should be larger than that of A and LeakA, none of which holds. Finally, the
probabilities of A, E, F and LeakEF A should add to 1, which does not hold.
6
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
The third row gives the adjusted probabilities according to our approach, which
are now consistent with the discovered relationships.
Table 2. Marginal probabilities obtained by the multi-label model for each of the
labels, including the virtual ones corresponding to leak nodes (before) and updated
probabilities after probabilistic inference (after) for the network in Figure 2.
Before
After
3
A
LeakA
B
D
LeakBD
C
F
E
LeakEF A
0.4
0.022
0.35
0.082
0.25
0.096
0.6
0.031
0.01
0.05
0.2
0.345
0.3
0.064
0.85
0.85
0.3
0.064
Related Work
The idea of discovering and exploiting label relationships from multi-label data
was first discussed in [4], where relationships were reffered to as constraints. An
interesting general point of [4] was that label constraints can be exploited either
at the learning phase or at a post-processing phase. In addition, it presented four
basic types of constraints, which correspond to the four entailment relationships,
and noted that more complex types of constraints can be represented by combining these basic constraints with logical connectors. For discovering constraints, it
proposed association rule mining, followed by removal of redundant association
rules that were more general than others. For exploiting constraints, it proposed
two post-processing approaches for the label ranking task in multi-label learning.
These approaches correct a predicted ranking when it violates the constraints by
searching for the nearest ranking that is consistent with the constraints. They
only differ in the function used to evaluate the distance between the invalid and
a valid ranking. As they focus on label ranking, these approaches cannot be used
for correcting the marginal probability estimates of the labels. Results on synthetic data with known constraints showed that constraint exploitation can be
helpful, but results on real-world data and automatically discovered constraints
did not lead to predictive performance improvements.
An approach for exploiting label relationships in order to update the marginal
probabilites was presented in [5,6]. For mutually exclusive labels that cover all
training examples, the idea was to keep the highest probability and set the rest
of the marginal probabilities to zero. For mutually exclusive labels that don’t
cover all training examples, this rule was used if the highest probability was
larger than a threshold. For an entailment relation a → b, the idea was to set
the marginal probability of b to that of a when the marginal probability of a is
larger than that of b and larger than 0.5. This post-processing phase approach,
called concept reasoning, leads to marginal probabilities that are consistent with
the label relationships, but lacks a sound probabilistic inference framework.
The exploitation of parent-child relationships of a given hierarchical taxonomy of labels via a Bayesian network structure was proposed in [3]. In particular,
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
7
parent labels were conditioned on their child labels, and the output of the binary
classifier for each label was conditioned on the corresponding label. Given the
classifier outputs as evidence, a probabilistic inference algorithm is then applied
to obtain hierarchically consistent marginal probabilities for each label. Our
approach shares the same principle of using a Bayesian network structure for
enforcing consistency of marginal probabilities, but: (i) constructs the structure
automatically according to relationships discovered from the data, (ii) represents
mutual exclusion relationships in addition to positive entailment, and (iii) builds
additional binary models for virtual labels corresponding to leak nodes in the
deterministic relations represented by the network.
A Bayesian network structure to encode the relationships among labels as
well as between the input attributes and the labels was presented in [10]. The
proposed algorithm, called LEAD, starts by building binary relevance models
and continues by learning a Bayesian network on the residuals of these models.
Then another set of binary models is learned, one for each label, but this time
incorporating the parents of each label according to the constructed Bayesian
network as additional features. For prediction, these models are queried top-down
according to the constructed Bayesian network. LEAD implicitly discovers probabilistic relationships among labels by learning the Bayesian network structure
directly from data, while our approach explicitly discovers deterministic positive entailment and mutual exclusion relationships among labels from the data,
which are then used to define the network structure.
A method for uncovering deterministic causal structures is introduced in
[11]. Similarly to our work, it aims at constructing a Bayesian network out of
automatically discovered deterministic relationships. Important differences are
that it does not consider latent variables, as in our representation of exclusions
and our treatment of unaccounted causes of a label via leak nodes. It therefore
requires relationships to be supported from the full dataset, which limits its
practical usefulness, as rarely such relationships appear in real-world data.
4
Experiments
4.1
Setup
We use the binary relevance (BR) problem transformatiom method for learning
multi-label models, which learns one binary model per label. As our approach
employs probabilistic inference to correct the marginal probability of each label,
we consider important to start with good probability estimates. We therefore use
Random Forest [12] as the learning algorithm for training each binary model,
since it is known to provide relatively accurate probability estimates without
calibration [13]. We use the implementations of BR and Random Forest from
Mulan [14] and Weka [15] respectively. Our approach is also implemented in
Mulan, utilizing the jSMILE library5 .
5
http://genie.sis.pitt.edu/
8
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
We experiment on the 12 multi-label datasets that are shown in Table 3. We
adopt a 10-fold cross-validation process and discuss the results in terms of mean
average precision (MAP) across all labels, as this was also the measure of choice
in the ImageCLEF 2011 challenge that motivated this work. MAP is also the
standard evaluation measure in multimedia information retrieval.
Table 3. A variety of multi-label datasets and their statistics: number of labels, examples, discrete and continuous features, and the mean number of inferring and excluding
constraints that were discovered in the training set.
dataset
Bibtex
Bookmarks
Emotions
Enron
ImageCLEF2011
ImageCLEF2012
IMDB
Medical
Scene
Slashdot
TMC2007
Yeast
variables
source labels examples disc. cont.
[16]
[16]
[17]
url6
[18]
[19]
[20]
[21]
[22]
[20]
[23]
[24]
159
208
6
53
99
94
28
45
6
20
22
14
7395
87856
593
1702
8000
15000
120919
978
2407
3782
28596
2417
1836
2150
0
1001
19540
10000
0
1449
0
0
49060
0
entailment
positive
exclusion
0
11±0
76.2 ± 2.3
0
4.1 ± 0.3
1±0
72
0
1.1±0.3
0
13 ± 15.2 480.7 ± 98.4
0
27.9 ± 0.9 325.4 ± 31.9
0
1.2
277.9 ± 43.5
1001
0
21.6 ± 1.2
0
6.3 ± 1
30.7 ± 7.2
294
0
4±0
1079
0
23.2 ± 1.2
0
0
7.5 ± 1.1
103
3±0
2.3 ± 0.5
We set the minimum support of discovered relationships to just 2 training examples (avoid single-point generalization). For positive entailment, support refers
to the positive training examples of the antecedent label, while for exclusion it
refers to the sum of the positive examples of all participating labels (S and T +U
in Figure 1 respectively). In Section 5, we discuss ideas for improved support
setting. In 3 datasets (Bibtex, Bookmarks, Medical) exclusion discovery did not
finish within a week, while in 3 other datasets (Enron, ImageCLEF2011/2012) a
large number of exclusion rules was discovered that caused memory outage during network construction in jSMILE. We increased the support exponentially (4,
8, 16, ...) until these issues were resolved.
4.2
Relationships
This section discusses the relationships discovered by our approach. At each fold
of the cross-validation, different relationships can be discovered. Table 3 reports
the mean of the discovered relationships across all folds. We only discuss here
those appearing in all folds. The relationships of Yeast and TMC2007 cannot be
discussed, as we were unable to determine their label semantics. Due to space
limitation, we refrain from discussing less noteworthy relationships (Slashdot).
6
http://bailando.sims.berkeley.edu/enron_email.html
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
9
Bibtex and Bookmarks. The labels of the Bibtex and Bookmarks datasets
correspond to tags assigned to publications and bookmarks respectively by users
of the social bookmark and publication sharing system Bibsonomy7 .
Table 4 presents the 11 positive entailments that were found in all folds
for Bibtex. These apparently correspond to a hierarchy relationship between
label statphys23, an international conference on statistical physics8 and the 11
topics of this conference. Table 5 presents the 4 positive entailments that were
discovered in all folds for Bookmarks. The first two most probably belong to a
single user who also used the tags film and kultur whenever he/she used the tag
filmsiveseenrecently. This is, by the way, an example of an unfortunate choice
of tag name, as it involves a time adverb recently, whose meaning changes over
time. The last two are examples of discovered is-a relationships, as paddling isa (water) sport. We conclude that our approach manages to discover positive
entailment relationships of social origin.
Table 4. Positive entailment relationships discovered in the Bibtex dataset.
id
relationship
sup id
1 nonequilibrium → statphys23 68
2
topic1 → statphys23 86
3
topic2 → statphys23 75
4
topic3 → statphys23 151
5
6
7
8
relationship
topic4
topic6
topic7
topic8
→
→
→
→
statphys23
statphys23
statphys23
statphys23
sup id
relationship
sup
62 9 topic9 → statphys23 82
63 10 topic10 → statphys23 130
129 11 topic11 → statphys23 143
73
Table 5. Positive entailment relationships discovered in the Bookmarks dataset.
id
1
2
relationship
filmsiveseenrecently
filmsiveseenrecently
→
→
film
kultur
sup
id
370
370
3
4
relationship
paddling
paddling
→
→
sports
watersports
sup
379
379
A minimum support of 128 and 2048 examples led to a mean number of 76.2
and 1 exclusion relationships per fold in Bibtex and Bookmarks respectively.
Due to space limitations we refrain from reporting the 18 exclusions discovered
in all folds of Bibtex. In Bookmarks the relationship involved the following pair
of labels: {computing; video}. No wonder it did not help improve accuracy.
Emotions. The 6 labels in the Emotions dataset concern 3 pairs of opposite emotions of the Tellegen-Watson-Clark model of mood: (quiet-still, amazedsurprised), (sad-lonely, happy-pleased) and (relaxing-calm, angry-aggresive) that
correspond to the axes of engangement, pleasantness and negative affect respectively. The Tellegen-Watson-Clark model includes a fourth axis that concerns
7
8
http://www.bibsonomy.org/
http://www.statphys23.org/
10
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
positive affect. The single discovered exclusion relationship, concerns the pair of
opposite labels related to engangement: {quiet-still; amazed-surprised}.
Enron. The 53 labels of this dataset are organized into 4 categories: coarse
genre, included/forwarded information, primary topics, which is applicable if
coarse genre Company Business, Strategy, etc is selected, and emotional tone
if not neutral. There are 13 positive entailment relationships by definition, as
there are 13 labels in the primary topics category, which are children of label
Company Business, Strategy, etc.
Table 6 presents the 3 positive entailment relationships that were discovered
in all folds. Relationship 1 is among the 13 positive entailments we already
knew from the description of the labels, as label company image – changing /
influencing is a primary topic and therefore a child of label Company Business,
Strategy, etc. Our approach manages to discover explicit is-a relationships, when
these are present in the training data.
Table 6. Positive entailment relationships discovered in the Enron dataset.
id
relationship
sup
1 company image – changing / influencing → Company Business, Strategy, etc.
63
2
triumph / gloating → Company Business, Strategy, etc.
3
3
triumph / gloating → regulations and regulators (includes price caps) 3
A minimum support of 8 examples led to a mean number of 480.7 exclusion relationships per fold. Only one relationship was present in all folds and
involved the following interesting pair of concepts {Company Business, Strategy,
etc; friendship / affection}. The conclusion is that there is no room for affection
in the business world.
ImageCLEF 2011 and 2012. Labels in these two datasets correspond to 99
and 94 concepts respectively covering a variety of concepts for image annotation.
A difference in the 2012 version of the contest was that concepts being superclasses of other concepts (e.g. Animal, Vehicle and Water) were removed. This is
why in the 2011 version of the dataset 27 positive entailments were found in all
folds (see Table 7), in contrast with the following single one in the 2012 version:
Spider → QuantityNone, with a support of 16 examples. The consequent of this
relationship refers to the number of people that appear in the photo.
A minimum support of 32 and 64 examples led to a mean number of 325.4 and
277.9 exclusion relationships per fold in the 2011 and 2012 version of ImageCLEF
respectively. Due to space limitations we refrain from reporting the 24 and 21
exclusion relationships that were discovered in all folds of the 2011 and 2012
version of ImageCLEF respectively.
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
11
Table 7. Positive entailment relationships discovered in the ImageCLEF2011 dataset.
id
relationship
1 Desert → Outdoor
2 Desert → Day
3 Spring → NeutralLight
4 Flowers → Plants
5
Trees → Plants
6 Clouds → Sky
7
Lake → Outdoor
8
Lake → Water
9
River → Water
sup id
30
30
105
367
890
1104
89
89
130
10
11
12
13
14
15
16
17
18
relationship
Sea
Sea
Cat
Dog
Horse
Horse
Horse
Bird
Insect
→
→
→
→
→
→
→
→
→
Outdoor
Water
Animals
Animals
Outdoor
Day
Animals
Animals
Animals
sup id
222
222
61
211
28
28
28
183
91
relationship
19
Fish → Animals
20
Bicycle → NeutralLight
21
Bicycle → Vehicle
22
Car → Vehicle
23
Train → Vehicle
24
Airplane → Vehicle
25 Skateboard → Vehicle
26
Ship → Vehicle
27
Female → Male
sup
25
61
61
268
59
41
12
79
1254
IMDB. Labels of this dataset correspond to 28 movie genres. Table 8 presents
the 15 exclusion relationships that were found in all folds. Labels Film-Noir,
Game-Show and Talk-Show and are the most frequent ones in these relationships.
We notice some obvious exclusions, such as {War, Reality-TV}, {Game-Show,
Crime} and {Talk-Show, Fantasy}. On the other hand, such relationships could
be a source of inspiration for innovative (or provocative) producers and directors
contemplating unattempted combinations of genres.
Table 8. Exclusion relationships discovered in the IMDB dataset.
{Film-Noir, Game-Show, Adult, News}, {Film-Noir, Adult, Family}, {Game-Show, Horror}
{Film-Noir, Game-Show, Western}, {Film-Noir, Documentary}, {Game-Show, Thriller}
{Film-Noir, Talk-Show, Western}, {Talk-Show, Adventure}, {Game-Show, Crime}
{Game-Show, Adult, Biography}, {Film-Noir, Comedy}, {War, Reality-TV}
{Film-Noir, Western, Reality-TV}, {Talk-Show, Mystery}, {Talk-Show, Action}
Medical. Labels of this dataset correspond to 45 codes/descriptions of the 9th
revision of the International Statistical Classification of Diseases (ICD). Table
9 presents the 3 positive entailment relationships that were found in all folds.
The support of these relationships is quite weak (up to 4 examples), yet it apparently corresponds to valid, yet already known, medical knowledge. For example, the top page returned by Google for the query “hydronephrosis congenital
obstruction of ureteropelvic junction”, contains the following excerpt: Ureteropelvic junction obstruction is the most common pathologic cause of antenatally
detected hydronephrosis. Future work could apply our approach to larger datasets
in search of unknown medical knowledge.
A minimum support of 16 examples led to a mean number of 30.7 exclusion relationships per fold. Only one relationship was present in 9 out of the
10 folds (none in all folds) and involved the following 5 concepts: {753.0 Renal agenesis and dysgenesis; 599.0 Urinary tract infection, site not specified;
596.54 Neurogenic bladder NOS; 780.6 Fever and other physiologic disturbances
of temperature regulation; 493.90 Asthma,unspecified type, unspecified}.
12
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
Table 9. Positive entailment relationships discovered in the Medical dataset.
id
relationship
sup
1 753.21 Congenital obstruction of ureteropelvic junction → 591 Hydronephrosis
4
2 786.05 Shortness of breath
→ 753.0 Renal agenesis and dysgenesis 4
3 787.03 Vomiting alone
→ 753.0 Renal agenesis and dysgenesis 3
Scene. Labels in this dataset correspond to 6 different scenery concepts. The
following 2 exclusion relationships were found in all folds: {Sunset, Fall Foliage,
Beach} and {Sunset, Fall Foliage, Urban}. These relationships do not really
correspond to interesting knowledge, as images with such concept combinations
can be found on the Web. However, only a few of the top 30 images returned
by Google image search do cover all three concepts of these relationships, a fact
that highlights the co-occurence rarity of these concepts. Still, this knowledge
should not be generalized beyond the particular limited collection of 2407 images
from the stock photo library of Corel.
4.3
Results
Tables 10 and 11 present the average MAP of BR across the 10 folds of the
cross-validation: (i) in its standard version, and (ii) with the exploitation of
positive entailment relationships (Table 10) and exlusion relationships (Table
11) via our approach. They also present the percentage of improvement brought
by our approach, the minimum support and the average number of discovered
relationships across the 10 folds of the cross-validation.
Table 10. Mean MAP of BR with and without exploitation of positive entailments via
our approach, along with the percentage of improvement, the minimum support and
the average number of discovered relationships.
dataset
Bibtex
Bookmarks
Enron
ImageCLEF2011
ImageCLEF2012
Medical
Yeast
standard BR
0.2152
0.1474
0.2810
0.2788
0.2376
0.5997
0.4545
±
±
±
±
±
±
±
0.0114
0.0041
0.0476
0.0113
0.0089
0.0768
0.0145
minsup positive entailment impr% #relations
2
2
2
2
2
2
2
0.2168
0.1475
0.2821
0.2871
0.2380
0.6134
0.4617
±
±
±
±
±
±
±
0.0109
0.0041
0.0480
0.0100
0.0084
0.0661
0.0141
0.279
0.068
0.391
2.977
0.168
2.284
1.584
11 ± 0
4.1 ± 0.3
3.8 ± 0.9
27.9 ± 0.9
1.2 ± 0.9
6.3 ± 1
3±0
In Table 10 we notice that in all datasets where positive entailments were
discovered, the exploitation of these relationships led to an increased MAP. Applying the Wilcoxon signed rank test we find a p-value of 0.0156, indicating
that the the improvements are statistically significant. In some datasets, such as
bookmarks, improvements are small, while in others, such as ImageCLEF2011,
improvements are large. The correlation coefficient between the percentage of
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
13
improvement and the number of relationships divided by the number of labels is
0.913, which supports the argument that the more relationships we discover per
label, the higher the improvements in MAP. This was expected to an extend as
MAP is a mean of the average precision across all labels. Focusing only on the
affected labels, we would notice larger improvements.
Table 11. Mean MAP of BR with and without exploitation of exlusions via our approach, along with the percentage of improvement, the minimum support and the
average number of discovered relationships.
dataset
standard BR
minsup
exclusion
impr%
#relations
Bibtex
Bookmarks
Emotions
Enron
ImageCLEF2011
ImageCLEF2012
IMDB
Medical
Scene
Slashdot
TMC2007
Yeast
0.2152
0.1474
0.7163
0.2810
0.2788
0.2376
0.0900
0.5997
0.8139
0.3982
0.3276
0.4545
±
±
±
±
±
±
±
±
±
±
±
±
0.0114
0.0041
0.0339
0.0476
0.0113
0.0089
0.0030
0.0768
0.0146
0.0323
0.0069
0.0145
128
2048
2
8
32
64
2
16
2
2
2
2
0.2117
0.1473
0.7265
0.2573
0.2840
0.2308
0.0938
0.6223
0.8385
0.4452
0.3474
0.4625
±
±
±
±
±
±
±
±
±
±
±
±
0.0126
0.0040
0.0368
0.0476
0.0132
0.0090
0.0026
0.0597
0.0140
0.0422
0.0075
0.0163
-1.626 76.2 ± 2.3
-0.068
1±0
1.424
1.1 ± 0.3
-8.434 480.7 ± 98.4
1.865 325.4 ± 31.9
-2.862 277.9 ± 43.5
4.222 21.6 ± 1.2
3.769 30.7 ± 7.2
3.023
4±0
11.803 23.2 ± 1.2
6.044
7.5 ± 1.1
1.760
2.3 ± 0.5
Bibtex
Enron
ImageCLEF2011
ImageCLEF2012
0.2152
0.2810
0.2788
0.2376
±
±
±
±
0.0114
0.0476
0.0113
0.0089
256
32
128
256
0.2165
0.2816
0.2881
0.2391
±
±
±
±
0.0114
0.0477
0.0129
0.0087
0.604
0.214
3.336
0.631
2.8 ± 0.4
22.2 ± 1.8
56.6 ± 3.3
39.8 ± 1.1
In the upper part of Table 11 we notice that there are 8 datasets where our
approach leads to improvements in MAP, but another 4 where it leads to reductions. The p-value of the Wilcoxon signed rank test is 0.1099, indicating that the
results are statistically insignificant even at the 0.1 level (though marginally).
We argue that one of the main reasons underlying the negative results of our
approach is the large number of spurious exclusion relationships that can be discovered in datasets with a larger number of labels. Indeed, the rarer two labels
are, the higher the probability of being considered as mutually exclusive, irrespectively of their actual semantic relationship. We therefore further experiment
with exponentially increasing minimum support values in Bibtex, Enron and the
ImageCLEF datasets, where a large number of relationships was discovered, until we reach a small set of relationships. The bottom part of Table 11 shows the
results achieved through this process. We see that MAP improvements are now
achieved for Bibtex, Enron and ImageCLEF2012, while the already improved
MAP of ImageCLEF2011 increases. Note that these are not the best achievable
results, both because we only tried a few minimum support values and because
our goal was to discover a small number of relationships. Figure 3 shows the
14
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
MAP and number of exclusion relationships for the minimum support values we
tried in ImageCLEF2011.
Fig. 3. MAP and number of exclusion relationships for the 4 different minimum support
values we tried in ImageCLEF2011.
The number of exclusion relationships divided by the number of labels is
again positively correlated with the percentage of improvement, with the coefficient being 0.770 in this case (based on the bottom part of Table 11 for the 4
datasets and ignoring Bookmarks). The coefficient would probably be higher had
we calculated the actual number of labels involved in the exclusion relationships.
The average improvement offered by exclusion (3.4%) is larger than those offered by positive entailment (1.1%), and so is the average number of relationships
(19.3 vs 8.2). Exclusions typically involve more than two labels, while positive
entailments are pairwise and some of them redundant due to the transitivity
property of positive entailment. A deeper analysis should look at the number of
labels involved in each type of relationship, which we leave as future work.
Table 12 shows results of utilizing both types of relationships. We notice
that in Yeast, Medical and ImageCLEF2012 the combination of both types of
relationships leads to larger improvement than their individual improvement.
The combined improvement is smaller than the sum of individual improvements.
This could be due to labels appearing in both types of relationship. For Enron
and ImageCLEF2011 we notice that the combined improvement is smaller than
the largest of the two individual improvements, that of positive entailment. This
is another indication of spurious exclusions existence. Results on Bibtex were not
obtained due to an error from jSMILE, which we are currently investigating.
5
Summary and Future Work
This work has introduced an approach that discovers entailment relationships
among labels within multi-label datasets and exploits them using a sound probabilistic approach that enforces the adherence of the marginal probability estimates of multi-label learning with the discovered background knowledge.
Discovering & Exploiting Entailment Relationships in Multi-Label Learning
15
Table 12. Mean MAP of BR with and without exploitation of both types of relationships via our approach, along with the percentage of improvement.
dataset
Bookmarks
Enron
ImageCLEF2011
ImageCLEF2012
Medical
Yeast
standard BR
0.1474
0.2810
0.2788
0.2376
0.5997
0.4545
±
±
±
±
±
±
0.0041
0.0476
0.0113
0.0089
0.0768
0.0145
minimum support
positive exlusion
2
2
2
2
2
2
2048
8
32
256
16
2
our approach
0.1474
0.2816
0.2846
0.2393
0.6263
0.4677
±
±
±
±
±
±
0.0040
0.0477
0.0129
0.0085
0.0566
0.0141
impr%
0
0.214
2.080
0.716
4.436
2.904
We believe that our approach can be further extended and improved in a
number of directions. An important issue concerns the statistical validity of the
extracted relationships, especially when based on infrequent labels. We are working on automatically selecting the minimum support per relationship in order
to separate chance artifacts from confident findings, which we expect not only
to improve accuracy results, but also to reduce the complexity of the discovery
process. On the opposite direction, it would be also interesting to investigate
whether approximate relations, where the contingency table frequencies may
not necessarily be zero due to noise, can be exploited with improved results.
Another important direction is the generalization of our approach so as to be
able to discover all types of entailments among any number of labels.
On the empirical part of this work, we intend to apply our approach to
additional datasets and to employ evaluation measures, such as log loss and
squared error, that assess the quality of the predicted probabilities. We also
intend to drill-down the accuracy results and discuss the extend of improvement
for each label involved in one or more relationships. Finally, we also intend
to investigate the effect that the quality of predicted probabilities has on our
approach.
References
1. Tsoumakas, G., Zhang, M.L., Zhou, Z.H.: Introduction to the special issue on
learning from multi-label data. Machine Learning 88(1-2) (2012) 1–4
2. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for
hierarchical multi-label classification. Machine Learning 73(2) (2008) 185–214
3. Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22(7) (2006) 830–836
4. Park, S.H., Fürnkranz, J.: Multi-label classification with label constraints. In:
ECML PKDD 2008 Workshop on Preference Learning. (2008)
5. Mbanya, E., Hentschel, C., Gerke, S., Liu, M., Nürnberger, A., Ndjiki-nya, P.:
Augmenting bag-of-words - category specific features and concept reasoning. In:
Working Notes of CLEF 2010. (2010)
6. Mbanya, E., Gerke, S., Hentschel, C., Ndjiki-nya, P.: Sample selection, category
specific features and reasoning. In: Working Notes of CLEF 2011. (2011)
16
C. Papagiannopoulou, G. Tsoumakas, I. Tsamardinos
7. Dembczynski, K., Cheng, W., Hüllermeier, E.: Bayes optimal multilabel classification via probabilistic classifier chains. In: Proceedings of the 27th International
Conference on Machine Learning (ICML). (2010)
8. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: Proceedings of the 20th International Conference on Very Large
Data Bases. VLDB ’94, San Francisco, CA, USA, Morgan Kaufmann Publishers
Inc. (1994) 487–499
9. Korb, K., Nicholson, A.E.: Bayesian Artificial Intelligence. CRC Press, Inc., Boca
Raton, FL, USA (2003)
10. Zhang, M.L., Zhang, K.: Multi-label learning by exploiting label dependency. In:
Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining. KDD ’10, New York, NY, USA, ACM (2010) 999–1008
11. Baumgartner, M.: Uncovering deterministic causal structures: a boolean approach.
Synthese 170(1) (2009) 71–96
12. Breiman, L.: Random forests. Machine Learning 45(1) (October 2001) 5–32
13. Niculescu-Mizil, A., Caruana, R.: Predicting good probabilities with supervised
learning. In: Proceedings of the 22nd international conference on Machine learning.
ICML ’05, New York, NY, USA, ACM (2005) 625–632
14. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java
library for multi-label learning. Journal of Machine Learning Research (JMLR) 12
(July 12 2011) 2411–2414
15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: An update. SIGKDD Explorations 11 (2009)
16. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD 2008 Discovery
Challenge, Antwerp, Belgium (2008)
17. Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multilabel classification of
music into emotions. In: Proc. 9th International Conference on Music Information
Retrieval (ISMIR 2008), Philadelphia, PA, USA, 2008. (2008)
18. Nowak, S., Nagel, K., Liebetrau, J.: The clef 2011 photo annotation and conceptbased retrieval tasks. In: CLEF (Notebook Papers/Labs/Workshop). (2011)
19. Thomee, B., Popescu, A.: Overview of the imageclef 2012 flickr photo annotation
and retrieval task. In Forner, P., Karlgren, J., Womser-Hacker, C., eds.: CLEF
(Online Working Notes/Labs/Workshop). (2012)
20. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. Machine Learning 85(3) (2011) 333–359
21. Pestian, J.P., Brew, C., Matykiewicz, P., Hovermale, D.J., Johnson, N., Cohen,
K.B., Duch, W.: A shared task involving multi-label classification of clinical free
text. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational,
and Clinical Language Processing. BioNLP ’07, Stroudsburg, PA, USA, Association
for Computational Linguistics (2007) 97–104
22. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification.
Pattern Recognition 37(9) (2004) 1757–1771
23. Srivastava, A., Zane-Ulman, B.: Discovering recurring anomalies in text reports
regarding complex space systems. In: Proc. 2005 IEEE Aerospace Conference.
(2005) 3853 –3862
24. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In:
Advances in Neural Information Processing Systems 14. (2002)