Complexity Control in Rule Based Models for
Classification in Machine Learning Context
Han Liu1, Alexander Gegov1 and Mihaela Cocea1
Abstract A rule based model is a special type of computational models, which can
be built by using expert knowledge or learning from real data. In this context, rule
based modelling approaches can be divided into two categories: expert based approaches and data based approaches. Due to the vast and rapid increase in data, the
latter approach has become increasingly popular for building rule based models. In
machine learning context, rule based models can be evaluated in three main dimensions, namely accuracy, efficiency and interpretability. All these dimensions are
usually affected by the key characteristic of a rule based model which is typically
referred to as model complexity. This paper focuses on theoretical and empirical
analysis of complexity of rule based models, especially for classification tasks. In
particular, the significance of model complexity is argued and a list of impact factors against the complexity are identified. This paper also proposes several techniques for effective control of model complexity, and experimental studies are reported for presentation and discussion of results in order to analyze critically and
comparatively the extent to which the proposed techniques are effective in control
of model complexity.
Keywords: Machine Learning, Rule Based Models, Model Complexity, Complexity Control, Rule Based Classification
1
Han Liu
School of Computing, University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom, Email: Han.Liu@port.ac.uk
Alexander Gegov,
School of Computing, University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom, Email: Alexander.Gegov@port.ac.uk
Mihaela Cocea
School of Computing, University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom, Email: Mihaela.Cocea@port.ac.uk
2
1 Introduction
A rule based model is a special type of computational models, which can be used
for the purpose of knowledge discovery and predictive modelling. A rule based
model consists of a set of rules, which can be built by using expert knowledge or by
learning from real data. From this point of view, rule based modelling approaches
can be categorized into expert based approaches and data based approaches. Due to
the vast and rapid increase in data, the latter approach of modelling has become
increasingly popular. The data based approach typically involves learning of rules
for building of rule based models. In practice, rule based models can be used for
different tasks such as classification, regression and association. From this point of
view, rules extracted from a rule based model can be categorized into classification
rules, regression rules and association rules. Both classification and regression rules
can be viewed as a special type of association rules, due to the fact that these two
types of rules represent the relationship between multiple independent variables and
a single dependent variable, whereas association rules represent the relationship between multiple independent variables and multiple dependent variables. The main
difference between classification rules and regression rules is that the output attribute on the right hand side must be discrete for the former and continuous for the
latter [1]. Therefore, classification rules are generally used for categorical predictions whereas regression rules are used for numerical predictions.
In machine learning research, rule based models can be evaluated in three main
dimensions namely accuracy, efficiency and interpretability. One of the important
characteristics of rule based models is referred to as model complexity, which usually impacts on the above three main dimensions. As described in [2], complex
models are usually less generalized than simple models, which are likely to result
in overfitting. This problem typically results in loss of accuracy for predictive modelling and decrease of the level of reliability for knowledge extracted from a rule
based model. On the other hand, as analyzed in [3], complex models usually lead to
less efficient prediction on test instances and poor interpretability to people for the
purpose of knowledge discovery. On the basis of the above description, this paper
focuses on theoretical and empirical analysis of complexity of rule based models
and how the complexity can be controlled effectively.
The rest of this paper is organized as follows: Section 2 argues why complexity
control is important for rule based models to be applied in real world. Section 3
identifies a list of impact factors that affect complexity of rule based models as well
as analyze in depth how these identified factors impact on model complexity. Section 4 introduces two main techniques, namely scaling up algorithms and scaling
down data, towards effective complexity control in rule based models. In particular,
scaling up algorithms involves proper use of statistical heuristics for rule generation
and effective assistance from rule simplification, and scaling down data involves
effective pre-processing of training data, which includes feature selection, feature
extraction and attribute discretization. Section 5 describes setup of the experimental
3
studies, and results are presented and discussed critically and comparatively in order
to show the extent to which the techniques used for complexity reduction are effective. Section 6 summaries the contributions of this paper and provides some suggestions for further directions towards advances in this research area.
2 Significance of Complexity Control
As mentioned in Section 1, model complexity usually impacts on accuracy, efficiency and interpretability. This section justifies why it is important to effectively
control the complexity of a rule based model.
As mentioned in [3], rule based models can be used in practice for the purpose
of knowledge discovery and predictive modelling. For the latter purpose, rule based
models are used in a black box manner, which means that the emphasis is on the
mapping from inputs to outputs without interpretation of the reasons, i.e. to predict
the values of the outputs on the basis of the values of the inputs. In this context, rule
based models need to be both accurate and efficient in predicting unseen instances.
On the other hand, for the purpose of knowledge discovery, rule based models are
used in a white box manner which should allow the interpretation of the reasons for
the mapping. In this context, rule based models need to be both accurate and interpretable for people to use knowledge extracted from the models, i.e. to see a list of
causal relationships by going through a set of rules. On the basis of the above description, model complexity can have a significant impact on the accuracy, efficiency and interpretability of rule based models.
In terms of accuracy, on the basis of the same data, more complex models usually
have lower generality than simpler models, which are likely to result in models performing well on the training data but poorly on the testing data. The above case is
commonly known as overfitting. As mentioned in [2], one of the biases that arise
with rule based models is referred to as overfitting avoidance bias [4, 5], which
means that rule learning algorithms prefer simpler rules to more complex rules under the expectation that the accuracy on the training data is lower but that on the
testing data it would be higher.
In terms of efficiency, more complex models are usually less efficient than simpler models in predicting unseen instances. This is because of the fact that predictions by a rule based model are made through checking the rules extracted from the
model [3]. In this context, a model that consists of a large number of rule terms is
considered as a complex model whereas a model that is made up of a small number
of rule terms is considered as a simple model. In the worst case, it always takes
longer to make a prediction using a more complex model than using a simpler
model, if the two models are represented in the same structure [3]. Section 3 will
give more details on complexity analysis in terms of rule representation.
In terms of interpretability, more complex models are usually less interpretable
for people to read and understand knowledge extracted from the rule based models.
4
This is because of the fact that people need to read each of the rules extracted from
a particular model in order to see any causal relationships between the inputs and
the outputs. In this context, a model that consists of a large number of complex rules
is considered as a complex model whereas a model that is made up of a small number of simple rules is considered as a simple model. In other words, a model that
consists of a large number of complex rules is like an article that is made up of a
large number of long paragraphs, which usually makes it difficult and cumbersome
for people to follow.
On the basis of the above description, model complexity needs to be considered
as an important impact on accuracy, efficiency and interpretability, and thus needs
to be controlled effectively.
3 Impact Factors for Model Complexity
Section 2 justified the significance of complexity control for rule based models towards generation of accurate, efficient and interpretable models in practice. This
section identifies a list of impact factors for model complexity and justifies how
these factors would affect the complexity of rule based models. In particular, the
strategy involved in a learning algorithm and the characteristic of a data set are
viewed as two main impact factors as already identified in [6]. Also, ways to impact
on the model complexity are analyzed in the context of rule based classification.
3.1 Learning Strategy
In terms of learning algorithms, the strategy of rule generation usually significantly
affects the model complexity. As mentioned in [7, 8], the generation of classification rules can be divided into two categories: ‘divide and conquer’ [9] and ‘separate
and conquer’ [2]. The former is also referred to as Top-Down Induction of Decision
Trees (TDIDT) due to the fact that this learning approach aims to generate classification rules in the form of a decision tree. The latter is also referred to as covering
approach because of the fact that this approach aims to learn a set of if-then rules
sequentially, each of which covers a subset of training instances that are normally
removed from the current training set prior to the generation of the next rule.
As introduced in [10, 11], Prism, which is a rule induction method that follows
the ‘Separate and Conquer’ approach, is likely to generate fewer and more general
rules than ID3, which is another rule induction method that follows the ‘Divide and
Conquer’ approach. The above phenomenon is due mainly to the strategy of rule
learning. As mentioned in [10], the rule set generated by the TDIDT needs to have
at least one common attribute in order to be represented in the form of a decision
tree. The same also applies to each of the subtrees of a decision tree, which requires
5
to have at least one common attribute represented as the root of the subtree. Due to
this requirement, the TDIDT is likely to generate a large number of complex rules
with many redundant terms such as the replicated subtree problem [10] illustrated
in Fig.1 and thus results in a model with high complexity.
Fig. 1. Cendrowska’s replicated subtree example [3]
3.2 Data Characteristics
As mentioned in Section 3.1, different algorithms involve different strategies of
learning and thus generate rule based models with different levels of complexity. In
this sense, when the same data set is used, different learning algorithms would usually lead to different levels of model complexity. However, for the same algorithm,
data of different size would also usually result in the generation of models with
different levels of complexity. The rest of this subsection justifies the potential correlation between data size and model complexity.
As mentioned earlier, rule learning methods involve the generation of rule based
models. The complexity of a rule based model is determined by the total number of
rule terms, which is dependent upon the number of rules and the average number of
terms per rule. However, the total number of rule terms is also affected by the data
size in terms of both dimensionality (number of attributes) and sample size (number
of instances). For example, a data set has n attributes, each of which has t values,
and its sample contains m instances and covers all possible values for each of the
attributes. In this example, the model complexity would be equal to Σ t i, for i=0, 1,
2…n, in principle, but no greater than m × n in the worst case in practice. This
indicates that a rule based model consists of a default rule, which is also referred to
as the ‘else’ rule, and t i rules, each of which has i terms, for i=0, 1, 2…n respectively. However, each rule usually covers more than one instance and the rule based
model is expected to cover all instances. Therefore, the number of rules from a rule
based model is usually less than the number of instances from a data set. As also
justified above, each rule would have up to n (the number of attributes) terms due
6
to the requirement that each attribute can only appear once comprising one of its
possible values in any of the rules. On the basis of the above description, the complexity of a rule based model is up to the product of dimensionality and sample size
of a data set. In addition, the complexity of each attribute also impacts on the complexity of the rule based model, especially for continuous attributes.
4 Techniques for Control of Model Complexity
Section 3 identified two main impact factors for model complexity – learning algorithms and data characteristics, and analyzed theoretically in what way the two factors impact on the complexity of a rule based model. This section presents several
techniques towards effective control of model complexity. In particular, these techniques follow one of the two approaches namely scaling up algorithms and scaling
down data.
4.1 Scaling Up Algorithms
As introduced in [6], scaling up algorithms for complexity reduction can be
achieved through proper employment of rule generation methods or proper use of
rule pruning algorithms.
In terms of rule generation, the learning approaches can be categorized into divide and conquer and separate and conquer as mentioned in Section 3.1. In particular, examples of the divide and conquer approach include ID3 [12] and C4.5 [9] and
examples of the separate and conquer approach include IEBRG [7] and Prism [10].
ID3 and IEBRG both involve use of information entropy for generation of rules but
with different strategies resulting in rule based models being represented in different
forms and having different levels of complexity. The illustration of these methods
are presented below using the contact-lenses data set [10] retrieved from the UCI
repository [13].
As mentioned in [14], ID3 makes attribute selection based on average entropy,
i.e. ID3 is an attribute oriented learning method and the calculation of entropy is for
a whole attribute on average. In addition, IEBRG makes selection of attribute-value
pairs based on conditional entropy, i.e. IEBRG is an attribute-value oriented learning method and the calculation of entropy is for a particular value of an attribute.
For each of the methods, the detailed illustration can be seen in [14]. As mentioned
in [14], ID3 makes attribute selection based on average entropy, i.e. ID3 is an attribute oriented learning method and the calculation of entropy is for a whole attribute on average. In addition, IEBRG makes selection of attribute-value pairs based
on conditional entropy, i.e. IEBRG is an attribute-value oriented learning method
7
and the calculation of entropy is for a particular value of an attribute. For each of
the methods, the detailed illustration can be seen in [14].
Table 1. Contact-lenses Data Set [14]
age
young
young
young
young
young
young
young
young
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
prescription
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
astigmatic
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
Tear production rate
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
class
no lenses
soft lenses
no lenses
hard lenses
no lenses
soft lenses
no lenses
hard lenses
no lenses
soft lenses
no lenses
hard lenses
no lenses
soft lenses
no lenses
hard lenses
no lenses
soft lenses
no lenses
hard lenses
no lenses
soft lenses
no lenses
hard lenses
In accordance with the illustration in [14], the complete decision tree generated
is the one as illustrated in Fig.2 and the corresponding if-then rules are represented
as follows: if tear production rate = reduced then class= no lenses; if tear production rate = normal and Astigmatic = yes then class= Hard lenses; if tear production
rate = normal and Astigmatic= no then class= Soft lenses.
Fig. 2. Complete decision tree
8
For IEBRG, the first rule generated is also the same as the one represented above:
if tear production rate = reduced then class= no lenses; this is because the conditional entropy E(tear production rate= reduced)=0 is the minimum and indicates
that there is no uncertainty any more for classifying instances covered by this rule.
The same can be seen from Table 1 as the class is always no lenses, while tear
production rate = reduced. All the subsequent rules are generated in the same way
as the first rule by appending rule terms on the left hand side of the rule by iteratively
selecting the attribute-value pair with the lowest conditional entropy for discriminating different classes. In particular, the rest of the rules generated are the following: if Astigmatic = yes then class= Hard lenses; if Astigmatic = no then class =
Soft lenses.
It can be seen that ID 3 generates a decision tree which contains 3 rules and 5
terms whereas IEBRG generates a set of if-then rules which contains 3 rules and 3
terms. The difference in the resulted complexity is due to the presence of the redundant term generated (tear production rate = normal) in the decision tree illustrated
in Fig.2. The essential reason is that the ID3 method is attribute oriented for measuring uncertainty and must have the current training subset split on a particular attribute at each iteration, whereas IEBRG is attribute-value oriented for measuring
uncertainty and only needs to separate some instances from the current training subset through selection of a particular attribute-value pair at each iteration. On the
basis of the above statement, a rule based model generated by the decision tree
learning approach must have at least one common attribute and the same also applies to each of its subtrees. However, a model generated by the if-then rules learning approach does not have such a constraint, which usually results in a lower level
of complexity than that of a model generated by the other learning approach. Therefore, it has been recommended in the research literature [2, 10, 14] that the if-then
rules learning approach should be used instead of the decision tree learning approach towards generation of simpler rules.
On the other hand, use of pruning algorithms can also manage to reduce the complexity of a rule based model as mentioned above. As introduced in [2], pruning
methods can be categorized into pre-pruning and post-pruning.
While decision tree learning methods are used for rule generation, pre-pruning
aims to stop a particular branch in a tree growing further whereas post-pruning aims
to simplify each of the branches in a tree after the whole tree has been generated. In
particular, the tree needs to be converted into a set of if-then rules before the pruning
action is taken. In addition, post-pruning can also be done through replacing a subtree with a leaf node. A popular method used for pruning of decision trees is referred
to as Reduced Error Pruning (REP) [15] which follows the strategy of post pruning.
While if-then rules learning methods are used for rule generation, pruning is
taken per single rule generated in contrast to tree pruning. In other words, each single rule is pruned prior to the generation of the next rule rather than posterior to the
completion of the generation of a whole rule set. In this context, pre-pruning aims
to stop the specialization of the left hand side of a rule. Post-pruning aims to simplify the left hand side of a rule after its generation has been completed. An example
9
of the methods for pruning of if-then rules is referred to as Jmid-pruning [8], which
follows the strategy of pre-pruning.
Section 5 will report experimentally more detailed analysis of model complexity
controlled through scaling up algorithms in terms of both rule generation and rule
pruning.
4.2 Scaling Down Data
As mentioned in Section 3, the size of data may also affect the complexity of a rule
based model. In other words, if a data set has a large number of attributes with
various values and instances, the generated model is very likely to be more complex.
The dimensionality issue can be resolved by using feature selection techniques
such as Correlation Based Feature Selection (CFS) [16]. In other words, the aim is
to remove those irrelevant attributes and thus make a model simpler. In addition,
the issue can also be resolved through feature extraction methods, such as Principal
Component Analysis (PCA) [17]. In other words, the aim is to transform the data
set to a lower dimensional space through combining existing attributes.
Besides, in some cases, it is also necessary to remove some attribute values as
they may be irrelevant. For example, in a rule based model, an attribute-value pair
may never be involved in any rules as a rule term. In this case, the value of this
attribute can be judged irrelevant and thus removed. In some cases, it is also necessary to merge some values for an attribute in order to reduce the attribute complexity
especially when the attribute is continuous with a large interval. There are some
ways to deal with continuous attributes such as ChiMerge [18] and use of fuzzy
linguistic terms [19].
As analyzed in Section 3.2, dimensionality reduction can effectively reduce the
average number of rule terms per rule. This is because each single rule can have
only up to n rule terms, where n is the data dimensionality. As also analyzed in
Section 3.2, reduction of the complexity for each input attribute can effectively reduce the number of rules. For example, three attribute a, b, c have 2, 3 and 4 values
respectively. In this case, the number of first order rules (with one rule term) is
2+3+4=9; the number of second order rules (with two rule terms) is
2×3+2×4+3×4=26; the number of third order rules (with three rule terms) is
2×3×4=24.
On the basis of the above description, feature selection, feature extraction and
reduction of attribute complexity are all generally effective towards reduction of
model complexity. More detailed experimental results are reported in Section 5 to
analyze the extent to which the model complexity can be effectively controlled
through scaling down data.
10
5 Experimental Studies
This section presents the validation of the proposed techniques mentioned in Section 4 for effective control of model complexity towards advances in model efficiency and interpretability. In particular, the validation includes five parts: rule generation, rule pruning, feature selection, feature extraction and attribute
discretization. The first two parts are in line with scaling up algorithms and the rest
of them are in line with scaling down data.
In terms of rule generation, ID3 is chosen as an example of the divide and conquer approach and IEBRG is chosen as an example of the separate and conquer
approach. This is based on the fact that both methods involve use of information
entropy as the heuristic for uncertainty measure. The rule based models generated
by using the above two methods are compared in terms of model complexity. In
particular, for the purpose of advancing model efficiency, the models generated by
ID3 and IEBRG are compared in terms of the total number of rule terms generated.
This is because the computational complexity of a rule based model in predicting
the class of an unseen instance is typically measured by using the BigO notation
and considering the worst case. As introduced in [3], if a rule based model is represented in the form of a decision tree, then the prediction is made by going through
the tree from the root node to a leaf node in a divide and conquer search. The computational complexity is O (log (n)), where n is the tree size, i.e. the number of nodes
in the tree. If a rule based model is represented in the form of a set of if-then rules,
then the prediction is made by linearly going through the whole rule set until the
firing rule is found. The computational complexity is O (n), where n is the total
number of rule terms in the rule set. In this experimental study, the models generated
by ID3 are all converted from the form of decision trees to the form of if-then rules
in order to make consistent comparisons with models generated by IEBRG. This is
because models represented in different forms cannot be compared consistently, and
it is straightforward to convert from a decision tree to a set of if-then rules but much
more difficult the other way around. On the other hand, for the purpose of advancing model interpretability, the models generated by ID3 and IEBRG are compared
in terms of the number of rules and average number of rule terms per rule in a rule
set. In this context, models generated by ID3 need to be converted from the form of
decision trees to the form of if-then rules with the same reason as mentioned above.
However, in general, models represented in the form of decision trees can be
checked in terms of height and width, i.e. the length of the longest branch of a tree
and the number of branches/leaf nodes respectively. This is because of the fact that
the two popular search strategies are depth first search and breadth first search.
In terms of rule pruning, C4.5 is chosen as an example of the divide and conquer
approach with the use of REP for tree pruning. The decision is based on the fact that
C4.5 is a popular decision tree learning method and REP has been proven effective
in reduction of overfitting of models generated by C4.5 towards improvement of
11
accuracy [15]. In addition, Prism is chosen as an example of the separate and conquer approach with the use of Jmid-pruning for pruning of if-then rules. The decision is based on the fact that Prism is a representative method for learning of if-then
rules and Jmid-pruning has been proven effective in reduction of overfitting of models generated by Prism towards improvement of accuracy [8, 14]. In the case of
decision tree pruning, the comparisons on model complexity are in terms of tree
size, height and width, whereas the comparisons in the case of pruning of if-then
rules are in terms of the total number of rule terms, number of rules and average
number of rule terms per rule.
In terms of feature selection, feature extraction and attribute discretization, CFS,
PCA and ChiMerge are used respectively to assist C4.5 for the purpose of data preprocessing. This is in order to reduce the level of difficulty for rule based modelling
towards effective control of model complexity. In particular, models generated by
C4.5 on the basis of the original data are compared in terms of tree size, height and
width, with those ones generated by the same method on the basis of the processed
version of the data by CFS for feature selection, PCA for feature extraction, and
ChiMerge for attribute discretization.
All parts of the validation mentioned above are undertaken by using data sets
retrieved from the UCI repository and the characteristics of these data sets can be
seen in [13]. The results for the rule generation part are presented in Table 2 and 3.
Table 2. Total number of rule terms
Dataset
vote
zoo
car
breast-cancer
kr-vs-kp
lung-cancer
mushroom
nursery
soybean
splice
tic-tac-toe
trains
contact-lenses
sponge
audiology
ID3
333
12000
54
167
502
52015
6595
85
12257
5815740
174
11048
5
75240
14475
IEBRG
24
5
91
7
9
3
9
846
9
10
605
2
3
4
7
It can be seen from Table 2 that IEBRG outperforms ID3 in 12 out of 15 cases
in terms of the total number of rule terms. The same phenomenon can also be seen
from Table 3. In the three cases that IEBRG performs worse than ID3, the reason is
that IEBRG cannot effectively learn consistent rules from the three data sets car,
nursery and tic-tac-toe. As reported in [20], on the above three data sets, IEBRG
generates a large number of inconsistent rules, each of which has already included
all attributes on its left hand side but still covers instances that belong to different
classes. In this case, the number of terms of an inconsistent rule is exactly the same
12
as the number of attributes of the data set, which is the maximum as analyzed in
Section 3.2, and thus leads to a higher level of model complexity.
On the basis of the above description, methods (e.g. IEBRG) that follow the separate and conquer approach typically generate a smaller number of simpler rules in
comparison with methods (e.g. ID3) that follow the divide and conquer approach,
while the former ones can effectively learn consistent rules with high quality. Therefore, rule based models generated by the former type of methods are generally more
efficient and interpretable.
Table 3. Number of rules and average number of terms per rule
Dataset
vote
zoo
car
breast-cancer
kr-vs-kp
lung-cancer
mushroom
nursery
soybean
splice
tic-tac-toe
trains
contact-lenses
sponge
audiology
ID3
Count(rules)
31
1500
16
43
228
1631
521
20
576
190680
31
565
3
3344
314
Avg(terms)
10.74
8.0
3.38
3.88
21.94
31.89
12.66
4.25
21.28
30.5
5.61
19.55
1.67
22.5
46.1
IEBRG
Count(rules)
10
5
23
7
9
3
9
121
9
10
91
2
3
4
7
Avg(terms)
2.4
1.0
3.96
1.0
1.0
1.0
1.0
6.99
1.0
1.0
6.65
1.0
1.0
1.0
1.0
Table 4. Tree size I- Tree Pruning
Dataset
anneal
breast-cancer
breast-w
car
credit-a
credit-g
diabetes
ecoli
heart-c
heart-h
heart-statlog
hepatitis
ionosphere
vote
segment
unpruned C4.5
72
179
45
186
135
466
43
51
77
47
61
31
35
37
101
pruned C4.5
60
22
27
112
43
64
15
11
21
8
25
1
9
9
59
The results for the rule pruning part are presented in Table 4 and 5 for decision
tree pruning and in Table 6 and 7 for pruning of if-then rules.
13
In terms of tree pruning, it can be seen from Table 4 that the pruned decision tree
has a smaller size than the unpruned decision tree in all cases. The same phenomenon can also be seen from Table 5. This is because of the fact that REP is a postpruning method and the aim is to replace a subtree with a leaf node without affecting
any other branches/subtrees after the whole tree has been generated. In addition, as
also analysed in [2], for a decision tree, pruning one branch does not affect any other
branches normally growing when using either pre-pruning or post-pruning. Therefore, in the context of decision tree learning, if there are any branches taken pruning
actions, the tree is definitely simpler than the one without pruning taken and thus
more efficient and interpretable.
Table 5. Tree complexity analysis I – Tree Pruning
Dataset
anneal
breast-cancer
breast-w
car
credit-a
credit-g
diabetes
ecoli
heart-c
heart-h
heart-statlog
hepatitis
ionosphere
vote
segment
unpruned C4.5
Tree height
13
7
9
6
10
11
10
9
8
7
10
10
12
9
15
Count(leafs)
53
152
23
134
101
359
22
26
46
29
31
16
18
19
51
pruned C4.5
Tree height
12
2
27
6
9
8
6
4
5
4
7
1
5
5
11
Count(leafs)
44
18
14
80
30
47
8
6
14
5
13
1
5
5
30
Table 6. Total number of rule terms by Prism
Dataset
cmc
vote
kr-vs-kp
ecoli
anneal.ORIG
audiology
car
optdigits
glass
lymph
yeast
shuttle
analcatdataasbestos
irish
breast-cancer
Prism without pruning
168
157
368
45
25
173
2
3217
74
13
62
116
8
15
12
Prism with Jmid-pruning
112
77
116
33
44
106
6
1287
79
10
30
12
7
14
11
14
In terms of pruning of if-then rules, it can be seen from Table 6 that the pruned
rule based model is simpler than the unpruned one in 12 out of 15 cases. The similar
phenomenon can also be seen from Table 7. For the three exceptional cases, the
reason could be explained by the fact that for learning of if-then rules, pruning one
rule could affect the generation of all subsequent rules. In other words, taking pruning actions can effectively make the current rule simpler, but may disadvantage
learning of the subsequent rules leading to generation of more complex rules if the
current pruning action is not appropriately taken. In this case, the model accuracy
is also decreased as reported in [8, 14].
Table 7. Number of rules and average number of rule terms by Prism
Dataset
cmc
vote
kr-vs-kp
ecoli
anneal.ORIG
audiology
car
optdigits
glass
lymph
yeast
shuttle
analcatdataasbestos
irish
breast-cancer
Prism without pruning
Count(rules)
Avg(terms)
36
4.67
25
6.28
63
5.84
24
1.88
16
1.56
48
3.60
2
1.0
431
7.46
26
2.85
1.3
10
37
1.68
30
3.87
1.6
5
1.5
10
1.09
11
Prism with Jmid-pruning
Count(rules)
Avg(terms)
25
4.48
15
5.13
21
5.52
1.94
17
3.67
12
38
2.79
3
2.0
197
6.53
3.29
24
10
1.11
20
1.5
12
1.0
5
1.4
11
1.27
11
1.0
Table 8. Tree size II - Feature Selection
Dataset
kr-vs-kp
ionosphere
sonar
mushroom
anneal
waveform
spambase
splice
sponge
cylinder-bands
audiology
lung-cancer
spectf
credit-g
breast-cancer
C4.5
Attribute#
37
35
61
23
39
41
58
62
46
40
70
57
45
21
10
Tree size
82
35
35
30
72
677
379
3707
18
432
62
12
17
466
179
C4.5 with CFS
Attribute#
8
15
20
5
10
16
16
23
4
7
17
9
13
4
6
Tree size
16
31
29
21
70
621
229
555
6
432
59
7
19
30
94
The results for the rest of the parts are presented in Table 8 and 9 for feature
selection, Table 10 and 11 for feature extraction and Table 12 and 13 for attribute
discretization.
15
In terms of feature selection, it can be seen from Table 8 that the tree generated
by using the pre-processed data is simpler than the one generated by using the original data in 13 out of 15 cases. The similar phenomenon can also be seen from Table
9. For the case on the cylinder-bands data set that the same tree is generated after
the data dimensionality is reduced, the reason is typically that C4.5 does not select
any irrelevant attributes for learning of a decision tree, while the data set is not preprocessed, and that the set of attributes removed by CFS does not contain any of the
attributes that are supposed to be selected by C4.5 for generation of the tree. In
addition, the other case on the spectf data set could normally be explained by the
possible reason that there are a few relevant attributes removed posterior to the preprocessing of the data set, which disadvantages the learning of a tree by C4.5.
Table 9. Tree complexity analysis II - Feature Selection
Dataset
kr-vs-kp
ionosphere
sonar
mushroom
anneal
waveform
spambase
splice
sponge
cylinder-bands
audiology
lung-cancer
spectf
credit-g
breast-cancer
C4.5
Tree height
14
12
8
6
13
20
31
9
4
3
14
4
7
11
7
Count(leafs)
43
18
18
25
53
339
190
3597
14
430
37
8
9
359
152
C4.5 with CFS
Tree height
7
10
7
5
12
17
19
11
2
3
10
3
8
5
6
Count(leafs)
9
16
15
17
45
311
115
440
5
430
38
5
10
21
80
Table 10. Tree size III – Feature Extraction
Dataset
vehicle
waveform
spambase
trains
hepatitis
lung-cancer
vowel
sonar
sponge
autos
car
cmc
heart-statlog
dermatology
tic-tac-toe
C4.5
Attribute#
19
41
58
33
20
57
14
61
46
26
7
10
14
35
10
Tree size
207
677
379
11
31
12
277
35
18
88
186
665
61
44
208
C4.5 with PCA
Attribute#
8
35
49
9
17
26
20
31
66
37
16
16
13
72
17
Tree size
165
369
345
3
9
7
241
35
7
61
123
197
21
17
43
16
In terms of feature extraction, it can be seen from Table 10 that the tree generated
by using the pre-processed data set is simpler than the one generated by using the
original data set in 14 out of 15 cases. The similar phenomenon can also be seen
from Table 11. For the case of the sonar data set that the same tree is generated after
the data is transformed by PCA, the reason is typically that C4.5 can very effectively
learn from the data set without the need to transform the data and thus the data
transformation by PCA does not provide any help.
Table 11. Tree complexity analysis III - Feature Extraction
Dataset
vehicle
waveform
spambase
trains
hepatitis
lung-cancer
vowel
sonar
sponge
autos
car
cmc
heart-statlog
dermatology
tic-tac-toe
C4.5
Tree height
17
20
31
3
10
4
11
8
4
8
6
15
10
7
7
Count(leafs)
104
339
190
9
16
8
178
18
14
65
134
437
31
33
139
C4.5 with PCA
Tree height
15
20
16
2
4
4
23
8
3
16
18
14
6
8
14
Count(leafs)
83
185
173
2
5
4
121
18
4
31
62
99
11
9
22
Table 12. Tree size IV - Attribute Discretization
Dataset
anneal
balance-scale
heart-c
heart-h
heart-statlog
labor
sick
tae
liver-disorders
cmc
colic
haberman
glass
weather
hypothyroid
C4.5 with original attributes
72
119
77
47
61
22
72
69
53
665
129
47
59
8
36
C4.5 II with discretised attributes
64
13
71
45
43
13
58
5
3
462
107
15
50
8
76
In terms of attribute discretization, it can be seen from Table 12 that the tree
generated by using the discretized set of attributes is simpler than the one generated
by using the original data set. The similar phenomenon can also be seen from Table
17
13. For the case on the hypothyroid data set, the reason is typically that much information gets lost after inappropriate discretization of continuous attributes. In particular, if the discretization is not appropriate, it is very likely to result in the case
that important patterns cannot be learned from important continuous attributes and
thus more attributes need to be selected for learning of a tree. A similar argumentation is also made in [21].
Table 13. Tree complexity analysis IV- Attribute Discretization
Dataset
anneal
balance-scale
heart-c
heart-h
heart-statlog
labor
sick
tae
liver-disorders
cmc
colic
haberman
glass
weather
hypothyroid
C4.5 with original attributes
Tree height
Count(leafs)
13
53
11
60
46
8
7
29
10
31
5
13
41
11
12
35
9
27
15
417
95
7
4
34
11
30
3
5
10
20
C4.5 with discretised attributes
Tree height
Count(leafs)
10
50
5
7
9
42
6
26
8
22
4
8
11
35
3
3
2
2
9
325
7
82
3
13
35
6
3
5
57
8
6 Conclusion
This paper argued the significance of complexity control for rule based models for
the purpose of knowledge discovery and predictive modelling. In particular, rule
based models need to be more efficient and interpretable. This paper also identified
two main impact factors for model complexity namely learning algorithms and data
characteristics, and also analyzed in what way the two factors impact on the model
complexity. The main contributions of this paper include theoretical analysis of the
proposed techniques for control of model complexity and empirical validation of
these techniques to show the extent to which these techniques are effective towards
generation of more efficient and interpretable models in practice. The results have
been discussed critically and comparatively and indicated that the proposed techniques can effectively manage to reduce the model complexity. On the basis of the
results obtained, the further directions identified for this research area are to investigate in depth how to employ existing methods to achieve scaling up algorithms
and scaling down data, respectively, in more effective ways.
18
References
[1] H. Liu, A. Gegov and F. Stahl, “Categorization and Construction of Rule
Based Systems,” in 15th Inter-national Conference on Engineering
Applications of Neural Networks, Sofia, Bulgaria, 2014.
[2] J. Furnkranz, “Separate-and-Conquer rule learning,” Artificial Intelligence
Review, vol. 13, pp. 3-54, 1999.
[3] H. Liu, A. Gegov and M. Cocea, “Network Based Rule Representation for
Knowledge Discovery and Predictive Modelling,” in IEEE International
Conference on Fuzzy Systems, Istanbul, 2015.
[4] C. Schaffer, “Overfitting Avoidance as Bias,” Machine Learning, vol. 10, p.
153–178, 1993.
[5] D. H. Wolpert, “On Overfitting Avoidance as Bias,” SFI TR, 1993.
[6] H. Liu, M. Cocea and A. Gegov, “Interpretability of Computational Models
for Sentiment Analysis,” in Sentiment Analysis and Ontology Engineering:
An Environment of Computational Intelligence, vol. 639, W. Pedrycz and S.
M. Chen, Eds., Switzerland, Springer, 2016, pp. 199-220.
[7] H. Liu, A. Gegov and F. Stahl, “Unified Framework for Construction of Rule
Based Classification Systems,” in Inforamtion Granularity, Big Data and
Computational Intelligence, vol. 8, W. Pedrycz and S. M. Chen, Eds.,
Springer, 2015, pp. 209-230.
[8] H. Liu, A. Gegov and F. Stahl, “J-measure Based Hybrid Pruning for
Complexity Reduction in Classification Rules,” WSEAS Transactions on
Systems, vol. 12, no. 9, pp. 433-446, 2013.
[9] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993.
[10] J. Cendrowska, “PRISM: an algorithm for inducing modular rules,”
International Journal of Man-Machine Studies, vol. 27, p. 349–370, 1987.
[11] X. Deng, “A covering-based algorithm for classification: PRISM,” SK, 2012.
[12] Q. Ross, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81106, 1986.
[13] M. Lichman, “UCI Machine Learning Repository,” University of California,
Irvine, School of Information and Computer Sciences, 2013. [Online].
Available: http://archive.ics.uci.edu/ml. [Accessed 25 June 2015].
[14] H. Liu, A. Gegov and M. Cocea, Rule Based Systems for Big Data: A
Machine Learning Approach, 1 ed., vol. 13, Switzerland: Springer, 2016.
[15] T. Elomaa and M. Kaariainen , “An Analysis of Reduced Error Pruning,”
Journal of Artificial Intelligence Research, vol. 15, no. 1, pp. 163-187, 2001.
[16] M. A. Hall, “Correlation-based Feature Selection for,” Hamilton,
NewZealand, 1999.
19
[17] I. T. Jolliffe, Principal Component Analysis, New York: Springer, 2002.
[18] R. Kerber, “ChiMerge: discretization of numeric attributes,” in Proceedings
of the 10th National Conference on Artificial Intelligence, California, 1992.
[19] T. J. Ross, Fuzzy Logic with Engineering Applications, West Sussex: John
Wiley & Sons Ltd, 2004.
[20] H. Liu and A. Gegov, “Induction of Modular Classification Rules by
Information Entropy Based Rule Generation,” in Innovative Issues in
Intelligent Systems, vol. 623, V. Sgurev, R. Yager, J. Kacprzyk and V. Jotsov,
Eds., Switzerland, Springer, 2016, pp. 217-230.
[21] D. Brain, “Learning from Large Data: Bias, Variance, and Learning Curves,”
2003.