Unified Framework for Control of Machine
Learning Tasks towards Effective and Efficient
Processing of Big Data
Han Liu1, Alexander Gegov2 and Mihaela Cocea3
Abstract Big data can be generally characterised by 5 Vs – Volume, Velocity,
Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing. In machine learning context, learning algorithms are typically evaluated in terms of accuracy, efficiency,
interpretability and stability. These four dimensions can be strongly related to veracity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the
quality of computational models can be impacted by data characteristics as well as
strategies involved in learning algorithms. This chapter also introduces a unified
framework for control of machine learning tasks towards appropriate employment
of algorithms and efficient processing of big data. In particular, this framework is
designed to achieve effective selection of data pre-processing techniques towards
effective selection of relevant attributes, sampling of representative training and test
data, and appropriate dealing with missing values and noise. More importantly, this
framework allows the employment of suitable machine learning algorithms on the
basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models.
Keywords: Big Data, Computational Intelligence, Data Mining, Machine Learning,
Data Processing, Predictive Modelling
1
Han Liu
University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace,
PO1 3HE Portsmouth, United Kingdom Email: Han.Liu@port.ac.uk
2 Alexander Gegov
University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace,
PO1 3HE Portsmouth, United Kingdom Email: Alexander.Gegov@port.ac.uk
3 Mihaela Cocea
University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace,
PO1 3HE Portsmouth, United Kingdom Email: Mihaela.Cocea@port.ac.uk
2
1 Introduction
Big data can generally be characterized by 5Vs – Volume, Velocity, Variety,
Veracity and Variability. In particular, volume generally reflects the space required
to store data. Velocity reflects the speed of data transmission and processing, i.e.
how effectively and efficiently real-time data is collected and processed on the platform of cloud computing. Variety reflects the type of data, i.e. data can be structured
or unstructured and can also be in different forms such as text, image, audio and
video. Veracity reflects the degree to which data can be trusted. Variability reflects
the dissimilarity between different instances in a data set. More details on big data
can be found in [1, 2, 3, 4].
In many studies, machine learning has been considered as a powerful tool of big
data processing. As introduced in [5], the relationship between big data and machine
learning is very similar to the relationship between resources and human learning.
In this context, people can learn from resources to deal with new matters. Similarly,
machines can learn from big data to resolve new problems. More details on big data
processing by machine learning can be found in [6, 7, 8, 9, 10, 11, 12].
Machine learning is regarded as one of the main approaches of computational
intelligence [13]. In general, computational intelligence encompasses a set of nature
or biology inspired computational approaches such as artificial neural networks,
fuzzy systems and evolutionary computation. In particular, artificial neural networks are biologically inspired to simulate the human brains in terms of learning
through experience. Also, fuzzy systems involve using fuzzy logic, which enables
computers to understand natural languages [14]. Moreover, evolutionary computation works based on the process of nature selection, learning theory and probabilistic methods, which helps with uncertainty handling [13]. As stated in [15], learning
theories, which help understand how cognitive, emotional and environmental effects and experiences are processed in the context of psychology, can help make
predictions on the basis of previous experience in the context of machine learning.
From this point of view, machine learning is naturally inspired by human learning
and would thus be considered as a nature inspired approach. In addition, most machine learning methods involve employing heuristics of computational intelligence,
such as probabilistic measures, fuzziness and fitness, towards optimal learning. In
particular, C4.5, Naïve Bayes and K nearest neighbors, which are selected for the
experimental studies in Section 4, involve employing such heuristics.
In machine learning context, learning algorithms are typically evaluated in four
dimensions, namely accuracy, efficiency, interpretability and stability, following
the concepts of computational intelligence. These four dimensions can be strongly
related to veracity, volume, variety and variability, respectively.
Veracity reflects the degree to which data can be trusted as mentioned above. In
practice, data needs to be transformed to information or knowledge for people to
use. From this point of view, the accuracy of information or knowledge discovered
3
from data can be highly impacted by the quality of the data and thus is an effective
way of evaluation against the degree of trust.
Volume reflects the size of data. In the areas of machine learning and statistics,
the data size can be estimated through the product of data dimensionality and sample size [16]. Increase of data dimensionality or sample size can usually increase
the computational costs of machine learning tasks. Therefore, evaluation of the volume for particular data is highly related to estimation of memory usage for data
processing by machine learning methods.
Variety reflects the format of data, i.e. data types and representation. Typical
data types include integer, real, Boolean, string, nominal and ordinal [17]. In machine learning and statistics, data types can be simply divided into two categories:
discrete and continuous. On the other hand, data can be represented in different
forms, e.g. text, graph and tables. All the differences mentioned above in terms of
data format can impact on the interpretability of models learned from data.
Variability reflects the dissimilarity between different instances in a data set. In
machine learning, the performance of learning algorithms can appear to be highly
unstable due to change of data samples, especially when the data instances are
highly dissimilar to each other. Therefore, the stability of a learning algorithm can
be highly impacted by data variability.
The above four aspects (accuracy, efficiency, interpretability and stability) are
also impacted greatly by the selection of different machine learning algorithms. For
example, data usually needs to be pre-processed by particular algorithms prior to
the training stage, which leads to a particular level of impact on data modelling.
Also, inappropriate sampling of training and test data can also lead to building a
poor model and biased estimation of accuracy, respectively. Further, different learning algorithms can usually lead to different quality of models learned from the same
training data. In addition, in the context of online learning, velocity, which is related
to the learning speed of an algorithm, is an important impact factor for data streams
to be processed effectively and efficiently. However, this chapter focuses on offline
learning, which analyses in depth how the nature of learning algorithms is related
to the nature of static data.
This chapter is organized as follows: Section 2 introduces fundamental concepts
of machine learning and how computational intelligence contributes to the design
of learning algorithms. Section 3 presents a framework proposed in a nature inspired
way for control of machine learning tasks towards appropriate employment of learning algorithms and efficient processing of big data. Section 4 reports experimental
studies on employment of learning algorithms and efficient processing of big data
and discusses the obtained results in both quantitative and qualitative terms. Section
5 highlights the contributions of this chapter and suggests further directions towards
advancing this research area by using computational intelligence approaches.
4
2 Fundamentals of Machine Learning
Machine learning is a branch of artificial intelligence and involves two stages:
training and testing [18]. The first stage aims to learn something from known properties by using learning algorithms and the second stage aims to make predictions
on unknown properties by using the knowledge learned in the first stage. From this
point of view, training and testing are also referred to as learning and prediction,
respectively. In practice, a machine learning task is aimed at building a model,
which is further used to make predictions, through the use of learning algorithms.
Therefore, this task is usually referred to as predictive modelling.
Machine learning could be divided into two special types: supervised learning
and unsupervised learning [19], in terms of the form of learning. Supervised learning means learning with a teacher, because all instances from a training set are labelled, which makes the learning outcomes very explicit. In other words, supervised
learning is naturally inspired by student learning with the supervision of teachers.
In practice, the aim of this type of learning is to build a model by learning from
labelled data and then to make predictions on other unlabeled instances with regard
to the value of a predicted attribute. The predicted value of an attribute could be
either discrete or continuous. Therefore, supervised learning could be involved in
both classification and regression tasks for categorical prediction and numerical prediction, respectively. In contrast, unsupervised learning means learning without a
teacher. This is because all instances from a training set are unlabeled and thus the
learning outcomes are not explicit. In other words, unsupervised learning is naturally inspired by student learning without being supervised. In practice, the aim of
this type of learning is to discover previously unknown patterns from data sets. It
includes association and clustering. The former aims to identify correlations between attributes whereas the latter aims to group objects on the basis of their similarity to each other.
According to [18], machine learning algorithms can be put into several categories: decision tree learning, rule learning, instance based learning, Bayesian learning, perceptron learning and ensemble learning. All of these learning algorithms
show the characteristic of nature inspiration.
Both decision tree learning and rule learning aim to learn a set of rules on an
inductive basis. However, the difference between the two types of learning is that
the former generates rules in the form of a decision tree and the latter generates ifthen rules directly from training instances [1, 20, 21]. The above difference is
mainly due to the fact that the former follows the divide and conquer approach [22]
and the latter follows the separate and conquer approach [23]. In particular, the divide and conquer approach is naturally similar to the top-down approach of student
learning, such as dividing a textbook into several levels: parts, chapters, sections
and subsections. The separate and conquer approach is naturally similar to the iter-
5
ative approach of student learning, which means by reading through an entire material in the first iteration and then focusing on more important parts of the material
for deeper understanding in the subsequent iterations.
Instance based learning generally involves predicting test instances on the basis
of their similarity to the training instances, such as K nearest neighbor [24]. This
type of learning is also referred to as lazy learning, due to the fact that it does not
aim to learn in depth to gain some pattern from data but just to make as many correct
predictions as possible [16]. In other words, this type of learning is naturally similar
to the exam centered approach of student learning, which means that students
mainly aim to answer correctly the exam questions without deep understanding of
knowledge.
Bayesian learning essentially employs the Bayes theorem [25]. In particular, this
type of learning is based on the assumption that all the input attributes are totally
independent of each other. In this context, each attribute-value pair would be independently correlated to each of the possible classes, which means that a posterior
probability is provided between the attribute-value pair and the class. A popular
method of Bayesian learning is Naive Bayes [26]. This type of learning is naturally
similar to the prior-knowledge based approach of human reasoning, which means
that people make decisions, reasoning and judgments based on the knowledge they
obtained before, towards having the most confident choice.
Perceptron learning aims to build a neural network topology that consists of a
number of layers and that has a number of nodes, each of which represents a perceptron. Some popular algorithms include backpropagation [17] and probabilistic
neural networks [18]. This type of learning is biology inspired as stated in Section
1. Ensemble learning generally aims to combine different learning algorithms in the
training stage or computational models in the testing stage towards improvement of
overall accuracy of predictions. Some popular approaches of ensemble learning include bagging [27] and boosting [28]. This type of learning is naturally similar to
the approach of group learning for students to collaborate on a group assignment.
In terms of evaluation of a machine learning task, there are generally two main
approaches: cross-validation and split of data into a training set and a test set. Crossvalidation generally means to split a data set into n disjoint subsets. In this context,
there would be n iterations in total for the evaluation, while at each iteration a subset
is used for testing and the other n-1 subsets are used for training. In other words,
each of the n subsets is in turn used as the test set at one of the n iterations, while
the rest of the subsets are used together as the training set. In laboratory research,
10-fold cross-validation is used more popularly, i.e. the original data set is split into
10 subsets. Cross-validation is generally more expensive in terms of computational
cost. Therefore, researchers sometimes instead choose to take the approach of splitting a data set into a training set and a test set in a specific ratio, e.g. 70% of the data
is used as the training set and the rest of the data is used as the test set. This data
split can be done randomly or in a fixed way. However, due to the presence of uncertainty in data, the random split of data is more popular for researchers in machine
learning or similar areas.
6
In this chapter, new perspectives of the two approaches of evaluating machine
learning tasks are used in Section 4. In particular, cross-validation is used towards
measuring effectively the learnability of an algorithm, i.e. the extent to which the
algorithm is suitable to build a confident model on the provided training data. This
is in order to help employ appropriately the suitable learning algorithms for building
predictive models on the basis of existing data. The other approach for splitting a
data set into a training set and a test set is adopted towards learning a model that
covers highly complete patterns from the training data and evaluating the model
accuracy using highly similar but different instances from the test data. This is in
order to ensure the model accuracy evaluated by using the test data is trustworthy.
Details on the use of the new perspectives are presented in Section 4.
3 Framework for Control of Machine Learning Tasks
This section presents a framework for control of machine learning tasks towards
appropriate employment of learning algorithms and effective processing of big data.
In particular, the key features of the proposed framework are described in detail.
Also, the motivation of developing this framework is justified by analyzing the impact of big data on machine learning, i.e. this is to argue the relevance for effective
control of machine learning tasks in a big data environment.
3.1 Key Features
A unified framework for control of machine learning tasks is proposed in a nature
inspired way in [1] as a further direction. The purpose is to effectively control the
pre-processing of data and to naturally employ learning algorithms and the generated predictive models. As mentioned in [1], it is relevant to deal with issues on
both the algorithms side and the data side for improvement of classification performance. In fact, a database is daily updated in real applications, which could result
in the gradual increase of data size and in changes to patterns that exist in the database. In order to avoid lowering computational efficiency, the size of a sample needs
to be determined in an optimal way. In addition, it is also required to avoid the loss
of accuracy. From this point of view, the sampling is critical not only in terms of
the size of a sample but also in the representativeness of the sample.
Feature selection/extraction is another critical task for pre-processing of data. As
mentioned in [1], high dimensional data usually results in high computational costs.
In addition, it is also very likely to contain irrelevant attributes which result in noise
and coincidental patterns. In some cases, it is also necessary to effectively detect
noise if the noise is introduced naturally or artificially. In other words, noise may
be introduced in a dataset due to mistakes in typing or illegal modifications from
7
hackers. A potential way of noise handling is using association rules to detect that
the value of an attribute is incorrect on the basis of the other attribute-value pairs in
the same data instance. Also, appropriate employment of learning algorithms and
predictive models are highly required, due to the fact that there are many existing
machine learning algorithms, but no effective ways to determine which of them are
suitable for a particular data set. Traditionally, the decision is made by experts based
on their knowledge and experience. However, it is fairly difficult to judge the correctness of the decision prior to empirical validation. In real applications, it is not
realistic to frequently change decisions after it has been confirmed that the chosen
algorithms are not suitable.
The arguments above outline the necessity to develop the framework for control
of machine learning tasks in a nature inspired way. In other words, this framework
aims to adopt computational intelligence techniques to control machine learning
tasks. In this framework, the actual employment of both learning algorithms and
predictive models follows computational intelligent approaches. The suitability of
a learning algorithm and the reliability of a model are measured by statistical analysis on the basis of historical records. In particular, each algorithm in the algorithms
base, as illustrated in Fig.1, is assigned a weight which is based on its performance
in previous machine learning tasks. The weight of an algorithm is naturally similar
to the impact factor of a journal which is based on its overall citation rate. Following
the employment of suitable learning algorithms, each model generated is then also
assigned a weight which is based on its performance on the latest version of validation data in a database. Following the employment of high quality models, a
knowledge base is finalized and deployed for real applications as illustrated in Fig.1.
Fig. 1. Unified Framework for Control of Machine Learning Tasks [1]
8
3.2 Justification
As mentioned in Section 1, machine learning algorithms are usually evaluated
against accuracy, efficiency, interpretability and stability. The presence of big data
has deeply affected machine learning tasks in the four aspects mentioned above.
In terms of accuracy, overfitting of training data can be significantly reduced in
general as the size of data is greatly increased. There is evidence reported in [29]
that learning from a large training set can significantly improve the performance in
predictive modelling. The evidence is illustrated in Fig. 2, which was provided with
an illustration by Banko and Brill in 2001 [30] that the complex problem of learning
on automated word disambiguation would keep improving after the size of training
data is beyond billions of words. In particular, each of the four learning algorithms
shows an increase of at least 10% in terms of test accuracy, while the number of
words is increased from 0.3 million to 1 billion. For example, the memory-based
algorithm gets the test accuracy increased from 83% to 93%, and the winnow algorithm achieves to increase the test accuracy from 75% to 97%. The improvement in
learning performance is due to the fact that the increase in data size can usually
improve the completeness of the pattern covered. In other words, small data may
cover only a small part of a pattern in a hypothesis space. Therefore, overfitting of
training data is likely to result in the case that a learning algorithm may build a
model that performs greatly on training data but poorly on test data. This case occurs
especially when the training data covers a pattern that is highly different from the
one in the test data. When the size of data is increased, the training data is likely to
cover a pattern that is more similar to the one present in the test data.
Fig. 2. Improvement of words disambiguation by learning from big data [30]
9
On the other hand, the increase in the size of data may also increase the chance
to have noise and coincidental patterns present in the data. This is due to the fact
that the biased improvement in the quantity of data may result in the loss of quality.
Also, large training data is likely to cover some patterns which occur in fairly low
frequencies. This could mean that the patterns covered by the training data are
purely coincidental rather than scientifically confident.
The above issues regarding accuracy can be solved through scaling up algorithms
or scaling down data. As specified in [1], the former way is to reduce the bias on
the algorithms side. In particular, the algorithms can be designed to be more robust
against noise and thus avoid being confused by coincidental patterns. In the context
of rule learning, the reduction of bias can be achieved through direct advancement
of rule generation methods or employment of rule simplification algorithms; similar
ways also apply to other types of learning algorithms. The latter way is to reduce
the variance on the data side. In particular, data can be pre-processed through removal of irrelevant attributes by feature selection techniques or through the merger
of redundant attributes by feature extraction techniques. In addition, data can also
be resampled by selecting only those instances that are more representative.
In terms of efficiency, the increase in the size of data is likely to increase the
computational costs in both training and testing stages. In the training stage, it may
slow down the process of building a predictive model by learning from big data. In
the testing stage, the predictive model is likely to have a high level of complexity,
which significantly increases the computational complexity in predicting on unseen
instances. In particular to rule learning algorithms, the presence of big data may
result in the generation of a large number of complex rules.
As stressed in [29, 7, 31], processing of big data needs decomposition, parallelism, modularity and recurrence. In this case, these machine learning algorithms,
which are inflexible and work in black box manners, would fail in dealing with big
data. This case would immediately happen to those algorithms that are quadratically
complex (O (n2)), when encountering data with millions of points (instances).
The above issues regarding efficiency can also be resolved through scaling up
algorithms or scaling down data. In the former way, the algorithms can be designed
to have a low level of computational complexity in the training stage and thus be
less affected by the increase in the size of training data. In the context of rule learning, the improvement of efficiency can be achieved through the employment of
pruning algorithms, as some of such algorithms can stop the process of rule learning
earlier. In the latter way, the size of data can be reduced through dimensionality
reduction and data sampling. This not only reduces the computational costs in the
training stage, but also results in the generation of simpler models and thus speeds
up the process of predicting on unseen instances in the testing stage.
In terms of interpretability, the increase in the size of data usually decreases the
interpretability. As analyzed in [1, 16], interpretability can be affected by the size
of training data in terms of model complexity. In the context of rule learning, big
data may result in the generation of a large number of complex rules, which would
make it difficult for people to read and understand.
10
The above issues regarding interpretability can also be solved through scaling up
algorithms or scaling down data. In the former way, the algorithms can be designed
to be robust against noise and irrelevant or redundant attributes. In other words, the
presence of noise and irrelevant/redundant attributes would not result in irrelevant
patterns being learned by the algorithms. In the context of rule learning, algorithms
for rule generation may decide to skip some attributes or attribute-value pairs for
generation of decision trees or if-then rules due to the irrelevance of these attributes
or attribute-value pairs. In addition, the employment of rule simplification methods
also helps improve the interpretability since such employment usually results in the
generation of a smaller number of simpler rules. In the latter way, the data size is
reduced through dimensionality reduction and data sampling as mentioned above.
In particular, as discussed in [1, 16], the reduction of data dimensionality decreases
the maximum length (the maximum number of rule terms) of each single rule. The
data sampling also reduces the maximum number of rules. In this approach, the
interpretability can be improved if the dimensionality reduction and data sampling
are effectively undertaken.
In terms of stability, the increase in the size of data usually leads to the increase
in dissimilarity between different instances, and thus results in high variance in
terms of the performance of learning algorithms when an experiment is repeated
independently on the same data. In other words, big data could generally have high
variability, which results in low stability of performance being shown from repeated
experiments on the same data, especially when unstable algorithms are used. Some
unstable algorithms include neural network learning and rule learning [17, 32].
The above stability issues can also be resolved through scaling up algorithms or
scaling down data. As introduced in [33, 34], prediction accuracy in machine learning tasks can be affected by bias and variance. Bias generally means errors originated from use of statistical heuristics and can be reduced through scaling up algorithms. Variance generally means errors originated from random processing of data
and can be reduced by scaling down data. From this point of view, heuristic based
algorithms usually show high bias and low variance on fixed training and test data.
In contrast, random algorithms usually show low bias and high variance on fixed
training and test data. However, if both the training data and the test data are randomly sampled from the original data, heuristic based algorithms usually appear to
be more sensitive to the change of sample and thus perform less stably. In the same
situation, random algorithms, in contrast, usually appear to be less sensitive and
perform more stably when an experiment is repeated independently on the basis of
random sampling of training and test data.
On the basis of the above descriptions, it is highly relevant to develop the unified
framework illustrated in Fig.1 towards effective control of machine learning tasks
in a big data environment. In particular, learning algorithms need to be employed
appropriately by measuring their accuracy, efficiency, interpretability and stability
on the basis of particularly selected data. It is also important to have effective data
pre-processing so that employment of algorithms can be done on the basis of high
quality data provided following the data pre-processing.
11
4 Experimental Studies
This section presents two case studies on big data. The first case study addresses
the veracity aspect, and is designed to confirm that cross-validation can be used to
measure the learnability of algorithms on particular training data towards effective
employment of learning algorithms for predictive modelling. The second case study
addresses variability, and is designed to show how to measure the data variability
through checking the variance of the performance of a particular algorithm, while
independently repeated experiments are undertaken on the same data. The other two
aspects of big data on volume and variety respectively have been studied in [1, 16]
through theoretical analysis and empirical investigations in terms of efficiency and
interpretability of computational models.
4.1 Measure of Learnability
This case study is done using 10 data sets retrieved from the biomedical repository
[35]. The characteristics of these data sets are described in Table 1.
Table 1. Medical data sets
Name
ALL-AML
DLBCL-NIH
lungCancer
MLL_Leukemia
BCR-ABL
E2A-PBX1
Hyperdip50
MLL
T-ALL
TEL-AML1
Attribute Types
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
#Attributes
7130
7400
12534
12583
12559
12559
12559
12559
12559
12559
#Instances
72
160
32
72
327
327
327
327
327
327
#Classes
2
2
2
3
2
2
2
2
2
2
In particular, all these selected data are highly dimensional and have additional
test sets supplied. This selection is in order to support the experimental setup, which
employs cross-validation [36] to measure the learnability of particular algorithms
on the training data and then employ suitable algorithms to build models that are
evaluated by using test instances. In other words, for each of the selected data sets,
the whole training set is provided in order to measure the extent to which a particular
algorithm is suitable to build a model on the training set, and the test set is used to
evaluate the performance of the model learned by the algorithm. In this setup, the
results would show the extent to which the learnability of an algorithm measured
12
by using cross validation on training data can provide a good basis for judging
whether the algorithm can build a confident model that performs well on additional
test data. In this case study, C4.5, Naïve Bayes and K nearest neighbor are chosen
as learning algorithms for testing due to their popularity in real applications. In addition, these three algorithms can also be seen as nature inspired approaches as justified in Section 2. The results of this experimental study can be seen in Table 2.
Table 2 shows that in almost all cases the learnability of an algorithm measured
by cross validation is effective for judging the suitability of an algorithm to a particular training set, which leads to expected performance on the corresponding test
set. In other words, the results show that if an algorithm is judged to be suitable for
a particular training set through measuring its learnability, then the model learned
by the algorithm from the training set usually performs well on the additionally
supplied test set.
Table 2. Learnability on training data and prediction accuracy on test data
Dataset
C4.5 I
C4.5 II
NB I
NB II
KNN I
KNN II
ALL-AML
93%
100%
70%
71%
88%
97%
DLBCL-NIH
44%
58%
55%
63%
56%
63%
lungCancer
94%
89%
25%
90%
88%
97%
MLL_Leukemia
79%
100%
22%
53%
89%
100%
BCR-ABL
91%
95%
96%
95%
97%
96%
E2A-PBX1
96%
87%
92%
95%
98%
88%
Hyperdip50
91%
88%
81%
80%
94%
98%
MLL
94%
97%
94%
95%
97%
100%
T-All
91%
100%
87%
87%
55%
99%
TEL-AML1
95%
95%
76%
76%
98%
98%
NB: C4.5 I means testing the learnability of the algorithm by cross validation on the basis of
training data and C4.5 II means testing the performance of the predictive model using the
additionally supplied test data. The same also applies to NB and KNN.
On the other hand, when an algorithm is judged to be unsuitable for a particular
training set through cross-validation, the results generally indicate the phenomenon
that the model learned by the algorithm from the training set performs a low level
of classification accuracy on the additionally supplied test set. In particular, it can
be seen on the DLBCL-NIH data that all these three algorithms are judged to be less
suitable for the training set and the models learned by these algorithms from the
training set fail to perform well on the corresponding test set. Another similar case
can be seen on the MLL-Leukemia data that Naïve Bayes is judged to be unsuitable
for the training set and the model learned by the algorithm fails to perform well on
the corresponding test set.
In addition, there are two exceptional cases on the lung-cancer and T-All data.
In the first case, Naïve Bayes is judged to be very unsuitable for the training set but
the performance on the test set by the model learned by the algorithm from the
13
training set is very good. In the second case, K nearest neighbor is judged to be less
suitable for the training set but the actual performance on the test set by the model
learned by the algorithm from the training set is extremely good. For both cases, it
could be because the training set essentially covers the complete information and
the split of the training set for the purpose of cross validation could result in incompleteness to which both Naïve Bayes and K nearest neighbor are quite sensitive.
However, when the algorithm learns from the whole training set, the resulted model
covers the complete information from the training set and thus performs well on the
test set.
4.2 Measure of Data Variability
This case study is conducted using 20 data sets retrieved from the UCI [37] and the
biomedical repositories. The characteristics of these chosen data sets are described
in Table 3.
Table 3. Data sets from UCI and Biomedical repositories
Name
anneal
balance-scale
car
credit-a
credit-g
diabetes
heart-statlog
hepatitis
ionosphere
iris
lymph
wine
zoo
sonar
segment
colonTumor
DLBCLOutcome
DLBCLTumor
DLBCL-Stanford
lung-Michigan
Attribute Types
discrete, continuous
discrete
discrete
discrete, continuous
discrete, continuous
discrete, continuous
continuous
discrete, continuous
continuous
continuous
discrete, continuous
continuous
discrete, continuous
continuous
continuous
continuous
continuous
continuous
continuous
continuous
#Attributes
38
4
6
15
20
20
13
20
34
4
19
13
18
61
19
2001
7130
7130
4027
7130
#Instances
798
625
1728
690
1000
768
270
155
351
150
148
178
101
208
2310
62
58
77
47
96
#Classes
6
3
4
2
2
2
2
2
2
3
4
3
7
2
7
2
2
2
2
2
14
The data sets selected from the UCI repository are all considered as small data
as they are of lower dimensionality and sample size, except for the segment data
which is considered to be big data due to its larger sample size. On the other hand,
the last five data sets selected from the biomedical repository are all considered as
big data due to the fact that they are of high dimensionality. This selection is in
order to put the case study in the context of data science by means of processing
data with different scalability. In addition, all these chosen data sets are not supplied
additional test sets. The selection of the data sets was also made so that both discrete
and continuous attributes are present, which is in order to investigate how the different types of attributes could impact on the data variability.
On the basis of the chosen data, the experiment on each data set is undertaken by
independently repeating the training-testing process 100 times and checking the variance of the performance over the 100 repetitions, on the basis of random sampling
of training and test data in the ratio of 70:30. This experimental setup is in order to
measure the extent to which the data is variable leading to variance in terms of performance in machine learning tasks. In this context, C4.5, Naïve Bayes and K nearest neighbor are chosen as learning algorithms for testing the variance due to the
fact that these algorithms are not stable, i.e. they are sensitive to the changes in data
sample. The results are presented in Table 4.
Table 4 Data variability measured by standard deviation of classification accuracy
Dataset
anneal
balance-scale
car
credit-a
credit-g
diabetes
heart-statlog
hepatitis
ionosphere
iris
lymph
wine
zoo
sonar
segment
colonTumor
DLBCLOutcome
DLBCLTumor
DLBCL-Stanford
lung-Michigan
C4.5
0.007
0.028
0.011
0.026
0.027
0.027
0.044
0.046
0.031
0.030
0.057
0.048
0.045
0.057
0.010
0.094
0.122
0.074
0.133
0.040
NB
0.017
0.022
0.019
0.021
0.023
0.028
0.039
0.042
0.043
0.033
0.057
0.027
0.068
0.059
0.015
0.105
0.104
0.067
0.060
0.041
KNN
0.023
0.020
0.028
0.030
0.022
0.027
0.045
0.073
0.035
0.027
0.055
0.054
0.063
0.052
0.010
0.089
0.109
0.072
0.096
0.028
15
It can be seen from Table 4 that on each data set, while different algorithms are
used, the standard deviation of the classification accuracy over 100 independently
repeated experiments appears to be in a very similar level, except for the DLBCLStandford data set on which Naïve Bayes displays a much lower level of standard
deviation.
On the other hand, while looking at different data sets, the standard deviation for
them appears to be very different no matter which one of the three algorithms is
adopted. In particular, for the 15 UCI data sets, the standard deviation is lower than
5% in most cases or a bit higher than 5% in several cases (e.g. on the lymph and
sonar data sets). In contrast, for the last five data sets selected from the biomedical
repository, the standard deviation is usually higher than 5% and is even close to or
higher than 10% in some cases (e.g. on the colonTumer and DLBCLOutcome data
sets). An exceptional case happens from the lung-Michigan data set, which appears
to have the standard deviation lower than 5%, no matter which one of the three
algorithms is used.
In addition, it can also be seen from Table 4 that the data sets that contain only
continuous attributes appear to have the standard deviation higher than the data sets
that contain discrete attributes. Some data sets that contain both discrete and continuous attributes also appear to have the standard deviation higher than the data
sets that contain only discrete attributes. In fact, the presence of continuous attributes generally increases the attribute complexity, and thus makes the data more
complex, which leads to the potential increase of the data variability.
The results shown in Table 4 generally indicate that attribute complexity, data
dimensionality and sample size impact on the size of data and that data with a larger
size is likely to be of higher variability, leading to higher variance in terms of performance in machine learning tasks, especially when the training and test data are
sampled on a purely random basis.
5 Conclusion
This chapter has proposed a unified framework in a nature inspired way for control
of machine learning tasks in Section 3.1, and the necessity of the proposal has been
justified in Section 3.2 through analyzing the impact of big data on machine learning. Two case studies have been conducted experimentally following computational
intelligence methodologies in Section 4. The results from the case studies also indicate the necessity of proposing the unified framework through using computational
intelligence concepts.
The results from the first case study indicate that cross-validation is an effective
way to measure the extent to which an algorithm is suitable to build a predictive
model on the basis of the existing data. In fact, a test set is not actually available in
reality and instead a set of unseen instances are given for the model to predict the
16
values of unknown attributes of each particular instance. From this point of view,
the framework proposed in Section 3 is highly relevant in order to achieve appropriate employment of learning algorithms on the basis of the existing data. However, it is difficult to guarantee in reality that the existing data can cover the full
population. Therefore, the framework proposed in Section 3 can be modified further
to work towards achieving natural selection of learning algorithms. In other words,
the learnability of an algorithm measured through cross-validation can be used as
the chance of being employed for predictive modelling, towards predicting unseen
instances through natural selection of the predefined classes. Similar ideas have
been applied to voting based classification in [38].
On the other hand, the results from the second case study indicate that data can
be of high variability, which could lead to high variance in terms of performance in
machine learning tasks while training and test data are sampled on a purely random
basis. In fact, as described in Section 3.2, while training and test data are sampled
randomly, the algorithms based on statistical heuristics generally display higher variance than those algorithms with high randomness. However, these heuristics-based
algorithms, such as C4.5, Naïve Bayes and K nearest neighbor, are highly popular
in practical applications. This indicates the necessity to have effective and efficient
pre-processing of data prior to the training stage in order to avoid any high variance
due to random sampling. In particular, effective sampling of training and test data
can be achieved through data clustering in order to ensure that the training instances
are of high similarity to the test instances. This is naturally inspired by the principle
of student examination that the exam questions should all cover what the students
actually learned from learning materials rather than anything else outside of these
sources. In other words, representative sampling of training and test data would
make the model learned from the training data cover more complete patterns and
the model accuracy evaluated by using the test data more trustworthy. On the basis
of the above descriptions, clustering-based sampling of training and test data is
strongly recommended as a further direction.
References
[1] H. Liu, A. Gegov and M. Cocea, Rule Based Systems for Big Data: A
Machine Learning Approach, 1 ed., vol. 13, Switzerland: Springer, 2016.
[2] “What is Big Data,” SAS Institute Inc, [Online].
http://www.sas.com/big-data/. [Accessed 17 May 2015].
Available:
[3] “Master Data Management for Big Data,” IBM, [Online]. Available:
http://www-01.ibm.com/software/data/infosphere/mdm-big-data/. [Accessed
17 May 2015].
[4] W. Pedrycz and S. M. Chen, Eds., Information Granularity, Big Data, and
Computational Intelligence, vol. 8, Switzerland: Springer, 2015.
17
[5] P. Levine, “Machine Learning + Big Data,” WorldPress, [Online]. Available:
http://a16z.com/2015/01/22/machine-learning-big-data/. [Accessed 15 May
2015].
[6] T. Condie, P. Mineiro, N. Polyzotis and M. Weimer, “Machine learning for
big data,” in ACM SIGMOD/PODS Conference, San Francisco, USA, 2013.
[7] L. Wang and C. A. Alexander, “Machine Learning in Big Data,” International
Journal of Mathematical, Engineering and Management Sciences, vol. 1, no.
2, p. 52–61, 2016.
[8] X. Wu, X. Zhu, G. Q. Wu and W. Ding, “Data Mining with Big Data,” IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107,
2014.
[9] S. Suthaharan, “Big data classification: problems and challenges in network
intrusion prediction with machine learning,” ACM SIGMETRICS
Performance Evaluation Review, vol. 41, no. 4, pp. 70-73, 2014 .
[10] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat and G. K. Karagiannidis, “Efficient
Machine Learning for Big Data: A Review,” Big Data Research, vol. 2, no.
3, pp. 87-93, 2015.
[11] D. E. O'Leary, “Artificial Intelligence and Big Data,” IEEE Intelligent
Systems, vol. 28, no. 2, pp. 96-99, 2013.
[12] C. Ma, H. H. Zhang and X. Wang, “Machine learning for Big Data Analytics
in Plants,” Trends in Plant Science, vol. 19, no. 12, p. 798–808, 2014.
[13] H. Adeli and N. Siddique, Computational Intelligence: Synergies of Fuzzy
Logic, Neural Networks and Evolutionary Computing, New Jersey: John
Wiley & Sons, 2013.
[14] L. Rutkowski, Computational Intelligence: Methods and Techniques,
Heidelberg: Springer, 2008.
[15] J. Worrell, “Computational Learning Theory: 2014-2015,” University of
Oxford,
2014.
[Online].
Available:
https://www.cs.ox.ac.uk/teaching/courses/2014-2015/clt/. [Accessed 20 9
2016].
[16] H. Liu, M. Cocea and A. Gegov, “Interpretability of Computational Models
for Sentiment Analysis,” in Sentiment Analysis and Ontology Engineering:
An Environment of Computational Intelligence, vol. 639, W. Pedrycz and S.
M. Chen, Eds., Switzerland, Springer, 2016, pp. 199-220.
[17] P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, New
Jersey: Pearson Education, 2006.
[18] T. Mitchell, Machine Learning, New York: McGraw Hill, 1997.
[19] D. Barber, Bayesian Reasoning and Machine Learning, Cambridge:
Cambridge University Press, 2012.
18
[20] H. Liu, A. Gegov and F. Stahl, “Categorization and Construction of Rule
Based Systems,” in 15th International Conference on Engineering
Applications of Neural Networks, Sofia, Bulgaria, 2014.
[21] H. Liu, A. Gegov and M. Cocea, “Network Based Rule Representation for
Knowledge Discovery and Predictive Modelling,” in IEEE International
Conference on Fuzzy Systems, Istanbul, 2015.
[22] R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81106, 1986.
[23] J. Furnkranz, “Separate-and-Conquer rule learning,” Artificial Intelligence
Review, vol. 13, pp. 3-54, 1999.
[24] J. Zhang, “Selecting typical instances in instance-based learning,” in The 9th
International Conference on Machine Learning, Aberdeen, Scotland, 1992.
[25] H. e. Michiel, “Bayes formula,” in Encyclopedia of Mathematics, Springer,
2001.
[26] I. Rish, “An Empirical Study of the Naïve Bayes Classifier,” IJCAI 2001
workshop on empirical methods in artificial intelligence, vol. 3, no. 22, pp.
41-46, 2001.
[27] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, p. 123–
140, 1996.
[28] Y. Freund and R. E. Schapire, “Experiments with a New Boosting
Algorithm,” in Machine Learning: Proceedings of the Thirteenth
International Conference (ICML '96), 1996.
[29] “Machine Learning on Big Data,” EBTIC, 19 August 2014. [Online].
Available:
http://www.ebtic.org/pages/ebtic-view/ebtic-viewdetails/machine-learning-on-big-data-d/687. [Accessed 15 May 2015].
[30] M. Banko and E. Brill, “Scaling to very very large corpora for natural
language disambiguation,” in Proceedings of the 39th Annual Meeting on
Association for Computational Linguistics, 2001.
[31] K. M. Tarwani, S. Saudagar and H. D. Misalkar, “Machine Learning in Big
Data Analytics: An Overview,” International Journal of Advanced Research
in Computer Science and Software Engineering, vol. 5, no. 4, pp. 270-274,
2015.
[32] I. Kononenko and M. Kukar, Machine Learning and Data Mining:
Introduction to Principles and Algorithms, Chichester, West Sussex:
Horwood Publishing Limmited, 2007.
[33] H. Liu, A. Gegov and M. Cocea, “Collaborative Rule Generation: An
Ensemble Learning Approach,” Journal of Intelligent and Fuzzy Systems, vol.
30, no. 4, pp. 2277-2287, 2016.
19
[34] H. Liu, A. Gegov and M. Cocea, “Hybrid Ensemble Learning Approach for
Generation of Classification Rules,” in Internation Conference on Machine
Learning and Cybernetics, Guangzhou, 2015.
[35] J. Li and H. Liu, “Kent Ridge Bio-medical Dataset,” I2R Data Mining
Department,
2003.
[Online].
Available:
http://datam.i2r.astar.edu.sg/datasets/krbd/. [Accessed 18 May 2015].
[36] S. Geisser, Predictive Inference, New York: Chapman and Hall, 1993.
[37] M. Lichman, “UCI Machine Learning Repository,” University of California,
School of Information and Computer Science, 2013. [Online]. Available:
http://archive.ics.uci.edu/ml. [Accessed 12 May 2015].
[38] H. Liu, A. Gegov and M. Cocea, “Nature and Biology Inspried Approach of
Classification towards Reduction of Bias in Machine Learning,” in
International Conference on Machine Learning and Cybernetics, Jeju Island,
South Korea, 2016.