Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Unified Framework for Control of Machine Learning Tasks towards Effective and Efficient Processing of Big Data Han Liu1, Alexander Gegov2 and Mihaela Cocea3 Abstract Big data can be generally characterised by 5 Vs – Volume, Velocity, Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing. In machine learning context, learning algorithms are typically evaluated in terms of accuracy, efficiency, interpretability and stability. These four dimensions can be strongly related to veracity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the quality of computational models can be impacted by data characteristics as well as strategies involved in learning algorithms. This chapter also introduces a unified framework for control of machine learning tasks towards appropriate employment of algorithms and efficient processing of big data. In particular, this framework is designed to achieve effective selection of data pre-processing techniques towards effective selection of relevant attributes, sampling of representative training and test data, and appropriate dealing with missing values and noise. More importantly, this framework allows the employment of suitable machine learning algorithms on the basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models. Keywords: Big Data, Computational Intelligence, Data Mining, Machine Learning, Data Processing, Predictive Modelling 1 Han Liu University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace, PO1 3HE Portsmouth, United Kingdom Email: Han.Liu@port.ac.uk 2 Alexander Gegov University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace, PO1 3HE Portsmouth, United Kingdom Email: Alexander.Gegov@port.ac.uk 3 Mihaela Cocea University of Portsmouth, School of Computing, Buckingham Building, Lion Terrace, PO1 3HE Portsmouth, United Kingdom Email: Mihaela.Cocea@port.ac.uk 2 1 Introduction Big data can generally be characterized by 5Vs – Volume, Velocity, Variety, Veracity and Variability. In particular, volume generally reflects the space required to store data. Velocity reflects the speed of data transmission and processing, i.e. how effectively and efficiently real-time data is collected and processed on the platform of cloud computing. Variety reflects the type of data, i.e. data can be structured or unstructured and can also be in different forms such as text, image, audio and video. Veracity reflects the degree to which data can be trusted. Variability reflects the dissimilarity between different instances in a data set. More details on big data can be found in [1, 2, 3, 4]. In many studies, machine learning has been considered as a powerful tool of big data processing. As introduced in [5], the relationship between big data and machine learning is very similar to the relationship between resources and human learning. In this context, people can learn from resources to deal with new matters. Similarly, machines can learn from big data to resolve new problems. More details on big data processing by machine learning can be found in [6, 7, 8, 9, 10, 11, 12]. Machine learning is regarded as one of the main approaches of computational intelligence [13]. In general, computational intelligence encompasses a set of nature or biology inspired computational approaches such as artificial neural networks, fuzzy systems and evolutionary computation. In particular, artificial neural networks are biologically inspired to simulate the human brains in terms of learning through experience. Also, fuzzy systems involve using fuzzy logic, which enables computers to understand natural languages [14]. Moreover, evolutionary computation works based on the process of nature selection, learning theory and probabilistic methods, which helps with uncertainty handling [13]. As stated in [15], learning theories, which help understand how cognitive, emotional and environmental effects and experiences are processed in the context of psychology, can help make predictions on the basis of previous experience in the context of machine learning. From this point of view, machine learning is naturally inspired by human learning and would thus be considered as a nature inspired approach. In addition, most machine learning methods involve employing heuristics of computational intelligence, such as probabilistic measures, fuzziness and fitness, towards optimal learning. In particular, C4.5, Naïve Bayes and K nearest neighbors, which are selected for the experimental studies in Section 4, involve employing such heuristics. In machine learning context, learning algorithms are typically evaluated in four dimensions, namely accuracy, efficiency, interpretability and stability, following the concepts of computational intelligence. These four dimensions can be strongly related to veracity, volume, variety and variability, respectively. Veracity reflects the degree to which data can be trusted as mentioned above. In practice, data needs to be transformed to information or knowledge for people to use. From this point of view, the accuracy of information or knowledge discovered 3 from data can be highly impacted by the quality of the data and thus is an effective way of evaluation against the degree of trust. Volume reflects the size of data. In the areas of machine learning and statistics, the data size can be estimated through the product of data dimensionality and sample size [16]. Increase of data dimensionality or sample size can usually increase the computational costs of machine learning tasks. Therefore, evaluation of the volume for particular data is highly related to estimation of memory usage for data processing by machine learning methods. Variety reflects the format of data, i.e. data types and representation. Typical data types include integer, real, Boolean, string, nominal and ordinal [17]. In machine learning and statistics, data types can be simply divided into two categories: discrete and continuous. On the other hand, data can be represented in different forms, e.g. text, graph and tables. All the differences mentioned above in terms of data format can impact on the interpretability of models learned from data. Variability reflects the dissimilarity between different instances in a data set. In machine learning, the performance of learning algorithms can appear to be highly unstable due to change of data samples, especially when the data instances are highly dissimilar to each other. Therefore, the stability of a learning algorithm can be highly impacted by data variability. The above four aspects (accuracy, efficiency, interpretability and stability) are also impacted greatly by the selection of different machine learning algorithms. For example, data usually needs to be pre-processed by particular algorithms prior to the training stage, which leads to a particular level of impact on data modelling. Also, inappropriate sampling of training and test data can also lead to building a poor model and biased estimation of accuracy, respectively. Further, different learning algorithms can usually lead to different quality of models learned from the same training data. In addition, in the context of online learning, velocity, which is related to the learning speed of an algorithm, is an important impact factor for data streams to be processed effectively and efficiently. However, this chapter focuses on offline learning, which analyses in depth how the nature of learning algorithms is related to the nature of static data. This chapter is organized as follows: Section 2 introduces fundamental concepts of machine learning and how computational intelligence contributes to the design of learning algorithms. Section 3 presents a framework proposed in a nature inspired way for control of machine learning tasks towards appropriate employment of learning algorithms and efficient processing of big data. Section 4 reports experimental studies on employment of learning algorithms and efficient processing of big data and discusses the obtained results in both quantitative and qualitative terms. Section 5 highlights the contributions of this chapter and suggests further directions towards advancing this research area by using computational intelligence approaches. 4 2 Fundamentals of Machine Learning Machine learning is a branch of artificial intelligence and involves two stages: training and testing [18]. The first stage aims to learn something from known properties by using learning algorithms and the second stage aims to make predictions on unknown properties by using the knowledge learned in the first stage. From this point of view, training and testing are also referred to as learning and prediction, respectively. In practice, a machine learning task is aimed at building a model, which is further used to make predictions, through the use of learning algorithms. Therefore, this task is usually referred to as predictive modelling. Machine learning could be divided into two special types: supervised learning and unsupervised learning [19], in terms of the form of learning. Supervised learning means learning with a teacher, because all instances from a training set are labelled, which makes the learning outcomes very explicit. In other words, supervised learning is naturally inspired by student learning with the supervision of teachers. In practice, the aim of this type of learning is to build a model by learning from labelled data and then to make predictions on other unlabeled instances with regard to the value of a predicted attribute. The predicted value of an attribute could be either discrete or continuous. Therefore, supervised learning could be involved in both classification and regression tasks for categorical prediction and numerical prediction, respectively. In contrast, unsupervised learning means learning without a teacher. This is because all instances from a training set are unlabeled and thus the learning outcomes are not explicit. In other words, unsupervised learning is naturally inspired by student learning without being supervised. In practice, the aim of this type of learning is to discover previously unknown patterns from data sets. It includes association and clustering. The former aims to identify correlations between attributes whereas the latter aims to group objects on the basis of their similarity to each other. According to [18], machine learning algorithms can be put into several categories: decision tree learning, rule learning, instance based learning, Bayesian learning, perceptron learning and ensemble learning. All of these learning algorithms show the characteristic of nature inspiration. Both decision tree learning and rule learning aim to learn a set of rules on an inductive basis. However, the difference between the two types of learning is that the former generates rules in the form of a decision tree and the latter generates ifthen rules directly from training instances [1, 20, 21]. The above difference is mainly due to the fact that the former follows the divide and conquer approach [22] and the latter follows the separate and conquer approach [23]. In particular, the divide and conquer approach is naturally similar to the top-down approach of student learning, such as dividing a textbook into several levels: parts, chapters, sections and subsections. The separate and conquer approach is naturally similar to the iter- 5 ative approach of student learning, which means by reading through an entire material in the first iteration and then focusing on more important parts of the material for deeper understanding in the subsequent iterations. Instance based learning generally involves predicting test instances on the basis of their similarity to the training instances, such as K nearest neighbor [24]. This type of learning is also referred to as lazy learning, due to the fact that it does not aim to learn in depth to gain some pattern from data but just to make as many correct predictions as possible [16]. In other words, this type of learning is naturally similar to the exam centered approach of student learning, which means that students mainly aim to answer correctly the exam questions without deep understanding of knowledge. Bayesian learning essentially employs the Bayes theorem [25]. In particular, this type of learning is based on the assumption that all the input attributes are totally independent of each other. In this context, each attribute-value pair would be independently correlated to each of the possible classes, which means that a posterior probability is provided between the attribute-value pair and the class. A popular method of Bayesian learning is Naive Bayes [26]. This type of learning is naturally similar to the prior-knowledge based approach of human reasoning, which means that people make decisions, reasoning and judgments based on the knowledge they obtained before, towards having the most confident choice. Perceptron learning aims to build a neural network topology that consists of a number of layers and that has a number of nodes, each of which represents a perceptron. Some popular algorithms include backpropagation [17] and probabilistic neural networks [18]. This type of learning is biology inspired as stated in Section 1. Ensemble learning generally aims to combine different learning algorithms in the training stage or computational models in the testing stage towards improvement of overall accuracy of predictions. Some popular approaches of ensemble learning include bagging [27] and boosting [28]. This type of learning is naturally similar to the approach of group learning for students to collaborate on a group assignment. In terms of evaluation of a machine learning task, there are generally two main approaches: cross-validation and split of data into a training set and a test set. Crossvalidation generally means to split a data set into n disjoint subsets. In this context, there would be n iterations in total for the evaluation, while at each iteration a subset is used for testing and the other n-1 subsets are used for training. In other words, each of the n subsets is in turn used as the test set at one of the n iterations, while the rest of the subsets are used together as the training set. In laboratory research, 10-fold cross-validation is used more popularly, i.e. the original data set is split into 10 subsets. Cross-validation is generally more expensive in terms of computational cost. Therefore, researchers sometimes instead choose to take the approach of splitting a data set into a training set and a test set in a specific ratio, e.g. 70% of the data is used as the training set and the rest of the data is used as the test set. This data split can be done randomly or in a fixed way. However, due to the presence of uncertainty in data, the random split of data is more popular for researchers in machine learning or similar areas. 6 In this chapter, new perspectives of the two approaches of evaluating machine learning tasks are used in Section 4. In particular, cross-validation is used towards measuring effectively the learnability of an algorithm, i.e. the extent to which the algorithm is suitable to build a confident model on the provided training data. This is in order to help employ appropriately the suitable learning algorithms for building predictive models on the basis of existing data. The other approach for splitting a data set into a training set and a test set is adopted towards learning a model that covers highly complete patterns from the training data and evaluating the model accuracy using highly similar but different instances from the test data. This is in order to ensure the model accuracy evaluated by using the test data is trustworthy. Details on the use of the new perspectives are presented in Section 4. 3 Framework for Control of Machine Learning Tasks This section presents a framework for control of machine learning tasks towards appropriate employment of learning algorithms and effective processing of big data. In particular, the key features of the proposed framework are described in detail. Also, the motivation of developing this framework is justified by analyzing the impact of big data on machine learning, i.e. this is to argue the relevance for effective control of machine learning tasks in a big data environment. 3.1 Key Features A unified framework for control of machine learning tasks is proposed in a nature inspired way in [1] as a further direction. The purpose is to effectively control the pre-processing of data and to naturally employ learning algorithms and the generated predictive models. As mentioned in [1], it is relevant to deal with issues on both the algorithms side and the data side for improvement of classification performance. In fact, a database is daily updated in real applications, which could result in the gradual increase of data size and in changes to patterns that exist in the database. In order to avoid lowering computational efficiency, the size of a sample needs to be determined in an optimal way. In addition, it is also required to avoid the loss of accuracy. From this point of view, the sampling is critical not only in terms of the size of a sample but also in the representativeness of the sample. Feature selection/extraction is another critical task for pre-processing of data. As mentioned in [1], high dimensional data usually results in high computational costs. In addition, it is also very likely to contain irrelevant attributes which result in noise and coincidental patterns. In some cases, it is also necessary to effectively detect noise if the noise is introduced naturally or artificially. In other words, noise may be introduced in a dataset due to mistakes in typing or illegal modifications from 7 hackers. A potential way of noise handling is using association rules to detect that the value of an attribute is incorrect on the basis of the other attribute-value pairs in the same data instance. Also, appropriate employment of learning algorithms and predictive models are highly required, due to the fact that there are many existing machine learning algorithms, but no effective ways to determine which of them are suitable for a particular data set. Traditionally, the decision is made by experts based on their knowledge and experience. However, it is fairly difficult to judge the correctness of the decision prior to empirical validation. In real applications, it is not realistic to frequently change decisions after it has been confirmed that the chosen algorithms are not suitable. The arguments above outline the necessity to develop the framework for control of machine learning tasks in a nature inspired way. In other words, this framework aims to adopt computational intelligence techniques to control machine learning tasks. In this framework, the actual employment of both learning algorithms and predictive models follows computational intelligent approaches. The suitability of a learning algorithm and the reliability of a model are measured by statistical analysis on the basis of historical records. In particular, each algorithm in the algorithms base, as illustrated in Fig.1, is assigned a weight which is based on its performance in previous machine learning tasks. The weight of an algorithm is naturally similar to the impact factor of a journal which is based on its overall citation rate. Following the employment of suitable learning algorithms, each model generated is then also assigned a weight which is based on its performance on the latest version of validation data in a database. Following the employment of high quality models, a knowledge base is finalized and deployed for real applications as illustrated in Fig.1. Fig. 1. Unified Framework for Control of Machine Learning Tasks [1] 8 3.2 Justification As mentioned in Section 1, machine learning algorithms are usually evaluated against accuracy, efficiency, interpretability and stability. The presence of big data has deeply affected machine learning tasks in the four aspects mentioned above. In terms of accuracy, overfitting of training data can be significantly reduced in general as the size of data is greatly increased. There is evidence reported in [29] that learning from a large training set can significantly improve the performance in predictive modelling. The evidence is illustrated in Fig. 2, which was provided with an illustration by Banko and Brill in 2001 [30] that the complex problem of learning on automated word disambiguation would keep improving after the size of training data is beyond billions of words. In particular, each of the four learning algorithms shows an increase of at least 10% in terms of test accuracy, while the number of words is increased from 0.3 million to 1 billion. For example, the memory-based algorithm gets the test accuracy increased from 83% to 93%, and the winnow algorithm achieves to increase the test accuracy from 75% to 97%. The improvement in learning performance is due to the fact that the increase in data size can usually improve the completeness of the pattern covered. In other words, small data may cover only a small part of a pattern in a hypothesis space. Therefore, overfitting of training data is likely to result in the case that a learning algorithm may build a model that performs greatly on training data but poorly on test data. This case occurs especially when the training data covers a pattern that is highly different from the one in the test data. When the size of data is increased, the training data is likely to cover a pattern that is more similar to the one present in the test data. Fig. 2. Improvement of words disambiguation by learning from big data [30] 9 On the other hand, the increase in the size of data may also increase the chance to have noise and coincidental patterns present in the data. This is due to the fact that the biased improvement in the quantity of data may result in the loss of quality. Also, large training data is likely to cover some patterns which occur in fairly low frequencies. This could mean that the patterns covered by the training data are purely coincidental rather than scientifically confident. The above issues regarding accuracy can be solved through scaling up algorithms or scaling down data. As specified in [1], the former way is to reduce the bias on the algorithms side. In particular, the algorithms can be designed to be more robust against noise and thus avoid being confused by coincidental patterns. In the context of rule learning, the reduction of bias can be achieved through direct advancement of rule generation methods or employment of rule simplification algorithms; similar ways also apply to other types of learning algorithms. The latter way is to reduce the variance on the data side. In particular, data can be pre-processed through removal of irrelevant attributes by feature selection techniques or through the merger of redundant attributes by feature extraction techniques. In addition, data can also be resampled by selecting only those instances that are more representative. In terms of efficiency, the increase in the size of data is likely to increase the computational costs in both training and testing stages. In the training stage, it may slow down the process of building a predictive model by learning from big data. In the testing stage, the predictive model is likely to have a high level of complexity, which significantly increases the computational complexity in predicting on unseen instances. In particular to rule learning algorithms, the presence of big data may result in the generation of a large number of complex rules. As stressed in [29, 7, 31], processing of big data needs decomposition, parallelism, modularity and recurrence. In this case, these machine learning algorithms, which are inflexible and work in black box manners, would fail in dealing with big data. This case would immediately happen to those algorithms that are quadratically complex (O (n2)), when encountering data with millions of points (instances). The above issues regarding efficiency can also be resolved through scaling up algorithms or scaling down data. In the former way, the algorithms can be designed to have a low level of computational complexity in the training stage and thus be less affected by the increase in the size of training data. In the context of rule learning, the improvement of efficiency can be achieved through the employment of pruning algorithms, as some of such algorithms can stop the process of rule learning earlier. In the latter way, the size of data can be reduced through dimensionality reduction and data sampling. This not only reduces the computational costs in the training stage, but also results in the generation of simpler models and thus speeds up the process of predicting on unseen instances in the testing stage. In terms of interpretability, the increase in the size of data usually decreases the interpretability. As analyzed in [1, 16], interpretability can be affected by the size of training data in terms of model complexity. In the context of rule learning, big data may result in the generation of a large number of complex rules, which would make it difficult for people to read and understand. 10 The above issues regarding interpretability can also be solved through scaling up algorithms or scaling down data. In the former way, the algorithms can be designed to be robust against noise and irrelevant or redundant attributes. In other words, the presence of noise and irrelevant/redundant attributes would not result in irrelevant patterns being learned by the algorithms. In the context of rule learning, algorithms for rule generation may decide to skip some attributes or attribute-value pairs for generation of decision trees or if-then rules due to the irrelevance of these attributes or attribute-value pairs. In addition, the employment of rule simplification methods also helps improve the interpretability since such employment usually results in the generation of a smaller number of simpler rules. In the latter way, the data size is reduced through dimensionality reduction and data sampling as mentioned above. In particular, as discussed in [1, 16], the reduction of data dimensionality decreases the maximum length (the maximum number of rule terms) of each single rule. The data sampling also reduces the maximum number of rules. In this approach, the interpretability can be improved if the dimensionality reduction and data sampling are effectively undertaken. In terms of stability, the increase in the size of data usually leads to the increase in dissimilarity between different instances, and thus results in high variance in terms of the performance of learning algorithms when an experiment is repeated independently on the same data. In other words, big data could generally have high variability, which results in low stability of performance being shown from repeated experiments on the same data, especially when unstable algorithms are used. Some unstable algorithms include neural network learning and rule learning [17, 32]. The above stability issues can also be resolved through scaling up algorithms or scaling down data. As introduced in [33, 34], prediction accuracy in machine learning tasks can be affected by bias and variance. Bias generally means errors originated from use of statistical heuristics and can be reduced through scaling up algorithms. Variance generally means errors originated from random processing of data and can be reduced by scaling down data. From this point of view, heuristic based algorithms usually show high bias and low variance on fixed training and test data. In contrast, random algorithms usually show low bias and high variance on fixed training and test data. However, if both the training data and the test data are randomly sampled from the original data, heuristic based algorithms usually appear to be more sensitive to the change of sample and thus perform less stably. In the same situation, random algorithms, in contrast, usually appear to be less sensitive and perform more stably when an experiment is repeated independently on the basis of random sampling of training and test data. On the basis of the above descriptions, it is highly relevant to develop the unified framework illustrated in Fig.1 towards effective control of machine learning tasks in a big data environment. In particular, learning algorithms need to be employed appropriately by measuring their accuracy, efficiency, interpretability and stability on the basis of particularly selected data. It is also important to have effective data pre-processing so that employment of algorithms can be done on the basis of high quality data provided following the data pre-processing. 11 4 Experimental Studies This section presents two case studies on big data. The first case study addresses the veracity aspect, and is designed to confirm that cross-validation can be used to measure the learnability of algorithms on particular training data towards effective employment of learning algorithms for predictive modelling. The second case study addresses variability, and is designed to show how to measure the data variability through checking the variance of the performance of a particular algorithm, while independently repeated experiments are undertaken on the same data. The other two aspects of big data on volume and variety respectively have been studied in [1, 16] through theoretical analysis and empirical investigations in terms of efficiency and interpretability of computational models. 4.1 Measure of Learnability This case study is done using 10 data sets retrieved from the biomedical repository [35]. The characteristics of these data sets are described in Table 1. Table 1. Medical data sets Name ALL-AML DLBCL-NIH lungCancer MLL_Leukemia BCR-ABL E2A-PBX1 Hyperdip50 MLL T-ALL TEL-AML1 Attribute Types continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous #Attributes 7130 7400 12534 12583 12559 12559 12559 12559 12559 12559 #Instances 72 160 32 72 327 327 327 327 327 327 #Classes 2 2 2 3 2 2 2 2 2 2 In particular, all these selected data are highly dimensional and have additional test sets supplied. This selection is in order to support the experimental setup, which employs cross-validation [36] to measure the learnability of particular algorithms on the training data and then employ suitable algorithms to build models that are evaluated by using test instances. In other words, for each of the selected data sets, the whole training set is provided in order to measure the extent to which a particular algorithm is suitable to build a model on the training set, and the test set is used to evaluate the performance of the model learned by the algorithm. In this setup, the results would show the extent to which the learnability of an algorithm measured 12 by using cross validation on training data can provide a good basis for judging whether the algorithm can build a confident model that performs well on additional test data. In this case study, C4.5, Naïve Bayes and K nearest neighbor are chosen as learning algorithms for testing due to their popularity in real applications. In addition, these three algorithms can also be seen as nature inspired approaches as justified in Section 2. The results of this experimental study can be seen in Table 2. Table 2 shows that in almost all cases the learnability of an algorithm measured by cross validation is effective for judging the suitability of an algorithm to a particular training set, which leads to expected performance on the corresponding test set. In other words, the results show that if an algorithm is judged to be suitable for a particular training set through measuring its learnability, then the model learned by the algorithm from the training set usually performs well on the additionally supplied test set. Table 2. Learnability on training data and prediction accuracy on test data Dataset C4.5 I C4.5 II NB I NB II KNN I KNN II ALL-AML 93% 100% 70% 71% 88% 97% DLBCL-NIH 44% 58% 55% 63% 56% 63% lungCancer 94% 89% 25% 90% 88% 97% MLL_Leukemia 79% 100% 22% 53% 89% 100% BCR-ABL 91% 95% 96% 95% 97% 96% E2A-PBX1 96% 87% 92% 95% 98% 88% Hyperdip50 91% 88% 81% 80% 94% 98% MLL 94% 97% 94% 95% 97% 100% T-All 91% 100% 87% 87% 55% 99% TEL-AML1 95% 95% 76% 76% 98% 98% NB: C4.5 I means testing the learnability of the algorithm by cross validation on the basis of training data and C4.5 II means testing the performance of the predictive model using the additionally supplied test data. The same also applies to NB and KNN. On the other hand, when an algorithm is judged to be unsuitable for a particular training set through cross-validation, the results generally indicate the phenomenon that the model learned by the algorithm from the training set performs a low level of classification accuracy on the additionally supplied test set. In particular, it can be seen on the DLBCL-NIH data that all these three algorithms are judged to be less suitable for the training set and the models learned by these algorithms from the training set fail to perform well on the corresponding test set. Another similar case can be seen on the MLL-Leukemia data that Naïve Bayes is judged to be unsuitable for the training set and the model learned by the algorithm fails to perform well on the corresponding test set. In addition, there are two exceptional cases on the lung-cancer and T-All data. In the first case, Naïve Bayes is judged to be very unsuitable for the training set but the performance on the test set by the model learned by the algorithm from the 13 training set is very good. In the second case, K nearest neighbor is judged to be less suitable for the training set but the actual performance on the test set by the model learned by the algorithm from the training set is extremely good. For both cases, it could be because the training set essentially covers the complete information and the split of the training set for the purpose of cross validation could result in incompleteness to which both Naïve Bayes and K nearest neighbor are quite sensitive. However, when the algorithm learns from the whole training set, the resulted model covers the complete information from the training set and thus performs well on the test set. 4.2 Measure of Data Variability This case study is conducted using 20 data sets retrieved from the UCI [37] and the biomedical repositories. The characteristics of these chosen data sets are described in Table 3. Table 3. Data sets from UCI and Biomedical repositories Name anneal balance-scale car credit-a credit-g diabetes heart-statlog hepatitis ionosphere iris lymph wine zoo sonar segment colonTumor DLBCLOutcome DLBCLTumor DLBCL-Stanford lung-Michigan Attribute Types discrete, continuous discrete discrete discrete, continuous discrete, continuous discrete, continuous continuous discrete, continuous continuous continuous discrete, continuous continuous discrete, continuous continuous continuous continuous continuous continuous continuous continuous #Attributes 38 4 6 15 20 20 13 20 34 4 19 13 18 61 19 2001 7130 7130 4027 7130 #Instances 798 625 1728 690 1000 768 270 155 351 150 148 178 101 208 2310 62 58 77 47 96 #Classes 6 3 4 2 2 2 2 2 2 3 4 3 7 2 7 2 2 2 2 2 14 The data sets selected from the UCI repository are all considered as small data as they are of lower dimensionality and sample size, except for the segment data which is considered to be big data due to its larger sample size. On the other hand, the last five data sets selected from the biomedical repository are all considered as big data due to the fact that they are of high dimensionality. This selection is in order to put the case study in the context of data science by means of processing data with different scalability. In addition, all these chosen data sets are not supplied additional test sets. The selection of the data sets was also made so that both discrete and continuous attributes are present, which is in order to investigate how the different types of attributes could impact on the data variability. On the basis of the chosen data, the experiment on each data set is undertaken by independently repeating the training-testing process 100 times and checking the variance of the performance over the 100 repetitions, on the basis of random sampling of training and test data in the ratio of 70:30. This experimental setup is in order to measure the extent to which the data is variable leading to variance in terms of performance in machine learning tasks. In this context, C4.5, Naïve Bayes and K nearest neighbor are chosen as learning algorithms for testing the variance due to the fact that these algorithms are not stable, i.e. they are sensitive to the changes in data sample. The results are presented in Table 4. Table 4 Data variability measured by standard deviation of classification accuracy Dataset anneal balance-scale car credit-a credit-g diabetes heart-statlog hepatitis ionosphere iris lymph wine zoo sonar segment colonTumor DLBCLOutcome DLBCLTumor DLBCL-Stanford lung-Michigan C4.5 0.007 0.028 0.011 0.026 0.027 0.027 0.044 0.046 0.031 0.030 0.057 0.048 0.045 0.057 0.010 0.094 0.122 0.074 0.133 0.040 NB 0.017 0.022 0.019 0.021 0.023 0.028 0.039 0.042 0.043 0.033 0.057 0.027 0.068 0.059 0.015 0.105 0.104 0.067 0.060 0.041 KNN 0.023 0.020 0.028 0.030 0.022 0.027 0.045 0.073 0.035 0.027 0.055 0.054 0.063 0.052 0.010 0.089 0.109 0.072 0.096 0.028 15 It can be seen from Table 4 that on each data set, while different algorithms are used, the standard deviation of the classification accuracy over 100 independently repeated experiments appears to be in a very similar level, except for the DLBCLStandford data set on which Naïve Bayes displays a much lower level of standard deviation. On the other hand, while looking at different data sets, the standard deviation for them appears to be very different no matter which one of the three algorithms is adopted. In particular, for the 15 UCI data sets, the standard deviation is lower than 5% in most cases or a bit higher than 5% in several cases (e.g. on the lymph and sonar data sets). In contrast, for the last five data sets selected from the biomedical repository, the standard deviation is usually higher than 5% and is even close to or higher than 10% in some cases (e.g. on the colonTumer and DLBCLOutcome data sets). An exceptional case happens from the lung-Michigan data set, which appears to have the standard deviation lower than 5%, no matter which one of the three algorithms is used. In addition, it can also be seen from Table 4 that the data sets that contain only continuous attributes appear to have the standard deviation higher than the data sets that contain discrete attributes. Some data sets that contain both discrete and continuous attributes also appear to have the standard deviation higher than the data sets that contain only discrete attributes. In fact, the presence of continuous attributes generally increases the attribute complexity, and thus makes the data more complex, which leads to the potential increase of the data variability. The results shown in Table 4 generally indicate that attribute complexity, data dimensionality and sample size impact on the size of data and that data with a larger size is likely to be of higher variability, leading to higher variance in terms of performance in machine learning tasks, especially when the training and test data are sampled on a purely random basis. 5 Conclusion This chapter has proposed a unified framework in a nature inspired way for control of machine learning tasks in Section 3.1, and the necessity of the proposal has been justified in Section 3.2 through analyzing the impact of big data on machine learning. Two case studies have been conducted experimentally following computational intelligence methodologies in Section 4. The results from the case studies also indicate the necessity of proposing the unified framework through using computational intelligence concepts. The results from the first case study indicate that cross-validation is an effective way to measure the extent to which an algorithm is suitable to build a predictive model on the basis of the existing data. In fact, a test set is not actually available in reality and instead a set of unseen instances are given for the model to predict the 16 values of unknown attributes of each particular instance. From this point of view, the framework proposed in Section 3 is highly relevant in order to achieve appropriate employment of learning algorithms on the basis of the existing data. However, it is difficult to guarantee in reality that the existing data can cover the full population. Therefore, the framework proposed in Section 3 can be modified further to work towards achieving natural selection of learning algorithms. In other words, the learnability of an algorithm measured through cross-validation can be used as the chance of being employed for predictive modelling, towards predicting unseen instances through natural selection of the predefined classes. Similar ideas have been applied to voting based classification in [38]. On the other hand, the results from the second case study indicate that data can be of high variability, which could lead to high variance in terms of performance in machine learning tasks while training and test data are sampled on a purely random basis. In fact, as described in Section 3.2, while training and test data are sampled randomly, the algorithms based on statistical heuristics generally display higher variance than those algorithms with high randomness. However, these heuristics-based algorithms, such as C4.5, Naïve Bayes and K nearest neighbor, are highly popular in practical applications. This indicates the necessity to have effective and efficient pre-processing of data prior to the training stage in order to avoid any high variance due to random sampling. In particular, effective sampling of training and test data can be achieved through data clustering in order to ensure that the training instances are of high similarity to the test instances. This is naturally inspired by the principle of student examination that the exam questions should all cover what the students actually learned from learning materials rather than anything else outside of these sources. In other words, representative sampling of training and test data would make the model learned from the training data cover more complete patterns and the model accuracy evaluated by using the test data more trustworthy. On the basis of the above descriptions, clustering-based sampling of training and test data is strongly recommended as a further direction. References [1] H. Liu, A. Gegov and M. Cocea, Rule Based Systems for Big Data: A Machine Learning Approach, 1 ed., vol. 13, Switzerland: Springer, 2016. [2] “What is Big Data,” SAS Institute Inc, [Online]. http://www.sas.com/big-data/. [Accessed 17 May 2015]. Available: [3] “Master Data Management for Big Data,” IBM, [Online]. Available: http://www-01.ibm.com/software/data/infosphere/mdm-big-data/. [Accessed 17 May 2015]. [4] W. Pedrycz and S. M. Chen, Eds., Information Granularity, Big Data, and Computational Intelligence, vol. 8, Switzerland: Springer, 2015. 17 [5] P. Levine, “Machine Learning + Big Data,” WorldPress, [Online]. Available: http://a16z.com/2015/01/22/machine-learning-big-data/. [Accessed 15 May 2015]. [6] T. Condie, P. Mineiro, N. Polyzotis and M. Weimer, “Machine learning for big data,” in ACM SIGMOD/PODS Conference, San Francisco, USA, 2013. [7] L. Wang and C. A. Alexander, “Machine Learning in Big Data,” International Journal of Mathematical, Engineering and Management Sciences, vol. 1, no. 2, p. 52–61, 2016. [8] X. Wu, X. Zhu, G. Q. Wu and W. Ding, “Data Mining with Big Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, 2014. [9] S. Suthaharan, “Big data classification: problems and challenges in network intrusion prediction with machine learning,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 4, pp. 70-73, 2014 . [10] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat and G. K. Karagiannidis, “Efficient Machine Learning for Big Data: A Review,” Big Data Research, vol. 2, no. 3, pp. 87-93, 2015. [11] D. E. O'Leary, “Artificial Intelligence and Big Data,” IEEE Intelligent Systems, vol. 28, no. 2, pp. 96-99, 2013. [12] C. Ma, H. H. Zhang and X. Wang, “Machine learning for Big Data Analytics in Plants,” Trends in Plant Science, vol. 19, no. 12, p. 798–808, 2014. [13] H. Adeli and N. Siddique, Computational Intelligence: Synergies of Fuzzy Logic, Neural Networks and Evolutionary Computing, New Jersey: John Wiley & Sons, 2013. [14] L. Rutkowski, Computational Intelligence: Methods and Techniques, Heidelberg: Springer, 2008. [15] J. Worrell, “Computational Learning Theory: 2014-2015,” University of Oxford, 2014. [Online]. Available: https://www.cs.ox.ac.uk/teaching/courses/2014-2015/clt/. [Accessed 20 9 2016]. [16] H. Liu, M. Cocea and A. Gegov, “Interpretability of Computational Models for Sentiment Analysis,” in Sentiment Analysis and Ontology Engineering: An Environment of Computational Intelligence, vol. 639, W. Pedrycz and S. M. Chen, Eds., Switzerland, Springer, 2016, pp. 199-220. [17] P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, New Jersey: Pearson Education, 2006. [18] T. Mitchell, Machine Learning, New York: McGraw Hill, 1997. [19] D. Barber, Bayesian Reasoning and Machine Learning, Cambridge: Cambridge University Press, 2012. 18 [20] H. Liu, A. Gegov and F. Stahl, “Categorization and Construction of Rule Based Systems,” in 15th International Conference on Engineering Applications of Neural Networks, Sofia, Bulgaria, 2014. [21] H. Liu, A. Gegov and M. Cocea, “Network Based Rule Representation for Knowledge Discovery and Predictive Modelling,” in IEEE International Conference on Fuzzy Systems, Istanbul, 2015. [22] R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81106, 1986. [23] J. Furnkranz, “Separate-and-Conquer rule learning,” Artificial Intelligence Review, vol. 13, pp. 3-54, 1999. [24] J. Zhang, “Selecting typical instances in instance-based learning,” in The 9th International Conference on Machine Learning, Aberdeen, Scotland, 1992. [25] H. e. Michiel, “Bayes formula,” in Encyclopedia of Mathematics, Springer, 2001. [26] I. Rish, “An Empirical Study of the Naïve Bayes Classifier,” IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, no. 22, pp. 41-46, 2001. [27] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, p. 123– 140, 1996. [28] Y. Freund and R. E. Schapire, “Experiments with a New Boosting Algorithm,” in Machine Learning: Proceedings of the Thirteenth International Conference (ICML '96), 1996. [29] “Machine Learning on Big Data,” EBTIC, 19 August 2014. [Online]. Available: http://www.ebtic.org/pages/ebtic-view/ebtic-viewdetails/machine-learning-on-big-data-d/687. [Accessed 15 May 2015]. [30] M. Banko and E. Brill, “Scaling to very very large corpora for natural language disambiguation,” in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 2001. [31] K. M. Tarwani, S. Saudagar and H. D. Misalkar, “Machine Learning in Big Data Analytics: An Overview,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 5, no. 4, pp. 270-274, 2015. [32] I. Kononenko and M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms, Chichester, West Sussex: Horwood Publishing Limmited, 2007. [33] H. Liu, A. Gegov and M. Cocea, “Collaborative Rule Generation: An Ensemble Learning Approach,” Journal of Intelligent and Fuzzy Systems, vol. 30, no. 4, pp. 2277-2287, 2016. 19 [34] H. Liu, A. Gegov and M. Cocea, “Hybrid Ensemble Learning Approach for Generation of Classification Rules,” in Internation Conference on Machine Learning and Cybernetics, Guangzhou, 2015. [35] J. Li and H. Liu, “Kent Ridge Bio-medical Dataset,” I2R Data Mining Department, 2003. [Online]. Available: http://datam.i2r.astar.edu.sg/datasets/krbd/. [Accessed 18 May 2015]. [36] S. Geisser, Predictive Inference, New York: Chapman and Hall, 1993. [37] M. Lichman, “UCI Machine Learning Repository,” University of California, School of Information and Computer Science, 2013. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed 12 May 2015]. [38] H. Liu, A. Gegov and M. Cocea, “Nature and Biology Inspried Approach of Classification towards Reduction of Bias in Machine Learning,” in International Conference on Machine Learning and Cybernetics, Jeju Island, South Korea, 2016.