Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
© 2019 IJSRSET | Volume 6 | Issue 3 | Print ISSN: 2395-1990 | Online ISSN : 2394-4099 Themed Section : Engineering and Technology DOI : 10.32628/ IJSRSET196373 A Survey of Machine Learning Algorithms Prof. Deepak Agrawal, Abhiruchi Dubey Takshshila Institute of Engineering and Technology, Jabalpur, Madhya Pradesh, India ABSTRACT Machine Learning is a booming research area in computer science and many other industries all over the world. It has gained great success in vast and varied application sectors. This includes social media, economy, finance, healthcare, agriculture, etc. Several intelligent machine learning techniques were designed and used to provide big data predictive analytics solutions. A literature survey of different machine learning techniques is provided in this paper. Also a study on commonly used machine learning algorithms for big data analytics is done and presented in this paper Keywords : ANN, Data Analytics, Machine Learning Algorithms, Technique, Prediction, Model I. INTRODUCTION II. BIG DATA ANALYTICS In this data rich era it is essential to use sophisticated analytics techniques on huge, diverse big data sets to produce useful knowledge and information. Big data analytics is a budding research area that deals with the collection, storage and analysis of immense data sets to trace the unknown patterns and other key information. Big data analytics helps us to recognize the data that are integral component to the future business decisions. Big data analytics can be abundantly found in domains such as banking and insurance sector, healthcare, education, social media and entertainment industry, bioinformatics The term big data which describes extremely large data sets is widely being used among different researchers all over the world. Traditional relational databases are not capable of handling big data. Enormous quantity of data sets arrives from several sources like sensors, transactional applications, web and social media, etc. The big data phenomenon can be comprehended clearly by knowing the different Vÿs associated with them- Volume, Velocity, Variety, Veracity and Value. • Volume: This denotes the huge amount of applications, geospatial applications, agriculture etc. data produced every second, oscillating It is a herculean task to handle big data using between terabytes to zettabytes. These big conventional data processing applications. Thus to data sets can be maintained using distributed discover systems. hidden data patterns, trends and associations, intelligent machine learning methods • Velocity: This term represents the rate at can be adapted. The objective of the current research which data is produced and processed to paper is to discuss various machine learning congregate the demands. algorithms used by data scientists for analyzing and modeling big data. • Variety: This indicates the diverse range of data that we can use. • Veracity: This speaks about the data quality. That is, it indicates the biases, noise, IJSRSET196373 | Received : 20s May 2019 | Accepted : 22 June 2019 | March-April -2019 [ 6 (3) : 364-369 ] 364 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369 • abnormality etc. in the data. Most Value: This points to the precious knowledge algorithms were implemented for data sets which reveled from the data. could be completely fit into the memory [15]. As of the traditional machines learning the data keeps getting bigger day by day, many Data scientists use many well-marked analytics techniques. Text analytics, predictive analytics, natural language processing, machine learning, etc. are a few approaches to make better and faster decisions on big data sets to uncover hidden insights. intelligent learning methods are being implemented to provide solutions to several big data predictive analytics problems. A study on several commonly used machines learning techniques for big data analytics is provided in the following section. III. MACHINE LEARNING IV. LITERATURE SURVEY Machine learning is an interdisciplinary research area which combines ideas from several branches of science namely, artificial intelligence, statistics, J. Qiu et al. presented different machine learning algorithms for big data processing [7]. The first one is representation learning or feature learning which information theory, mathematics, etc. The prime deals with learning data representations that make focus of machine learning research is on the the data analysis process easier. It is found that the development performances of the machine learning algorithms are of fast and efficient learning algorithms which can make predictions on data. strongly When dealing with data analytics, machine representation (or features) [16]. This learning learning is an approach used to create models for scheme plays a crucial role in dimensionality prediction. Machine learning tasks are mainly reduction grouped supervised, representation learning are feature selection, feature learning. extraction and distance metric learning [14]. Feature Supervised machine learning requires training with selection (variable selection) techniques are used to labeled data. Each labeled training data consists of find those features of data which are most relevant for input value and a desired target output value. The use in model construction. Feature extraction supervised learning algorithm analyzes the training techniques transform the high dimensional data into a data and makes an inferred function, which may be low dimensional space. In distance metric learning, a used for mapping new values. In unsupervised distance function is constructed to calculate the distance between various points of a data set. into unsupervised three and categoriesreinforcement machine learning technique, hidden insights are drawn from unlabelled data sets, for example, influenced tasks. by The the selection important of steps data under cluster analysis. The third category, reinforcement The authors mentioned about another hot learning learning allows a machine to learn its behavior technique called deep learning in their paper. Most of from the ancient machine learning approaches follows the feedback received through the interactions with an external environment [3]. shallow- From a data processing point of view, both containing a single layer of nonlinear feature supervised and unsupervised learning techniques transformations. Some of the examples of such are preferred for data analysis and reinforcement learning techniques are Gaussian mixture models techniques are preferred for decision making (GMMs), hidden Markov models (HMMs), support problems [7]. vector machines (SVMs), logistic regression, kernel regression etc. [9]. In contrast to the shallow- structured learning International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com) architecture 365 that Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369 structured learning architecture, deep learning target domain by getting trainings from a related techniques make use of supervised and unsupervised source domain. Transfer learning techniques are strategies in deep architecture. The learning systems widely being used in many real-world data processing with deep learning architecture are composed of applications. several levels of nonlinear processing stage, in which each lower layer’s output is given as the input of the The authors discussed about another learning scheme immediate higher layer. Some of the examples are called active learning. In some cases the data is deep neural networks, conventional neural networks, represented without labels which become a challenge. deep belief networks and recurrent neural networks Manually labeling this large data collection is an etc. Because of the high performance of deep learning expensive and strenuous task. Also, learning from algorithms they are well suited for big data analytics unlabelled data is very difficult. Active learning is applications. used to solve the above mentioned issue by selecting a subset of the most important instances for labeling Scalability is a challenging issue with the traditional machine learning algorithms. The traditional schemes [17]. Another scheme, kernel- based learning, has been widely used in many engineering applications to cannot process the huge data chunks within a design efficient, powerful and high performance stipulated time as they require all the data in the same nonlinear algorithms [5]. Some of the algorithms database. A new field of machine learning called capable of operating with kernels are support vector distributed learning has been evolved to solve this machines (SVM), principal component analysis (PCA), problem. In this scheme, the learning is carried out kernel perceptron, etc. on data sets distributed among several workstations to scale up the learning process [4]. Examples of the J.L. Berral-Garcia presented a paper describing the distributed frequently used machine learning algorithms for big data analytics [6]. Several algorithms are used for performing modeling, prediction and clustering tasks. machine learning algorithms are decision rules, Decision tree algorithms (like CART, Recursive stacked generalization, meta-learning and distributed Partition Trees or M5), K- Nearest neighbors boosting etc. Parallel machine learning is another algorithms, popular learning scheme where the learning process theorem) , Support vector machines(SVM), Artificial is executed among multiple processor environments Neural Network, K-means, DBSCAN algorithms, etc or on multiple threaded machines [1]. are presented in this paper. Several execution frameworks - Map-Reduce Frameworks (Apache Transfer learning is another machine learning Hadoop and Spark), Google’s Tensor flow, Microsoft’s approach mentioned in their paper. A common Azure-ML practice is that both the training data and test data are implementations taken from the same field in the conventional algorithms are made available to the public through machine learning process. That is, the input feature different tools, platforms and libraries such as R-cran, space and data distribution are identical [8]. But there Python Sci-Kit, Weka, MOA, Elastic Search, Kibana are certain scenarios in which getting training and etc. Bayesian were of algorithms also the (using mentioned. previously Byes The discussed test data from the same domain is a difficult and expensive task. In order to solve this issue, the M. U. Bokhari et al. presented a three layered transfer learning technique has been used. In this scheme a high performance learner is created for a architecture model for storing and analyzing big data [11]. The three layers are data gathering layer, data International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com) 366 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369 storing layer and data analysis & report generation medicine. Multi-omic TCGA [13] data and EHR data layer. In order to gather and handle the huge volume [2]were used to conduct this study. of big data coming from high speed sources such as sensors or social media, a cluster of high speed nodes M.R. Bendre et al. [10] conducted a research on the or severs are kept in the data gathering layer. The usage of big data in precision agriculture. The authors data storage layer is responsible for storing the big mentioned that big data provides a broad range of data. The Hadoop functions to uncover new insights to address several farming problems. The designed model uses the Distributed File System (HDFS) can be used for data MapReduce technique for big data processing and the storage. In the data analysis layer, machine learning linear regression method for data prediction. The data techniques such as ANN, Naive Bayes, SVM and collected from the KVR( Krishi Vidyapeeth Rahuri Principal Component Analysis etc. are used to churn (KVR), Ahmednagar, India) station are used to test knowledge from the huge complex data chunks. the model. The result forecasted using this model is P.Y. Wu et al. in their paper provided case studies to very useful for effective decision making in the agriculture domain. show how big data analytics is useful in precision medicine to provide the most appropriate treatment The following table summarizes the literature survey to each patient [12]. Principal Component Analysis, presented in this paper. Singular Value Decomposition and tensor-based TABLE I. LITERATURE SURVEY SUMMARY approaches are useful for feature extraction and for feature selection filter based and wrapper based methods are helpful. All these are dimensionality reduction techniques. The authors compared different techniques for performing data mining tasks. Logistic Sl. No. 1. regression, cox regression, local regression techniques Author(s) name (year) J. L. BerralGarcia (2016) are simple to interpret, but are prone to outliers. Logistic regression with LASSO regularization reduces feature space. But over fitting is a problem. Other models such as Hidden Markov models, Conditional Random fields, relational subgroup discovery, episode rule mining etc are also useful for performing data mining tasks. The authors discussed about the useful platforms for big data analytics. Apache Hadoop, IBM InfoSphere Platform, Apache Spark Streaming, Tableau, QlikView, TIBCO Spotfire, and other visual analytics tools are highly impactful platforms for providing big data analytics solutions. Two real world case studies such as integrative -omic data for the improved understanding of cancer mechanisms, and the incorporation of genomic knowledge into the EHR system for improved patient diagnosis and care were done to discuss the usefulness of biomedical big data analytics for precision 2. J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng ( 2016) Algorithms / Techniques Decision tree algorithms, KNearest neighbor algorithms, Bayesian algorithms, SVM, ANN, K-means, DBSCAN Gaussian Mixture models, Hidden Markov Models, SVM, logistic regression, Kernel Rgression, Deep neural networks, Deep belief networks, PCA, Kernel Perceptron International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com) Summary A survey was done on the various machine algorithms for classification, prediction and modeling A survey was done on the various traditional as well as advanced machine learning algorithms used for big data processing. 367 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369 V. CONCLUSION 3. 4. 5. M.U. Bokhari, M. Zeyauddin and M. A. Siddiqui (2016) P. Y. Wu, C. W. Cheng, C. D. Kaddi, J. Venugopala n, R. Hoffman and M. D. Wang (2016) M. R. Bendre, R. C. Thool and V. R. Thool ANN, SVM,PCA, Naive Bayes Logistic regression, PCA, HMM, Local regression, cox regression MapReduce, Linear regression Presented a 3 layered architecture model for storing and analyzing Bigdata. Data storage can be done using the Hadoop Distributed File System(HDFS) and data analysis can be done using techniques like ANN, SVM, Naive Bayes and PCA. Discussed several machine algorithms and platforms like Hadoop , IBM Infosphere, Tableu, Qlik view, Spark etc for providing big data solutions. Case studies were done using –omic data from TCGA and EHR data to show the usefulness of biomedical big data analytics for precision medicine. A model was built using MapReduce and Linear regression techniques. Case study was carried out to predict the rainfall and temperature value for the year 2013 using the historical weather data collected from KVR, Ahmednagar. The main objective was to improvise the accuracy of rainfall forecasting. With the advents in big data technology, it became difficult to handle the complex big data using the traditional learning algorithms. Therefore several advanced, efficient and intelligent learning algorithms are required to handle the huge chunks of heterogeneous datasets. The results obtained through these analytics techniques provide more effective solutions to many real world problems in various domains such as healthcare, agriculture, social media, banking etc. Various research papers are surveyed to gather information about advanced learning techniques. This paper gives an overall idea about the advanced machine learning algorithms and techniques used to provide solutions to the big data analytics problems. VI. [1]. REFERENCES “Parallel machine learning toolbox”, retrieved from http://www.research.ibm.com/haifa/projects/ver ification/ml_tool box/. [2]. C. A. Caligtan and P. C. Dykes, “Electronic health records and personal health records”, Semin Oncol Nurs, vol. 27, pp. 218- 228, 2018. [3]. C.M. Bishop, Pattern recognition and machine learning, Springer, New York, 2019. [4]. D Peteiro-Barral and B Guijarro-Berdinas, “A survey of methods for distributed machine learning”, Progress in Artificial Intelligence, Springer, vol. 2, issue 1, pp. 1-11, 2018. DOI:10.1007/s13748-012-0035-5. [5]. G. Ding, Q. Wu, Y. D. Yao, J. Wang and Y. Chen, “Kernel- Based Learning for Statistical Signal Processing in Cognitive Radio Networks: Theoretical Foundations, Example Applications, and Future Directions”, IEEE Signal Processing Magazine, vol. 30, issue. 4, pp. 126-136, 2018. DOI: 10.1109/MSP.2013.2251071. [6]. J. L. Berral-Garcia, “A quick view on current techniques and machine learning algorithms for big data analytics”, 18th International Conf. on International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com) 368 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369 [7]. Transparent Optical Networks, pp.1-4, 2016. Network Mining , ACM, pp. 18–25, 2012. DOI: DOI: 10.1109/ICTON.2016.7550517. 10.1145/2351333.2351336. J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A survey of machine learning for big data [16]. X. W. Chen and X. Lin, “Big Data Deep Learning: Challenges and Perspectives”, in IEEE processing”, EURASIP Journal on Advances in Access, vol. 2, pp. 514-525, 2014. DOI: Signal Processing, Springer, vol. 2016:67, pp. 1- 10.1109/ACCESS.2014.2325029. 16, 2016. DOI: 10.1186/s13634-016-0355-x. [8]. [9]. [17]. Y. Bengio, A. Courville and P. Vincent, K Weiss, T Khoshgoftaar and D Wang, “A “Representation Learning: A Review and New survey of transfer learning”, Journal of Big Data, Perspectives”, in IEEE Transactions on Pattern Springer, vol. 3, issue 9, pp. 1- 40, 2016. DOI: Analysis and Machine Intelligence, vol. 35, 10.1186/s40537-016-0043-6 issue Li Deng, “A tutorial survey of architectures, 10.1109/TPAMI.2013.50. 8, pp. 1798-1828, 2017. DOI: algorithms and applications for deep learning”, [18]. Y. Fu, B. Li, X. Zhu and C. Zhang, “Active APSIPA transactions on Signal and Information Processing, vol. 3,pp.1-29,2014. DOI: Learning without Knowing Individual Instance Labels: A Pairwise Label Homogeneity Query https://doi.org/10.1017/atsip.2017.9. Approach”, in IEEE Transactions on Knowledge [10]. M. R. Bendre, R. C. Thool and V. R. Thool, “Big data in precision forecasting for International agriculture: future Conf. on farming”, Next Computing Technologies, 750, Weather 822, 2017. DOI: 10.1109/TKDE.2013.165. 1st Generation pp. and Data Engineering, vol. 26, issue 4, pp. 808- Cite this article as : 744Prof. Deepak Agrawal, Abhiruchi Dubey, "A Survey 2017. [11]. DOI:10.1109/NGCT.2015.7375220. of Machine Learning Algorithms", International [12]. M. U. Bokhari, M. Zeyauddin and M. A. Journal of Scientific Research in Science, Engineering Siddiqui, “An effective model for big data and Technology (IJSRSET), Online ISSN : 2394-4099, analytics”, 3rd International Conference on Print ISSN : 2395-1990, Volume 6 Issue 3, pp. 364-369, Computing May-June for Sustainable Global 2019. Journal URL : http://ijsrset.com/IJSRSET196373 Development, pp. 3980-3982, 2016. [13]. P. Y. Wu, C. W. Cheng, C. D. Kaddi, J. Venugopalan, R. Hoffman and M. D. Wang, “– Omic and Electronic Health Record Big Data Analytics for Precision Medicine”, IEEE Transactions on Biomedical Engineering, vol. 64, issue 2, pp. 263-273, 2017. DOI: 10.1109/TBME.2016.2573285. [14]. T. C. G. Atlas. Available: http://cancergenome.nih.gov/. [15]. W. Tu and S. representation-learning Sun, “Cross-domain framework with combination of class-separate and domainmerge objectives”, Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com) 369