Abstract
Big Data appears with not only the increasing size of data but also complex and different processing and analytical tools. This research aims to compare some selected machine learning algorithms on datasets of different types and sizes using Apache spark tool in order to make a fair judgment about which one is the best fitting in. The algorithms were compared based on few parameters including mainly accuracy and training time. The algorithms were applied on three datasets of different fields: marketing, packing and statistics, and security datasets. The findings of this experiment show that the decision tree algorithm is the most suitable algorithm for marketing and security datasets. Additionally, logistic regression algorithm had the highest accuracy for packing and statistics dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Raj, P., Raman, A., Nagaraj, D., Duggirala, S.: The high-performance technologies for big and fast data analytics. In: Raj, P., Raman, A., Nagaraj, D., Duggirala, S. (eds.) High-Performance Big-Data Analytics, pp. 25–66. Springer, Heidelberg (2015)
McAfee, A., et al.: Big data. The management revolution. Harvard Bus. Rev. 90(10), 61–67 (2012)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile Netw. Appl. 19(2), 171–209 (2014)
Chen, Z., Liu, B.: Topic modeling using topics from many domains, lifelong learning and big data. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014) (2014)
Agrawal, D., Budak, C., El Abbadi, A., Georgiou, T., Yan, X.: Big data in online social networks: user interaction analysis to model user behavior in social networks. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds.) DNIS 2014. LNCS, vol. 8381, pp. 1–16. Springer, Heidelberg (2014). doi:10.1007/978-3-319-05693-7_1
Chen, H., Chiang, R.H., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012)
Solanas, A., et al.: Smart health: a context-aware health paradigm within smart cities. IEEE Commun. Mag. 52(8), 74–81 (2014)
Schell, R.: Security—a big question for big data. In: 2013 IEEE International Conference on Big Data. IEEE (2013)
Assunção, M.D., et al.: Big Data computing and clouds: trends and future directions. J. Parallel Distrib. Comput. 79, 3–15 (2015)
Cuzzocrea, A.: Privacy and security of big data: current challenges and future research perspectives. In: Proceedings of the First International Workshop on Privacy and Security of Big Data. ACM (2014)
Machine Learning Repository. http://mlr.cs.umass.edu/ml/. Accessed 4 Jan 2016
Mineta Transportation Institute. https://www.dhs.gov/publication/mineta-transportation-institute-mti-database. Accessed 4 Jan 2016
Zaharia, M., et al.: Fast and interactive analytics over Hadoop data with Spark. USENIX: login 37(4), 45–51 (2012)
Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2011)
Suthaharan, S.: Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform. Eval. Rev. 41(4), 70–73 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hafez, M.M., Shehab, M.E., El Fakharany, E., Abdel Ghfar Hegazy, A.E.F. (2017). Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark. In: Hassanien, A., Shaalan, K., Gaber, T., Azar, A., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016. AISI 2016. Advances in Intelligent Systems and Computing, vol 533. Springer, Cham. https://doi.org/10.1007/978-3-319-48308-5_66
Download citation
DOI: https://doi.org/10.1007/978-3-319-48308-5_66
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48307-8
Online ISBN: 978-3-319-48308-5
eBook Packages: EngineeringEngineering (R0)