Abstract
Nowadays, massive data sets are generated in many modern applications ranging from economics to bioinformatics, and from social networks to scientific databases. Typically, such data need to be processed by machine learning algorithms, which entails high processing cost and usually requires the execution of iterative algorithms. Spark has been recently proposed as a framework that supports iterative algorithms over massive data efficiently. In this paper, we design a framework for clustering and classification of big data suitable for Spark. Our framework supports different restrictions on the data exchange model that are applicable in different settings. We integrate k-means and ID3 algorithms in our framework, leading to interesting variants of our algorithms that apply to the different restrictions on the data exchange model. We implemented our algorithms over the open-source computing framework Spark and evaluated our approach in a cluster of 37-nodes, thus demonstrating the scalability of our techniques. Our experimental results show that we outperform the algorithm provided by Spark for k-means up to 31 %, while the centralized k-means is at least one order of magnitude worse.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Balcan, M.-F.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, k. (eds.) Advances in Neural Information Processing Systems 26, pp. 1995–2003. Curran Associates Inc. (2013)
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, pp. 368–374 (1997)
Cheung, Y.-M.: k-means: a new generalized k-means clustering algorithm. Pattern Recogn. Lett. 24(15), 2883–2893 (2003)
Datta, S., Giannella, C., Kargupta, H.: K-means clustering over a large, dynamic network. In: SDM, pp. 153–164 (2006)
Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: SIGKDD, pp. 690–698 (2011)
Fisher, D.H., McKusick, K.B.: An empirical comparison of id3 and back-propagation. In: IJCAI, pp. 788–793 (1989)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Foundations of Computer Science, pp. 359–366 (2000)
Jasso-Luna, O., Sosa-Sosa, V., Lopez-Arevalo, I.: An approach to building a distributed id3 classifier. In: DCAI, pp. 385–394 (2009)
Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10(1), 17–40 (2006)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)
Poteras, C.M., Mihaescu, M.C., Mocanu, M.: An optimized version of the k-means clustering algorithm. In: Computer Science and Information Systems, pp. 695–699 (2014)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003)
Xiao, M.-J., Huang, L.-S., long Luo, Y., Shen, H.: Privacy preserving id3 algorithm over horizontally partitioned data. In: PDCAT, pp. 239–243 (2005)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel k-means clustering algorithm with mpi. In: Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)
Acknowledgment
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project). This work was co-funded by the European Union and the General Secretariat of Research and Technology, Ministry Of Education, Research Religious Affairs under the project (AMOR) of the Bilateral R&T Cooperation Program Greece srael 2013–2015. This support is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Mallios, X., Vassalos, V., Venetis, T., Vlachou, A. (2016). A Framework for Clustering and Classification of Big Data Using Spark. In: Debruyne, C., et al. On the Move to Meaningful Internet Systems: OTM 2016 Conferences. OTM 2016. Lecture Notes in Computer Science(), vol 10033. Springer, Cham. https://doi.org/10.1007/978-3-319-48472-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-48472-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48471-6
Online ISBN: 978-3-319-48472-3
eBook Packages: Computer ScienceComputer Science (R0)