A Framework for Clustering and Classification of Big Data Using Spark

Xristos Mallios²⁰,
Vasilis Vassalos²⁰,
Tassos Venetis²⁰ &
…
Akrivi Vlachou²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10033))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1631 Accesses

Abstract

Nowadays, massive data sets are generated in many modern applications ranging from economics to bioinformatics, and from social networks to scientific databases. Typically, such data need to be processed by machine learning algorithms, which entails high processing cost and usually requires the execution of iterative algorithms. Spark has been recently proposed as a framework that supports iterative algorithms over massive data efficiently. In this paper, we design a framework for clustering and classification of big data suitable for Spark. Our framework supports different restrictions on the data exchange model that are applicable in different settings. We integrate k-means and ID3 algorithms in our framework, leading to interesting variants of our algorithms that apply to the different restrictions on the data exchange model. We implemented our algorithms over the open-source computing framework Spark and evaluated our approach in a cluster of 37-nodes, thus demonstrating the scalability of our techniques. Our experimental results show that we outperform the algorithm provided by Spark for k-means up to 31 %, while the centralized k-means is at least one order of magnitude worse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An Optimized K-means Clustering Approach on Top of MapReduce

Automated Spark Clusters Deployment for Big Data with Standalone Applications Integration

Notes

References

Balcan, M.-F.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, k. (eds.) Advances in Neural Information Processing Systems 26, pp. 1995–2003. Curran Associates Inc. (2013)
Google Scholar
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, pp. 368–374 (1997)
Google Scholar
Cheung, Y.-M.: k-means: a new generalized k-means clustering algorithm. Pattern Recogn. Lett. 24(15), 2883–2893 (2003)
Article MATH Google Scholar
Datta, S., Giannella, C., Kargupta, H.: K-means clustering over a large, dynamic network. In: SDM, pp. 153–164 (2006)
Google Scholar
Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: SIGKDD, pp. 690–698 (2011)
Google Scholar
Fisher, D.H., McKusick, K.B.: An empirical comparison of id3 and back-propagation. In: IJCAI, pp. 788–793 (1989)
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Article Google Scholar
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Foundations of Computer Science, pp. 359–366 (2000)
Google Scholar
Jasso-Luna, O., Sosa-Sosa, V., Lopez-Arevalo, I.: An approach to building a distributed id3 classifier. In: DCAI, pp. 385–394 (2009)
Google Scholar
Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10(1), 17–40 (2006)
Article Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Article MATH Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)
MathSciNet MATH Google Scholar
Poteras, C.M., Mihaescu, M.C., Mocanu, M.: An optimized version of the k-means clustering algorithm. In: Computer Science and Information Systems, pp. 695–699 (2014)
Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)
Google Scholar
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003)
Google Scholar
Xiao, M.-J., Huang, L.-S., long Luo, Y., Shen, H.: Privacy preserving id3 algorithm over horizontally partitioned data. In: PDCAT, pp. 239–243 (2005)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Google Scholar
Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel k-means clustering algorithm with mpi. In: Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)
Google Scholar

Download references

Acknowledgment

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project). This work was co-funded by the European Union and the General Secretariat of Research and Technology, Ministry Of Education, Research Religious Affairs under the project (AMOR) of the Bilateral R&T Cooperation Program Greece srael 2013–2015. This support is gratefully acknowledged.

Author information

Authors and Affiliations

Athens University of Economics and Business, Athens, Greece
Xristos Mallios, Vasilis Vassalos, Tassos Venetis & Akrivi Vlachou

Authors

Xristos Mallios
View author publications
You can also search for this author in PubMed Google Scholar
Vasilis Vassalos
View author publications
You can also search for this author in PubMed Google Scholar
Tassos Venetis
View author publications
You can also search for this author in PubMed Google Scholar
Akrivi Vlachou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akrivi Vlachou .

Editor information

Editors and Affiliations

ADAPT Centre, Trinity College Dublin, Dublin 2, Ireland
Christophe Debruyne
University of Lorraine, Vandoeuvre-les-Nancy, France
Hervé Panetto
TU Graz, Graz, Austria
Robert Meersman
La Trobe University, Melbourne, Australia
Tharam Dillon
Institute of Computer Languages, TU Wien, Vienna, Austria
eva Kühn
ADAPT Centre, Trinity College Dublin, Dublin 2, Ireland
Declan O'Sullivan
Università degli Studi di Milano Crema, Crema, Italy
Claudio Agostino Ardagna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mallios, X., Vassalos, V., Venetis, T., Vlachou, A. (2016). A Framework for Clustering and Classification of Big Data Using Spark. In: Debruyne, C., et al. On the Move to Meaningful Internet Systems: OTM 2016 Conferences. OTM 2016. Lecture Notes in Computer Science(), vol 10033. Springer, Cham. https://doi.org/10.1007/978-3-319-48472-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-48472-3_20
Published: 18 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48471-6
Online ISBN: 978-3-319-48472-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Framework for Clustering and Classification of Big Data Using Spark

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An Optimized K-means Clustering Approach on Top of MapReduce

Automated Spark Clusters Deployment for Big Data with Standalone Applications Integration

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Framework for Clustering and Classification of Big Data Using Spark

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An Optimized K-means Clustering Approach on Top of MapReduce

Automated Spark Clusters Deployment for Big Data with Standalone Applications Integration

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation