Nothing Special   »   [go: up one dir, main page]

Skip to main content

A Framework for Clustering and Classification of Big Data Using Spark

  • Conference paper
  • First Online:
On the Move to Meaningful Internet Systems: OTM 2016 Conferences (OTM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10033))

  • 1631 Accesses

Abstract

Nowadays, massive data sets are generated in many modern applications ranging from economics to bioinformatics, and from social networks to scientific databases. Typically, such data need to be processed by machine learning algorithms, which entails high processing cost and usually requires the execution of iterative algorithms. Spark has been recently proposed as a framework that supports iterative algorithms over massive data efficiently. In this paper, we design a framework for clustering and classification of big data suitable for Spark. Our framework supports different restrictions on the data exchange model that are applicable in different settings. We integrate k-means and ID3 algorithms in our framework, leading to interesting variants of our algorithms that apply to the different restrictions on the data exchange model. We implemented our algorithms over the open-source computing framework Spark and evaluated our approach in a cluster of 37-nodes, thus demonstrating the scalability of our techniques. Our experimental results show that we outperform the algorithm provided by Spark for k-means up to 31 %, while the centralized k-means is at least one order of magnitude worse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.hhs.gov/ocr/privacy/.

  2. 2.

    https://mahout.apache.org/.

  3. 3.

    http://wiki.apache.org/hadoop.

  4. 4.

    http://mesos.apache.org.

  5. 5.

    https://okeanos.grnet.gr/home/.

  6. 6.

    https://archive.ics.uci.edu/ml/datasets/HIGGS.

  7. 7.

    https://archive.ics.uci.edu/ml/datasets/SUSY.

  8. 8.

    https://archive.ics.uci.edu/ml/datasets/Nursery.

  9. 9.

    https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

  10. 10.

    http://weka.sourceforge.net/doc.dev/weka/clusterers/SimpleKMeans.html.

  11. 11.

    https://github.com/apache/spark.

References

  1. Balcan, M.-F.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, k. (eds.) Advances in Neural Information Processing Systems 26, pp. 1995–2003. Curran Associates Inc. (2013)

    Google Scholar 

  2. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, pp. 368–374 (1997)

    Google Scholar 

  3. Cheung, Y.-M.: k-means: a new generalized k-means clustering algorithm. Pattern Recogn. Lett. 24(15), 2883–2893 (2003)

    Article  MATH  Google Scholar 

  4. Datta, S., Giannella, C., Kargupta, H.: K-means clustering over a large, dynamic network. In: SDM, pp. 153–164 (2006)

    Google Scholar 

  5. Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: SIGKDD, pp. 690–698 (2011)

    Google Scholar 

  6. Fisher, D.H., McKusick, K.B.: An empirical comparison of id3 and back-propagation. In: IJCAI, pp. 788–793 (1989)

    Google Scholar 

  7. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)

    Article  Google Scholar 

  8. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Foundations of Computer Science, pp. 359–366 (2000)

    Google Scholar 

  9. Jasso-Luna, O., Sosa-Sosa, V., Lopez-Arevalo, I.: An approach to building a distributed id3 classifier. In: DCAI, pp. 385–394 (2009)

    Google Scholar 

  10. Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10(1), 17–40 (2006)

    Article  Google Scholar 

  11. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)

    Article  MATH  Google Scholar 

  12. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)

    MathSciNet  MATH  Google Scholar 

  13. Poteras, C.M., Mihaescu, M.C., Mocanu, M.: An optimized version of the k-means clustering algorithm. In: Computer Science and Information Systems, pp. 695–699 (2014)

    Google Scholar 

  14. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  15. Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)

    Google Scholar 

  16. Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003)

    Google Scholar 

  17. Xiao, M.-J., Huang, L.-S., long Luo, Y., Shen, H.: Privacy preserving id3 algorithm over horizontally partitioned data. In: PDCAT, pp. 239–243 (2005)

    Google Scholar 

  18. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)

    Google Scholar 

  19. Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel k-means clustering algorithm with mpi. In: Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)

    Google Scholar 

Download references

Acknowledgment

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project). This work was co-funded by the European Union and the General Secretariat of Research and Technology, Ministry Of Education, Research Religious Affairs under the project (AMOR) of the Bilateral R&T Cooperation Program Greece srael 2013–2015. This support is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akrivi Vlachou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Mallios, X., Vassalos, V., Venetis, T., Vlachou, A. (2016). A Framework for Clustering and Classification of Big Data Using Spark. In: Debruyne, C., et al. On the Move to Meaningful Internet Systems: OTM 2016 Conferences. OTM 2016. Lecture Notes in Computer Science(), vol 10033. Springer, Cham. https://doi.org/10.1007/978-3-319-48472-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48472-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48471-6

  • Online ISBN: 978-3-319-48472-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics