Nothing Special   »   [go: up one dir, main page]

skip to main content
article

DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Published: 01 June 2014 Publication History

Abstract

Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention. In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.

References

[1]
D. Aioanei, Uzaygezen 0.1, {https://code.google.com/p/uzaygezen/}, 2008.
[2]
M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: ordering points to identify the clustering structure, in: SIGMOD Conference, 1999, pp. 49-60.
[3]
Apache. Apache hadoop. {http://hadoop.apache.org}, 2010.
[4]
D. Arlia, M. Coppola, Experiments in parallel clustering with dbscan, in: Euro-Par, 2001, pp. 326-331.
[5]
N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R¿-Tree: an efficient and robust access method for points and rectangles, in: SIGMOD Conference, 1990, pp. 322-331.
[6]
Bellman, R., Dynamic Programming. 2003. Dover Publications.
[7]
Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C., Introduction to Algorithms. 2001. second edition. MIT Press, McGraw-Hill.
[8]
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: OSDI, 2004, pp. 137-150.
[9]
ELKI, Environment for Developing kdd-Applications Supported by Index-Structures, {http://elki.dbs.ifi.lmu.de}, 2012.
[10]
M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: KDD, 1996, pp. 226-231.
[11]
Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D. and Equitz, W., Efficient and effective querying by image content. J. Intell. Inf. Syst. v3 i3/4. 231-262.
[12]
M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, GNU Scientific Library-Reference Manual, Third Edition, for GSL Version 1.12, third edition, Network Theory Ltd, 2009.
[13]
Guha, S., Rastogi, R. and Shim, K., CURE: an efficient clustering algorithm for large databases. . Inf. Syst. v26 i1. 35-58.
[14]
Han, J. and Kamber, M., Data Mining Concepts and Techniques. 2006. second edition. Morgan Kaufmann Publishers.
[15]
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce, in: ICPADS, 2011, pp. 473-480.
[16]
A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: KDD, 1998, pp. 58-65.
[17]
H.V. Jagadish, A retrieval technique for similar shapes, in: SIGMOD Conference, 1991, pp. 208-217.
[18]
B. Kaluza, V. Mirchevska, E. Dovgan, M. Lustrek, M. Gams, An agent-based approach to care in independent living, in: AmI, 2010, pp. 177-186.
[19]
I. Kamel, C. Faloutsos, On packing r-trees, in: CIKM, 1993, pp. 490-499.
[20]
S. Kisilevich, F. Mansmann, D.A. Keim, P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos, in: COM.Geo, 2010, pp. 1-10.
[21]
F. Korn, N. Sidiropoulos, C. Faloutsos, E.L. Siegel, Z. Protopapas, Fast nearest neighbor search in medical image databases, in: VLDB, 1996, pp. 215-226.
[22]
L. Li, Y. Xi, Research on clustering algorithm and its parallelization strategy, in: ICCIS, 2011, pp. 325-328.
[23]
J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.
[24]
Marioh, Spatial Indexes, {http://www2.research.att.com/~marioh/spatialindex/}, 2010.
[25]
J.M. Patel, D.J. DeWitt, Partition based spatial-merge join, in: SIGMOD Conference, 1996, pp. 259-270.
[26]
K. Shim, MapReduce algorithms for big data analysis, in: PVLDB, 2012.
[27]
K. Shim, R. Srikant, R. Agrawal, High-dimensional similarity joins, in: ICDE, 1997, pp. 301-311.
[28]
Shim, K., Srikant, R. and Agrawal, R., High-dimensional similarity joins. IEEE Trans. Knowl. Data Eng. v14 i1. 156-171.
[29]
Viswanath, P. and Babu, V.S., Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. . Pattern Recognition Lett. v30 i16. 1477-1488.
[30]
P. Viswanath, R. Pinkesh, l-DBSCAN: a fast hybrid density based clustering method. in: ICPR, vol. 1, 2006, pp. 912-915.
[31]
Wikipedia, Multivariate Normal Distribution, {http://en.wikipedia.org/wiki/Multivariate_normal_distribution}, 2012.
[32]
A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. v3 i3. 263-290.
[33]
Zhang, T., Ramakrishnan, R. and Livny, M., Birch: a new data clustering algorithm and its applications. . Data Min. Knowl. Discov. v1 i2.
[34]
Zhang, Z., Determining the epipolar geometry and its uncertainty: a review. . Int. J. Comput. Vis. v27 i2. 161-195.

Cited By

View all
  • (2022)Scalable clustering for EO data using efficient raster representationMultimedia Tools and Applications10.1007/s11042-022-13726-x82:8(12303-12319)Online publication date: 23-Sep-2022
  • (2021)MR-BIRCHJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20207940:3(5295-5305)Online publication date: 1-Jan-2021
  • (2021)DBWGIE-MRJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20179240:6(10781-10796)Online publication date: 1-Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Systems
Information Systems  Volume 42, Issue
June, 2014
216 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 June 2014

Author Tags

  1. Clustering algorithm
  2. Density-based clustering
  3. MapReduce
  4. Parallel algorithm

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Scalable clustering for EO data using efficient raster representationMultimedia Tools and Applications10.1007/s11042-022-13726-x82:8(12303-12319)Online publication date: 23-Sep-2022
  • (2021)MR-BIRCHJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20207940:3(5295-5305)Online publication date: 1-Jan-2021
  • (2021)DBWGIE-MRJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20179240:6(10781-10796)Online publication date: 1-Jan-2021
  • (2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020
  • (2020)Theoretically-Efficient and Practical Parallel DBSCANProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380582(2555-2571)Online publication date: 11-Jun-2020
  • (2020)Research on large data set clustering method based on MapReduceNeural Computing and Applications10.1007/s00521-018-3780-y32:1(93-99)Online publication date: 1-Jan-2020
  • (2019)A Distributed PCM Clustering Algorithm Based on SparkProceedings of the 2019 11th International Conference on Machine Learning and Computing10.1145/3318299.3318315(70-74)Online publication date: 22-Feb-2019
  • (2019)One-pass MapReduce-based clustering method for mixed large scale dataJournal of Intelligent Information Systems10.1007/s10844-017-0472-552:3(619-636)Online publication date: 1-Jun-2019
  • (2017)An on-line algorithm for cluster detection of mobile nodes through complex event processingInformation Systems10.1016/j.is.2015.12.00364:C(303-320)Online publication date: 1-Mar-2017
  • (2017)DASCKnowledge and Information Systems10.1007/s10115-016-0958-450:3(851-881)Online publication date: 1-Mar-2017
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media