article

DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Authors:

June Sup LeeAuthors Info & Claims

Information Systems, Volume 42

Pages 15 - 35

https://doi.org/10.1016/j.is.2013.11.002

Published: 01 June 2014 Publication History

Abstract

Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention. In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.

References

[1]

D. Aioanei, Uzaygezen 0.1, {https://code.google.com/p/uzaygezen/}, 2008.

[2]

M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: ordering points to identify the clustering structure, in: SIGMOD Conference, 1999, pp. 49-60.

Digital Library

[3]

Apache. Apache hadoop. {http://hadoop.apache.org}, 2010.

[4]

D. Arlia, M. Coppola, Experiments in parallel clustering with dbscan, in: Euro-Par, 2001, pp. 326-331.

[5]

N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R¿-Tree: an efficient and robust access method for points and rectangles, in: SIGMOD Conference, 1990, pp. 322-331.

Digital Library

[6]

Bellman, R., Dynamic Programming. 2003. Dover Publications.

[7]

Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C., Introduction to Algorithms. 2001. second edition. MIT Press, McGraw-Hill.

[8]

J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: OSDI, 2004, pp. 137-150.

Digital Library

[9]

ELKI, Environment for Developing kdd-Applications Supported by Index-Structures, {http://elki.dbs.ifi.lmu.de}, 2012.

[10]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: KDD, 1996, pp. 226-231.

[11]

Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D. and Equitz, W., Efficient and effective querying by image content. J. Intell. Inf. Syst. v3 i3/4. 231-262.

[12]

M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, GNU Scientific Library-Reference Manual, Third Edition, for GSL Version 1.12, third edition, Network Theory Ltd, 2009.

[13]

Guha, S., Rastogi, R. and Shim, K., CURE: an efficient clustering algorithm for large databases. . Inf. Syst. v26 i1. 35-58.

[14]

Han, J. and Kamber, M., Data Mining Concepts and Techniques. 2006. second edition. Morgan Kaufmann Publishers.

[15]

Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, J. Fan, MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce, in: ICPADS, 2011, pp. 473-480.

[16]

A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: KDD, 1998, pp. 58-65.

[17]

H.V. Jagadish, A retrieval technique for similar shapes, in: SIGMOD Conference, 1991, pp. 208-217.

[18]

B. Kaluza, V. Mirchevska, E. Dovgan, M. Lustrek, M. Gams, An agent-based approach to care in independent living, in: AmI, 2010, pp. 177-186.

[19]

I. Kamel, C. Faloutsos, On packing r-trees, in: CIKM, 1993, pp. 490-499.

Digital Library

[20]

S. Kisilevich, F. Mansmann, D.A. Keim, P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos, in: COM.Geo, 2010, pp. 1-10.

[21]

F. Korn, N. Sidiropoulos, C. Faloutsos, E.L. Siegel, Z. Protopapas, Fast nearest neighbor search in medical image databases, in: VLDB, 1996, pp. 215-226.

[22]

L. Li, Y. Xi, Research on clustering algorithm and its parallelization strategy, in: ICCIS, 2011, pp. 325-328.

[23]

J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.

[24]

Marioh, Spatial Indexes, {http://www2.research.att.com/~marioh/spatialindex/}, 2010.

[25]

J.M. Patel, D.J. DeWitt, Partition based spatial-merge join, in: SIGMOD Conference, 1996, pp. 259-270.

[26]

K. Shim, MapReduce algorithms for big data analysis, in: PVLDB, 2012.

Digital Library

[27]

K. Shim, R. Srikant, R. Agrawal, High-dimensional similarity joins, in: ICDE, 1997, pp. 301-311.

Digital Library

[28]

Shim, K., Srikant, R. and Agrawal, R., High-dimensional similarity joins. IEEE Trans. Knowl. Data Eng. v14 i1. 156-171.

[29]

Viswanath, P. and Babu, V.S., Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. . Pattern Recognition Lett. v30 i16. 1477-1488.

[30]

P. Viswanath, R. Pinkesh, l-DBSCAN: a fast hybrid density based clustering method. in: ICPR, vol. 1, 2006, pp. 912-915.

[31]

Wikipedia, Multivariate Normal Distribution, {http://en.wikipedia.org/wiki/Multivariate_normal_distribution}, 2012.

[32]

A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. v3 i3. 263-290.

[33]

Zhang, T., Ramakrishnan, R. and Livny, M., Birch: a new data clustering algorithm and its applications. . Data Min. Knowl. Discov. v1 i2.

[34]

Zhang, Z., Determining the epipolar geometry and its uncertainty: a review. . Int. J. Comput. Vis. v27 i2. 161-195.

Cited By

Sisodiya NGarg SDube NThakkar PParmar ASharma S(2022)Scalable clustering for EO data using efficient raster representationMultimedia Tools and Applications10.1007/s11042-022-13726-x82:8(12303-12319)Online publication date: 23-Sep-2022
https://dl.acm.org/doi/10.1007/s11042-022-13726-x
Li YJiang HLu JLi XSun ZLi M(2021)MR-BIRCHJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20207940:3(5295-5305)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-202079
Yu XZeng FMwakapesa DNanehkaran YMao YXu KChen Z(2021)DBWGIE-MRJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20179240:6(10781-10796)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-201792
Show More Cited By

DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Mux-Kmeans: multiplex kmeans for clustering large-scale data set
ScienceCloud '14: Proceedings of the 5th ACM workshop on Scientific cloud computing

Kmeans clustering algorithm is widely used in a number of scientific applications due to its simple iterative nature and ease of implementation. The quality of clustering result highly depends on the selection of initial centroids. Different selections ...
Research on Clustering Algorithm and Its Parallelization Strategy
ICCIS '11: Proceedings of the 2011 International Conference on Computational and Information Sciences

As a hot topic of recent study, clouding computing can help us to analyze and process massive data effectively. Clustering is one of the important tasks of data mining. This paper focuses on how to improve the performance of clustering algorithm on ...
Research and implementation of user clustering based on MapReduce in multimedia big data

Poor understanding and low clustering efficiency of massive data is a problem under the context of big data. To solve this problem, Canopy + K-means clustering algorithm is proposed, and the MapReduce programming model is used to make full use of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Systems

Information Systems Volume 42, Issue

June, 2014

216 pages

ISSN:0306-4379

Issue’s Table of Contents

Copyright © © 2013.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 June 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sisodiya NGarg SDube NThakkar PParmar ASharma S(2022)Scalable clustering for EO data using efficient raster representationMultimedia Tools and Applications10.1007/s11042-022-13726-x82:8(12303-12319)Online publication date: 23-Sep-2022
https://dl.acm.org/doi/10.1007/s11042-022-13726-x
Li YJiang HLu JLi XSun ZLi M(2021)MR-BIRCHJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20207940:3(5295-5305)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-202079
Yu XZeng FMwakapesa DNanehkaran YMao YXu KChen Z(2021)DBWGIE-MRJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20179240:6(10781-10796)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-201792
Khader MAl-Naymat G(2020)Density-based Algorithms for Big Data Clustering Using MapReduce FrameworkACM Computing Surveys10.1145/340395153:5(1-38)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3403951
Wang YGu YShun JMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Theoretically-Efficient and Practical Parallel DBSCANProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380582(2555-2571)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380582
Wei PHe FLi LShang CLi J(2020)Research on large data set clustering method based on MapReduceNeural Computing and Applications10.1007/s00521-018-3780-y32:1(93-99)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1007/s00521-018-3780-y
Zhang YLiu HChen TTang D(2019)A Distributed PCM Clustering Algorithm Based on SparkProceedings of the 2019 11th International Conference on Machine Learning and Computing10.1145/3318299.3318315(70-74)Online publication date: 22-Feb-2019
https://dl.acm.org/doi/10.1145/3318299.3318315
Ben Hajkacem MN'cir CEssoussi N(2019)One-pass MapReduce-based clustering method for mixed large scale dataJournal of Intelligent Information Systems10.1007/s10844-017-0472-552:3(619-636)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10844-017-0472-5
Roriz Junior MEndler MSilva F(2017)An on-line algorithm for cluster detection of mobile nodes through complex event processingInformation Systems10.1016/j.is.2015.12.00364:C(303-320)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.is.2015.12.003
Bhatnagar VKaur SSaxena RKhanna D(2017)DASCKnowledge and Information Systems10.1007/s10115-016-0958-450:3(851-881)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1007/s10115-016-0958-4
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents