Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3183713.3196887acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning

Published: 27 May 2018 Publication History

Abstract

In most parallel DBSCAN algorithms, neighboring points are assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. This data partitioning scheme causes a few critical problems including load imbalance between data partitions, especially in a skewed data set. To remedy these problems, we propose a cell-based data partitioning scheme, pseudo random partitioning, that randomly distributes small cells rather than the points themselves. It achieves high load balance regardless of data skewness while retaining the data contiguity required for DBSCAN. In addition, we build and broadcast a highly compact summary of the entire data set, which we call a two-level cell dictionary, to supplement random partitions. Then, we develop a novel parallel DBSCAN algorithm, Random Partitioning-DBSCAN (shortly, RP-DBSCAN), that uses pseudo random partitioning together with a two-level cell dictionary. The algorithm simultaneously finds the local clusters to each data partition and then merges these local clusters to obtain global clustering. To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). In RP-DBSCAN, data partitioning and cluster merging are very light, and clustering on each split is not dragged out by a specific worker. Therefore, the performance results show that RP-DBSCAN significantly outperforms the state-of-the-art algorithms by up to 180 times.

References

[1]
Guilherme Andrade, Gabriel Ramos, Daniel Madeira, Rafael Sachetto, Renato Ferreira, and Leonardo Rocha. 2013. G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Computer Science Vol. 18 (2013), 369--378.
[2]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure Proc. 1999 ACM SIGMOD Int'l Conf. on Management of Data. 49--60.
[3]
Apache. 2017. Spark API Documentation. https://spark.apache.org/docs/2.1.0/api.html. (2017). Accessed: 2017-10-31.
[4]
Domenica Arlia and Massimo Coppola. 2001. Experiments in Parallel Clustering with DBSCAN. Proc. 7th European Conf. on Parallel Processing. 326--331.
[5]
Marsha J. Berger and Shahid H. Bokhari. 1987. A Partitioning Strategy for Nonuniform Problems on Multiprocessors. IEEE Trans. on Computers Vol. C-36, 5 (1987), 570--580.
[6]
Chun-Chieh Chen and Ming-Syan Chen. 2015. HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud. Procedia Computer Science Vol. 53 (2015), 149--157.
[7]
Irving Cordova and Teng-Sheng Moh. 2015. DBSCAN on Resilient Distributed Datasets. In Proc. 2015 Int'l Conf. on High Performance Computing & Simulation. 531--540.
[8]
Bi-Ru Dai and I-Chang Lin. 2012. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition Proc. 2012 IEEE Int'l Conf. on Cloud Computing. 59--66.
[9]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters Proc. 6th Sympo. on Operating System Design and Implementation. 137--150.
[10]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. 2nd Int'l Conf. on Knowledge Discovery and Data Mining. 226--231.
[11]
Junhao Gan and Yufei Tao. 2015. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation Proc. 2015 ACM SIGMOD Int'l Conf. on Management of Data. 519--530.
[12]
Junhao Gan and Yufei Tao. 2017. Dynamic Density Based Clustering. In Proc. 2017 ACM SIGMOD Int'l Conf. on Management of Data. 1493--1507.
[13]
Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: Highly Parallel DBSCAN. In Proc. Workshop on Machine Learning in High-Performance Computing Environments. 2:1--2:10.
[14]
Poonam Goyal, Sonal Kumari, Dhruv Kumar, Sundar Balasubramaniam, Navneet Goyal, Saiyedul Islam, and Jagat Sesh Challa. 2015. Parallelizing OPTICS for Commodity Clusters. In Proc. 2015 Int'l Conf. on Distributed Computing and Networking. 33.
[15]
Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master's thesis. Eindhoven University of Technology, the Netherlands.
[16]
Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps.
[17]
Dianwei Han, Ankit Agrawal, Wei-Keng Liao, and Alok Choudhary. 2016. A Novel Scalable DBSCAN Algorithm with Spark. Proc. 2016 IEEE Int'l Sympo. on Parallel and Distributed Processing. 1393--1402.
[18]
Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014. MR-DBSCAN: A Scalable MapReduce-based DBSCAN Algorithm for Heavily Skewed Data. Frontiers of Computer Science Vol. 8, 1 (2014), 83--99.
[19]
Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. 2004. Scalable Density-Based Distributed Clustering. Proc. 8th European Conf. on Principles of Data Mining and Knowledge Discovery (2004), 231--244.
[20]
George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical Clustering Using Dynamic Modeling. Computer, Vol. 32, 8 (1999), 68--75.
[21]
Dexter C. Kozen. 2012. The Design and Analysis of Algorithms. Springer Science & Business Media.
[22]
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable Clustering Algorithm for N-body Simulations in a Shared-Nothing Cluster Proc. 22nd Int'l Conf. on Scientific and Statistical Database Management. 132--150.
[23]
Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data. Proceedings of the VLDB Endowment Vol. 10, 3 (2016), 157--168.
[24]
Guangchun Luo, Xiaoyu Luo, Thomas Fairley Gooch, Ling Tian, and Ke Qin. 2016. A Parallel DBSCAN Algorithm Based on Spark. In Proc. 2016 IEEE Int'l Conf. on Big Data and Cloud Computing. 548--553.
[25]
Son T. Mai, Ira Assent, and Martin Storgaard. 2016. AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets. In Proc. 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 1025--1034.
[26]
Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. Comput. Surveys Vol. 48, 2 (2015), 25.
[27]
Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-Keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-Set Data Structure Proc. 2012 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 62:1--62:11.
[28]
Mostofa Ali Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Salman Habib, and Pradeep Dubey. 2014. PARDICLE: Parallel Approximate Density-based Clustering Proc. 2014 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 560--571.
[29]
William M. Rand. 1971. Objective Criteria for the Evaluation of Clustering Methods. J. Amer. Statist. Assoc. Vol. 66, 336 (1971), 846--850.
[30]
Tatsuhiro Sakai, Keiichi Tamura, and Hajime Kitakami. 2017. Cell-Based DBSCAN Algorithm Using Minimum Bounding Rectangle Criteria Proc. 2017 Int'l Conf. on Database Systems for Advanced Applications. 133--144.
[31]
Hwanjun Song, Jae-Gil Lee, and Wook-Shin Han. 2017. PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency Proc. 23rd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 1087--1096.
[32]
Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. on Mathematical Software Vol. 11, 1 (1985), 37--57.
[33]
Larry Wasserman. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science &Business Media.
[34]
Benjamin Welton, Evan Samanas, and Barton P. Miller. 2013. Mr. Scan: Extreme Scale Density-Based Clustering Using a Tree-Based Network of GPGPU Nodes Proc. 2013 Int'l Conf. on High Performance Computing, Networking, Storage and Analysis. 84:1--84:11.
[35]
Xiaowei Xu, Jochen J"ager, and Hans-Peter Kriegel. 1999. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery Vol. 3, 3 (1999), 263--290.
[36]
Yanwei Yu, Jindong Zhao, Xiaodong Wang, Qin Wang, and Yonggang Zhang. 2015. Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop. International Journal of Distributed Sensor Networks Vol. 2015 (2015), 1--13.
[37]
Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel K-Means Clustering Based on MapReduce. Proc. 1st Int'l Conf. on Cloud Computing. 674--679.
[38]
Yu Zheng, Like Liu, Longhao Wang, and Xing Xie. 2008. Learning Transportation Mode from Raw GPS Data for Geographic Applications on the Web Proc. 17th Int'l Conf. on World Wide Web. 247--256.

Cited By

View all
  • (2024)Towards Metric DBSCAN: Exact, Approximate, and Streaming AlgorithmsProceedings of the ACM on Management of Data10.1145/36549812:3(1-25)Online publication date: 30-May-2024
  • (2024)A scalable multi-density clustering approach to detect city hotspots in a smart cityFuture Generation Computer Systems10.1016/j.future.2024.03.042157(226-236)Online publication date: Aug-2024
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • Show More Cited By

Index Terms

  1. RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
    May 2018
    1874 pages
    ISBN:9781450347037
    DOI:10.1145/3183713
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. dbscan
    3. parallelization
    4. spark

    Qualifiers

    • Research-article

    Funding Sources

    • National Research Foundation of Korea
    • Ministry of Land, Infrastructure and Transport, Korea

    Conference

    SIGMOD/PODS '18
    Sponsor:

    Acceptance Rates

    SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards Metric DBSCAN: Exact, Approximate, and Streaming AlgorithmsProceedings of the ACM on Management of Data10.1145/36549812:3(1-25)Online publication date: 30-May-2024
    • (2024)A scalable multi-density clustering approach to detect city hotspots in a smart cityFuture Generation Computer Systems10.1016/j.future.2024.03.042157(226-236)Online publication date: Aug-2024
    • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
    • (2023)STRP-DBSCAN: A Parallel DBSCAN Algorithm Based on Spatial-Temporal Random Partitioning for Clustering Trajectory DataApplied Sciences10.3390/app13201112213:20(11122)Online publication date: 10-Oct-2023
    • (2023)A fast parallelized DBSCAN algorithm based on OpenMp for detection of criminals on streaming servicesFrontiers in Big Data10.3389/fdata.2023.12929236Online publication date: 31-Oct-2023
    • (2023)Efficient Density-Peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/3607873Online publication date: 7-Jul-2023
    • (2023)Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605594(503-512)Online publication date: 7-Aug-2023
    • (2023)Fast Density-Based Clustering: Geometric ApproachProceedings of the ACM on Management of Data10.1145/35889121:1(1-24)Online publication date: 30-May-2023
    • (2023)GTraclus: a novel algorithm for local trajectory clustering on GPUsDistributed and Parallel Databases10.1007/s10619-023-07429-x41:3(467-488)Online publication date: 13-May-2023
    • (2023)Design of an Image Segmentation System Based on Hierarchical Density-Based Spatial Clustering of Applications with Noise for Off-Road Unmanned Ground VehiclesProceedings of 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022)10.1007/978-981-99-0479-2_291(3163-3175)Online publication date: 10-Mar-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media