Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering

Published: 30 May 2023 Publication History

Abstract

Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters (ε, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to ε whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude.

Supplemental Material

MP4 File
Video on FINEX, presented by Konstantin Emil Thiel

References

[1]
Elke Achtert, Christian Böhm, Hans-Peter Kriegel, and Peer Kröger. 2005. Online Hierarchical Clustering in a Data Warehouse Environment. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05). IEEE Computer Society, USA.
[2]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. In Proceedings of ACM SIGMOD Int. Conf. on Management of Data (SIGMOD '99, Vol. 28). ACM New York, NY, USA, 49--60.
[3]
Nikolaus Augsten and Michael Bohlen. 2013. Similarity Joins in Relational Database Systems 3rd ed.). San Rafael: Morgan & Claypool Publishers.
[4]
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM, Vol. 18, 9 (1975), 509--517.
[5]
Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. 2006. Parallel Density-Based Clustering of Complex Objects. In Advances in Knowledge Discovery and Data Mining (PAKDD '06). Springer Berlin Heidelberg, 179--188.
[6]
Ricardo J.G.B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimate. In Advances in Knowledge Discovery and Data Mining (PAKDD '13). Springer Berlin Heidelberg, 160--172.
[7]
Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 426--435.
[8]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD '96). AAAI Press, 226--231.
[9]
Junhao Gan and Yufei Tao. 2018. Fast Euclidean OPTICS with Bounded Precision in Low Dimensional Space. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1067--1082.
[10]
Antonin Guttmann. 1984. R-Trees: A Dynamic Index Structure For Spatial Searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD '84, Vol. 14). Association for Computing Machinery, New York, NY, USA, 47--57.
[11]
Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques 3rd edition ed.). Morgan Kaufmann.
[12]
B. F. A. Hompes, J. C. A. M. Buijs, W. M. P. van der Aalst, P. M. Dixit, and J Buurman. 2015. Discovering Deviating Cases and Process Variants Using Trace Clustering. In 27th Benelux Conference on Artificial Intelligence (BNAIC 2015). Hasselt, Belgium.
[13]
Jennifer Jang and Heinrich Jiang. 2019. DBSCAN: Towards fast and scalable density clustering. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97. PMLR, 3019--3029.
[14]
Matthew B Kennel. 2004. KDTREE 2: Fortran 95 and C software to efficiently search for near neighbors in a multi-dimensional Euclidean space. arXiv preprint physics/0408067 (2004).
[15]
Daniel Kocher, Nikolaus Augsten, and Willi Mann. 2021. Scaling Density-Based Clustering to Large Collections of Sets. In Proceedings of the 24th International Conference on Extending Database Technology. EDBT, 109--120.
[16]
Son T Mai, Ira Assent, and Martin Storgaard. 2016. AnyDBC: An efficient anytime density-based clustering algorithm for very large complex datasets. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1025--1034.
[17]
Son T. Mai, Jon Jacobsen, Sihem Amer-Yahia, Ivor Spence, Nhat-Phuong Tran, Ira Assent, and Quoc Viet Hung Nguyen. 2022. Incremental Density-Based Clustering on Multicore Processors. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 3 (2022), 1338--1356.
[18]
Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An Empirical Evaluation of Set Similarity Join Techniques. In Proceedings of the VLDB Endowment, Vol. 9. San Francisco, CA, USA, 636--647.
[19]
Natalie Price-Jones and Jo Bovy. 2019. Blind chemical tagging with DBSCAN: prospects for spectroscopic surveys. Monthly Notices of the Royal Astronomical Society, Vol. 487, 1 (2019), 871--886.
[20]
UCI Machine Learning Repository. 2022. HT-SENSOR. http://archive.ics.uci.edu/ml/datasets/gassensorsforhomeactivitymonitoring. Accessed: October 2022.
[21]
Leonardo Andrade Ribeiro and Theo H"arder. 2009. Efficient Set Similarity Joins Using Min-prefixes. In Proceedings of the 13th East-European Conference on Advances in Databases and Information Systems. ADBIS, 88--102.
[22]
scikit learn. 2022. Standardization, or mean removal and variance scaling. https://scikit-learn.org/stable/modules/preprocessing.html. Accessed: October 2022.
[23]
Kevin Sheridan, Tejas G Puranik, Eugene Mangortey, Olivia J Pinon-Fischer, Michelle Kirby, and Dimitri N Mavris. 2020. An application of dbscan clustering for flight anomaly detection during the approach phase. In AIAA Scitech 2020 Forum.
[24]
Konstantin Emil Thiel, Daniel Kocher, Nikolaus Augsten, Thomas Hütter, Willi Mann, and Daniel Ulrich Schmitt. 2023. FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*. arxiv: 2304.04817 [cs.DB]
[25]
Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. 2019. Leveraging set relations in exact and dynamic set similarity join. In The VLDB Journal. 267--292.
[26]
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient Similarity Joins for Near-Duplicate Detection. ACM Transactions on Database Systems, Vol. 36, 3 (2011).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Author Tags

  1. DBscan
  2. density-based clustering
  3. index
  4. optics

Qualifiers

  • Research-article

Funding Sources

  • Austrian Science Fund (FWF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 427
    Total Downloads
  • Downloads (Last 12 months)185
  • Downloads (Last 6 weeks)13
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media