Abstract
Many clustering algorithms suffer from scalability problems on massive datasets and do not support any user interaction during runtime. To tackle these problems, anytime clustering algorithms are proposed. They produce a fast approximate result which is continuously refined during the further run. Also, they can be stopped or suspended anytime to provide an intermediate answer. In this paper, we propose a novel anytime clustering algorithm modeled on the density-based clustering paradigm. Our algorithm called A-DBSCAN is applicable to many complex data such as trajectory and medical data. The general idea of our algorithm is to use a sequence of lower bounding functions (LBs) of the true distance function to produce multiple approximate results of the true density-based clusters. A-DBSCAN operates in multiple levels w.r.t. the LBs and is mainly based on two algorithmic schemes: (1) an efficient distance upgrade scheme which restricts distance calculations to core objects at each level of the LBs and (2) a local reclustering scheme which restricts update operations to the relevant objects only. To further improve the performance, we propose a significant extension version of A-DBSCAN called A-DBSCAN-XS which is built upon the anytime scheme of A-DBSCAN and the \(\mu \)-range query scheme of a data structure called extended Xseedlist. A-DBSCAN-XS requires less distance calculations at each level than A-DBSCAN and thus is more efficient. Extensive experiments demonstrate that A-DBSCAN and A-DBSCAN-XS acquire very good clustering results at very early stages of execution and thus save a large amount of computational time. Even if they run to the end, A-DBSCAN and A-DBSCAN-XS are still orders of magnitude faster than the original algorithm DBSCAN and its variants. We also introduce a novel application for our algorithms for the segmentation of the white matter fiber tracts in human brain which is an important tool for studying the brain structure and various diseases such as Alzheimer.
Similar content being viewed by others
References
Ueno K, Xi X, Keogh E, Lee D (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632
Zhu Q, Batista G, Rakthanmanon T, Keogh E (2012) A novel approximation to dynamic time warping allows anytime clustering of massive time series datasets. In: SDM, pp 999–1010
Zilberstein S, Russell SJ (1995) Approximate reasoning using anytime algorithms. In: Natarajan S (ed) Imprecise and approximate computation. Kluwer Academic Publishers. http://rbr.cs.umass.edu/shlomo/papers/ZRchapter95.html
Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322
Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322
Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: DASFAA (1), pp 228–242
Assent I, Kranen P, Baldauf C, Seidl T (2010) Detecting outliers on arbitrary data streams using anytime approaches. In: StreamKDD. ACM, New York, NY, USA, pp 10–16
Lin J, Vlachos M, Keogh E, Gunopulos D (2004) Iterative incremental clustering of time series. In: EDBT, pp 106–122
Lin J, Vlachos M, Keogh E, Gunopulos D, Liu J, Yu S, Le J (2005) A MPAA-based iterative clustering algorithm augmented by nearest neighbors search for time-series data streams. In: PAKDD, pp 333–342
Ester M, Kriegel H, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: VLDB, pp 323–333
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231
Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD, pp 49–60
Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: ICDE, pp 126–133
Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417
Sakurai Y, Yoshikawa M, Faloutsos C (2005) FTW: fast similarity search under the time warping distance. In: PODS. ACM, New York, NY, USA, pp 326–337
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow 1:1542–1552
Brecheisen S, Kriegel H, Pfeif1e M (2004) Efficient density-based clustering of complex objects. In: ICDM, pp 43–50
Brecheisen S, Kriegel H, Pfeifle M (2006) Parallel density-based clustering of complex objects. In: PAKDD, pp 179–188
Kröger P, Kriegel H, Kailing K (2004) Density-connected subspace clustering for high-dimensional data. In: SDM, pp 246–256
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp 1073–1080
Mai ST, He X, Feng J, Böhm C (2013) Efficient anytime density-based clustering. In: SDM, pp 112–120
Morris BT, Trivedi MM (2009) Learning trajectory patterns by clustering: experimental studies and comparative evaluation. In: CVPR, pp 312–319
Dom BE (2001) An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM
Lee J, Han J (2007) Trajectory clustering: a partition-and-group framework. In: SIGMOD, pp 593–604
Zhang K, Kwok JT (2009) Density-weighted Nyström method for computing large kernel eigensystems. Neural Comput 21(1):121–146
Shang F, Jiao LC, Shi J, Gong M, Shang R (2011) Fast density-weighted low-rank approximation spectral clustering. Data Min Knowl Discov 23(2):345–378
Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957
Mai ST (2014) Density-based algorithms for active and anytime clustering. PhD thesis, University of Munich
Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: ICDM, pp 249–258
Mai ST, He X, Hubig N, Plant C, Böhm C (2013) Active density-based clustering. In: ICDM, pp 508–517
Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: ICDM, pp 1014–1019
Mai ST (2013) Density-based clustering: a comprehensive survey. University of Munich, Technical report
Kriegel H, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Data Mining Knowl Discov 1(3):231–240
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Mori S (2007) Introduction to diffusion tensor imaging. Elsevier, Amsterdam
Basser PJ, Pajevic S, Pierpaoli C, Duda J, Aldroubi A (2000) In vivo fiber tractography using DT-MRI data. Magn Reson Med 44:625–632
Catani M, Howard RJ, Pajevic S, Jones DK (2002) Virtual in vivo interactive dissection of white matter fasciculi in the human brain. Neuroimage 17(1):77–94
Brun A, Park HJ, Knutsson H, Westin CF (2003) Coloring of DT-MRI fiber traces using Laplacian eigenmaps. In: EUROCAST, pp 564–572
Ding Z, Gore JC, Anderson AW (2003) Classification and quantification of neuronal fiber pathways using diffusion tensor MRI. Mag Res Med 49:716–721
Corouge I, Gerig G, Gouttard S (2004) Towards a shape model of white matter fiber bundles using diffusion tensor MRI. In: ISBI, pp 344–347
Tsai A, Westin CF, Hero AO, Willsky AS (2007) Fiber tract clustering on manifolds with dual rooted-graphs. In: CVPR
Chen W, Ding Z, Zhang S, MacKay-Brandt A, Correia S, Qu H, Crow JA, Tate DF, Yan Z, Peng Q (2009) A novel interface for interactive exploration of DTI fibers. IEEE Trans Vis. Comput Graph 15(6):1433–1440
Maddah M, Eric W, Grimson L, Warfield SK (2006) Statistical modeling and EM clustering of white matter fiber tracts. In: ISBI, pp 53–56
Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: KDD-DMMH, pp 1–9
Sherbondy A, Akers D, Mackenzie R, Dougherty R, Wandell B (2005) Exploring connectivity of the brain’s white matter with dynamic queries. IEEE Trans Vis Comput Graph 11(4):419–430
Wang Q, Yap PT, Jia H, Wu G, Shen D (2010) Hierarchical fiber clustering based on multi-scale neuroanatomical features. In: MIAR, pp 448–456
Zhu Y, Shasha D (2003) Warping indexes with envelope transforms for query by humming. In: SIGMOD, pp 181–192
Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp 385–394
Kim S, Park S, Chu WW (2001) An index-based approach for similarity search supporting time warping in large sequence databases. In: ICDE, pp 607–614
Acknowledgments
We thank Diep M. T. Phan, Ha H. T. Mai, Hanh M. T. Vo, Nhan M. T. Luong, Quan A. Tran, Ninh A. Nguyen, Anh X. Nghiem, Sebastian Goebl, Nina Hubig, and Franz Krojer for their helps and supports. Our special thanks to Prof. Kai Zhang and Prof. Brian Kulis for kindly providing us the source codes of their papers. We special thank anonymous reviewers for their invaluable comments which help to significantly improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mai, S.T., He, X., Feng, J. et al. Anytime density-based clustering of complex data. Knowl Inf Syst 45, 319–355 (2015). https://doi.org/10.1007/s10115-014-0797-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0797-0