Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020408.2020555acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

An effective evaluation measure for clustering on evolving data streams

Published: 21 August 2011 Publication History

Abstract

Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

References

[1]
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, pages 81--92, 2003.
[2]
F. B. Baker and L. J. Hubert. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association (JASA), 70(349):31--38, 1975.
[3]
D. Barbará and P. Chen. Using the fractal dimension to cluster datasets. In ACM SIGKDD, pages 260--264, 2000.
[4]
A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà. New ensemble methods for evolving data streams. In ACM SIGKDD, pages 139--148, 2009.
[5]
A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen, and T. Seidl. MOA: Massive online analysis, a framework for stream classification and clustering. In JMLR, 2010.
[6]
M. Bouguessa, S. Wang, and H. Sun. An objective approach to cluster validation. Pattern Recognition Letters, 27(13):1419--1430, 2006.
[7]
M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, and E. R. Dougherty. Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3):807--824, 2007.
[8]
F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In SIAM SDM, pages 328--339, 2006.
[9]
Y. Chen and L. Tu. Density-based clustering for real-time stream data. In ACM SIGKDD, pages 133--142, 2007.
[10]
T. Cover and J. Thomas. Elements of Information Theory (2nd Edition). Wiley-Interscience, 2006.
[11]
J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.
[12]
E. Folkes and C. Mallows. A method for comparing two hierarchical clusterings. JASA, 78:553--569, 1983.
[13]
B. Gartner. Fast and robust smallest enclosing balls. In ESA, pages 325--338. Springer, 1999.
[14]
M. Halkidi and M. Vazirgiannis. A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6):773--786, 2008.
[15]
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.
[16]
J. A. Hartigan. Clustering Algorithms. Wiley, 1975.
[17]
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193--218, 1985.
[18]
L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83(6):1072--1080, 1976.
[19]
A. Jain, Z. Zhang, and E. Y. Chang. Adaptive non-linear clustering in data streams. In ACM CIKM, pages 122--131, 2006.
[20]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988.
[21]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM CS, 31(3):264--323, 1999.
[22]
L. Kaufmann and P. Rousseeuw. Finding Groups in Data: an Introduct. to Cluster Analysis. Wiley, 1990.
[23]
P. Kranen, I. Assent, C. Baldauf, and T. Seidl. Self- adaptive anytime stream clustering. In IEEE ICDM, pages 249--258, 2009.
[24]
P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In IEEE ICDMW, pages 1400--1403, 2010.
[25]
Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In IEEE ICDM, pages 911--916, 2010.
[26]
M. Meila. Comparing clusterings: an axiomatic view. In ICML, pages 577--584, 2005.
[27]
G. W. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45(3):325--342, 1980.
[28]
G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.
[29]
MOA project. http://moa.cs.waikato.ac.nz.
[30]
S. G. Mojaveri, E. Mirzaeian, Z. Bornaee, and S. Ayat. New approach in data stream association rule mining based on graph structure. In IEEE ICDM, pages 158--164, 2010.
[31]
L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In ICDE, pages 685--694, 2002.
[32]
W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66:846--850, 1971.
[33]
C. Rijsbergen. Information Retrieval (2nd Edition). Butterworths, London, 1979.
[34]
F. J. Rohlf. Methods for comparing classifications. Annual Review of Ecology and Sys., 5:101--113, 1974.
[35]
A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, pages 410--420, 2007.
[36]
V. Roth, M. L. Braun, T. Lange, and J. M. Buhmann. Stability-based model order selection in clustering with applications to gene expression data. In ICANN, pages 633--640. Springer, 2002.
[37]
S. Saitta, B. Raphael, and I. F. C. Smith. A comprehensive validity index for clustering. Intell. Data Anal. (IDA), 12(6):529--548, 2008.
[38]
M. J. Song and L. Zhang. Comparison of cluster representations from partial second- to full fourth-order cross moments for data stream clustering. In IEEE ICDM, pages 560--569, 2008.
[39]
S. Van Dongen. Performance criteria for graph clustering and markov cluster experiments. Report-Information systems, (12):1--36, 2000.
[40]
L. Wang, U. T. V. Nguyen, J. C. Bezdek, C. Leckie, and K. Ramamohanarao. iVAT and aVAT: Enhanced visual analysis for cluster tendency assessment. In PAKDD (1), pages 16--27. Springer, 2010.
[41]
J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In ACM SIGKDD, pages 877--886, 2009.
[42]
K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data. Bioinformatics, 17(4):309--318, 2001.
[43]
Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. ML, 55(3):311--331, 2004.

Cited By

View all
  • (2023)Data Stream Clustering: An In-depth Empirical StudyProceedings of the ACM on Management of Data10.1145/35893071:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Tracking the Evolution of Clusters in Social Media StreamsIEEE Transactions on Big Data10.1109/TBDATA.2022.32042079:2(701-715)Online publication date: 1-Apr-2023
  • (2023)Disease Cluster Specific learning for Patient Treatment Prediction on High Dimensional EHR Data2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)10.1109/ICACCS57279.2023.10112707(1222-1226)Online publication date: 17-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation measure
  2. stream clustering

Qualifiers

  • Poster

Conference

KDD '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)4
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Data Stream Clustering: An In-depth Empirical StudyProceedings of the ACM on Management of Data10.1145/35893071:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Tracking the Evolution of Clusters in Social Media StreamsIEEE Transactions on Big Data10.1109/TBDATA.2022.32042079:2(701-715)Online publication date: 1-Apr-2023
  • (2023)Disease Cluster Specific learning for Patient Treatment Prediction on High Dimensional EHR Data2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS)10.1109/ICACCS57279.2023.10112707(1222-1226)Online publication date: 17-Mar-2023
  • (2023)Autoencoder-Based Iterative Modeling and Multivariate Time-Series Subsequence Clustering AlgorithmIEEE Access10.1109/ACCESS.2023.324756411(18868-18886)Online publication date: 2023
  • (2023)On fly hybrid swarm optimization algorithms for clustering of streaming dataResults in Control and Optimization10.1016/j.rico.2022.10019010(100190)Online publication date: Mar-2023
  • (2023)ARD-Stream: An adaptive radius density-based stream clusteringFuture Generation Computer Systems10.1016/j.future.2023.07.027149(416-431)Online publication date: Dec-2023
  • (2023)Temporal silhouette: validation of stream clustering robust to concept driftMachine Learning10.1007/s10994-023-06462-2113:4(2067-2091)Online publication date: 10-Nov-2023
  • (2023)Time series clustering using trend, seasonal and autoregressive components to identify maximum temperature patterns in the Iberian PeninsulaEnvironmental and Ecological Statistics10.1007/s10651-023-00572-930:3(421-442)Online publication date: 15-Jul-2023
  • (2023)Performance Evaluation of Data Stream Clustering Algorithm on Parameter SpecificationThe 6th International Conference on Wireless, Intelligent and Distributed Environment for Communication10.1007/978-3-031-47126-1_12(173-189)Online publication date: 21-Dec-2023
  • (2022)Network traffic analysis over clustering-based collective anomaly detectionComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.108760205:COnline publication date: 14-Mar-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media