Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2695664.2695860acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Published: 13 April 2015 Publication History

Abstract

Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P1, ..., Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Mover's Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.

References

[1]
D. Applegate, T. Dasu, S. Krishnan, and S. Urbanek. Unsupervised clustering of multidimensional distributions using earth mover distance. In KDD, pages 636--644, 2011.
[2]
D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007.
[3]
R. Giancarlo and F. Utro. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Comp. Sci., 428(0):58--79, 2012.
[4]
P. Indyk and E. Price. K-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance. In STOC, pages 627--636, 2011.
[5]
S. Jegelka, A. Gretton, B. Schölkopf, B. K. Sriperumbudur, and U. von Luxburg. Generalized clustering via kernel embeddings. In KI 2009: Advances in AI, pages 144--152, 2009.
[6]
S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Math. Stat., 22(1):79--86, 1951.
[7]
M. Meila. Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages 173--187. 2003.
[8]
M. E. Mugavin. Multidimensional scaling: A brief overview. Nurs. Res., 57:64--8, 2008.
[9]
P. Raman, J. M. Phillips, and S. Venkatasubramanian. Spatially-aware comparison and consensus for clusterings. CoRR, abs/1102.0026, 2011.
[10]
Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV, pages 59--66, 1998.
[11]
O. Schwander and F. Nielsen. Learning mixtures by simplifying kernel density estimators. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, pages 403--426. 2013.
[12]
S. Shirdhonkar and D. Jacobs. Approximate earth mover's distance in linear time. In CVPR, pages 1--8, 2008.
[13]
N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073--1080, 2009.
[14]
V. M. Zolotarev. Lévy metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.
[15]
V. M. Zolotarev. Lévy-Prokhorov metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.

Cited By

View all
  • (2024)Online Detection and Fuzzy Clustering of Anomalies in Non-Stationary Time SeriesSignals10.3390/signals50100035:1(40-59)Online publication date: 24-Jan-2024
  • (2024)Regression Tree and Clustering for Distributions, and Homogeneous Structure of Population CharacteristicsJournal of Agricultural, Biological and Environmental Statistics10.1007/s13253-024-00631-zOnline publication date: 13-Jun-2024
  • (2022)Nearest Neighbor Density Functional Estimation From Inverse Laplace TransformIEEE Transactions on Information Theory10.1109/TIT.2022.315123168:6(3511-3551)Online publication date: Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
April 2015
2418 pages
ISBN:9781450331968
DOI:10.1145/2695664
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SAC 2015
Sponsor:
SAC 2015: Symposium on Applied Computing
April 13 - 17, 2015
Salamanca, Spain

Acceptance Rates

SAC '15 Paper Acceptance Rate 291 of 1,211 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Online Detection and Fuzzy Clustering of Anomalies in Non-Stationary Time SeriesSignals10.3390/signals50100035:1(40-59)Online publication date: 24-Jan-2024
  • (2024)Regression Tree and Clustering for Distributions, and Homogeneous Structure of Population CharacteristicsJournal of Agricultural, Biological and Environmental Statistics10.1007/s13253-024-00631-zOnline publication date: 13-Jun-2024
  • (2022)Nearest Neighbor Density Functional Estimation From Inverse Laplace TransformIEEE Transactions on Information Theory10.1109/TIT.2022.315123168:6(3511-3551)Online publication date: Jun-2022
  • (2021)An FDA-Based Approach for Clustering Elicited Expert KnowledgeStats10.3390/stats40100144:1(184-204)Online publication date: 4-Mar-2021
  • (2019)Histogram-based clustering of multiple data streamsKnowledge and Information Systems10.1007/s10115-019-01350-5Online publication date: 19-Mar-2019
  • (2019)Mining Frequent Distributions in Time SeriesIntelligent Data Engineering and Automated Learning – IDEAL 201910.1007/978-3-030-33617-2_28(271-279)Online publication date: 14-Nov-2019
  • (2016)Optimal copula transport for clustering multivariate time series2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7472103(2379-2383)Online publication date: Mar-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media