research-article

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Authors:

Keith Henderson,

Brian Gallagher,

Tina Eliassi-RadAuthors Info & Claims

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Pages 893 - 900

https://doi.org/10.1145/2695664.2695860

Published: 13 April 2015 Publication History

Abstract

Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P₁, ..., P_m}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Mover's Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.

References

[1]

D. Applegate, T. Dasu, S. Krishnan, and S. Urbanek. Unsupervised clustering of multidimensional distributions using earth mover distance. In KDD, pages 636--644, 2011.

Digital Library

[2]

D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007.

Digital Library

[3]

R. Giancarlo and F. Utro. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Comp. Sci., 428(0):58--79, 2012.

Digital Library

[4]

P. Indyk and E. Price. K-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance. In STOC, pages 627--636, 2011.

Digital Library

[5]

S. Jegelka, A. Gretton, B. Schölkopf, B. K. Sriperumbudur, and U. von Luxburg. Generalized clustering via kernel embeddings. In KI 2009: Advances in AI, pages 144--152, 2009.

Digital Library

[6]

S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Math. Stat., 22(1):79--86, 1951.

[7]

M. Meila. Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages 173--187. 2003.

[8]

M. E. Mugavin. Multidimensional scaling: A brief overview. Nurs. Res., 57:64--8, 2008.

[9]

P. Raman, J. M. Phillips, and S. Venkatasubramanian. Spatially-aware comparison and consensus for clusterings. CoRR, abs/1102.0026, 2011.

[10]

Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV, pages 59--66, 1998.

Digital Library

[11]

O. Schwander and F. Nielsen. Learning mixtures by simplifying kernel density estimators. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, pages 403--426. 2013.

[12]

S. Shirdhonkar and D. Jacobs. Approximate earth mover's distance in linear time. In CVPR, pages 1--8, 2008.

[13]

N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073--1080, 2009.

Digital Library

[14]

V. M. Zolotarev. Lévy metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.

[15]

V. M. Zolotarev. Lévy-Prokhorov metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.

Cited By

He CLeslie DGrant J(2024)Online Detection and Fuzzy Clustering of Anomalies in Non-Stationary Time SeriesSignals10.3390/signals50100035:1(40-59)Online publication date: 24-Jan-2024
https://doi.org/10.3390/signals5010003
Minami MLennert-Cody C(2024)Regression Tree and Clustering for Distributions, and Homogeneous Structure of Population CharacteristicsJournal of Agricultural, Biological and Environmental Statistics10.1007/s13253-024-00631-zOnline publication date: 13-Jun-2024
https://doi.org/10.1007/s13253-024-00631-z
Ryu JGanguly SKim YNoh YLee D(2022)Nearest Neighbor Density Functional Estimation From Inverse Laplace TransformIEEE Transactions on Information Theory10.1109/TIT.2022.315123168:6(3511-3551)Online publication date: Jun-2022
https://doi.org/10.1109/TIT.2022.3151231
Show More Cited By

Index Terms

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document ...
RK-Means Clustering: K-Means with Reliability

This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

April 2015

2418 pages

ISBN:9781450331968

DOI:10.1145/2695664

Conference Chairs:
Roger L. Wainwright
University of Tulsa
,
Juan Manuel Corchado
University of Salamanca, Spain
,
Program Chairs:
Alessio Bechini
University of Pisa, Italy
,
Jiman Hong
Soongsil University, South Korea

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SAC 2015

Sponsor:

SIGAPP

SAC 2015: Symposium on Applied Computing

April 13 - 17, 2015

Salamanca, Spain

Acceptance Rates

SAC '15 Paper Acceptance Rate 291 of 1,211 submissions, 24%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
176
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

He CLeslie DGrant J(2024)Online Detection and Fuzzy Clustering of Anomalies in Non-Stationary Time SeriesSignals10.3390/signals50100035:1(40-59)Online publication date: 24-Jan-2024
https://doi.org/10.3390/signals5010003
Minami MLennert-Cody C(2024)Regression Tree and Clustering for Distributions, and Homogeneous Structure of Population CharacteristicsJournal of Agricultural, Biological and Environmental Statistics10.1007/s13253-024-00631-zOnline publication date: 13-Jun-2024
https://doi.org/10.1007/s13253-024-00631-z
Ryu JGanguly SKim YNoh YLee D(2022)Nearest Neighbor Density Functional Estimation From Inverse Laplace TransformIEEE Transactions on Information Theory10.1109/TIT.2022.315123168:6(3511-3551)Online publication date: Jun-2022
https://doi.org/10.1109/TIT.2022.3151231
Barrera-Causil CCorrea JZamecnik ATorres-Avilés FMarmolejo-Ramos F(2021)An FDA-Based Approach for Clustering Elicited Expert KnowledgeStats10.3390/stats40100144:1(184-204)Online publication date: 4-Mar-2021
https://doi.org/10.3390/stats4010014
Balzanella AVerde R(2019)Histogram-based clustering of multiple data streamsKnowledge and Information Systems10.1007/s10115-019-01350-5Online publication date: 19-Mar-2019
https://doi.org/10.1007/s10115-019-01350-5
Coutinho JMoreira Jde Sá C(2019)Mining Frequent Distributions in Time SeriesIntelligent Data Engineering and Automated Learning – IDEAL 201910.1007/978-3-030-33617-2_28(271-279)Online publication date: 14-Nov-2019
https://dl.acm.org/doi/10.1007/978-3-030-33617-2_28
Marti GNielsen FDonnat P(2016)Optimal copula transport for clustering multivariate time series2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7472103(2379-2383)Online publication date: Mar-2016
https://doi.org/10.1109/ICASSP.2016.7472103

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents