Article

Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Authors:

Erik L. Johnson,

Hillol KarguptaAuthors Info & Claims

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD

Pages 221 - 244

Published: 15 August 1999 Publication History

Abstract

This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n²) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n²) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.

References

[1]

Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113-228.

[2]

Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30-34.

[3]

Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9-15.

[4]

Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103-114.

[5]

Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144-155.

[6]

Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73-84.

[7]

Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313-1325.

[8]

Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multiprocessors. In: Workshop on Large-Scale Parallel KDD Systems. (1999).

[9]

Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing '97, The Society for Computer Simulation International (1996) 290-295.

[10]

Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211-214.

[11]

Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35-61.

[12]

Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996).

[13]

Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999).

[14]

Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999).

[15]

Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985).

Cited By

Cerroni WMoro GPasolini RRamilli M(2015)Decentralized detection of network attacks through P2P data clustering of SNMP dataComputers and Security10.1016/j.cose.2015.03.00652:C(1-16)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1016/j.cose.2015.03.006
Cesario ETalia D(2012)Distributed data mining patterns and services: an architecture and experimentsConcurrency and Computation: Practice & Experience10.1002/cpe.187724:15(1751-1774)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1002/cpe.1877
Lodi SMoro GSartori C(2010)Distributed data clustering in multi-dimensional peer-to-peer networksProceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 10410.5555/1862242.1862264(171-178)Online publication date: 1-Jan-2010
https://dl.acm.org/doi/10.5555/1862242.1862264
Show More Cited By

Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Distributed hierarchical document clustering
ACST'06: Proceedings of the 2nd IASTED international conference on Advances in computer science and technology

This paper investigates the applicability of distributed clustering technique, called RACHET [1], to organize large sets of distributed text data. Although the authors of RACHET claim that the algorithm generates quality clusters for massive and high ...
Density-based hierarchical clustering for streaming data

For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of ...
Robust hierarchical clustering

One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD

August 1999

261 pages

ISBN:3540671943

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 August 1999

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cerroni WMoro GPasolini RRamilli M(2015)Decentralized detection of network attacks through P2P data clustering of SNMP dataComputers and Security10.1016/j.cose.2015.03.00652:C(1-16)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1016/j.cose.2015.03.006
Cesario ETalia D(2012)Distributed data mining patterns and services: an architecture and experimentsConcurrency and Computation: Practice & Experience10.1002/cpe.187724:15(1751-1774)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1002/cpe.1877
Lodi SMoro GSartori C(2010)Distributed data clustering in multi-dimensional peer-to-peer networksProceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 10410.5555/1862242.1862264(171-178)Online publication date: 1-Jan-2010
https://dl.acm.org/doi/10.5555/1862242.1862264
Pedrycz WRai P(2009)A multifaceted perspective at data analysisIEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics10.1109/TSMCB.2009.202021339:4(834-844)Online publication date: 1-Aug-2009
https://dl.acm.org/doi/10.1109/TSMCB.2009.2020213
Pedrycz W(2008)Collaborative architectures of fuzzy modelingProceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers10.5555/1788915.1788923(117-139)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.5555/1788915.1788923
Ko ASabourin RDe Souza Britto A(2007)A new HMM-based ensemble generation method for numeral recognitionProceedings of the 7th international conference on Multiple classifier systems10.5555/1761171.1761179(52-61)Online publication date: 23-May-2007
https://dl.acm.org/doi/10.5555/1761171.1761179
Chandrasekar RSrinivasan T(2007)An improved probabilistic ant based clustering for distributed databasesProceedings of the 20th international joint conference on Artifical intelligence10.5555/1625275.1625710(2701-2706)Online publication date: 6-Jan-2007
https://dl.acm.org/doi/10.5555/1625275.1625710
Bechchi MRaschia GMouaddib NLaender AFalcão AOlsen ØSilva MBaeza-Yates RMcGuinness DOlstad B(2007)Merging distributed database summariesProceedings of the sixteenth ACM conference on Conference on information and knowledge management10.1145/1321440.1321500(419-428)Online publication date: 6-Nov-2007
https://dl.acm.org/doi/10.1145/1321440.1321500
Li MLee GLee WSivasubramaniam AJia X(2006)PENSProceedings of the 1st international conference on Scalable information systems10.1145/1146847.1146886(39-es)Online publication date: 30-May-2006
https://dl.acm.org/doi/10.1145/1146847.1146886
Habich DWächter TLehner WPilarsky CHaddad H(2006)Two-phase clustering strategy for gene expression data setsProceedings of the 2006 ACM symposium on Applied computing10.1145/1141277.1141311(145-150)Online publication date: 23-Apr-2006
https://dl.acm.org/doi/10.1145/1141277.1141311
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten