Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/648035.761140guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Published: 15 August 1999 Publication History

Abstract

This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n2) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n2) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.

References

[1]
Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113-228.
[2]
Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30-34.
[3]
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9-15.
[4]
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103-114.
[5]
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144-155.
[6]
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73-84.
[7]
Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313-1325.
[8]
Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multiprocessors. In: Workshop on Large-Scale Parallel KDD Systems. (1999).
[9]
Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing '97, The Society for Computer Simulation International (1996) 290-295.
[10]
Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211-214.
[11]
Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35-61.
[12]
Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996).
[13]
Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999).
[14]
Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999).
[15]
Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985).

Cited By

View all
  • (2015)Decentralized detection of network attacks through P2P data clustering of SNMP dataComputers and Security10.1016/j.cose.2015.03.00652:C(1-16)Online publication date: 1-Jul-2015
  • (2012)Distributed data mining patterns and services: an architecture and experimentsConcurrency and Computation: Practice & Experience10.1002/cpe.187724:15(1751-1774)Online publication date: 1-Oct-2012
  • (2010)Distributed data clustering in multi-dimensional peer-to-peer networksProceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 10410.5555/1862242.1862264(171-178)Online publication date: 1-Jan-2010
  • Show More Cited By
  1. Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
    August 1999
    261 pages
    ISBN:3540671943

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 15 August 1999

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Decentralized detection of network attacks through P2P data clustering of SNMP dataComputers and Security10.1016/j.cose.2015.03.00652:C(1-16)Online publication date: 1-Jul-2015
    • (2012)Distributed data mining patterns and services: an architecture and experimentsConcurrency and Computation: Practice & Experience10.1002/cpe.187724:15(1751-1774)Online publication date: 1-Oct-2012
    • (2010)Distributed data clustering in multi-dimensional peer-to-peer networksProceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 10410.5555/1862242.1862264(171-178)Online publication date: 1-Jan-2010
    • (2009)A multifaceted perspective at data analysisIEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics10.1109/TSMCB.2009.202021339:4(834-844)Online publication date: 1-Aug-2009
    • (2008)Collaborative architectures of fuzzy modelingProceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers10.5555/1788915.1788923(117-139)Online publication date: 1-Jun-2008
    • (2007)A new HMM-based ensemble generation method for numeral recognitionProceedings of the 7th international conference on Multiple classifier systems10.5555/1761171.1761179(52-61)Online publication date: 23-May-2007
    • (2007)An improved probabilistic ant based clustering for distributed databasesProceedings of the 20th international joint conference on Artifical intelligence10.5555/1625275.1625710(2701-2706)Online publication date: 6-Jan-2007
    • (2007)Merging distributed database summariesProceedings of the sixteenth ACM conference on Conference on information and knowledge management10.1145/1321440.1321500(419-428)Online publication date: 6-Nov-2007
    • (2006)PENSProceedings of the 1st international conference on Scalable information systems10.1145/1146847.1146886(39-es)Online publication date: 30-May-2006
    • (2006)Two-phase clustering strategy for gene expression data setsProceedings of the 2006 ACM symposium on Applied computing10.1145/1141277.1141311(145-150)Online publication date: 23-Apr-2006
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media