Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1081870.1081957acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A hybrid unsupervised approach for document clustering

Published: 21 August 2005 Publication History

Abstract

We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.

References

[1]
T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics, 3:1--27, 1974.
[2]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1977.
[3]
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, 1973.
[4]
M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods. In Microsoft Technical Report MSR-TR-98-06, 1998.
[5]
G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 1985.
[6]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 2000.
[7]
E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1996.
[8]
T. G. Rose, M. Stevenson, and M. Whitehead. The Reuters corpus volume 1 - from yesterday's news to tomorrow's language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, 2002.
[9]
G. Schwartz. Estimating the dimension of a model. The Annals of Statistics, 5(2), 1978.
[10]
Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the International Conference on Information and Knowledge Management, 2002.
[11]
Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3), 2004.

Cited By

View all
  • (2016)Parallel Data Reduction Techniques for Big DatasetsBig Data10.4018/978-1-4666-9840-6.ch034(734-756)Online publication date: 2016
  • (2015)Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduceProceedings of the Symposium on High Performance Computing10.5555/2872599.2872610(83-92)Online publication date: 12-Apr-2015
  • (2014)Parallel Data Reduction Techniques for Big DatasetsBig Data Management, Technologies, and Applications10.4018/978-1-4666-4699-5.ch004(72-93)Online publication date: 2014
  • Show More Cited By

Index Terms

  1. A hybrid unsupervised approach for document clustering

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
    August 2005
    844 pages
    ISBN:159593135X
    DOI:10.1145/1081870
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EM initialization
    2. unsupervised clustering

    Qualifiers

    • Article

    Conference

    KDD05

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)Parallel Data Reduction Techniques for Big DatasetsBig Data10.4018/978-1-4666-9840-6.ch034(734-756)Online publication date: 2016
    • (2015)Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduceProceedings of the Symposium on High Performance Computing10.5555/2872599.2872610(83-92)Online publication date: 12-Apr-2015
    • (2014)Parallel Data Reduction Techniques for Big DatasetsBig Data Management, Technologies, and Applications10.4018/978-1-4666-4699-5.ch004(72-93)Online publication date: 2014
    • (2014)PardicleProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.51(560-571)Online publication date: 16-Nov-2014
    • (2013)Scalable parallel OPTICS data clustering using graph algorithmic techniquesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503255(1-12)Online publication date: 17-Nov-2013
    • (2013)Global nanotechnology development from 1991 to 2012: patents, scientific publications, and effect of NSF fundingJournal of Nanoparticle Research10.1007/s11051-013-1951-415:9Online publication date: 30-Aug-2013
    • (2012)Joint entity and event coreference resolution across documentsProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning10.5555/2390948.2391006(489-500)Online publication date: 12-Jul-2012
    • (2012)A new scalable parallel DBSCAN algorithm using the disjoint-set data structureProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389081(1-11)Online publication date: 10-Nov-2012
    • (2012)Anomaly Detection in Streaming Sensor DataWireless Technologies10.4018/978-1-61350-101-6.ch403(910-928)Online publication date: 2012
    • (2012)Incremental Clustering of News ReportsAlgorithms10.3390/a50303645:3(364-378)Online publication date: 24-Aug-2012
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media