Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/345508.345578acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free access

Document clustering using word clusters via the information bottleneck method

Published: 01 July 2000 Publication History

Abstract

We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) ≈ I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

References

[1]
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
[2]
L. D. Baker and A. K. McCallum. Distributional Clustering of Words for Text Classification In ACM SIGIR 98, 1998.
[3]
P. E Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer. Class-based n-gram models of natual language. Computational Linguistics, 18(4), pages 467-477, 1992
[4]
T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991.
[5]
D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W. Tukey. Scatter/Gother: A Cluster Based Approach to Browsing Large Document Collections. In ACM SIGIR 92, pages 318- 329, 1992.
[6]
D. R. Cutting, D. R. Karger and J. O. Pedersen. Constant Interaction-Time Scatter/Gother Browsing of Very Large Document Collections. In ACM SIGIR 93, pages 126-134, 1993.
[7]
R. E1-Yaniv, S. Fine, and N. Tishby. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing (NIPS-97), pages 465.-471, 1997.
[8]
K. Eguchi. Adaptive Cluster-based Browsing Using Incrementally Expanded Queries and Its Effects. In ACM SIGIR 99, pages 265-266, 1999.
[9]
M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In ACM SIGIR 96, pages 76-84, 1996.
[10]
T. Hofmann. Probabilistic Latent Semantic Indexing. In ACM SIGIR 99, pages 50-57, 1999.
[11]
M. Iwayama and T. Tokunaga. Cluster-Based Text Categorization: A Comparison of Category Search Strategies. In ACM SIGIR 95, pages 273-280, 1995.
[12]
K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, pages 331-339, 1995.
[13]
J. Lin. Divergence Measures Based on the Sharmon Entropy. IEEE Transactions on Information theory, 37(1):145-151, 1991.
[14]
McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
[15]
M. Mechkour, D. J. Harper and G. Muresan. The WebCluster Project: Using Clustering for Mediating Access to the WWW In ACM SIGIR 98, pages 357-358, 1998.
[16]
G. Muresan, D. J. Harper and M. Mechkour. WebCluster, a Tool for Mediated Information Access. In ACM SIGIR 99, page 337, 1999.
[17]
E C. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pages 183-190, 1993.
[18]
D. Roussinov, K. Tolle, M. Ramsey and H. Chert. INteractive Interact Search through Automatic Clustering: an Empirical Study. In ACM SIGIR 99, pages 289-290, 1999.
[19]
G. Salton. The SMART retrieval system. Englewood Cliffs, NJ:Prentice-Hall; 1971.
[20]
G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974-980, 1990.
[21]
R. E. Schapire and Y. E. Singer. BoosTexter: A System for Multiclass Multi-label Text Categorization, 1998.
[22]
H. Schutze and C. Silverstein. Projections for Efficient Doeuments Clustering In ACM SIGIR 97, pages 74-81, 1997.
[23]
C. Silverstein and J. O. Pedersen. Almost-Constant-Time Clustering for Arbitrary Corpus Subsets. In ACM SIGIR 97, pages 60--66, 1997.
[24]
N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-99), pages 617-623, 1999.
[25]
N. Slonim and N. Tishby. The Hard Clustering Limit of the Information Bottleneck Method. In preperation.
[26]
N. Tishby, EC. Pereira and W. Bialek. The Information Botflencek Method In Proc. of the 37-th Allerton Conference on Communication and Computation, 1999.
[27]
C. J. van Rijsbergen. Information Retrieval. London: Butterworths; 1979.
[28]
P. Willett. Recent Trends in Hierarchic Document Clustering: A Crtical Review. Information Processing & Management, Vol. 24(5), pp. 577-597, 1988.
[29]
J. Xu and W. B. Croft. Cluster-based Language Models for Distributed Retrieval. In ACM SIGIR 99, pages 254-261, 1999.
[30]
O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In ACM SIGIR 98, pages 46-54, 1998.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2000

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGIR00
Sponsor:
  • Greek Com Soc
  • SIGIR
  • Athens U of Econ & Business

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)239
  • Downloads (Last 6 weeks)37
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Clustering accuracyApplied Computing and Intelligence10.3934/aci.20240034:1(24-44)Online publication date: 2024
  • (2024)A Survey on Information BottleneckIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336634946:8(5325-5344)Online publication date: Aug-2024
  • (2024)Adversarial Information BottleneckIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317298635:1(221-230)Online publication date: Jan-2024
  • (2024)A multi-head attention-like feature selection approach for tabular dataKnowledge-Based Systems10.1016/j.knosys.2024.112250301(112250)Online publication date: Oct-2024
  • (2023)How does information bottleneck help deep learning?Proceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619066(16049-16096)Online publication date: 23-Jul-2023
  • (2023)Sublinear information bottleneck based two-stage deep learning approach to genealogy layout recognitionFrontiers in Neuroscience10.3389/fnins.2023.123078617Online publication date: 30-Jun-2023
  • (2023)Efficient algorithms for quantum information bottleneckQuantum10.22331/q-2023-03-02-9367(936)Online publication date: 2-Mar-2023
  • (2023)An image compression and encryption scheme for similarity retrievalImage Communication10.1016/j.image.2023.117044119:COnline publication date: 1-Nov-2023
  • (2022)Similarity among eucalyptus planted areas based on leaf-cutting ant nest sizesPesquisa Florestal Brasileira10.4336/2022.pfb.42e20190207142Online publication date: 14-Feb-2022
  • (2022)Sufficient Dimension Reduction: An Information-Theoretic ViewpointEntropy10.3390/e2402016724:2(167)Online publication date: 22-Jan-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media