Article

Free access

Document clustering using word clusters via the information bottleneck method

Authors:

Naftali TishbyAuthors Info & Claims

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 208 - 215

https://doi.org/10.1145/345508.345578

Published: 01 July 2000 Publication History

Abstract

We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) ≈ I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

References

[1]

M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.

[2]

L. D. Baker and A. K. McCallum. Distributional Clustering of Words for Text Classification In ACM SIGIR 98, 1998.

Digital Library

[3]

P. E Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer. Class-based n-gram models of natual language. Computational Linguistics, 18(4), pages 467-477, 1992

Digital Library

[4]

T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991.

Digital Library

[5]

D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W. Tukey. Scatter/Gother: A Cluster Based Approach to Browsing Large Document Collections. In ACM SIGIR 92, pages 318- 329, 1992.

Digital Library

[6]

D. R. Cutting, D. R. Karger and J. O. Pedersen. Constant Interaction-Time Scatter/Gother Browsing of Very Large Document Collections. In ACM SIGIR 93, pages 126-134, 1993.

Digital Library

[7]

R. E1-Yaniv, S. Fine, and N. Tishby. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing (NIPS-97), pages 465.-471, 1997.

Digital Library

[8]

K. Eguchi. Adaptive Cluster-based Browsing Using Incrementally Expanded Queries and Its Effects. In ACM SIGIR 99, pages 265-266, 1999.

Digital Library

[9]

M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In ACM SIGIR 96, pages 76-84, 1996.

Digital Library

[10]

T. Hofmann. Probabilistic Latent Semantic Indexing. In ACM SIGIR 99, pages 50-57, 1999.

Digital Library

[11]

M. Iwayama and T. Tokunaga. Cluster-Based Text Categorization: A Comparison of Category Search Strategies. In ACM SIGIR 95, pages 273-280, 1995.

Digital Library

[12]

K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, pages 331-339, 1995.

[13]

J. Lin. Divergence Measures Based on the Sharmon Entropy. IEEE Transactions on Information theory, 37(1):145-151, 1991.

[14]

McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.

[15]

M. Mechkour, D. J. Harper and G. Muresan. The WebCluster Project: Using Clustering for Mediating Access to the WWW In ACM SIGIR 98, pages 357-358, 1998.

Digital Library

[16]

G. Muresan, D. J. Harper and M. Mechkour. WebCluster, a Tool for Mediated Information Access. In ACM SIGIR 99, page 337, 1999.

Digital Library

[17]

E C. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pages 183-190, 1993.

Digital Library

[18]

D. Roussinov, K. Tolle, M. Ramsey and H. Chert. INteractive Interact Search through Automatic Clustering: an Empirical Study. In ACM SIGIR 99, pages 289-290, 1999.

Digital Library

[19]

G. Salton. The SMART retrieval system. Englewood Cliffs, NJ:Prentice-Hall; 1971.

[20]

G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974-980, 1990.

[21]

R. E. Schapire and Y. E. Singer. BoosTexter: A System for Multiclass Multi-label Text Categorization, 1998.

[22]

H. Schutze and C. Silverstein. Projections for Efficient Doeuments Clustering In ACM SIGIR 97, pages 74-81, 1997.

Digital Library

[23]

C. Silverstein and J. O. Pedersen. Almost-Constant-Time Clustering for Arbitrary Corpus Subsets. In ACM SIGIR 97, pages 60--66, 1997.

Digital Library

[24]

N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-99), pages 617-623, 1999.

[25]

N. Slonim and N. Tishby. The Hard Clustering Limit of the Information Bottleneck Method. In preperation.

[26]

N. Tishby, EC. Pereira and W. Bialek. The Information Botflencek Method In Proc. of the 37-th Allerton Conference on Communication and Computation, 1999.

[27]

C. J. van Rijsbergen. Information Retrieval. London: Butterworths; 1979.

Digital Library

[28]

P. Willett. Recent Trends in Hierarchic Document Clustering: A Crtical Review. Information Processing & Management, Vol. 24(5), pp. 577-597, 1988.

Digital Library

[29]

J. Xu and W. B. Croft. Cluster-based Language Models for Distributed Retrieval. In ACM SIGIR 99, pages 254-261, 1999.

Digital Library

[30]

O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In ACM SIGIR 98, pages 46-54, 1998.

Digital Library

Cited By

Fränti PSieranoja S(2024)Clustering accuracyApplied Computing and Intelligence10.3934/aci.20240034:1(24-44)Online publication date: 2024
https://doi.org/10.3934/aci.2024003
Hu SLou ZYan XYe Y(2024)A Survey on Information BottleneckIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336634946:8(5325-5344)Online publication date: Aug-2024
https://doi.org/10.1109/TPAMI.2024.3366349
Zhai PZhang S(2024)Adversarial Information BottleneckIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317298635:1(221-230)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3172986
Show More Cited By

Index Terms

Document clustering using word clusters via the information bottleneck method
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Using Topic Keyword Clusters for Automatic Document Clustering
ICITA '05: Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05) Volume 2 - Volume 02

Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms frequently perform ...
Web Document Clustering by Using Automatic Keyphrase Extraction
WI-IATW '07: Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops

In most traditional techniques of document clustering, the number of total clusters is not known in advance and the cluster that contain the target information cannot be determined since the semantic nature is not associated with the cluster. The well-...
Using Topic Keyword Clusters for Automatic Document Clustering

Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms, frequently perform ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

July 2000

396 pages

ISBN:1581132263

DOI:10.1145/345508

Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science

Copyright © 2000 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Greek Com Soc: Greek Computer Society
SIGIR: ACM Special Interest Group on Information Retrieval
Athens U of Econ & Business: Athens University of Economics and Business

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2000

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGIR00

Sponsor:

Greek Com Soc
SIGIR
Athens U of Econ & Business

SIGIR00: 23rd ACM International SIGIR Conference on Research and Development in Information Retrieval

July 24 - 28, 2000

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

310
Total Citations
View Citations
878
Total Downloads

Downloads (Last 12 months)239
Downloads (Last 6 weeks)37

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fränti PSieranoja S(2024)Clustering accuracyApplied Computing and Intelligence10.3934/aci.20240034:1(24-44)Online publication date: 2024
https://doi.org/10.3934/aci.2024003
Hu SLou ZYan XYe Y(2024)A Survey on Information BottleneckIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336634946:8(5325-5344)Online publication date: Aug-2024
https://doi.org/10.1109/TPAMI.2024.3366349
Zhai PZhang S(2024)Adversarial Information BottleneckIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317298635:1(221-230)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3172986
Pham HTan YSingh TPavlopoulos VPatnayakuni R(2024)A multi-head attention-like feature selection approach for tabular dataKnowledge-Based Systems10.1016/j.knosys.2024.112250301(112250)Online publication date: Oct-2024
https://doi.org/10.1016/j.knosys.2024.112250
Kawaguchi KDeng ZJi XHuang JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)How does information bottleneck help deep learning?Proceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619066(16049-16096)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619066
You JWang Q(2023)Sublinear information bottleneck based two-stage deep learning approach to genealogy layout recognitionFrontiers in Neuroscience10.3389/fnins.2023.123078617Online publication date: 30-Jun-2023
https://doi.org/10.3389/fnins.2023.1230786
Hayashi MYang Y(2023)Efficient algorithms for quantum information bottleneckQuantum10.22331/q-2023-03-02-9367(936)Online publication date: 2-Mar-2023
https://doi.org/10.22331/q-2023-03-02-936
Meng KWo Y(2023)An image compression and encryption scheme for similarity retrievalImage Communication10.1016/j.image.2023.117044119:COnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.image.2023.117044
Jacobovitz DAndrade JDe Groote J(2022)Similarity among eucalyptus planted areas based on leaf-cutting ant nest sizesPesquisa Florestal Brasileira10.4336/2022.pfb.42e20190207142Online publication date: 14-Feb-2022
https://doi.org/10.4336/2022.pfb.42e201902071
Ghosh D(2022)Sufficient Dimension Reduction: An Information-Theoretic ViewpointEntropy10.3390/e2402016724:2(167)Online publication date: 22-Jan-2022
https://doi.org/10.3390/e24020167
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents