Article

Free access

Concept discovery from text

Authors:

Patrick PantelAuthors Info & Claims

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

Pages 1 - 7

https://doi.org/10.3115/1072228.1072372

Published: 24 August 2002 Publication History

Abstract

Broad-coverage lexical resources such as WordNet are extremely useful. However, they often include many rare senses while missing domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers concepts from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning elements to their most similar cluster. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and classes extracted from WordNet (the answer key). Our experiments show that CBC outperforms several well-known clustering algorithms in cluster quality.

References

[1]

Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp. 318--329, Copenhagen, Denmark.

Digital Library

[2]

Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE'99. pp. 512--521. Sydney, Australia.

Digital Library

[3]

Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264--323.

Digital Library

[4]

Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405--416. Elsevier/North Holland. Amsterdam.

[5]

Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8):68--75.

Digital Library

[6]

Landes, S.; Leacock, C.; and Tengi, R. I. 1998. Building Semantic Concordances. In WordNet: An Electronic Lexical Database, edited by C. Fellbaum. pp. 199--216. MIT Press.

[7]

Leacock, C.; Chodorow, M.; and Miller; G. A. 1998. Using corpus statistics and WordNet relations for sense identification. Computational Linguistics, 24(1):147--165.

Digital Library

[8]

Lee, L. and Pereira, F. 1999. Distributional similarity models: Clustering vs. nearest neighbors. In Proceedings of ACL-99. pp. 33--40. College Park, MD.

Digital Library

[9]

Lin, D. 1994. Principar - an Efficient, Broad-Coverage, Principle-Based Parser. In Proceedings of COLING-94. pp. 42--48. Kyoto, Japan.

Digital Library

[10]

Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL-98. pp. 768--774. Montreal, Canada.

Digital Library

[11]

Lin, D. and Pantel, P. 2001. Induction of semantic classes from natural language text. In Proceedings of SIGKDD-01. pp. 317--322. San Francisco, CA.

Digital Library

[12]

Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press.

Digital Library

[13]

McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5" Berkeley Symposium on Mathematics, Statistics and Probability, 1:281--298.

[14]

Miller, G. 1990. WordNet: An Online Lexical Database. International Journal of Lexicography, 1990.

[15]

Pasca, M. and Harabagiu, S. 2001. The informative role of WordNet in Open-Domain Question Answering. In Proceedings of NAACL-01 Workshop on WordNet and Other Lexical Resources. pp. 138--143. Pittsburgh, PA.

[16]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.

Digital Library

[17]

Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesotas.

Cited By

Sathiya BGeetha T(2018)Automatic Ontology Learning from Multiple Knowledge Sources of TextInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201804010114:2(1-21)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.4018/IJIIT.2018040101
Li KHe YGanjam KMatwin SYu SFarooq F(2017)Discovering Enterprise Concepts Using Spreadsheet TablesProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098102(1873-1882)Online publication date: 13-Aug-2017
https://dl.acm.org/doi/10.1145/3097983.3098102
Paşca Mde Rijke MShokouhi MTomkins AZhang M(2017)German Typographers vs. German GrammarProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018662(315-324)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018662
Show More Cited By

Concept discovery from text
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Concept, topic, and pattern discovery using clustering
Morphemes as necessary concept for structures discovery from untagged corpora
NeMLaP3/CoNLL '98: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning

This paper describes an overview of a method which allows discovery of syntactic structures from untagged corpora. It is composed of three main steps: the discovery of the grammatical morphemes of the language. Then the construction of the chunks which ...
Concept decompositions for short text clustering by identifying word communities

A new concept decomposition method WordCom is proposed.It creates concept vectors by identifying semantic word communities from a weighted word co-occurrence network.It is not only robust to the sparsity of short texts but also overcomes the curse of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

August 2002

1184 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 24 August 2002

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
699
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)7

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sathiya BGeetha T(2018)Automatic Ontology Learning from Multiple Knowledge Sources of TextInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201804010114:2(1-21)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.4018/IJIIT.2018040101
Li KHe YGanjam KMatwin SYu SFarooq F(2017)Discovering Enterprise Concepts Using Spreadsheet TablesProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098102(1873-1882)Online publication date: 13-Aug-2017
https://dl.acm.org/doi/10.1145/3097983.3098102
Paşca Mde Rijke MShokouhi MTomkins AZhang M(2017)German Typographers vs. German GrammarProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018662(315-324)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018662
Sun MPriebe CTang M(2013)Generalized canonical correlation analysis for disparate data fusionPattern Recognition Letters10.1016/j.patrec.2012.09.01834:2(194-200)Online publication date: 1-Jan-2013
https://dl.acm.org/doi/10.1016/j.patrec.2012.09.018
Cheung JLi XAdar ETeevan JAgichtein EMaarek Y(2012)Sequence clustering and labeling for unsupervised query intent discoveryProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124342(383-392)Online publication date: 8-Feb-2012
https://dl.acm.org/doi/10.1145/2124295.2124342
Dalvi BCohen WCallan JAdar ETeevan JAgichtein EMaarek Y(2012)WebSetsProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124327(243-252)Online publication date: 8-Feb-2012
https://dl.acm.org/doi/10.1145/2124295.2124327
Liu DZhao ZHu YQian L(2012)Incorporating lexical semantic similarity to tree kernel-based chinese relation extractionProceedings of the 13th Chinese conference on Chinese Lexical Semantics10.1007/978-3-642-36337-5_2(11-21)Online publication date: 6-Jul-2012
https://dl.acm.org/doi/10.1007/978-3-642-36337-5_2
Ma ZMarchette DPriebe C(2012)Fusion and inference from multiple data sources in a commensurate spaceStatistical Analysis and Data Mining10.1002/sam.111425:3(187-193)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1002/sam.11142
Kozareva ZVoevodski KTeng SMerlo PBarzilay RJohnson M(2011)Class label enhancement via related instancesProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/2145432.2145446(118-128)Online publication date: 27-Jul-2011
https://dl.acm.org/doi/10.5555/2145432.2145446
Oliveira HGomes P(2011)Automatically enriching a thesaurus with information from dictionariesProceedings of the 15th Portugese conference on Progress in artificial intelligence10.5555/2051115.2051158(462-475)Online publication date: 10-Oct-2011
https://dl.acm.org/doi/10.5555/2051115.2051158
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents