Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/1072228.1072372dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free access

Concept discovery from text

Published: 24 August 2002 Publication History

Abstract

Broad-coverage lexical resources such as WordNet are extremely useful. However, they often include many rare senses while missing domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers concepts from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning elements to their most similar cluster. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and classes extracted from WordNet (the answer key). Our experiments show that CBC outperforms several well-known clustering algorithms in cluster quality.

References

[1]
Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp. 318--329, Copenhagen, Denmark.
[2]
Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE'99. pp. 512--521. Sydney, Australia.
[3]
Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264--323.
[4]
Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405--416. Elsevier/North Holland. Amsterdam.
[5]
Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8):68--75.
[6]
Landes, S.; Leacock, C.; and Tengi, R. I. 1998. Building Semantic Concordances. In WordNet: An Electronic Lexical Database, edited by C. Fellbaum. pp. 199--216. MIT Press.
[7]
Leacock, C.; Chodorow, M.; and Miller; G. A. 1998. Using corpus statistics and WordNet relations for sense identification. Computational Linguistics, 24(1):147--165.
[8]
Lee, L. and Pereira, F. 1999. Distributional similarity models: Clustering vs. nearest neighbors. In Proceedings of ACL-99. pp. 33--40. College Park, MD.
[9]
Lin, D. 1994. Principar - an Efficient, Broad-Coverage, Principle-Based Parser. In Proceedings of COLING-94. pp. 42--48. Kyoto, Japan.
[10]
Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL-98. pp. 768--774. Montreal, Canada.
[11]
Lin, D. and Pantel, P. 2001. Induction of semantic classes from natural language text. In Proceedings of SIGKDD-01. pp. 317--322. San Francisco, CA.
[12]
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
[13]
McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5" Berkeley Symposium on Mathematics, Statistics and Probability, 1:281--298.
[14]
Miller, G. 1990. WordNet: An Online Lexical Database. International Journal of Lexicography, 1990.
[15]
Pasca, M. and Harabagiu, S. 2001. The informative role of WordNet in Open-Domain Question Answering. In Proceedings of NAACL-01 Workshop on WordNet and Other Lexical Resources. pp. 138--143. Pittsburgh, PA.
[16]
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.
[17]
Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesotas.

Cited By

View all
  • (2018)Automatic Ontology Learning from Multiple Knowledge Sources of TextInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201804010114:2(1-21)Online publication date: 1-Apr-2018
  • (2017)Discovering Enterprise Concepts Using Spreadsheet TablesProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098102(1873-1882)Online publication date: 13-Aug-2017
  • (2017)German Typographers vs. German GrammarProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018662(315-324)Online publication date: 2-Feb-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1
August 2002
1184 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 24 August 2002

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)7
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Automatic Ontology Learning from Multiple Knowledge Sources of TextInternational Journal of Intelligent Information Technologies10.4018/IJIIT.201804010114:2(1-21)Online publication date: 1-Apr-2018
  • (2017)Discovering Enterprise Concepts Using Spreadsheet TablesProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098102(1873-1882)Online publication date: 13-Aug-2017
  • (2017)German Typographers vs. German GrammarProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018662(315-324)Online publication date: 2-Feb-2017
  • (2013)Generalized canonical correlation analysis for disparate data fusionPattern Recognition Letters10.1016/j.patrec.2012.09.01834:2(194-200)Online publication date: 1-Jan-2013
  • (2012)Sequence clustering and labeling for unsupervised query intent discoveryProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124342(383-392)Online publication date: 8-Feb-2012
  • (2012)WebSetsProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124327(243-252)Online publication date: 8-Feb-2012
  • (2012)Incorporating lexical semantic similarity to tree kernel-based chinese relation extractionProceedings of the 13th Chinese conference on Chinese Lexical Semantics10.1007/978-3-642-36337-5_2(11-21)Online publication date: 6-Jul-2012
  • (2012)Fusion and inference from multiple data sources in a commensurate spaceStatistical Analysis and Data Mining10.1002/sam.111425:3(187-193)Online publication date: 1-Jun-2012
  • (2011)Class label enhancement via related instancesProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/2145432.2145446(118-128)Online publication date: 27-Jul-2011
  • (2011)Automatically enriching a thesaurus with information from dictionariesProceedings of the 15th Portugese conference on Progress in artificial intelligence10.5555/2051115.2051158(462-475)Online publication date: 10-Oct-2011
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media