Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3582768.3582807acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article

Building term hierarchies using graph-based clustering

Published: 27 June 2023 Publication History

Abstract

Classical tasks of a librarian, such as screening and categorizing new documents based on their content, are increasingly replaced by search engines or through the use of cataloging software. A first overview of a corpus topical orientation can be achieved by combining graph-based search engines and clustering methods. Existing classical clustering methods, however, often require an a priori specification of the desired number of clusters to be output and do not consider term relationships in graphs, which is deficient from a practical point of view. Therefore, fully unsupervised graph-based clustering approaches at the term level offer new possibilities that mitigate these shortcomings. Within this work, a set of novel graph-based clustering algorithms have been developed. The hierarchical clustering algorithm (HCA) forms term hierarchies by iteratively isolating nodes of a given co-occurrence graph based on the evaluation of the edge weight between the nodes. Based on the co-occurrence graph inherent relationships of terms, a new graph is built agglomerative forming individual term clusters of related terms. The feasibility of the outlined methods for text analysis is shown.

References

[1]
[1] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
[2]
[2] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pages 281–297, Berkeley, Calif., 1967. University of California Press.
[3]
[3] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
[4]
[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013.
[5]
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality, 2013.
[6]
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
[7]
[7] Chris Biemann. Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, TextGraphs-1, pages 73–80, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
[8]
[8] Mark Hloch and Mario Kubek. Sequential clustering using centroid terms. In Autonomous Systems 2019: An Almanac, pages 72–88, VDI, 2019.
[9]
[9] Supaporn Simcharoen and Herwig Unger. Dynamic clustering for segregation of co-occurrence graphs. In Autonomous Systems 2019: An Almanac, pages 53–71, VDI, 2019.
[10]
[10] Mark Hloch, Mario Kubek, and Herwig Unger. A Survey on Innovative Graph-Based Clustering Algorithms, pages 95–110. Springer International Publishing, Cham, 2022.
[11]
[11] S. van Dongen. A cluster algirithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 2000.
[12]
[12] Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
[13]
[13] George Zipf. Human behavior and the principle of least effort. Journal of Clinical Psychology, 6(3):573, 1950.
[14]
[14] Mario Kubek. Dezentrale, kontextbasierte Steuerung der Suche im Internet. PhD thesis, Hagen, 2012.
[15]
[15] Mario Kubek. Concepts and Methods for a Libarian of the Web. FernUniversität in Hagen, 2018.
[16]
[16] M. Kubek and H. Unger. Centroid terms as text representatives. In Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng ’16, pages 99–102, New York, NY, USA, 2016. ACM.
[17]
[17] Aleksa Vukotic, Nicki Watt, Tareq Abedrabbo, Dominic Fox, and Jonas Partner. Neo4j in Action. Manning, 2015.
[18]
[18] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, USA, 2008.
[19]
[19] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
[20]
[20] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2005.
[21]
[21] Rada F. Mihalcea and Dragomir R. Radev. Graph-Based Natural Language Processing and Information Retrieval. Cambridge University Press, USA, 1st edition, 2011.
[22]
[22] G. Heyer, U. Quasthoff, and T. Witting. Text Mining: Wissensrohstoff Text - Konzepte, Algorithmen, Ergebnisse. IT lernen. W3L-Verlag, Herdeke, 2008.
[23]
[23] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:601–608, 01 2001.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval
December 2022
241 pages
ISBN:9781450397629
DOI:10.1145/3582768
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. co-occurrence graph
  3. text mining
  4. word embeddings

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

NLPIR 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 22
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media