research-article

Building term hierarchies using graph-based clustering

Authors:

Markus Van Meegen,

Herwig UngerAuthors Info & Claims

NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

Pages 49 - 56

https://doi.org/10.1145/3582768.3582807

Published: 27 June 2023 Publication History

Abstract

Classical tasks of a librarian, such as screening and categorizing new documents based on their content, are increasingly replaced by search engines or through the use of cataloging software. A first overview of a corpus topical orientation can be achieved by combining graph-based search engines and clustering methods. Existing classical clustering methods, however, often require an a priori specification of the desired number of clusters to be output and do not consider term relationships in graphs, which is deficient from a practical point of view. Therefore, fully unsupervised graph-based clustering approaches at the term level offer new possibilities that mitigate these shortcomings. Within this work, a set of novel graph-based clustering algorithms have been developed. The hierarchical clustering algorithm (HCA) forms term hierarchies by iteratively isolating nodes of a given co-occurrence graph based on the evaluation of the edge weight between the nodes. Based on the co-occurrence graph inherent relationships of terms, a new graph is built agglomerative forming individual term clusters of related terms. The feasibility of the outlined methods for text analysis is shown.

References

[1]

[1] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.

[2]

[2] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pages 281–297, Berkeley, Calif., 1967. University of California Press.

[3]

[3] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.

[4]

[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013.

[5]

[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality, 2013.

[6]

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.

[7]

[7] Chris Biemann. Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, TextGraphs-1, pages 73–80, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.

[8]

[8] Mark Hloch and Mario Kubek. Sequential clustering using centroid terms. In Autonomous Systems 2019: An Almanac, pages 72–88, VDI, 2019.

[9]

[9] Supaporn Simcharoen and Herwig Unger. Dynamic clustering for segregation of co-occurrence graphs. In Autonomous Systems 2019: An Almanac, pages 53–71, VDI, 2019.

[10]

[10] Mark Hloch, Mario Kubek, and Herwig Unger. A Survey on Innovative Graph-Based Clustering Algorithms, pages 95–110. Springer International Publishing, Cham, 2022.

[11]

[11] S. van Dongen. A cluster algirithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 2000.

[12]

[12] Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.

[13]

[13] George Zipf. Human behavior and the principle of least effort. Journal of Clinical Psychology, 6(3):573, 1950.

[14]

[14] Mario Kubek. Dezentrale, kontextbasierte Steuerung der Suche im Internet. PhD thesis, Hagen, 2012.

[15]

[15] Mario Kubek. Concepts and Methods for a Libarian of the Web. FernUniversität in Hagen, 2018.

[16]

[16] M. Kubek and H. Unger. Centroid terms as text representatives. In Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng ’16, pages 99–102, New York, NY, USA, 2016. ACM.

[17]

[17] Aleksa Vukotic, Nicki Watt, Tareq Abedrabbo, Dominic Fox, and Jonas Partner. Neo4j in Action. Manning, 2015.

[18]

[18] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, USA, 2008.

[19]

[19] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.

[20]

[20] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2005.

Digital Library

[21]

[21] Rada F. Mihalcea and Dragomir R. Radev. Graph-Based Natural Language Processing and Information Retrieval. Cambridge University Press, USA, 1st edition, 2011.

Digital Library

[22]

[22] G. Heyer, U. Quasthoff, and T. Witting. Text Mining: Wissensrohstoff Text - Konzepte, Algorithmen, Ergebnisse. IT lernen. W3L-Verlag, Herdeke, 2008.

[23]

[23] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:601–608, 01 2001.

Index Terms

Building term hierarchies using graph-based clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index

Classical clustering methods, such as partitioning and hierarchical clustering algorithms, often fail to deliver satisfactory results, given clusters of arbitrary shapes. Motivated by a clustering validity index based on inter-cluster and intra-cluster ...
Improving a Centroid-Based Clustering by Using Suitable Centroids from Another Clustering
Abstract
Fast centroid-based clustering algorithms such as k-means usually converge to a local optimum. In this work, we propose a method for constructing a better clustering from two such suboptimal clustering solutions based on the fact that each ...
Clustering stability-based Evolutionary K-Means

Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means' initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

December 2022

241 pages

ISBN:9781450397629

DOI:10.1145/3582768

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

NLPIR 2022

NLPIR 2022: 2022 6th International Conference on Natural Language Processing and Information Retrieval

December 16 - 18, 2022

Bangkok, Thailand

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
22
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten