A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

Qiaoming Zhu²²,
Junhui Li²²,
Guodong Zhou²²,
Peifeng Li²² &
…
Peide Qian²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

International Conference on Computer Processing of Oriental Languages

1062 Accesses
2 Citations

Abstract

Bottom-up hierarchical document clustering normally merges two most similar clusters in each step iteratively. This paper proposes a novel bottom-up hierarchical document clustering algorithm to merge several pairs of most similar clusters in each step. This is done via a concept of “kNN-connectedness”, which measures the mutual connectedness of clusters in kNNs, and a kNN connection graph, which organizes given clusters into several sets of kNN-connected clusters. In such a graph, a connection between any two clusters only exists in the kNN-connected clusters of the same set. Moreover, a new kNN-based attraction function is proposed to measure the similarity between two clusters and indicates the potential probability of the two clusters being merged. The attraction function only considers the relative distribution of their nearest neighbors between two clusters in a vector space while other criteria, such as the well-known cluster-based cosine similarity function, measures the absolute distance between two clusters. This makes the attraction function effectively apply to the cases where different clusters may have very different distance variation. In each step, a kNN connection graph, consisting of several sets of kNN-connected clusters, is first constructed from the given clusters using a kNN algorithm and the concept of “kNN-connectedness”. For each set of kNN-connected clusters, the attraction degree between any two clusters is calculated and several top connected cluster pairs will be merged. In this way, the iteration number can be largely reduced and the clustering process can be much speeded. Evaluation on a news document corpus shows that the kNN connection graph-based hierarchical document clustering algorithm can achieve better performance than the famous k-means clustering algorithm while reducing the iteration number sharply in comparison with normal hierarchical document clustering.

This research was supported by the High Technology Plan of Jiangsu Province, China under Grant No.2005020.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Multiviewpoint-Based Agglomerative Hierarchical Clustering

Extended Strategies for Document Clustering with Word Co-occurrences

Connected graph decomposition for spectral clustering

Article 10 September 2018

References

Breiman, L.: Bagging Predictors. Machine Learning 24, 123–140 (1996)
MATH MathSciNet Google Scholar
Croft, W.B.: Clustering large files of codument using the single-link method. Journal of the American Society for Information Science (1977)
Google Scholar
Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)
Article Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm. Applied Statistics (1979)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., et al.: An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2002)
Google Scholar
Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Wei, J.H., He, P.L., Sun, Y.H.: Research on Text Hierarchical Clustering Algorithm based on K-Means. Computer Applications 25(10), 2323–2324 (2005)
Google Scholar
Zhen, T.: Research of Clustering Algorithm Based on Hierarchical and Partitioning Method. Computer engineering and Applications 8, 182–184 (2006)
Google Scholar
Wu, F., Li, S.J.: An Efficient Hierarchical Clustering Algorithm. Computer Engineering 30(9), 70–71 (2004)
Google Scholar
Schutze, H.: Single-link, complete link and average-link, http://www-csli.stanford.edu/~schuetze/completelink.html (accessed, 20/05/2006)
Steinbach, M., Karipis, G.: A comparison of document clustering techniques. In: KDD workshop on text mining (2000)
Google Scholar
Willett, P.: Recent trends in hierarchical document clustering: a critical review. Information Processing and Management (1988)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998 (1998)
Google Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 184–187 (2003)
Google Scholar
Zhao, Y., Karipis, G.: Evaluation of hierarchical document clustering for document datasets. In: CIKM 2002 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Technology, Soochow University, 215006, Suzhou, China
Qiaoming Zhu, Junhui Li, Guodong Zhou, Peifeng Li & Peide Qian

Authors

Qiaoming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Junhui Li
View author publications
You can also search for this author in PubMed Google Scholar
Guodong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Peifeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Peide Qian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 630-0192, Takayama, Ikoma, Nara, Japan
Yuji Matsumoto
Dept of ECE, University of Illinois at Urbana Champaign, IL 61801, Urbana, USA
Richard W. Sproat
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Min Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Q., Li, J., Zhou, G., Li, P., Qian, P. (2006). A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_12

Download citation

DOI: https://doi.org/10.1007/11940098_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Multiviewpoint-Based Agglomerative Hierarchical Clustering

Extended Strategies for Document Clustering with Word Co-occurrences

Connected graph decomposition for spectral clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Multiviewpoint-Based Agglomerative Hierarchical Clustering

Extended Strategies for Document Clustering with Word Co-occurrences

Connected graph decomposition for spectral clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation