Nothing Special   »   [go: up one dir, main page]

Skip to main content

A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

  • Conference paper
Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead (ICCPOL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

Abstract

Bottom-up hierarchical document clustering normally merges two most similar clusters in each step iteratively. This paper proposes a novel bottom-up hierarchical document clustering algorithm to merge several pairs of most similar clusters in each step. This is done via a concept of “kNN-connectedness”, which measures the mutual connectedness of clusters in kNNs, and a kNN connection graph, which organizes given clusters into several sets of kNN-connected clusters. In such a graph, a connection between any two clusters only exists in the kNN-connected clusters of the same set. Moreover, a new kNN-based attraction function is proposed to measure the similarity between two clusters and indicates the potential probability of the two clusters being merged. The attraction function only considers the relative distribution of their nearest neighbors between two clusters in a vector space while other criteria, such as the well-known cluster-based cosine similarity function, measures the absolute distance between two clusters. This makes the attraction function effectively apply to the cases where different clusters may have very different distance variation. In each step, a kNN connection graph, consisting of several sets of kNN-connected clusters, is first constructed from the given clusters using a kNN algorithm and the concept of “kNN-connectedness”. For each set of kNN-connected clusters, the attraction degree between any two clusters is calculated and several top connected cluster pairs will be merged. In this way, the iteration number can be largely reduced and the clustering process can be much speeded. Evaluation on a news document corpus shows that the kNN connection graph-based hierarchical document clustering algorithm can achieve better performance than the famous k-means clustering algorithm while reducing the iteration number sharply in comparison with normal hierarchical document clustering.

This research was supported by the High Technology Plan of Jiangsu Province, China under Grant No.2005020.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Breiman, L.: Bagging Predictors. Machine Learning 24, 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

  2. Croft, W.B.: Clustering large files of codument using the single-link method. Journal of the American Society for Information Science (1977)

    Google Scholar 

  3. Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)

    Article  Google Scholar 

  4. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm. Applied Statistics (1979)

    Google Scholar 

  5. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., et al.: An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2002)

    Google Scholar 

  6. Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  7. Wei, J.H., He, P.L., Sun, Y.H.: Research on Text Hierarchical Clustering Algorithm based on K-Means. Computer Applications 25(10), 2323–2324 (2005)

    Google Scholar 

  8. Zhen, T.: Research of Clustering Algorithm Based on Hierarchical and Partitioning Method. Computer engineering and Applications 8, 182–184 (2006)

    Google Scholar 

  9. Wu, F., Li, S.J.: An Efficient Hierarchical Clustering Algorithm. Computer Engineering 30(9), 70–71 (2004)

    Google Scholar 

  10. Schutze, H.: Single-link, complete link and average-link, http://www-csli.stanford.edu/~schuetze/completelink.html (accessed, 20/05/2006)

  11. Steinbach, M., Karipis, G.: A comparison of document clustering techniques. In: KDD workshop on text mining (2000)

    Google Scholar 

  12. Willett, P.: Recent trends in hierarchical document clustering: a critical review. Information Processing and Management (1988)

    Google Scholar 

  13. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998 (1998)

    Google Scholar 

  14. Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 184–187 (2003)

    Google Scholar 

  15. Zhao, Y., Karipis, G.: Evaluation of hierarchical document clustering for document datasets. In: CIKM 2002 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Q., Li, J., Zhou, G., Li, P., Qian, P. (2006). A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_12

Download citation

  • DOI: https://doi.org/10.1007/11940098_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49667-0

  • Online ISBN: 978-3-540-49668-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics