Abstract
Text documents have sparse data spaces, and nearest neighbors may belong to different classes when using current existing proximity measures to describe the correlation of documents. In this paper, we propose an asymmetric similarity measure to strengthen the discriminative feature of document objects. We construct a semantic correlation network by asymmetric similarity between documents and conjecture the power law feature of the connections distributions. Hub points which exist in semantic correlation network are classified by an agglomerative hierarchical clustering approach named SCN. Both objects similarity and neighbors similarity are considered in the definition of hub points proximity. Finally, we assign the rest text objects to their nearest hub points. The experimental evaluation on textual data sets demonstrates the validity and efficiency of SCN. The comparison with other clustering algorithms shows the superiority of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD 2000 Workshop on Text Mining (2000)
Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Han, J., Kamber, M.: Data mining: concept and techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Steyvers, M., Tenenbaum, J.: Small worlds in semantic networks. Unpublished manuscript (2001)
Fellbaum, C. (ed.): WordNet: an electronic lexical database. MIT Press, Cambridge (1998)
Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999)
Pissanetzky, S.: Sparse matrix technology. Academic Press, London (1984)
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Lang, K.: NewsWeeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, ICML 1995, pp. 331–339 (1995)
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transactions on Neural Networks 11(3), 574–585 (2000)
Wermter, S., Hung, C.: Selforganising classification on the Reuters news corpus. In: The 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, pp. 1086–1092 (2002)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, S., Li, C. (2005). Semantic Correlation Network Based Text Clustering. In: Zhang, S., Jarvis, R. (eds) AI 2005: Advances in Artificial Intelligence. AI 2005. Lecture Notes in Computer Science(), vol 3809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11589990_63
Download citation
DOI: https://doi.org/10.1007/11589990_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30462-3
Online ISBN: 978-3-540-31652-7
eBook Packages: Computer ScienceComputer Science (R0)