Article

Improved ROCK for text clustering using asymmetric proximity

Authors:

Shaoxu Song,

Chunping LiAuthors Info & Claims

SOFSEM'06: Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science

Pages 501 - 510

https://doi.org/10.1007/11611257_48

Published: 21 January 2006 Publication History

Publisher Site

Abstract

The ROCK algorithm can be applied to text clustering in large databases. The effectiveness of ROCK, however, is limited, because of the high dimensionality of textual data and traditional proximity measure of documents. In this paper, we propose an improved approach to strengthen the discriminative feature of text documents, which uses asymmetric proximity. Instead of the links count in ROCK, we propose a novel concept of link weight overlaps to measure the proximity between two clusters. The IROCK (Improved ROCK) algorithm performs clustering analysis based on the overlap information of asymmetric proximities between text objects. We carry on the clustering process in an agglomerative hierarchical way. To demonstrate the effectiveness of IROCK, we perform an experimental evaluation on real textual data. A comparison with ROCK and classical algorithms indicates the superiority of our approach.

References

[1]

Cliff, A., Haggett, P., Smallman-Raynor, M., Stroup, D., and Williamson, G.: The Application of Multidimensional Scaling Methods to Epidemiologial Data. Statistical Methods in Medical Research 4 (1995) 345-366

Google Scholar

[2]

Guha, S., Rastogi, R., and Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information Systems 25 5 (2000) 345-366

Digital Library

Google Scholar

[3]

Han, J., and Kamber, M.: Data Mining: Concept and Techniques. Morgan Kaufmann Publishers (2001)

Digital Library

Google Scholar

[4]

Karypis, G., Han, E.H., and Kumar, V.: Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer 32 8 (1999) 68-75

Digital Library

Google Scholar

[5]

Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., and Saarela, S.: Self Organization of a Massive Document Collection. In IEEE Transactions on Neural Networks 11 3 (2000) 574-585

Digital Library

Google Scholar

[6]

Lang, K.: Newsweeder: Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning, ICML95 (1995) 331-339

Google Scholar

[7]

Lewis, D., Yang, Y., Rose, T., and Li, F.: Rcv1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004) 361-397

Digital Library

Google Scholar

[8]

Salton, G.: Automatic Text Processing-The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)

Digital Library

Google Scholar

[9]

Steinbach, M., Karypis, G., and Kumar, V.: A Comparison of Document Clustering Techniques. In KDD Workshop on Text Mining (2000)

Google Scholar

[10]

Wermter, S., and Hung, C.: Selforganising Classification on the Reuters News Corpus. In The 19th International Conference on Computational Linguistics (COLING 2002) (2002) 1086-1092

Digital Library

Google Scholar

Improved ROCK for text clustering using asymmetric proximity
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Recommendations

A novel incremental conceptual hierarchical text clustering method using CFu-tree

This paper presents a novel down-top incremental conceptual hierarchical text clustering approach using CFu-tree (ICHTC-CF) representation.For summarizing a cluster, we use the term-based feature extraction in text clustering.A new measure criterion, ...
A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSDMM+. FGSDMM+ assumes that there are at most K_max clusters in the corpus, and regards these K_max potential clusters as one large ...
Improved k- means clustering algorithm for two dimensional data
CCSEIT '12: Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology

Clustering is a procedure of organizing the objects in groups whose member exhibits some kind of similarity. So a cluster is a collection of objects which are alike and are different from the objects belonging to other clusters. K-Means is one of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SOFSEM'06: Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science

January 2006

574 pages

ISBN:354031198X

Editors:
Jiří Wiedermann
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodárenskou věží 2, Prague 8, Czech Republic
,
Gerard Tel
Department of Information and Computer Sciences, University of Utrecht, Pod Vodárenskou věží 2, Utrecht, TB, The Netherlands
,
Jaroslav Pokorný
Faculty of Mathematics and Physics, Charles University, Pod Vodárenskou věží 2, Prague, TB, The Netherlands
,
Mária Bieliková
Institute of Informatics and Software Engineering Faculty of Informatics and Information technologies, Slovak University of Technology, Ilkovičova 3, Bratislava, TB, The Netherlands
,
Július Štuller
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodárenskou věží 2, Prague 8, Czech Republic

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 January 2006

Author Tags

Qualifiers

Article

Recommendations

A novel incremental conceptual hierarchical text clustering method using CFu-tree

A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization

Improved k- means clustering algorithm for two dimensional data