An Incremental Document Clustering for the Large Document Database

Kil Hong Joo²⁰ &
Won Suk Lee²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Asia Information Retrieval Symposium

1046 Accesses
1 Citations

Abstract

With the development of the internet and computer, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The document clustering is integrated documents to subject by classifying a set of documents through their similarity among them. Accordingly, the document clustering can be used in exploring and searching a document and it can increase accuracy of search. This paper proposes an efficient incremental clustering algorithm for a set of documents increase gradually. The incremental document clustering algorithm assigns a set of new documents to the legacy clusters which have been identified in advance. In addition, to improve the correctness of the clustering, removing the stop words can be proposed and the weight of the word can be calculated by the proposed TF × NIDF function. In this paper, the performance of the proposed method is analyzed by a series of experiments to identify their various characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Document Clustering Using Different Unsupervised Learning Approaches: A Survey

Semi-supervised Document Clustering via Loci

References

Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: SIGIR, pp. 46–54 (1998)
Google Scholar
Wong, W.-c., Wai-chee Fu, A.: Incremental Document Clustering for Web Page Classification. In: Proceedings of 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS 2000), Aizu-Wakamatsu City, Fukushima, Japan, November 5-8 (2000)
Google Scholar
Van Rijsvergen, C.J.: Information Retrieval, 2nd edn. Butterworth, London (1979)
Google Scholar
Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categorization. In: Proceedings of the 21th annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, pp. 81–89 (August 1998)
Google Scholar
Slattery, S., Craven, M.: Combining statistical and relation methods for learning in hypertext domains. In: Page, D.L. (ed.) ILP 1998. LNCS, vol. 1446. Springer, Heidelberg (1998)
Chapter Google Scholar
Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training Algorithms for Linear Text Classifiers. In: Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval (1996)
Google Scholar
Han, E.-H(S.), Karypis, G., Kumar, V.: Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, p. 53. Springer, Heidelberg (2001)
Chapter Google Scholar
Yang, Y.: Expert Network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
Google Scholar
Frakes, B.W., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Ribert, A., Ennaji, A., Lecourtier, Y.: An Incremental Hierarchical Clustering. In: Vision Interface 1999, Trois-Rivieres, Canada, May 19-21, pp. 586–591 (1999)
Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, Chichester (1972)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York (1990)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–144 (June 1996)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: 15th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proceedings of 29th Annual ACM Symposium on the Theory of Computing, El Paso, Texas, USA, pp. 626–635 (May 1997)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted Document Length Normalization. In: Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval (1996)
Google Scholar
fisher, D.: Iterative Optimization and Simplification of Hierarchical Clusterings. Journal of Artificial Intelligence Research (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Education, Gyeongin National University of Education, Gyodae Street 45, Gyeyang-gu, Incheon, 407-753, Korea
Kil Hong Joo
Dept. of Computer Science, Yonsei University, 134 Shinchondong Seodaemoongu, Seoul, 120-749, Korea
Won Suk Lee

Authors

Kil Hong Joo
View author publications
You can also search for this author in PubMed Google Scholar
Won Suk Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
Computer and Communication Media Research, NEC Corp., Miyazaki 4-1-1, Miyamae-ku, 216-8555, Kawasaki, Japan
Akio Yamada
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Helen Meng
School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joo, K.H., Lee, W.S. (2005). An Incremental Document Clustering for the Large Document Database. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_29

Download citation

DOI: https://doi.org/10.1007/11562382_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Incremental Document Clustering for the Large Document Database

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Document Clustering Using Different Unsupervised Learning Approaches: A Survey

Semi-supervised Document Clustering via Loci

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Incremental Document Clustering for the Large Document Database

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Document Clustering Using Different Unsupervised Learning Approaches: A Survey

Semi-supervised Document Clustering via Loci

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation