Abstract
Recently research on text mining has attracted lots of attention from both industrial and academic fields. Text mining concerns of discovering unknown patterns or knowledge from a large text repository. The problem is not easy to tackle due to the semi-structured or even unstructured nature of those texts under consideration. Many approaches have been devised for mining various kinds of knowledge from texts. One important aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category if the document falls into the theme of the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. These maps were then analyzed to obtain the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Apte, C., Damerau, F., and Weiss, S.M. (1994). Automated Learning of Decision Rules for Text Categorization. ACM Trans. Information Systems, 12(3), 233–251.
Chen, A., He, J.Z., Xu, L.J., Gey, F.C., and Meggs, J. (1997). Chinese Text Retrieval Without Using a Dictionary. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 42–49).
Clifton, C. and Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In Proc. European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD 99) (pp. 174–183).
Cohen, W.W. and Singer, Y. (1996). Context-Sensitive Learning Methods for Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 307–315).
Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. London: Chapman & Hall.
Dai, Y., Loh, T.E., and Khoo, C., (1999). A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information. In Proc. 22th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 82–89).
Deerwester, S., Dumais, S., Furnas, G., and Landauer, K. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 40(6), 391–407.
Feldman, R., Dagan, I., and Hirsh, H. (1998). Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems, 10, 281–300.
Grobelnik, M. and Mladenić, D. (1998). Efficient Text Categorization. In Proc. Text Mining Workshop on ECML-98. Chemnitz, Germany.
Hearst, M.A. and Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 246–255).
Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-Length Document Access. In Proc. 16th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 59–68).
Hofmann, T. (1999). The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. In Proc. Int’l Joint Conf. on Artificial Intelligence (IJCAI 99) (pp. 682–687).
Huang, X. and Robertson, S.E. (1997a). Experiments on Large Test Collections with Probabilistic Approaches to Chinese Text Retrieval. In Proc. the 2nd Int’l Workshop on Information Retrieval With Asian Languages (pp. 129–140). Tsukuba, Japan.
Huang, X. and Robertson, S.E. (1997b). Okapi Chinese Text Retrieval Experiments at TREC-6. In Proc. 6th Text Retrieval Conference (TREC-6) (pp. 137–142).
Jolliffe, I.T. (1986), Principal Component Analysis. Berlin: Springer-Verlag.
Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). WEBSOM-Self-Organizing Maps of Document Collections. Neurocomputing, 21, 101–117.
Kohonen, T. (1997). Self-Organizing Maps. Berlin (Springer-Verlag).
Lam, W., Ruiz, M., and Srinivasan, P. (1999). Automatic Text Categorization and Its Application to Text Retrieval. IEEE Trans. Knowledge and Data Engineering, 11(8), 865–879.
Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 289–297).
Lee, C.H. and Yang, H.C. (1999). A Web Text Mining Approach Based on Self-Organizing Map. In Proc. ACM CIKM’99 2nd Workshop on Web Information and Data Management. (pp. 59–62) Kansas City, MI.
Lewis, D.D. (1992). Feature Selection and Feature Extraction for Text Categorization. In Proc. Speech and Natural Language Workshop (pp. 212–217), Arden House.
Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 298–306).
Lin C.Y. (1995). Knowledge-Based Automatic Topic Identification. In Proc. Meeting of the Association for Computational Linguistics (ACL 95). (pp. 308–310).
McCallum, A. and Nigam, K. (1999). Text Classification by Bootstrapping with Keywords, EM and Shrinkage. In Proc. ACL ‘99 Workshop for Unsupervised Learning in Natural Language Processing. (pp. 52–58).
Mehnert, R. (1997). Federal Agency and Federal Library Reports, National Library of Medicine: 2 edition. Providence, NJ: R. R. Bowker.
Nie, J.Y., Brisebois, M., and Ren, X. (1996). On Chinese Text Retrieval. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 225–233).
Ponte, J.M. and Croft, W.B. (1997). Text Segmentation by Topic. In Proc. European Conference on Digital Libraries (ECDL 97) (pp. 113–125).
Rajaraman, K., Lai, K.F., and Changwen, Y. (1997). Experiments on Proximity Based Chinese Text Retrieval in TREC 6. In Proc. 6th Text REtrieval Conference (TREC-6) (pp. 559–576).
Rauber, A. and Merkl, D. (1999). Using Self-Organizing Maps to Organize Document Archives and to Characterize Subject Matter: How to Make a Map Tell the News of the World. In Proc. 10th International Conference on Database and Expert Systems Applications. (pp. 302–311).
Rizzo, R., Allegra, M., and Fulantelli, G. (1998). Developing Hypertext through a Self-Organizing Map. In Proc. WebNet 98 (pp. 768–772) Orlando, USA.
Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Salton, G. and Singhal, A. (1994). Automatic Text Theme Generation and the Analysis of Text Structure. Technical Report TR 94-1438, Dept. Computer Science, Cornell Univ., Ithaca, NY.
Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorization. Information Retrieval, 1(3), 193–216.
Wu, Z.M. and Tseng, G. (1993). Chinese Text Segmentation for Text Retrieval, Achievements and Problems. Journal of the American Society for Information Science, 44(9), 532–542.
Wu, Z.M. and Tseng, G. (1995). An Automatic Chinese Text Segmentation System for Full Text Retrieval. Journal of the American Society for Information Science, 46(2), 83–96.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, HC., Lee, CH. Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization. J Intell Inf Syst 25, 47–67 (2005). https://doi.org/10.1007/s10844-005-0859-6
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10844-005-0859-6