An Evaluation of Statistical Approaches to Text Categorization

Yiming Yang

3926 Accesses
1050 Citations
Explore all metrics

Abstract

This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.

References

Apte C, Damerau F and Weiss S (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th Annual ACM/SIGIR Conference.
Bell TAH and Moffat A (1996) The design of a high performance information filtering system. In: Proceedings of the 19th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pp. 12–20.
Cohen WW and Singer Y (1996) Context-sensitive learning metods for text categorization. In: SIGIR’ 96: Proceedings of the 19th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315.
Creecy RH, Masand BM, Smith SJ and Waltz DL (1992) Trading mips and memory for knowledge engineering: Classifying census returns on the connection machine. Comm. ACM, 35: 48–63.
Google Scholar
Fuhr N, Hartmanna S, Lustig G, Schwantner M and Tzeras K (1991) Air/x—A rule-based multistage indexing systems for large subject fields. In: Proceedings of RIAO'91, pp. 606–623.
Hayes PJ and Weinstein SP (1990) Construe/tis: A system for content-based indexing of a database of new stories. In: Second Annual Conference on Innovative Applications of Artificial Intelligence.
Hersh W, Buckley C, Leone TJ and Hickman D (1994) Ohsumed: An interactive retrieval evaluation and new large text collection for research. In: 17th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 192–201.
Iwayama M and Tokunaga T (1995) Cluster-based text categorization: A comparison of category search strategies. In: Proceedings of the 18th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pp. 273–281.
Lewis DD and Ringuette M (1994) Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94).
Lewis DD, Schapire RE, Callan JP and Papka R (1996) Training algorithms for linear text classifiers. In: SIGIR’ 96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–306.
Mitchell T (1996) Machine Learning. McCraw Hill.
Moulinier I (1997) Is learning bias an issue on the text categorization problem? Technical Report, LAFORIA-LIP6, Universite Paris VI, page (to appear).
Moulinier I, Raskinis G and Ganascia J (1996) Text categorization: A symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval.
Ng HT, Goh WB and Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), pp. 67–73.
Persin M (1994) Document filtering for fast ranking. In: Proceedings of the 17th Ann. Int.ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 341–348.
Quinlan JR (1986) Induction of decision trees. Machine Learning, 1(1): 81–106.
Article Google Scholar
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, PA.
Google Scholar
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York.
Google Scholar
Tzeras K and Hartman S (1993) Automatic indexing based on bayesian inference networks. In: Proc. 16th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'93), pp. 22–34.
van Rijsbergen CJ (1979) Information Retrieval. Butterworths, London.
Google Scholar
Wiener E, Pedersen JO and Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95).
Yang Y (1994) Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 13–22.
Yang Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pp. 256–263.
Yang Y (1997) An evaluation of statistical approach to text categorization. Technical Report CMU-CS–97–127, Computer Science Department, Carnegie Mellon University.
Yang Y and Chute CG (1992) A linear least squares fit mapping method for information retrieval from natural language texts. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING 92), pp. 447–453.
Yang Y and Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), pp. 253–277.
Yang Y and Pedersen JP (1997) Feature selection in statistical learning of text categorization. In: 14th International Conference on Machine Learning, pp. 412–420.

Download references

Authors

Yiming Yang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1, 69–90 (1999). https://doi.org/10.1023/A:1009982220290

Download citation

Issue Date: April 1999
DOI: https://doi.org/10.1023/A:1009982220290

An Evaluation of Statistical Approaches to Text Categorization

Abstract

Article PDF

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Text categorization based on a new classification by thresholds

Automatic Text Classification Using Neural Network and Statistical Approaches

References

Rights and permissions

About this article

Cite this article

Navigation

An Evaluation of Statistical Approaches to Text Categorization

Abstract

Article PDF

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Text categorization based on a new classification by thresholds

Automatic Text Classification Using Neural Network and Statistical Approaches

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation