Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Automated learning of decision rules for text categorization

Published: 01 July 1994 Publication History

Abstract

We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.

References

[1]
APTLY, C., DAMERAU, F., AND WEISS, S. 1993. Knowledge discovery for document classification. In Working Notes of the AAAI 1993 Workshop on Knowledge Discovery zn Databases (KDD-93). AAAI, Menlo Park, Calif., 326-336.
[2]
BIEBRICHER, P. FUHR, N., AND LUSTIG, G. 1988. The automatm indexing system (AIR/PHYS)--From research to application In ACM SIGIR' 88 ACM, New York, 333 342.
[3]
BREIMAN, L., FRIEDMAN, J., OLSHEN, R., AND STONE, C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth, Monterey, Calif
[4]
CHURCH, K. W. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Assocmtton for Cornputatwnal Lingu~stws. ACL, 76-83.
[5]
CLARK, P. AND NIBLETT, T. 1989. The CN2 induction algorithm Mach Learn. 3, 261-283
[6]
FLOWER, M. AND JENNINGS, A. 1992. Domain classification of language using neural networks. In 3rd Australian Conference on Neural Networks.
[7]
FUHR, N. AND PFEIFER, U. 1991 Combining model-oriented and description-oriented approaches for probabihstm reasoning. In ACM SIGIR' 91. ACM, New York, 46 56.
[8]
FUNG, R., CRAWFORD, S., AND APPELBAUM, L. 1990. An architecture for probabilistic conceptbased information retrieval. In ACM SIGIR~ .90. ACM, New York, 455-467.
[9]
HAYES, P AND WEINSTEIN, S 1991. Adding value to financial news by computer. In Proceedrags of the 1st International Conference on Artzf~c~al Intelligence Apphcatwns on Wall Street. 2 8.
[10]
HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I B, AND SCRMANDT, L. M 1990. TCS: A shell for content-based text categorization. In Procee&ngs of the 6th IEEE CAIA. IEEE, Piscataway, N.J,320 326.
[11]
HIGHLEYMAN, W. 1962. The design and analysm of pattern recognition experiments. Bell Syst Tech. J. 41,723 744.
[12]
LEWIS, D 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual Internatwnal ACM SIGIR Conference on Research and Development zn Informatzon Retrieval. ACM, New York, 37-50.
[13]
LEWIS, D. 1992b. Feature selection and feature extraction for text categorization In Proceed- Lngs of the Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Washington, D.C., 212 217.
[14]
LEWIS, D. AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Symposzum on Document Analyszs and Information Retrzeval. ISRI, Univ. of Nevada, Las Vegas. To be published.
[15]
LIN, S. AND K~RNmHAN, B. 1973 An efficient heuristic for the traveling salesman problem Oper. Res. 21, 2, 498-516.
[16]
MASAND, B., LINOFF, G., AND WALTZ, D. 1992, Classlfy/ng news stories using memory based reasoning. In Proceedtngs of the 15th Annual International ACM SIGIR Conference on Research and Development tn Information Retrieval. ACM, New York, 59 65.
[17]
MICHALSKI, R., MOZETIC, I., HONG, J., AND LAVRAC, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains In Proceedings of the AAAI-86. AAAI, Menlo Park, Calif, 1041 1045.
[18]
QuINL~-% J.R. 1993 C4.5: Programs for Machine Learmng. Morgan Kaufmann, San Mateo, Calif.
[19]
QUINLAN, J. 1987. Simplifying decision trees, Int. J. Man-Mach Stud. 27, 221 234
[20]
SARACEVIC, T. 1991. Individual differences in organizing, searching and retrieving information. In Proceedzngs of the 54th Annual Meeting of the Society for Informatzon Sczence, Jose-Marie Griffiths, Ed. Soc for Infornmtion Science, 82 86
[21]
SHETH, B. AND MAES, P. 1993. Evolving agents for personalized information filtering. In Proceedings of the IEEE CAIA-93. IEEE, New York, 345-352.
[22]
WEISS, S. AND INDURKHYA, N. 1993. Optimized rule induction. IEEE Exp. 8, 6, 61-69.
[23]
WEISS, S. M. AND KULIKOWSKI, C.A. 1991. Computer Systems That Learn. Morgan Kaufmann, San Mateo, Calif.
[24]
WEISS, S., GALEN, R., AND TADE?ALLI, P. 1990. Maximizing the predictive value of production rules. Art. Intell. 45, 47-71.

Cited By

View all
  • (2024)Weighted Asymmetric Loss for Multi-Label Text Classification on Imbalanced DataJournal of Natural Language Processing10.5715/jnlp.31.116631:3(1166-1192)Online publication date: 2024
  • (2024)Approximation of the Meaning for Thematic Subject Headings by Simple Interpretable RepresentationsLobachevskii Journal of Mathematics10.1134/S199508022460077845:3(1261-1274)Online publication date: 19-Jul-2024
  • (2024)Enhanced text classification through an improved discrete laying chicken algorithmExpert Systems10.1111/exsy.13553Online publication date: 25-Jan-2024
  • Show More Cited By

Recommendations

Reviews

Ian Hugh Witten

Can rules for document classification be induced from a training set of manually classified documents, enabling new documents to be classified automatically__?__ This question is important because of the time and skill required for manual classification and the huge volumes of textual information to be processed. The problem splits into two parts: creating a feature set to represent each document, and inferring classification rules based on these features. For the first part, the authors advocate the use of topic-specific dictionaries, one for each classification topic, prepared manually in advance. A document is represented by the most frequently occurring dictionary words it contains, for each dictionary. For the second part, a new rule-learning scheme called Swap-1 is described that uses a dynamic optimization technique to overcome the possible shortcomings of the usual greedy rule-selection procedure. The scheme is tested on a collection of 15,000 Reuters news stories and 90 topics. Three-quarters of the stories are used for training, and the resulting rules are evaluated on the remaining stories in terms of recall and precision. Significant improvement is claimed over previously reported results on the same data, although the experimental conditions are slightly different. I found this paper difficult to read and understand, principally because it introduces several new ideas in a sketchy manner and does not evaluate them properly. For <__?__Pub Fmt nolinebreak>example,<__?__Pub Fmt /nolinebreak> it would have been interesting to compare results using <__?__Pub Fmt nolinebreak>Swap-1<__?__Pub Fmt /nolinebreak><__?__Pub Caret> with those of standard rule-learning schemes such as C4.5 [1] and to evaluate what improvement the local-dictionary feature selection scheme gives over traditional methods. Focusing the experiments on a comparison with a single study, details of which are not included, seems to be of lesser value, particularly in a high-profile journal of fairly general coverage.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 12, Issue 3
July 1994
101 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/183422
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 1994
Published in TOIS Volume 12, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)290
  • Downloads (Last 6 weeks)26
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Weighted Asymmetric Loss for Multi-Label Text Classification on Imbalanced DataJournal of Natural Language Processing10.5715/jnlp.31.116631:3(1166-1192)Online publication date: 2024
  • (2024)Approximation of the Meaning for Thematic Subject Headings by Simple Interpretable RepresentationsLobachevskii Journal of Mathematics10.1134/S199508022460077845:3(1261-1274)Online publication date: 19-Jul-2024
  • (2024)Enhanced text classification through an improved discrete laying chicken algorithmExpert Systems10.1111/exsy.13553Online publication date: 25-Jan-2024
  • (2024)Anchor Graph-Based Feature Selection for One-Step Multi-View ClusteringIEEE Transactions on Multimedia10.1109/TMM.2024.336760526(7413-7425)Online publication date: 26-Feb-2024
  • (2024)Efficient Multi-View -Means for Image ClusteringIEEE Transactions on Image Processing10.1109/TIP.2023.334060933(273-284)Online publication date: 1-Jan-2024
  • (2024)Chinese Fraudulent Text Message Detection Based on Graph Neural Networks2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653182(1078-1081)Online publication date: 10-May-2024
  • (2024)Fast multi-view clustering via correntropy-based orthogonal concept factorizationNeural Networks10.1016/j.neunet.2024.106170(106170)Online publication date: Feb-2024
  • (2024)Anchor graph-based multiview spectral clusteringNeurocomputing10.1016/j.neucom.2024.127579583(127579)Online publication date: May-2024
  • (2024)Euler State Networks: Non-dissipative Reservoir ComputingNeurocomputing10.1016/j.neucom.2024.127411579(127411)Online publication date: Apr-2024
  • (2024)Analyzing critical core components’ technology opportunities based on multilayer networks from a lifecycle perspective: A case study of offshore wind turbine foundationJournal of Cleaner Production10.1016/j.jclepro.2024.143850477(143850)Online publication date: Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media