Abstract
This paper addresses personal E-mail filtering by casting it in the framework of text classification. Modeled as semi-structured documents, E-mail messages consist of a set of fields with predefined semantics and a number of variable length free-text fields. While most work on classification either concentrates on structured data or free text, the work in this paper deals with both of them. To perform classification, a naive Bayesian classifier was designed and implemented, and a decision tree based classifier was implemented. The design considerations and implementation issues are discussed. Using a relatively large amount of real personal E-mail data, a comprehensive comparative study was conducted using the two classifiers. The importance of different features is reported. Results of other issues related to building an effective personal E-mail classifier are presented and discussed. It is shown that both classifiers can perform filtering with reasonable accuracy. While the decision tree based classifier outperforms the Bayesian classifier when features and training size are selected optimally for both, a carefully designed naive Bayesian classifier is more robust.
The second author’s work is partially supported by a grant from the National 973 project of China (No. G1998030414).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
William W. Cohen: Learning Rules that Classify E-mail. In Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access
W. W. Cohen, Y. Singer: Context-Sensitive Learning Methods for Text Categorization. In Proceedings of SIGIR-1996
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery: Learning to Extract Symbolic Knowledge from the World Wide Web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98)
Fredrik Kilander: Properties of Electronic Texts for Classification Purposes as Suggested by Users. http://www.dsv.su.se/~fk/if_Doc/F25/essays.ps.Z
D. D. Lewis: Naïve (Bayes) at Forty: The Independent Assumption in Information Retrieval. In European Conference on Machine Learning, 1998
D. D. Lewis, K. A. Knowles: Threading Electronic Mail: A Preliminary Study. In Information Processing and Management, 33(2): 209–217, 1997
D. D. Lewis, M. Ringuette: A Comparison of Two Learning Algorithms for Text Categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93, Las Vegas, NV
Andrew McCallum and Kamal Nigam: A Comparison of Event Models for Naive Bayes Text Classification. Working notes of the 1998 AAAI/ICML workshop on Learning for Text Categorization
J. R. Quinlan: Induction of Decision Trees. Machine Learning, 1: 81–106, 1986
J. R. Quinlan: C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann Publishers, 1993
M. Sahami, S. Dumais, D. Heckerman, E. Horvitz: A Bayesian Approach to Filtering Junk E-mail. In Learning for Text Categorization: Papers from the 1998 workshop. AAAI Technical Report WS-98-05
Ellen Spertus: Smokey: Automatic Recognition of Hostile Messages. In Proceedings of Innovative Applications of Artificial Intelligence (IAAI) 1997
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Diao, Y., Lu, H., Wu, D. (2000). A Comparative Study of Classification Based Personal E-mail Filtering. In: Terano, T., Liu, H., Chen, A.L.P. (eds) Knowledge Discovery and Data Mining. Current Issues and New Applications. PAKDD 2000. Lecture Notes in Computer Science(), vol 1805. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45571-X_48
Download citation
DOI: https://doi.org/10.1007/3-540-45571-X_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67382-8
Online ISBN: 978-3-540-45571-4
eBook Packages: Springer Book Archive