Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/345508.345569acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free access

An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

Published: 01 July 2000 Publication History

Abstract

The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

References

[1]
C. Apte and F. Damerau. Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems, 12(3):233-251, 1994.
[2]
W.W. Cohen. Learning Rules that Classify E-Mail. In Proc. of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996.
[3]
L.F. Cranor and B.A. LaMacchia. Spare Communications of the ACM, 41(8):74-83, 1998.
[4]
H. Cunningham, Y. Wilks and R. Gaizauskas. GATE - a General Architecture for Text Engineering. In Proc. of the 16 th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996.
[5]
I. Dagan, Y. Karov and D. Roth. Mistake-Driven Learning in Text Categorization. In Proc. of the 2 na Conference on Empirical Methods in Natural Language Processing, pp. 55- 63, Providence, Rhode Island, 1997.
[6]
P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proc. o f the 13 th International Conference on Machine Learning, pp. 105-112, Bari, Italy, 1996.
[7]
R.O. Duda and P.E. Hart. Bayes Decision Theory. Chapter 2 in Pattern Classification and Scene Analysis, pp. 10-43. John Wiley, 1973.
[8]
D. Forsyth. Finding Naked People. In Proc. of the 4 th European Conference on Computer Vision, Cambridge, England, 1996.
[9]
K.T. Frantzi. Automatic Recognition of Multi-Word Terms. PhD Thesis, Manchester Metropolitan University, England, 1998.
[10]
N. Friedman, D. Geiger and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2/3):131-163, 1997.
[11]
C.L. Green and P. Edwards. Using Machine Learning to Enhance Software Tools for lnternet Information Management. In Proc. o f the AAAI Workshop on Internet-Based Information Systems, pp. 48-55, Portland, Oregon, 1996.
[12]
R.J. Hall. How to Avoid Unwanted Email. Communications o f the ACM, 41 (3):88-95, 1998.
[13]
N. Kushmerick. Learning to Remove Internet Advertisements. In Proc. of the 3 "a International Conference on Autonomous Agents, pp. 175-181, Seattle, Washington, 1999.
[14]
K. Lang. Newsweeder: Learning to Filter Netnews. In Proc. of the 12 th International Conference on Machine Learning, pp. 331-339, Stanford, California, 1995.
[15]
P. Langley, I. Wayne and K. Thompson. An Analysis of Bayesian Classifiers. In Proc. o f the 10 h National Conference on Artificial Intelligence, pp. 223-228, San Jose, California, 1992.
[16]
D. Lewis. Feature Selection and Feature Extraction for Text Categorization. In Proc. o f the DARPA Workshop on Speech and Natural Language, pp. 212-217, Harriman, New York, 1992.
[17]
D. Lewis. Training Algorithms for Linear Text Classifiers. In Proe. of the 19 th Annual International ACM- SIGIR Conference on Research and Development in Information Retrieval, pp. 298-306, Konstanz, Germany, 1996.
[18]
D. Lewis and K.A. Knowles. Threading Electronic Mail: A Preliminary Study. Information Processing and Management, 33(2):209-217, 1997.
[19]
H. Li and K. Yamanishi. Document Classification Using a Finite Mixture Model. In Proc. of the 35 th Annual Meeting of the ACL and the 8 th Conference of the EACL, pp, 39-47, Madrid, Spain, 1997.
[20]
T.M. Mitchell. Bayesian Learning. Chapter 6 in Machine Learning, pp. 154-200. McGraw-Hill, 1997.
[21]
T.R. Payne and P. Edwards. Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface. Applied Artificial Intelligence, 11 (1): 1-32, 1997.
[22]
E. Riloff and W. Lehnert. Information Extraction as a Basis for High-Precision Text Classification. ACM Transactions on Information Systems, 12(3):296-333, 1994.
[23]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization - Papers fi'om the AAA1 Workshop, pp. 55-62, Madison Wisconsin. AAAI Technical Report WS-98- 05, 1998.
[24]
G. Saiton and M.J. MeGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[25]
E. Spertus. Smokey: Automatic Recognition of Hostile Messages. In Proc. of the 14 th National Conference on AI and th the 9 Conference on Innovative Applications of A1, pp. 1058- 1065, Providence, Rhode Island, 1997.

Cited By

View all
  • (2024)Amalgamating an Intelligent Variant of the Gravitational Search Algorithm with Decision Trees for Email Spam DetectionInternational Journal of Modern Physics C10.1142/S0129183124501912Online publication date: 20-Jun-2024
  • (2023)Analyzing Customer Satisfaction using Support Vector Machine and Naive Bayes Utilizing Filipino TextWSEAS TRANSACTIONS ON ENVIRONMENT AND DEVELOPMENT10.37394/232015.2023.19.5019(514-524)Online publication date: 6-Jun-2023
  • (2023)Topic based document modeling for information filteringCTU Journal of Innovation and Sustainable Development10.22144/ctujoisd.2023.04015:ISDS(102-109)Online publication date: 16-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2000

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation (general)
  2. filtering/routing
  3. machine learning and IR
  4. test collections
  5. text categorization

Qualifiers

  • Article

Conference

SIGIR00
Sponsor:
  • Greek Com Soc
  • SIGIR
  • Athens U of Econ & Business

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)193
  • Downloads (Last 6 weeks)19
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Amalgamating an Intelligent Variant of the Gravitational Search Algorithm with Decision Trees for Email Spam DetectionInternational Journal of Modern Physics C10.1142/S0129183124501912Online publication date: 20-Jun-2024
  • (2023)Analyzing Customer Satisfaction using Support Vector Machine and Naive Bayes Utilizing Filipino TextWSEAS TRANSACTIONS ON ENVIRONMENT AND DEVELOPMENT10.37394/232015.2023.19.5019(514-524)Online publication date: 6-Jun-2023
  • (2023)Topic based document modeling for information filteringCTU Journal of Innovation and Sustainable Development10.22144/ctujoisd.2023.04015:ISDS(102-109)Online publication date: 16-Oct-2023
  • (2023)A Weak-Region Enhanced Bayesian Classification for Spam Content-Based FilteringACM Transactions on Asian and Low-Resource Language Information Processing10.1145/351042022:3(1-18)Online publication date: 2-Apr-2023
  • (2023)Rhyme detection of Hindi and Rajasthani Poems using Statistical-Based Methods2023 IEEE International Conference on Contemporary Computing and Communications (InC4)10.1109/InC457730.2023.10262983(1-5)Online publication date: 21-Apr-2023
  • (2022)A Comparative Analysis of Fraudulent Recruitment Advertisement Detection Methods in the IoT EnvironmentJournal of Sensors10.1155/2022/45835122022(1-11)Online publication date: 8-Nov-2022
  • (2022)A graph-based approach to client relationship management in fund administrationMachine Learning with Applications10.1016/j.mlwa.2022.10043310(100433)Online publication date: Dec-2022
  • (2021)Towards Lightweight URL-Based Phishing DetectionFuture Internet10.3390/fi1306015413:6(154)Online publication date: 13-Jun-2021
  • (2021)A Hybrid Classification Approach That Combines K-Nearest Neighbor and Helps Vector Machine Will Provide Results That Are Closer to The True ValueInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET2183131(572-580)Online publication date: 20-Jun-2021
  • (2021)Spam Text DetectionInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2173151(698-704)Online publication date: 12-Jun-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media