Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Review: A review of machine learning approaches to Spam filtering

Published: 01 September 2009 Publication History

Abstract

In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.

References

[1]
Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communication. v7 i1. 39-59.
[2]
Abi-Haidar, A., & Rocha, L. M. (2008). Adaptive spam detection inspired by a cross-regulation model of immune dynamics: A study of concept drift. Lecture Notes in Computer Science, 5132.
[3]
Lazy learning. Artificial Intelligence Review. v11 i1-5. 7-10.
[4]
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive Bayesian anti-spam filtering. In G. Potamias, V. Moustakis, & M. van Someren, M. (Eds.), Proc of the 11th Eur conf on mach learn.
[5]
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000). An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proc of the ann int ACM SIGIR conf on res and devel in inform retrieval.
[6]
Androutsopoulos, I., Paliouras, G., & Michelakis, E. (2004). Learning to filter unsolicited commercial e-mail. Tech. rep. 2004/2, NCSR "Demokritos".
[7]
Aradhye, H., Myers, G., & Herson, J. (2005). Image analysis for efficient categorization of image-based spam e-mail. In Proc int conf doc analysis and recog (Vol. 2).
[8]
Asuncion, A., &amp; Newman, D. (2007). UCI machine learning repository. <http://www.ics.uci.edu/mlearn/MLRepository.html>.
[9]
Bekkerman, R. (2005). Email classification on enron dataset. <http://www.cs.umass.edu/ronb/enron_dataset.html> (visited on June 2008).
[10]
An immunological filter for spam. Lecture Notes in Computer Science. v4163. 446-458.
[11]
Dirichlet-enhanced spam filtering based on biased samples. Advances in Neural Information Processing System. v19. 161-168.
[12]
Biggio, B., Fumera, G., Pillai, I., &amp; Roli, F. (2007). Image spam filtering using visual information. In Proc int conf on image analysis and proc.
[13]
Biggio, B., Fumera, G., Pillai, I., &amp; Roli, F. (2008). Improving image spam filtering using image text features. In Proc of the fifth conf on email and anti-spam.
[14]
Blanzieri, E., &amp; Bryl, A. (2008). A survey of learning-based techniques of email spam filtering. Tech. rep. DIT-06-056, University of Trento, Information Engineering and Computer Science Department.
[15]
Spam filtering using statistical data compression models. Journal of Machine Learning Research. v7. 2673-2698.
[16]
Byun, B., Lee, C.-H., Webb, S., &amp; Pu, C. (2007). A discriminative classifier learning approach to image modeling and spam image identification. In Proc of the fourth conf on email and anti-spam.
[17]
A novel kernel method for clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27 i5. 801-805.
[18]
Tightening the net: A review of current and next generation spam filtering tools. Computers and Security. v25 i8. 566-578.
[19]
Carreras, X., &amp; Marquez, L. (2001). Boosting trees for anti-spam email filtering. In Proc of the fourth int conf on recent adv in nat lang proc.
[20]
Time-efficient spam e-mail filtering using n-gram models. Pattern Recognition Letters. v29 i1. 19-33.
[21]
Clark, J., Koprinska, I., &amp; Poon, J. (2003). A neural network based approach to automated e-mail classification. In Proc of the IEEE/WIC int conf on web intell.
[22]
Cormack, G. V. (2006). TREC 2006 spam track overview. In Proc of TREC 2006: The 15th text retrieval conf.
[23]
Cormack, G. V. (2007). TREC 2007 spam track overview. In Proc of TREC 2007: The 16th text retrieval conf.
[24]
Cormack, G. V., &amp; Lynam, T. (2005). TREC 2005 spam track overview. In: Proc of TREC 2005: The 14th text retrieval conf.
[25]
Online supervised spam filter evaluation. ACM Transactions on Information Systems. v25 i3. 11
[26]
Corrective feedback and persistent learning for information extraction. Artificial Intelligence. v170 i14-15. 1101-1122.
[27]
Artificial immune systems: A new computational intelligence approach. 1st ed. Springer.
[28]
Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artificial Intelligence Review. v26 i1-2. 75-87.
[29]
An assessment of case-based reasoning for spam filtering. Artificial Intelligence Review. v24 i3-4. 359-378.
[30]
A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems. v18 i4-5. 187-195.
[31]
Denis, F., Gilleron, R., &amp; Tommasi, M. (2002). Text classification from positive and unlabeled examples. In Proc of the int conf on inform proc and manag of uncertainty in knowl-based syst.
[32]
Dornbos, J. (2002). Spam: What can you do about it?<http://www.dornbos.com/spam01.shtml> (visited on June 2008).
[33]
Dredze, M., Gevaryahu, R., &amp; Elias-Bachrach, A. (2007). Learning fast classifiers for image spam. In Proc of the fourth conf on email and anti-spam.
[34]
Support vector machines for spam categorization. IEEE Transactions on Neural Networks. v10 i5. 1048-1054.
[35]
"In vivo" spam filtering: A challenge problem for KDD. SIGKDD Explorations. v5 i2. 140-148.
[36]
An introduction to ROC analysis. Pattern Recognition Letters. v27 i8. 861-874.
[37]
SpamHunting: An instance-based reasoning system for spam labelling and filtering. Decision Support Systems. v43 i3. 722-736.
[38]
Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications. v33 i1. 36-48.
[39]
Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research. v7. 2699-2720.
[40]
Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. Journal of Machine Learning Research. v8. 2297-2345.
[41]
A stochastic algorithm for feature selection in pattern recognition. Journal of Machine Learning Research. v8. 509-547.
[42]
Workload models of spam and legitimate e-mails. Performance Evaluation. v64 i7-8. 690-714.
[43]
Goodman, J., &amp; Yih, W. (2006). Online discriminative spam filter training. In Proc of the third conf on email and anti-spam.
[44]
Spam and the ongoing battle for the inbox. Communications of the ACM. v50 i2. 24-33.
[45]
An HMM for detecting spam mail. Expert Systems with Applications. v33 i3. 667-682.
[46]
Graham, P. (2002). A plan for spam. <http://www.paulgraham.com/spam.html> (visited on April 2008).
[47]
Guenter, B. (2008). SPAM archive. <http://untroubled.org/spam/> (visited on June 2008).
[48]
Identification of spam messages using an approach inspired on the immune system. Biosystems. v92 i3. 215-225.
[49]
Gyongyi, Z., &amp; Garcia-Molina, H. (2005). Web spam taxonomy. In Proceedings of the first international workshop on adversarial information retrieval on the web (AIRWeb).
[50]
Haider, P., Brefeld, U., &amp; Scheffer, T. (2007). Supervised clustering of streaming data for email batch detection. In Proc of the int conf on mach learn.
[51]
The elements of statistical learning. Springer.
[52]
How many ways can you spell V1@gra?. Scientific American. v95 i4. 298-302.
[53]
Neural networks: A comprehensive foundation. 2nd ed. Prentice Hall.
[54]
He, J., &amp; Thiesson, B. (2007). Asymmetric gradient boosting with application to spam filtering. In Proc of the fourth conf on email and anti-spam.
[55]
How good are our weapons in the spam wars?. IEEE Technology and Society Magazine. v25 i1. 22-30.
[56]
An incremental cluster-based approach to spam filtering. Expert Systems with Applications. v34 i3. 1599-1608.
[57]
Jones, R. (2003). Spam. <http://www.annexia.org/spam> (visited on July 2008).
[58]
A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research. v8. 993-1019.
[59]
Words versus character N-grams for anti-spam filtering. International Journal of Artificial Intelligence Tools. v16 i6. 1047-1067.
[60]
Spam filtering with dynamically updated URL statistics. IEEE Security and Privacy. v5 i4. 33-39.
[61]
Self-organizing maps. 3rd ed. Springer.
[62]
Kolari, P., Java, A., Finin, T., Oates, T., &amp; Joshi, A. (2006). Detecting spam blogs: A machine learning approach. In Proc of the 21st nat conf on artif intell.
[63]
Learning to classify e-mail. Information Sciences. v177 i10. 2167-2187.
[64]
Krasser, S., Tang, Y., Gould, J., Alperovitch, D., &amp; Judge, P. (2007). Identifying image spam based on header and file properties using C4.5 decision trees and support vector machine learning. In IEEE SMC inf assur and sec workshop.
[65]
An empirical study of three machine learning methods for spam filtering. Knowledge-Based Systems. v20 i3. 249-254.
[66]
Lam, H.-Y., &amp; Yeung, D.-Y. (2007). A learning approach to spam detection based on social networks. In Proc of the fourth conf on email and anti-spam.
[67]
Text categorization with support vector machines. How to represent texts in input space?. Machine Learning. v46 i1-3. 423-444.
[68]
Luo, X., &amp; Zincir-Heywood, N. (2005). Comparison of a SOM based sequence analysis system and naive Bayesian classifier for spam filtering. In Proc of the int conf on neural networks (Vol. 4).
[69]
Binary LNS-based naïve Bayes inference engine for spam control: Noise analysis and FPGA implementation. IET Computers and Digitial Techniques. v2 i1. 56-62.
[70]
Competing for consumer's attention. Automatica. v44 i2. 361-370.
[71]
Medlock, B. (2006). An adaptive, semi-structured language model approach to spam filtering on a new corpus. In Proc of the third conf on email and anti-spam.
[72]
Managing irrelevant knowledge in CBR models for unsolicited e-mail classification. Expert Systems with Applications. v36 i2. 1601-1614.
[73]
Metsis, V., Androutsopoulos, I., &amp; Paliouras, G. (2006). Spam filtering with naive Bayes - Which naive Bayes? In Proc conf on email and anti-spam.
[74]
Machine learning. 1st ed. McGraw-Hill.
[75]
Oda, T., &amp; White, T. (2003b). Increasing the accuracy of a spam-detecting artificial immune system. In Proc of the IEEE cong on evol comput (Vol. 1).
[76]
Developing an immunity to spam. Lecture Notes in Computer Science. v2723. 231-242.
[77]
Immunity from spam: An analysis of an artificial immune system for junk email detection. Lecture Notes in Computer Science. v3627. 276-289.
[78]
Adaptive anti-spam filtering for agglutinative languages: A special case for Turkish. Pattern Recognition Letters. v25 i16. 1819-1831.
[79]
A suffix tree approach to anti-spam email filtering. Machine Learning. v65 i1. 309-338.
[80]
Pu, C., &amp; Webb, S. (2006). Observed trends in spam construction techniques: A case study of spam evolution. In Proc of the third conf on email and anti-spam.
[81]
Empirical likelihood confidence intervals for differences between two datasets with missing data. Pattern Recognition Letters. v29 i6. 803-812.
[82]
A statistical approach to the spam problem. Linux Journal. v107. 6467
[83]
Ruan, G., &amp; Tan, Y. (2007). Intelligent detection approaches for spam. In Proc Int Conf on Nat Comput (Vol. 3).
[84]
Sahami, M., Dumais, S., Heckerman, D., &amp; Horvitz, E. (1998). A Bayesian approach to filtering junk E-mail. Tech. rep. WS-98-05. AAI Press.
[85]
A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval. v6 i1. 49-73.
[86]
Sarafijanovic, S., &amp; Le Boudec, J.-Y. (2007). Artificial immune system for collaborative spam filtering. Tech. rep. LCA-REPORT-2007-008, Ecole Polytechnique Federale de Lausanne.
[87]
Schneider, K.-M. (2003). A comparison of event models for naive Bayes anti-spam e-mail filtering. In Proc of the 10th conf of the Eur chapter of the assoc for comput ling.
[88]
Learning with kernels. 1st ed. MIT Press.
[89]
Sculley, D. (2007). Online active learning methods for fast label efficient spam filtering. In Proc of CEAS.
[90]
Sculley, D., &amp; Cormack, G. V. (2008). Filtering email spam in the presence of noisy user feedback. In Proc of the fifth conf on email and anti-spam.
[91]
Sculley, D., &amp; Wachman, G. M. (2007a). Relaxed online SVMs for spam filtering. In Proc of the ann int ACM SIGIR conf on res and devel in inform retrieval.
[92]
Sculley, D., &amp; Wachman, G. M. (2007b). Relaxed online SVMs in the TREC spam filtering track. In Proc of TREC 2007: The 16th text retrieval conf.
[93]
Machine learning in automated text categorization. ACM Computing Surveys. v34 i1. 1-47.
[94]
An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis. v11 i5. 497-524.
[95]
Segal, R. (2007). Combining global and personal anti-spam filtering. In Proc of the fourth conf on email and anti-spam.
[96]
Segal, R., Markowitz, T., &amp; Arnold, W. (2006). Fast uncertainty sampling for labeling large e-mail corpora. In Proc of the third conf on email and anti-spam.
[97]
Collaborative spam filtering with heterogeneous agents. Expert Systems with Applications. v34 i4. 1555-1566.
[98]
Sirisanyalak, B., &amp; Somit, O. (2007). An artificial immunity-based spam detection system. In IEEE cong on evol comput.
[99]
Five new feature selection metrics in text categorization. International Journal of Pattern Recognition. v21 i6. 1085-1101.
[100]
SpamAssassin. (2005). Spamassassin public corpus. <http://spamassassin.apache.org/publiccorpus/> (visited on June 2008).
[101]
Stern, H. (2008). A survey of modern spam tools. In Proc of the fifth conf on email and anti-spam.
[102]
Where SPAM is born. Technology Review. v111 i3. 28
[103]
Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research. v1. 211-244.
[104]
Tzortzis, G., &amp; Likas, A. (2007). Deep belief networks for spam filtering. In Proc of the IEEE int conf on tools with art intel (Vol. 2).
[105]
Statistical learning theory. Wiley-Interscience.
[106]
Veloso, A., Meira Jr., W. (2006). Lazy associative classification for content-based spam detection. In Proc of the Latin Amer web cong.
[107]
Wang, X.-L., &amp; Cloete, I. (2005). Learning to classify email: A survey. In Proc of the int conf on mach learn and cybernetics (Vol. 9).
[108]
Using header session messages to anti-spamming. Computers and Security. v26 i5. 381-390.
[109]
Wang, Z., Josephson, W., Lv, Q., Charikar, M., &amp; Li, K. (2007). Filtering image spam with near-duplicate detection. In Proc of the fourth conf on email and anti-spam.
[110]
Using online linear classifiers to filter spam emails. Pattern Analysis and Applications. v9 i4. 339-351.
[111]
Webb, S., Chitti, S., &amp; Pu, C. (2005). An experimental evaluation of spam filter performance and robustness against attack. In Proc of the int conf on collab comput: Networking, appl and worksharing.
[112]
Effective spam filtering: A single-class learning and ensemble approach. Decision Support Systems. v45 i3. 491-503.
[113]
Wouters, P. (2004). Why spam is bad. <http://www.xtdnet.nl/paul/spam/> (visited on July 2008).
[114]
Wu, C.-T., Cheng, K.-T., Zhu, Q., &amp; Wu, Y.-L. (2005). Using visual features for anti-spam filtering. In Proc of the IEEE int conf on image proc (Vol. 3).
[115]
Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications. v36 i3. 4321-4330.
[116]
Yih, W.-T., Goodman, J., &amp; Hulton, G. (2006). Learning at low false positive rates. In Proc of the third conf on email and anti-spam.
[117]
A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Systems. v21 i4. 355-362.
[118]
Artificial immune system inspired behavior-based anti-spam filter. Soft Computing. v11. 729-740.
[119]
PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering. v16 i1. 70-81.
[120]
An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing. v3 i4. 243-269.
[121]
Zhao, W., &amp; Zhang, Z. (2005). An email classification model based on rough set theory. In Proc of the int conf on active media technology.
[122]
Adaptive spam filtering using dynamic feature spaces. International Journal of Artificial Intelligence Tools. v16 i4. 627-646.
[123]
Efficient information theoretic extraction of higher order features for improving neural network-based spam e-mail categorization. Journal of Experimental and Theoretical Artificial Intelligence. v18 i4. 523-534.
[124]
Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering. Neural Networks. v18 i5-6. 799-807.

Cited By

View all
  • (2024)Product Spam on YouTube: A Case StudyProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638303(358-363)Online publication date: 10-Mar-2024
  • (2024)An improved transformer‐based model for detecting phishing, spam and ham emailsSecurity and Privacy10.1002/spy2.4027:5Online publication date: 24-Apr-2024
  • (2023)Annealing genetic-based preposition substitution for text rubbish example generationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/569(5122-5130)Online publication date: 19-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 36, Issue 7
September, 2009
597 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2009

Author Tags

  1. Bag-of-words (BoW)
  2. Image Spam
  3. Naive Bayes
  4. Online learning
  5. Spam filtering

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Product Spam on YouTube: A Case StudyProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638303(358-363)Online publication date: 10-Mar-2024
  • (2024)An improved transformer‐based model for detecting phishing, spam and ham emailsSecurity and Privacy10.1002/spy2.4027:5Online publication date: 24-Apr-2024
  • (2023)Annealing genetic-based preposition substitution for text rubbish example generationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/569(5122-5130)Online publication date: 19-Aug-2023
  • (2023)A Weak-Region Enhanced Bayesian Classification for Spam Content-Based FilteringACM Transactions on Asian and Low-Resource Language Information Processing10.1145/351042022:3(1-18)Online publication date: 2-Apr-2023
  • (2023)Laplacian Lp norm least squares twin support vector machinePattern Recognition10.1016/j.patcog.2022.109192136:COnline publication date: 1-Apr-2023
  • (2023)Efficient e-mail spam filtering approach combining Logistic Regression model and Orthogonal Atomic Orbital Search algorithmApplied Soft Computing10.1016/j.asoc.2023.110478144:COnline publication date: 1-Sep-2023
  • (2022)Enhancement of email spam detection using improved deep learning algorithms for cyber securityJournal of Computer Security10.3233/JCS-20011130:2(231-264)Online publication date: 1-Jan-2022
  • (2022)Enhancing representation in the context of multiple-channel spam filteringInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10281259:2Online publication date: 1-Mar-2022
  • (2022)Enhancing deep learning nuclear quadrupole resonance detection using transfer learning and autoencodersExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118093207:COnline publication date: 30-Nov-2022
  • (2022)A feature selection method based on term frequency difference and positive weighting factorData & Knowledge Engineering10.1016/j.datak.2022.102060141:COnline publication date: 1-Sep-2022
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media