article

Cost-sensitive three-way email spam filtering

Authors:

Jigang LuoAuthors Info & Claims

Journal of Intelligent Information Systems, Volume 42, Issue 1

Pages 19 - 45

https://doi.org/10.1007/s10844-013-0254-7

Published: 01 February 2014 Publication History

Abstract

Email spam filtering is typically treated as a binary classification problem that can be solved by machine learning algorithms. We argue that a three-way decision approach provides a more meaningful way to users for precautionary handling their incoming emails. Three email folders instead of two are produced in a three-way spam filtering system, a suspected folder is added to allow users make further examinations of suspicious emails, thereby reducing the chances of misclassification. Different from existing ternary email spam filtering systems, we focus on two issues that are less studied, that is, the computation of required thresholds to define the three email categories, and the interpretation of the cost-sensitive characteristics of spam filtering. Instead of supplying the thresholds based on intuitive understandings of the levels of tolerance for errors, we systematically calculate the thresholds based on decision-theoretic rough set model. A loss function is interpreted as the costs of making classification decisions. A decision is made for which the overall cost is minimum. Experimental results show that the new approach reduces the error rate of misclassifying a legitimate email to spam and demonstrates a better performance for the cost-sensitivity aspect.

References

[1]

Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D. (2000). An evaluation of naive Bayesian anti-spam filtering. In Proc. of the workshop on machine learning in the new information age.

[2]

Barracuda Spam Firewall (2012). From http://www.barracudanetworks.com. Accessed 25 July 2012.

[3]

Bogofilter (2012). From http://bogofilter.sourceforge.net. Accessed 25 July 2012.

[4]

Cohen, W. (1996). Learning rules that classify email. In Advances in inductive logic programming.

[5]

Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.

[6]

Drummond, C., & Holte, R.C. (2000). Explicitly representing expected cost: an alternative to ROC representation. In KDD 2000 (pp. 198-207).

[7]

Drummond, C., & Holte, R.C. (2006). Cost curves: an improved method for visualizing classifier performance. Machine Learning, 65(1), 95-130.

Digital Library

[8]

Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. New York: Wiley.

[9]

Elkan, C. (2001). The foundations of cost-senstive learning. In Proceedings of the 17th international joint conference on artificial intelligence (pp. 973-978).

[10]

Fayyad, U.M., & Irani, K.B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022-1029).

[11]

GFI MailEssentials (2012). http://www.gfi.com/. Accessed 25 July 2012.

[12]

Good, I.J. (1965). The estimation of probabilities: An essay on modern Bayesian methods. Cambridge: MIT Press.

[13]

Graham, P. (2002). A Plan for spam. http://www.paulgraham.com/spam.html. Accessed 25 July 2012.

[14]

Masand, B., Linoff, G., Waltz, D. (1992). Classifying news stories using memory based reasoning. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval (pp. 59-65).

[15]

Mitchell, T. (1997). Machine learning. New York: McGraw Hill.

[16]

Pantel, P., & Lin, D.K. (1998). SpamCop--a spam classification & organization program. In Proceedings of AAAI workshop on learning for text categorization (pp. 95-98). Madison, WI.

[17]

Rennie, J. (1996). "ifile". http://people.csail.mit.edu/jrennie/ifile/. Accessed 25 July 2012.

[18]

Robinson, G. (2004). A statistical approach to the spam problem, spam detection. In Why Chi? Motivations for the use of fishers inverse Chi-square procedure in spam classification. Handling redundancy in email token probabilities.

[19]

Sahami, M., Dumais, S., Heckerman, D., Horvitz, E. (1998). A Bayesian approach to filtering junk email. In AAAI workshop on learning for text categorization. AAAI Technical ReportWS-98-05, Madison, Wisconsin.

[20]

Schapire, E., & Singer, Y. (2000). BoosTexter: a boosting-based system for text categorization. Machine Learning, 39(2/3), 135-168.

Digital Library

[21]

Siersdorfer, S., & Weikum, G. (2005). Using restrictive classification and meta classification for junk elimination. In Proceedings of ECIR'2005 (pp. 287-299).

[22]

Triola, M.F. (2005). Elementary statistics. Reading: Addison Wesley.

[23]

Yao, Y.Y. (2011). The superiority of three-way decisions in probabilistic rough set models. Information Sciences, 181, 1080-1096.

Digital Library

[24]

Yao, Y.Y., Wong, S.K.M., Lingras, P. (1990). A decision-theoretic rough set model. In Z.W. Ras, M. Zemankova, M.L. Emrich (Eds.), Methodologies for intelligent systems (Vol. 5, pp. 17-24). New York: North Holland.

[25]

Yerazunis, W.S. (2003). Sparse binary polynomial hashing and the CRM114 discriminator. In Proceedings of the MIT spam conference.

[26]

Yih, W., McCann, R., Kolcz, A. (2007). Improving spam filtering by Detecting Gray mail. In Proceedings of the 4th conference on e-mail and anti-spam (CEAS07).

[27]

Zhao, W., & Zhang, Z. (2005). An email classification model based on rough set theory. In Procedings of the international conference on active media technology (pp. 403-408).

[28]

Zhou, Z.H., & Liu, X.Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1), 63-77.

Digital Library

[29]

Zhou, Z.H., & Liu, X.Y. (2010). On multi-class cost-sensitive learning. Computational Intelligence, 26(3), 232-257.

[30]

Zhou, B., & Liu, Q.Z. (2012). A comparison study of cost-sensitive classifier evaluations. In The 2012 international conference on brain informatics (BI'12). Lecture notes in computer science (Vol. 7670, pp. 360-371).

[31]

Zhou, B., Yao, Y.Y., Luo, J.G. (2010). A three-way decision approach to email spam filtering. In Proceedings of the 23th Canadian conference on artificial intelligence (AI 2010), University of Ottawa, Ontario, Canada, 31 May-2 June 2010. Lecture notes in artif icial intelligence (pp. 28-39).

Cited By

Liu PXiao QYu HLang G(2024)New Models of Three-Way Conflict Analysis Based on Decision-Theoretic Rough SetsRough Sets10.1007/978-3-031-65668-2_13(181-195)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65668-2_13
Yang HRen H(2023)A three-way decision model on incomplete single-valued neutrosophic information tablesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22194244:3(5179-5193)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-221942
Zhou JZhu YLi LShi X(2023)Similarity measure-based three-way decisions in Pythagorean fuzzy information systems and its application in FANETsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22142444:5(7153-7168)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-221424
Show More Cited By

Cost-sensitive three-way email spam filtering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
2. Information systems
  1. World Wide Web
    1. Web applications
      1. Internet communications tools

Recommendations

A three-way decision approach to email spam filtering
AI'10: Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence

Many classification techniques used for identifying spam emails, treat spam filtering as a binary classification problem That is, the incoming email is either spam or non-spam This treatment is more for mathematical simplicity other than reflecting the ...
Multistage Email Spam Filtering Based on Three-Way Decisions
Proceedings of the 8th International Conference on Rough Sets and Knowledge Technology - Volume 8171

A ternary, three-way decision strategy to email spam filtering divides incoming emails into three folders, namely, a mail folder consisting of emails that we accept as being legitimate, a spam folder consisting of emails that we reject as being ...
Online active multi-field learning for efficient email spam filtering

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Intelligent Information Systems

Journal of Intelligent Information Systems Volume 42, Issue 1

February 2014

172 pages

ISSN:0925-9902

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu PXiao QYu HLang G(2024)New Models of Three-Way Conflict Analysis Based on Decision-Theoretic Rough SetsRough Sets10.1007/978-3-031-65668-2_13(181-195)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65668-2_13
Yang HRen H(2023)A three-way decision model on incomplete single-valued neutrosophic information tablesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22194244:3(5179-5193)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-221942
Zhou JZhu YLi LShi X(2023)Similarity measure-based three-way decisions in Pythagorean fuzzy information systems and its application in FANETsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22142444:5(7153-7168)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-221424
Nosrati VRahmani MJolfaei ASeifollahi S(2023)A Weak-Region Enhanced Bayesian Classification for Spam Content-Based FilteringACM Transactions on Asian and Low-Resource Language Information Processing10.1145/351042022:3(1-18)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1145/3510420
Kim ISusilo WBaek JKim JChow Y(2023)PCSF: Privacy-Preserving Content-Based Spam FilterIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.325517218(2856-2869)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIFS.2023.3255172
Liang DWu YDuan W(2023)Multiple granularity user intention fairness recognition of intelligent government Q & A system via three-way decisionInformation Sciences: an International Journal10.1016/j.ins.2023.02.070631:C(305-326)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.ins.2023.02.070
Zavrak SYilmaz S(2023)Email spam detection using hierarchical attention hybrid deep learning methodExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120977233:COnline publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120977
Li WLu YChen LJia X(2022)Label distribution learning with noisy labels via three-way decisionsInternational Journal of Approximate Reasoning10.1016/j.ijar.2022.08.009150:C(19-34)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.ijar.2022.08.009
Jiang CGuo DSun L(2021)Effectiveness measure for TAO model of three-way decisions with interval setJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20220740:6(11071-11084)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-202207
Peng ZYu H(2021)Knowledge Graph Representation Learning for Link Prediction with Three-Way DecisionsRough Sets10.1007/978-3-030-87334-9_23(266-278)Online publication date: 19-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87334-9_23
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents