Article

SF-HME system: a hierarchical mixtures-of-experts classification system for spam filtering

Authors:

Stefanos Gritzalis,

Christos SkourlasAuthors Info & Claims

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

Pages 354 - 360

https://doi.org/10.1145/1141277.1141360

Published: 23 April 2006 Publication History

Abstract

Many linear statistical models have been lately proposed in text classification related literature and evaluated against the Unsolicited Bulk Email filtering problem. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice, due to its inability to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose the SF-HME, a Hierarchical Mixture-of-Experts system, attempting to overcome limitations common to other machine-learning based approaches when applied to spam mail classification. By reducing the dimensionality of data through the usage of the effective Simba algorithm for feature selection, we evaluated our SF-HME system with a publicly available corpus of emails, with very high similarity between legitimate and bulk email - and thus low discriminative potential - where the traditional rule based filtering approaches achieve considerable lower degrees of precision. As a result, we confirm the domination of our SF-HME method against other machine learning approaches, which appeared to present lesser degree of recall.

References

[1]

Cohen, W. W. Learning Rules that Classify E-mail. In Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access, California.]]

[2]

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. Learning for Text Categorizaton - Papers from the AAAI Workshop, 1998, pages 55--62, Madison Wisconsin. AAAI Technical Report WS-98-05.]]

[3]

I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naïve Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000.]]

[4]

Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proc. 9th International workshop on machine learning (pp. 249--256)]]

Digital Library

[5]

Gilad-Bachrach, Navot A., Tishby N. Margin Based Feature Selection - Theory and Algorithms. In Proc of ICML 2004]]

Digital Library

[6]

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79--87, 1991.]]

[7]

Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2):181--214, 1994.]]

Digital Library

[8]

S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In Proceedings 1994 IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, Long Beach CA, 1994. IEEE Press.]]

[9]

Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. 'Adaptive mixtures of local experts', Neural Computation3(1), 79--87, 1991.]]

[10]

H. Drucker, V. Vapnik, mad D. Wu. Support Vector Machines for Spam Categorization. IEEE Trans. on Neural Networks, 10(5), 1999.]]

Digital Library

[11]

Andreas S. Weigend, Morgan Mangeas, and Ashok N. Srivastava. Nonlinear gated experts for time series: Discovering regimes and avoiding overfitting. International Journal of Neural Systems, 6:373--399, 1995.]]

[12]

J. S. Bridle. Probabilistic interpretation of feed forward classification network outputs with relationships to statistical pattern recognition. In F. Fogelman Souli'e and J. H'erault, editors, Neurocomputing: Algorithms, Architectures, and Applications, pages 227--236. Springer Verlag, New York, 1990.]]

[13]

Jurgen Fritsch, Michael Finke, and Alex Waibel. Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees. In Proceedings of ICASSP-97, 1997.]]

Digital Library

[14]

I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos. An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proc. of SIGIR, 2000.]]

Digital Library

[15]

Jake D. Brutlag and Christopher Meek. Challenges of the Email Domain for Text Classification. In Proc. of the 17th International Conference on Machine Learning, pages 103--110, Stanford University, USA, 2000.]]

Digital Library

[16]

Cranor, L. and Lamachia, B. (1998), Spam! Comm. ACM 41, 8, 74--83.]]

Digital Library

[17]

Gburzinsky, P. and Maitan, J. (2004) Fighting the Spam Wars: A Remailer Approach with Restrictive Aliasing, ACM Transactions on Internet Technology, vol. 4, No 1, Feb. 2004, pg. 1--30.]]

Digital Library

[18]

Stephen Hinde: Spam, scams, chains, hoaxes and other junk mail. Computers & Security 21(7): 592--606 (2002).]]

Digital Library

[19]

Hidalgo J. (2002), Evaluating Cost Sensitive Bulk Email Categorization, pp 615--620, SAC 2002, Madrid, Spain.]]

Digital Library

[20]

P. Hoffman and D- Crocker. Unsolicited bulk email: Mechanisms for control. Technical Report UBE-SOL, IMCR-008, Internet Mail Cons., 1998.]]

[21]

H. Katirai. Filtering junk e-mail: A performance comparison between genetic programming & naïïve Bayes. Available: http://members.rogers.com/hoomank/papers/katirai99filtering.pdf, 1999.]]

[22]

Nicholas T., 2003. "Using AdaBoost and Decision Stumps to Identify Spam E-mail", Available: http://nlp.stanford.edu/courses/cs224n/2003/fp/tyronen/report.pdf]]

[23]

Lewis, D. D, Feature selection and feature extraction for text categorization, Morgan Kaufmann, San Francisco, pp. 212--217, 1992.]]

Digital Library

[24]

Roller, D. and Sahami, M., Hierarchically classifying documents using very few words, in International Conference on Machine Learning (ICML), pp. 170--178, 1997.]]

Digital Library

[25]

Mladenic, D, Feature subset selection in text-learning, in Proc. of the 10th European Conference on Machine Learning, 1998.]]

Digital Library

[26]

S. Kiritchenko and S. Matwin, "Email Classification with Co-Training," in Proc. Annual IBM Centers for Advanced Studies Conference (CASCON 2001).]]

Digital Library

[27]

O'Brien and Carl Vogel Spam Filters: Bayes vs. Chi-squared; Letters vs. Words. Presented at the International Symposium on Information and Communication Technologies, September 24--26, 2003]]

Digital Library

[28]

X. Carreras and L. Marquez. Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01 International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001]]

[29]

Cunningham P., Nowlan N., Delany S. J., Haahr J. "A Case-Based Approach to Spam Filtering that Can Track Concept Drift" In The ICCBR'03 Workshop on Long-Lived CBR Systems, Trondheim, Norway, June 2003]]

[30]

Gee K. Using Latent Semantic Indexing to Filter spam, SAC 2003, Florida, USA]]

Digital Library

[31]

Shapire R. E., Singer Y. "Improved boosting algorithms using confidence-rated predictions". Machine learning 37(3): pp. 297--336, 1999]]

Digital Library

[32]

Drewes R. An artificial neural network spam classifier, available at project homepage: www.interstice.com/drewes/cs676/spam-nn]]

[33]

Woitaszek M., Shaaban M., "Identifying Junk Electronic Mail in Microsoft Outlook with a support vector machine", proc. of the 2003 symposium on applications and Internet]]

Digital Library

[34]

Kolsz A., Chowdhury A., Alspector J. "The impact of feature selection on signature-driven spam detection". Conference on email and Anti-Spam 2004, CA, USA]]

[35]

http://spamassassin.org/publiccorpus]]

[36]

Fawcett T. "In vivo" spam filtering: A challenge for KDD. SIGKDD explorations, vol 5, issue 2, 2003, pp. 140--149.]]

Digital Library

Index Terms

SF-HME system: a hierarchical mixtures-of-experts classification system for spam filtering

Recommendations

Invalidation of Mailing List Address to Block Spam Mails
APSCC '08: Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference

Mailing lists are used for information exchange in specific groups. However, in the recent times, the number of spam mails received has increased, and considerable amount of time is wasted in filtering spam mails. Spam filtering techniques are widerly ...
The Research and Design of an Anti-open Junk Mail Relay System
CSSS '12: Proceedings of the 2012 International Conference on Computer Science and Service System

With the rapid development of Internet, more and more people use electronic mails (email), unsolicited commercial emails, such as junk mails or spam mails, overwhelm peoples' email boxes. To fight against these spam emails, a simple method is refusing ...
A collaborative anti-spam system

Growing volume of spam mails has generated a need for a precise anti-spam filter detecting unsolicited emails. Most works only focus on spam rule generation on a standalone mail server. This paper presents a collaborative framework on spam rule ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

April 2006

1967 pages

ISBN:1595931082

DOI:10.1145/1141277

Conference Chair:
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SAC06

Sponsor:

SIGAPP

SAC06: The 2006 ACM Symposium on Applied Computing

April 23 - 27, 2006

Dijon, France

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
451
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents