Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1141277.1141360acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

SF-HME system: a hierarchical mixtures-of-experts classification system for spam filtering

Published: 23 April 2006 Publication History

Abstract

Many linear statistical models have been lately proposed in text classification related literature and evaluated against the Unsolicited Bulk Email filtering problem. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice, due to its inability to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose the SF-HME, a Hierarchical Mixture-of-Experts system, attempting to overcome limitations common to other machine-learning based approaches when applied to spam mail classification. By reducing the dimensionality of data through the usage of the effective Simba algorithm for feature selection, we evaluated our SF-HME system with a publicly available corpus of emails, with very high similarity between legitimate and bulk email - and thus low discriminative potential - where the traditional rule based filtering approaches achieve considerable lower degrees of precision. As a result, we confirm the domination of our SF-HME method against other machine learning approaches, which appeared to present lesser degree of recall.

References

[1]
Cohen, W. W. Learning Rules that Classify E-mail. In Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access, California.]]
[2]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. Learning for Text Categorizaton - Papers from the AAAI Workshop, 1998, pages 55--62, Madison Wisconsin. AAAI Technical Report WS-98-05.]]
[3]
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naïve Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000.]]
[4]
Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proc. 9th International workshop on machine learning (pp. 249--256)]]
[5]
Gilad-Bachrach, Navot A., Tishby N. Margin Based Feature Selection - Theory and Algorithms. In Proc of ICML 2004]]
[6]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79--87, 1991.]]
[7]
Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2):181--214, 1994.]]
[8]
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In Proceedings 1994 IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, Long Beach CA, 1994. IEEE Press.]]
[9]
Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. 'Adaptive mixtures of local experts', Neural Computation3(1), 79--87, 1991.]]
[10]
H. Drucker, V. Vapnik, mad D. Wu. Support Vector Machines for Spam Categorization. IEEE Trans. on Neural Networks, 10(5), 1999.]]
[11]
Andreas S. Weigend, Morgan Mangeas, and Ashok N. Srivastava. Nonlinear gated experts for time series: Discovering regimes and avoiding overfitting. International Journal of Neural Systems, 6:373--399, 1995.]]
[12]
J. S. Bridle. Probabilistic interpretation of feed forward classification network outputs with relationships to statistical pattern recognition. In F. Fogelman Souli'e and J. H'erault, editors, Neurocomputing: Algorithms, Architectures, and Applications, pages 227--236. Springer Verlag, New York, 1990.]]
[13]
Jurgen Fritsch, Michael Finke, and Alex Waibel. Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees. In Proceedings of ICASSP-97, 1997.]]
[14]
I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos. An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proc. of SIGIR, 2000.]]
[15]
Jake D. Brutlag and Christopher Meek. Challenges of the Email Domain for Text Classification. In Proc. of the 17th International Conference on Machine Learning, pages 103--110, Stanford University, USA, 2000.]]
[16]
Cranor, L. and Lamachia, B. (1998), Spam! Comm. ACM 41, 8, 74--83.]]
[17]
Gburzinsky, P. and Maitan, J. (2004) Fighting the Spam Wars: A Remailer Approach with Restrictive Aliasing, ACM Transactions on Internet Technology, vol. 4, No 1, Feb. 2004, pg. 1--30.]]
[18]
Stephen Hinde: Spam, scams, chains, hoaxes and other junk mail. Computers & Security 21(7): 592--606 (2002).]]
[19]
Hidalgo J. (2002), Evaluating Cost Sensitive Bulk Email Categorization, pp 615--620, SAC 2002, Madrid, Spain.]]
[20]
P. Hoffman and D- Crocker. Unsolicited bulk email: Mechanisms for control. Technical Report UBE-SOL, IMCR-008, Internet Mail Cons., 1998.]]
[21]
H. Katirai. Filtering junk e-mail: A performance comparison between genetic programming & naïïve Bayes. Available: http://members.rogers.com/hoomank/papers/katirai99filtering.pdf, 1999.]]
[22]
Nicholas T., 2003. "Using AdaBoost and Decision Stumps to Identify Spam E-mail", Available: http://nlp.stanford.edu/courses/cs224n/2003/fp/tyronen/report.pdf]]
[23]
Lewis, D. D, Feature selection and feature extraction for text categorization, Morgan Kaufmann, San Francisco, pp. 212--217, 1992.]]
[24]
Roller, D. and Sahami, M., Hierarchically classifying documents using very few words, in International Conference on Machine Learning (ICML), pp. 170--178, 1997.]]
[25]
Mladenic, D, Feature subset selection in text-learning, in Proc. of the 10th European Conference on Machine Learning, 1998.]]
[26]
S. Kiritchenko and S. Matwin, "Email Classification with Co-Training," in Proc. Annual IBM Centers for Advanced Studies Conference (CASCON 2001).]]
[27]
O'Brien and Carl Vogel Spam Filters: Bayes vs. Chi-squared; Letters vs. Words. Presented at the International Symposium on Information and Communication Technologies, September 24--26, 2003]]
[28]
X. Carreras and L. Marquez. Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01 International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001]]
[29]
Cunningham P., Nowlan N., Delany S. J., Haahr J. "A Case-Based Approach to Spam Filtering that Can Track Concept Drift" In The ICCBR'03 Workshop on Long-Lived CBR Systems, Trondheim, Norway, June 2003]]
[30]
Gee K. Using Latent Semantic Indexing to Filter spam, SAC 2003, Florida, USA]]
[31]
Shapire R. E., Singer Y. "Improved boosting algorithms using confidence-rated predictions". Machine learning 37(3): pp. 297--336, 1999]]
[32]
Drewes R. An artificial neural network spam classifier, available at project homepage: www.interstice.com/drewes/cs676/spam-nn]]
[33]
Woitaszek M., Shaaban M., "Identifying Junk Electronic Mail in Microsoft Outlook with a support vector machine", proc. of the 2003 symposium on applications and Internet]]
[34]
Kolsz A., Chowdhury A., Alspector J. "The impact of feature selection on signature-driven spam detection". Conference on email and Anti-Spam 2004, CA, USA]]
[35]
http://spamassassin.org/publiccorpus]]
[36]
Fawcett T. "In vivo" spam filtering: A challenge for KDD. SIGKDD explorations, vol 5, issue 2, 2003, pp. 140--149.]]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing
April 2006
1967 pages
ISBN:1595931082
DOI:10.1145/1141277
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hierarchical systems of experts
  2. machine learning
  3. spam mail

Qualifiers

  • Article

Conference

SAC06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 451
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media