research-article

Spam Filtering: an Active Learning Approach using Incremental Clustering

Authors:

Kleanthi Georgala,

Aris Kosmopoulos,

George PaliourasAuthors Info & Claims

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Article No.: 23, Pages 1 - 12

https://doi.org/10.1145/2611040.2611059

Published: 02 June 2014 Publication History

Abstract

This paper introduces a method that deals with unwanted mail messages by combining active learning with incremental clustering. The proposed approach is motivated by the fact that the user cannot provide the correct category for all received messages. The email messages are divided into chronological batches (e.g. one per day). The user is asked to give the correct categories (labels) for the messages of the first batch and from then on the proposed algorithm decides when to ask for a new label, based on a clustering of the messages that is incrementally updated. We test different variants of the algorithm on a number of different datasets and show that it achieves very good results with only 2% of all email messages labelled by the user.

References

[1]

J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, J. A. Umass, B. A. Cmu, D. B. Cmu, A. B. Cmu, R. B. Cmu, I. C. Dragon, G. D. Darpa, A. H. Cmu, J. L. Cmu, V. L. Umass, X. L. Cmu, S. L. Dragon, P. V. M. Dragon, R. P. Umass, T. P. Cmu, J. P. Umass, and M. S. Umass. Topic detection and tracking pilot study final report. In In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.

[2]

A. Banerjee and J. Ghosh. Competitive learning mechanisms for scalable, incremental and balanced clustering of streaming texts. In Neural Networks, 2003. Proceedings of the International Joint Conference on, volume 4, pages 2697--2702 vol. 4, 2003.

[3]

G. Cormack. Trec 2006 spam track overview. In Proceedings of TREC, 2006.

[4]

G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval., 1(4):335--455, June 23 2008.

Digital Library

[5]

S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, ICML '08, pages 208--215, New York, NY, USA, 2008. ACM.

Digital Library

[6]

P. Donmez and J. G. Carbonell. Paired sampling in density-sensitive active learning. In International Symposium on Artificial Intelligence and Mathematics, 2008.

[7]

P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In J. N. Kok, J. Koronacki, R. L. de MÃąntaras, S. Matwin, D. Mladenic, and A. Skowron, editors, European Conference on Machine Learning, volume 4701 of Lecture Notes in Computer Science, pages 116--127. Springer, 2007.

Digital Library

[8]

D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139--172, Sept. 1987.

Digital Library

[9]

J. H. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation. Artificial Intelligence, 40(1-3):11--61, Sept. 1989.

Digital Library

[10]

R. Hu, B. M. Namee, and S. J. Delany. Off to a good start: Using clustering to select the initial training set in active learning. In H. W. Guesgen and R. C. Murray, editors, Florida Artificial Intelligence Research Society Conference. AAAI Press, 2010.

[11]

D. Ienco, A. Bifet, I. Zliobaite, and B. Pfahringer. Clustering based active learning for evolving data streams. In J. FÃijrnkranz, E. HÃijllermeier, and T. Higuchi, editors, Discovery Science, volume 8140 of Lecture Notes in Computer Science, pages 79--93. Springer, 2013.

[12]

N. Jardine and C. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.

[13]

A. Kosmopoulos, G. Paliouras, and I. Androutsopoulos. Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers. In Conference on Email and Anti-Spam, 2008.

[14]

D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Special Interest Group on Information Retrieval '94, pages 3--12, New York, NY, USA, 1994. Springer-Verlag New York, Inc.

Digital Library

[15]

A. K. McCallum. Employing em in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, pages 350--358. Morgan Kaufmann, 1998.

Digital Library

[16]

V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, 2006.

[17]

H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the twenty-first international conference on Machine learning, International Conference on Machine Learning, pages 623--630, New York, NY, USA, 2004. ACM.

Digital Library

[18]

H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML '04, pages 79--, New York, NY, USA, 2004. ACM.

Digital Library

[19]

N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In Proceedings of the 15th ACM international conference on Information and knowledge management, Conference on Information and Knowledge Management '06, pages 357--366, New York, NY, USA, 2006. ACM.

Digital Library

[20]

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning, International Conference on Machine Learning '00, pages 839--846, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Digital Library

[21]

D. Sculley. Online active learning methods for fast label-efficient spam filtering. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference.

[22]

R. Segal, T. Markowitz, and W. Arnold. Fast Uncertainty Sampling for Labeling Large E-mail Corpora. In Third Conference on Email and Anti-Spam CEAS 2006, Stanford University, Palo Alto, CA, July 2006.

[23]

B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.

[24]

D. A. Simovici, N. Singla, and M. Kuperberg. Metric incremental clustering of nominal data. In International Conference on Data Mining, pages 523--526. IEEE Computer Society, 2004.

Digital Library

[25]

N. Slonim, E. Yom-Tov, and K. Crammer. Active online classification via information maximization. In T. Walsh, editor, International Joint Conferences on Artificial Intelligence, pages 1498--1504. IJCAI/AAAI, 2011.

Digital Library

[26]

S. Tong. Active Learning: Theory and Applications. PhD thesis, 2001. AAI3028187.

Digital Library

[27]

Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification using support vector machines. In Proceedings of the 25th European conference on IR research, ECIR'03, pages 393--407, Berlin, Heidelberg, 2003. Springer-Verlag.

Digital Library

[28]

J. Zhu, H. Wang, T. Yao, and B. K. Tsou. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In D. Scott and H. Uszkoreit, editors, International Conference on Computational Linguistics, pages 1137--1144, 2008.

Digital Library

Cited By

Chen YHuang LWang CFu MHuang SHuang JTan YYan C(2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
https://doi.org/10.1109/TII.2022.3177726
Althobaiti KWolters MAlsufyani NVaniea K(2023)Using Clustering Algorithms to Automatically Identify Phishing CampaignsIEEE Access10.1109/ACCESS.2023.331081011(96502-96513)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3310810
Hrehova SKnapcikova L(2022)The Study of Machine Learning Assisted the Design of Selected Composites PropertiesApplied Sciences10.3390/app12211086312:21(10863)Online publication date: 26-Oct-2022
https://doi.org/10.3390/app122110863
Show More Cited By

Index Terms

Spam Filtering: an Active Learning Approach using Incremental Clustering
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Clustering for semi-supervised spam filtering
CEAS '11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

We present a novel investigation of email clustering, demonstrating that clustering can be a powerful tool for email spam filtering. We first extend the well-known notion that ham and spam emails can be divided into clusters, showing the striking result ...
Collective classification for spam filtering
CISIS'11: Proceedings of the 4th international conference on Computational intelligence in security for information systems

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. Many solutions feature machine-learning algorithms trained using statistical representations of the terms that usually ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

June 2014

506 pages

ISBN:9781450325387

DOI:10.1145/2611040

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Aristotle University of Thessaloniki

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WIMS '14

WIMS '14: 4th International Conference on Web Intelligence, Mining and Semantics

June 2 - 4, 2014

Thessaloniki, Greece

Acceptance Rates

WIMS '14 Paper Acceptance Rate 41 of 90 submissions, 46%;

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YHuang LWang CFu MHuang SHuang JTan YYan C(2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
https://doi.org/10.1109/TII.2022.3177726
Althobaiti KWolters MAlsufyani NVaniea K(2023)Using Clustering Algorithms to Automatically Identify Phishing CampaignsIEEE Access10.1109/ACCESS.2023.331081011(96502-96513)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3310810
Hrehova SKnapcikova L(2022)The Study of Machine Learning Assisted the Design of Selected Composites PropertiesApplied Sciences10.3390/app12211086312:21(10863)Online publication date: 26-Oct-2022
https://doi.org/10.3390/app122110863
Salloum SGaber TVadera SShaalan K(2022)A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing TechniquesIEEE Access10.1109/ACCESS.2022.318308310(65703-65727)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3183083
Gupta BTewari ACvitić IPeraković DChang X(2021)Artificial intelligence empowered emails classifier for Internet of Things based systems in industry 4.0Wireless Networks10.1007/s11276-021-02619-wOnline publication date: 12-Apr-2021
https://doi.org/10.1007/s11276-021-02619-w
Nagao KNagao K(2019)Discussion Data AnalyticsArtificial Intelligence Accelerates Human Learning10.1007/978-981-13-6175-3_2(19-56)Online publication date: 3-Feb-2019
https://doi.org/10.1007/978-981-13-6175-3_2
Chan PHe ZLi HHsu C(2017)Data sanitization against adversarial label contamination based on data complexityInternational Journal of Machine Learning and Cybernetics10.1007/s13042-016-0629-59:6(1039-1052)Online publication date: 24-Jan-2017
https://doi.org/10.1007/s13042-016-0629-5
Gupta BTewari AJain AAgrawal D(2017)Fighting against phishing attacksNeural Computing and Applications10.1007/s00521-016-2275-y28:12(3629-3654)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s00521-016-2275-y
Feng LWang YZuo W(2015)Quick online spam classification method based on active and incremental learningJournal of Intelligent & Fuzzy Systems10.3233/IFS-15170730:1(17-27)Online publication date: 21-Aug-2015
https://doi.org/10.3233/IFS-151707
Sipahi DDalklç GÖzcanhan M(2015)Detecting spam through their Sender Policy Framework recordsSecurity and Communication Networks10.1002/sec.12808:18(3555-3563)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1002/sec.1280

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten