Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2611040.2611059acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Spam Filtering: an Active Learning Approach using Incremental Clustering

Published: 02 June 2014 Publication History

Abstract

This paper introduces a method that deals with unwanted mail messages by combining active learning with incremental clustering. The proposed approach is motivated by the fact that the user cannot provide the correct category for all received messages. The email messages are divided into chronological batches (e.g. one per day). The user is asked to give the correct categories (labels) for the messages of the first batch and from then on the proposed algorithm decides when to ask for a new label, based on a clustering of the messages that is incrementally updated. We test different variants of the algorithm on a number of different datasets and show that it achieves very good results with only 2% of all email messages labelled by the user.

References

[1]
J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, J. A. Umass, B. A. Cmu, D. B. Cmu, A. B. Cmu, R. B. Cmu, I. C. Dragon, G. D. Darpa, A. H. Cmu, J. L. Cmu, V. L. Umass, X. L. Cmu, S. L. Dragon, P. V. M. Dragon, R. P. Umass, T. P. Cmu, J. P. Umass, and M. S. Umass. Topic detection and tracking pilot study final report. In In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.
[2]
A. Banerjee and J. Ghosh. Competitive learning mechanisms for scalable, incremental and balanced clustering of streaming texts. In Neural Networks, 2003. Proceedings of the International Joint Conference on, volume 4, pages 2697--2702 vol. 4, 2003.
[3]
G. Cormack. Trec 2006 spam track overview. In Proceedings of TREC, 2006.
[4]
G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval., 1(4):335--455, June 23 2008.
[5]
S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, ICML '08, pages 208--215, New York, NY, USA, 2008. ACM.
[6]
P. Donmez and J. G. Carbonell. Paired sampling in density-sensitive active learning. In International Symposium on Artificial Intelligence and Mathematics, 2008.
[7]
P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In J. N. Kok, J. Koronacki, R. L. de MÃąntaras, S. Matwin, D. Mladenic, and A. Skowron, editors, European Conference on Machine Learning, volume 4701 of Lecture Notes in Computer Science, pages 116--127. Springer, 2007.
[8]
D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139--172, Sept. 1987.
[9]
J. H. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation. Artificial Intelligence, 40(1-3):11--61, Sept. 1989.
[10]
R. Hu, B. M. Namee, and S. J. Delany. Off to a good start: Using clustering to select the initial training set in active learning. In H. W. Guesgen and R. C. Murray, editors, Florida Artificial Intelligence Research Society Conference. AAAI Press, 2010.
[11]
D. Ienco, A. Bifet, I. Zliobaite, and B. Pfahringer. Clustering based active learning for evolving data streams. In J. FÃijrnkranz, E. HÃijllermeier, and T. Higuchi, editors, Discovery Science, volume 8140 of Lecture Notes in Computer Science, pages 79--93. Springer, 2013.
[12]
N. Jardine and C. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.
[13]
A. Kosmopoulos, G. Paliouras, and I. Androutsopoulos. Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers. In Conference on Email and Anti-Spam, 2008.
[14]
D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Special Interest Group on Information Retrieval '94, pages 3--12, New York, NY, USA, 1994. Springer-Verlag New York, Inc.
[15]
A. K. McCallum. Employing em in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, pages 350--358. Morgan Kaufmann, 1998.
[16]
V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, 2006.
[17]
H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the twenty-first international conference on Machine learning, International Conference on Machine Learning, pages 623--630, New York, NY, USA, 2004. ACM.
[18]
H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML '04, pages 79--, New York, NY, USA, 2004. ACM.
[19]
N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In Proceedings of the 15th ACM international conference on Information and knowledge management, Conference on Information and Knowledge Management '06, pages 357--366, New York, NY, USA, 2006. ACM.
[20]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning, International Conference on Machine Learning '00, pages 839--846, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[21]
D. Sculley. Online active learning methods for fast label-efficient spam filtering. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference.
[22]
R. Segal, T. Markowitz, and W. Arnold. Fast Uncertainty Sampling for Labeling Large E-mail Corpora. In Third Conference on Email and Anti-Spam CEAS 2006, Stanford University, Palo Alto, CA, July 2006.
[23]
B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.
[24]
D. A. Simovici, N. Singla, and M. Kuperberg. Metric incremental clustering of nominal data. In International Conference on Data Mining, pages 523--526. IEEE Computer Society, 2004.
[25]
N. Slonim, E. Yom-Tov, and K. Crammer. Active online classification via information maximization. In T. Walsh, editor, International Joint Conferences on Artificial Intelligence, pages 1498--1504. IJCAI/AAAI, 2011.
[26]
S. Tong. Active Learning: Theory and Applications. PhD thesis, 2001. AAI3028187.
[27]
Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification using support vector machines. In Proceedings of the 25th European conference on IR research, ECIR'03, pages 393--407, Berlin, Heidelberg, 2003. Springer-Verlag.
[28]
J. Zhu, H. Wang, T. Yao, and B. K. Tsou. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In D. Scott and H. Uszkoreit, editors, International Conference on Computational Linguistics, pages 1137--1144, 2008.

Cited By

View all
  • (2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
  • (2023)Using Clustering Algorithms to Automatically Identify Phishing CampaignsIEEE Access10.1109/ACCESS.2023.331081011(96502-96513)Online publication date: 2023
  • (2022)The Study of Machine Learning Assisted the Design of Selected Composites PropertiesApplied Sciences10.3390/app12211086312:21(10863)Online publication date: 26-Oct-2022
  • Show More Cited By

Index Terms

  1. Spam Filtering: an Active Learning Approach using Incremental Clustering

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)
      June 2014
      506 pages
      ISBN:9781450325387
      DOI:10.1145/2611040
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      In-Cooperation

      • Aristotle University of Thessaloniki

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Active learning
      2. incremental clustering
      3. machine learning
      4. semi-supervised learning
      5. spam filtering

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      WIMS '14

      Acceptance Rates

      WIMS '14 Paper Acceptance Rate 41 of 90 submissions, 46%;
      Overall Acceptance Rate 140 of 278 submissions, 50%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Adversarial Spam Detector With Character Similarity NetworkIEEE Transactions on Industrial Informatics10.1109/TII.2022.317772619:3(2541-2551)Online publication date: Mar-2023
      • (2023)Using Clustering Algorithms to Automatically Identify Phishing CampaignsIEEE Access10.1109/ACCESS.2023.331081011(96502-96513)Online publication date: 2023
      • (2022)The Study of Machine Learning Assisted the Design of Selected Composites PropertiesApplied Sciences10.3390/app12211086312:21(10863)Online publication date: 26-Oct-2022
      • (2022)A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing TechniquesIEEE Access10.1109/ACCESS.2022.318308310(65703-65727)Online publication date: 2022
      • (2021)Artificial intelligence empowered emails classifier for Internet of Things based systems in industry 4.0Wireless Networks10.1007/s11276-021-02619-wOnline publication date: 12-Apr-2021
      • (2019)Discussion Data AnalyticsArtificial Intelligence Accelerates Human Learning10.1007/978-981-13-6175-3_2(19-56)Online publication date: 3-Feb-2019
      • (2017)Data sanitization against adversarial label contamination based on data complexityInternational Journal of Machine Learning and Cybernetics10.1007/s13042-016-0629-59:6(1039-1052)Online publication date: 24-Jan-2017
      • (2017)Fighting against phishing attacksNeural Computing and Applications10.1007/s00521-016-2275-y28:12(3629-3654)Online publication date: 1-Dec-2017
      • (2015)Quick online spam classification method based on active and incremental learningJournal of Intelligent & Fuzzy Systems10.3233/IFS-15170730:1(17-27)Online publication date: 21-Aug-2015
      • (2015)Detecting spam through their Sender Policy Framework recordsSecurity and Communication Networks10.1002/sec.12808:18(3555-3563)Online publication date: 1-Dec-2015

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media