Abstract
Spam email automated analysis and classification are a challenging task, which is vital in the identification of botnet structures and cybercrime fighting. In this work, we propose an automated methodology and the resulting framework based on innovative categorical divisive clustering, used both for grouping and for classification of spam messages. In particular, the grouping is exploited to identify campaigns of similar spam emails, while the classification is used to label specific emails according to the goal of spammer (e.g., phishing, malware distribution, advertisement, etc.). This work introduces the CCTree algorithm, both as clustering algorithm and as classification algorithm, in two operative modes: batch and dynamic, to handle both large data sets and data streams. Afterward, the CCTree is applied to large sets of spam emails for campaign identification and labeling. The performance of the algorithm is reported for both clustering and classification, and a comparison between the batch and dynamic approaches is presented and discussed.
Similar content being viewed by others
References
Quigley, R.: Today in History: The First Spam Email Ever Sent. https://www.themarysue.com/first-spam-email/ (2016). Accessed 24 Sept 2019
Statista: Global Spam Volume as Percentage of Total E-Mail Traffic from January 2014 to December 2016. https://www.statista.com/statistics/420391/spam-email-traffic-share/ (2018)
Rao, J., Reiley, D.: On the spam campaign trail. Econ. Spam 26(3), 87–110 (2012)
Shah, N.F., Kumar, P.: A comparative analysis of various spam classifications. In: Sa, P.K., Sahoo, M.N., Murugappan, M., Wu, Y., Majhi, B. (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 265–271. Springer, Singapore (2018)
Carreras, X., Marquez, L., Salgado, J.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pp. 58–64 (2001)
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999). https://doi.org/10.1109/72.788645
Seewald, A.K.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)
Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)
McDaniel, P., Papernot, N., Celik, Z.B.: Machine learning in adversarial settings. IEEE Secur. Priv. 14(3), 68–72 (2016). https://doi.org/10.1109/MSP.2016.51
Bergholz, A., De Beer, J., Glahn, S., Moens, M., Paass, G., Strobel, S.: New filtering approaches for phishing email. J. Comput. Secur. 18(1), 7–35 (2010)
Roman, R., Zhou, J., Lopez, J.: An anti-spam scheme using pre-challenges. Comput. Commun. 29(15), 2739–2749 (2006). https://doi.org/10.1016/j.comcom.2005.10.037
John, J., Moshchuk, A., Gribble, S., Krishnamurthy, A.: Studying spamming using Botlab. In: Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI09, USENIX Association, Berkeley, CA, USA, pp. 291–306 ( 2009)
Leontiadis, N.: Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade. In: Proceedings of USENIX Security 2011 (2011)
Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. SIGCOMM Comput. Commun. Rev. 38(4), 171–182 (2008)
Zhao, Y., Xie, Y., Yu, F., Ke, Q., Yu, Y., Chen, Y., Gillum, E.: BotGraph: large scale spamming botnet detection. In: Proceedings of 6th NSDI
Putman, C.G.J., Abhishta, Nieuwenhuis, L.J.M.: Business Model of a Botnet, CoRR arXiv:1804.10848
Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(Supplement 1), S12–S21 (2015)
Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium (2007)
Radicati, S.: Email statistics report 2013–2017. http://goo.gl/ggLntn (2013). Accessed 24 Sept 2019
Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)
Spam archive. http://untroubled.org/spam/. Accessed 24 Sept 2019
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Fast and effective clustering of spam emails based on structural similarity. In: Foundations and Practice of Security—8th International Symposium, FPS 2015, Clermont-Ferrand, France, October 26–28, 2015, Revised Selected Papers, pp. 195–211 (2015)
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Security and Trust Management—11th International Workshop, STM 2015, Vienna, Austria, September 21–22, 2015, Proceedings, pp. 3–19 (2015)
Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)
Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
Bezdek, J., Pal, N.: Cluster validation with generalized Dunn’s indices. In: Artificial Neural Networks and Expert Systems, 1995. Proceedings, Second New Zealand International Two-Stream Conference on, pp. 190–193 (1995). https://doi.org/10.1109/ANNES.1995.499469
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pp. 187–194 (2001). https://doi.org/10.1109/ICDM.2001.989517
Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)
Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI ’04, IEEE Computer Society, Washington, DC, USA, pp. 576–584 (2004)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)
Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992)
Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Hedley, J.: Jsoup cookbook. http://jsoup.org/cookbook (2009). Accessed 24 Sept 2019
Kanich, C., Weavery, N., McCoy, D., Halvorson, T., Kreibichy, C., Levchenko, K., Paxson, V., Voelker, G., Savage, S.: Show me the money: Characterizing spam-advertised revenue. In: Proceedings of the 20th USENIX Conference on Security, SEC’11, USENIX Association, Berkeley, CA, USA (2011)
Federal Trade Commission. http://www.consumer.ftc.gov (2009). Accessed 24 Sept 2019
Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamalytics: an empirical analysis of spam marketing conversion. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS08, pp. 3–14. ACM, New York (2008). https://doi.org/10.1145/1455770.1455774
Henderson, L.: Crimes of Persuasion: Schemes, Scams, Frauds: How Con Artists Will Steal Your Savings and Inheritance Through Telemarketing Fraud, Investment Schemes and Consumer Scams. Coyoto Ridge Press (2003)
Cohen, Y., Hendler, D., Rubin, A.: Detection of malicious webmail attachments based on propagation patterns. Knowl.-Based Syst. 141, 67–79 (2018)
Narang, S.: Cryptolocker alert: millions in the UK targeted in mass spam campaign (2013). http://www.symantec.com/connect/blogs/cryptolocker-alert-millions-uk-targeted-mass-spam-campaign. Accessed 24 Sept 2019
Almomani, A., Gupta, B.B., Atawneh, S., Meulenberg, A., Almomani, E.: A survey of phishing email filtering techniques. IEEE Commun. Surv. Tutor. 15(4), 2070–2090 (2013)
Smadi, S., Aslam, N., Zhang, L.: Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis. Support Syst. 107, 88–102 (2018)
Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)
Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM symposium on Applied Computing, SAC ’09, pp. 891–892. ACM, New York (2009)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511–I–518 (2001). https://doi.org/10.1109/CVPR.2001.990517
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., Popp, J.: Sample size planning for classification models. Anal. Chim. Acta 760, 25–33 (2013)
Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al-Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015). https://doi.org/10.1186/s40537-015-0029-9
Kumar, V., Monika, Kumar, P., Sharma, A.: Spam email detection using id3 algorithm and hidden Markov model. In: 2018 Conference on Information and Communication Technology (CICT), pp. 1–6 (2018). https://doi.org/10.1109/INFOCOMTECH.2018.8722378
Labs, M.A.: Mcafee threats report: third quarter 2013 (2013)
Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET09 (2009)
Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: An empirical study of spam: analyzing spam sending systems and malicious web servers. In: Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT ’10, pp. 257–260 (2010)
Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC ’08, pp. 1433–1437. ACM, New York (2008)
Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09. USENIX Association, Berkeley (2009)
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.: Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, pp. 35–47. ACM, New York (2010). https://doi.org/10.1145/1879141.1879147
Pathak, A., Qian, F., Hu, Y.C., Mao, Z.M., Ranjan, S.: Botnet spam campaigns can be long lasting: evidence, implications, and analysis. SIGMETRICS Perform. Eval. Rev. 37(1), 13–24 (2009)
Moradpoor, N., Clavie, B., Buchanan, B.: Employing machine learning techniques for detection and classification of phishing emails. In: 2017 Computing Conference, pp. 149–156 (2017). https://doi.org/10.1109/SAI.2017.8252096
Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., Birlinghoven, S.: Improved phishing detection using model-based features. In: In Fifth Conference on Email and Anti-Spam, CEAS (2008)
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM (2007)
Jain, G., Sharma, M., Agarwal, B.: Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell. 85(1), 21–44 (2019). https://doi.org/10.1007/s10472-018-9612-z
Sohrabi, M.K., Karimi, F.: A feature selection approach to detect spam in the facebook social network. Arab. J. Sci. Eng. 43(2), 949–958 (2018). https://doi.org/10.1007/s13369-017-2855-x
Feng, B., Fu, Q., Dong, M., Guo, D., Li, Q.: Multistage and elastic spam detection in mobile social networks through deep learning. IEEE Netw. 32(4), 15–21 (2018). https://doi.org/10.1109/MNET.2018.1700406
Almaatouq, A., Shmueli, E., Nouh, M., Alabdulkareem, A., Singh, V.K., Alsaleh, M., Alarifi, A., Alfaris, A., Pentland, A.S.: If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int. J. Inf. Secur. 15(5), 475–491 (2016). https://doi.org/10.1007/s10207-016-0321-5
Wu, T., Wen, S., Liu, S., Zhang, J., Xiang, Y., Alrubaian, M., Hassan, M.M.: Detecting spamming activities in twitter based on deep-learning technique. Concurr. Comput. Pract. Exp. 29(19), e4209 (2017). https://doi.org/10.1002/cpe.4209
Lingam, G., Rout, R.R., Somayajulu, D.: Detection of social botnet using a trust model based on spam content in Twitter network. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 280–285 (2018). https://doi.org/10.1109/ICIINFS.2018.8721318
Funding
This study was funded by H2020 C3ISP Project (GA 700294).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they do not have conflict of interests.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sheikhalishahi, M., Saracino, A., Martinelli, F. et al. Digital Waste Disposal: an automated framework for analysis of spam emails. Int. J. Inf. Secur. 19, 499–522 (2020). https://doi.org/10.1007/s10207-019-00470-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10207-019-00470-x