Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

2023

Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient antispam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Francisco Jáñez-Martinoa,b , Rocío Alaiz-Rodrígueza,b , Víctor González-Castroa,b , Eduardo Fidalgoa,b and Enrique Alegrea,b a Department b Researcher of Electrical, Systems and Automation, University of León, Spain at INCIBE (Spanish National Cybersecurity Institute), León, Spain ARTICLE INFO ABSTRACT Keywords: Spam Detection, Multi-classification, Image-based Spam, Hidden Text, Text Classification, word embedding, Term Frequency Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient antispam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively. 1. Introduction Billions of spam emails are sent and received everyday1 . All email clients have a spam folder which automatically collects undesired content that is often unnoticed. Thanks to these spam folders, unsolicited emails are less annoying and they do not swamp users’ mailboxes. However, spam emails could be adapted for specific targeted attacks, jumping to our main inboxes. They may just contain advertisements and company promotions, and although this is annoying, it is harmless [8, 26]. Unfortunately, there is also a significant proportion of spam messages that have a malicious nature and whose aim could be to steal personal data, introduce malware or hijack user systems [26]. Spam generation is low-cost and identifying its creators is not a straightforward task, which makes this a very common method used by cybercriminals. In addition, the volume of spam represents a huge proportion of the total emails sent daily. According to the reports of Cisco Talos2 and Kaspersky Lab3 , spam emails represent between 55% and 85% of the daily total volume of worldwide emails. Spam ma«y cause productivity loss, distrust in email service, annoyance or services bottlenecks which limit memory space and speed of computers, resulting in an economic expense for organisations that is steadily increasing. As a consequence of all the above, a decade ago spam was estimated to cost companies twenty billion dollars annually. This cost is likely to surpass 250 billion dollars in a couple of years [43]. Moreover, if the aim of the spam is fraudulent, the integrity, security and privacy of the user may also be exposed to cybercriminals. Spam email is a problem widely studied in literature, since spammers constantly develop new techniques in order to bypass the email client’s spam filters. Existing research on using Natural Language Processing (NLP) for spam detection has focused particularly on binary classification approaches, categorising emails into two classes, legitimate or undesired email, i.e. ham or spam [30, 5, 6, 25, 21]. ORCID (s): 0000-0001-7665-6418 (F. Jáñez-Martino); 0000-0003-4164-5887 (R. Alaiz-Rodríguez); 0000-0001-8742-3775 (V. González-Castro); 0000-0003-1202-5232 (E. Fidalgo); 0000-0003-2081-774X (E. Alegre) 1 https://techjury.net/blog/how-many-emails-are-sent-per-day/- Retrieved December 2021 2 https://talosintelligence.com/reputation_center/email_rep Retrieved December 2021 3 https://www.statista.com/statistics/420391/spam-email-traffic-share/ Retrieved December 2021 Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 1 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach It is a well-known fact that spam can be classified into different categories4 .In addition, some spam categories may be more harmful than others, and some of them may be more prone to go through spam filters undetected. Therefore, it would be a valuable improvement to detect not only if an email is spam, but also its type. The multi-classification of spam emails could improve the cybersecurity incidents handling, companies and citizens protection and early warning by identifying the behavioural patterns of spammers as a vital aspect of spam detection [19]. Due to the malicious nature of some of the spam emails, it is important to analyse its content to prevent cyber-attacks or campaigns against specific targets [24, 46]. At the time of writing this paper, there is only one work carried out by Murugavel and Santhi [44] that deals with multiple threads of spam from a text analytic perspective, but without applying artificial intelligence. In this paper, we approach the spam email problem from a different and novel perspective. We analyse the text content to identify cybersecurity topic-based class detection. These classes emphasise the most common topic hoaxes that citizens and companies have to face when they receive spam daily. Since we gain insight into the spam email data, cybersecurity organisations may identify campaigns more easily in relation to the scam topic and enhance the warnings against them. The main contributions of this work can be summarized as follows: 1. We carried out an analysis and investigation of the textual part of spam emails using a hierarchical clustering algorithm in order to divide them into classes based on a cybersecurity topic-based approach. 2. We presented an email preprocessing method to extract the textual content from spam emails considering spammer tricks such as (i) introducing part (or all) of the spam message into images and (ii) hiding random text in the body of the email (known as “salting”). 3. We created a novel dataset called Spam Email Multiclassfication (SPEMC) that is divided into two subdatasets: one with emails in English and another with emails in Spanish. Each one contains almost 15K spam emails labelled into a predefined set of eleven categories. 4. We introduced a framework to classify spam emails into cybersecurity categories using machine learning and natural language processing techniques. The proposed approach can be integrated into tools and services whose objective is to serve citizens and organisations, helping them to identify harmful spam, like the one containing extortion hacking, fake reward, identity fraud or false job offers. Our collaboration with the Spanish National Cybersecurity Institute (INCIBE)5 aims at developing solutions based on machine learning that could be useful for Public Administrations, Industry or Law Enforcement Agencies. This work is an extension of a preliminary study [32] and it has been influenced by some research carried out about the dark web ([10, 3, 9, 27]), where domains in the onion router (Tor) darknet are classified in multiple categories depending on their contents, instead of just dividing them as legal or suspicious of being illegal. The ultimate goal of this work is to enable the extraction of meaningful information from large amounts of undesired -and possibly harmful- spam emails, which can help Law Enforcement Agencies (LEAs) or companies to fight against them. To the best of our knowledge, there are no works that tackle the spam email problem from a cybersecurity topicbased perspective using NLP and machine learning. For the first time, in this paper we propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets containing approximately 15K spam emails each. The SPEMC datasets have been semi-automatically labelled into 11 categories by means of an agglomerative hierarchical clustering dealing with the hidden text problem efficiently. SPEMC-15K-E and SPEMC-15K-S comprise English and Spanish spam respectively, which are the second and third most spoken languages in the world6 . Besides, we propose a spam multi-classification pipeline, assessing sixteen different combinations of encoding techniques with machine learning classifiers for English and Spanish spam emails, setting baseline results for the SPEMC datasets for future research on the cybersecurity topic. In addition, in order to extract all valuable text from the spam email, we detect two well-know spammer tricks in our datasets and propose a solution to minimise their impact in the classification. Before encoding the content of the undesired email, we also introduce a different way of working with the text included in images and the salting. For the spam that includes images, we extract the text using OCR techniques, instead of ignoring it. For the hidden disturbing text, instead of looking into the HTML tags, we extract the text which is visible to the user using OCR technologies. An overview of the entire process, including the creation of the datasets, is shown in Fig. 1. In addition, a flow chart explaining the process conducted is depicted in Fig. 2. The process of knowledge discovery in this field can be simplified with the automatic classification of spam emails into several categories. This multi-classification model can help current cybersecurity agencies that manually analyse 4 https://encyclopedia.kaspersky.com/knowledge/types-of-spam/ - Retrieved December 2021 5 https://www.incibe.es/en Retrieved December 2021 6 https://www.ethnologue.com/, Retrieved December 2021 Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 2 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Fig. 1: Spam email multi-classification process: a) extraction of 15K random spam emails per language from resources, b) pre-process emails, c) extraction of all visible text of every email, d) text preprocessing on each email, then encoding with Bag of Words and finally, hierarchical clustering, e) manual review of the clusters, f) category labelling, g) training and evaluation of 16 pipelines of text classification. Fig. 2: Flow chart of the entire process to develop our proposed model capable of detecting cybersecurity topic-based classes automatically. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 3 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach spam emails using hard-coded rules, trying to avoid loss of work productivity, malware distribution [15] and phishing [16] as well as to detect cybercrime campaigns [24]. The rest of the paper is organised as follows: related works are reviewed in Section 2. Section 3 explains the methodology we have followed to create the SPEMC datasets. The set of the designed classification pipelines is explained in Section 4. After that, in Section 5, we detail the experimental setup and we discuss the results. Finally, Section 6 presents our conclusion and future work. 2. Literature Review 2.1. Spam email detection Organisations and researchers have been developing filters to classify emails as spam or not spam for the last few decades. Models based on machine learning and NLP have become the state-of-the-art filters. Barushka and Hayek [6] used a deep learning model and, later, Faris et al. [25] developed a genetic algorithm as a feature selector along with a Random Weight Network. Recently, Saidini et al. [49] combined semantic features extracted by different text encodings, e.g. doc2vec or Bag of Words (BOW), with six machine learning algorithms, such as Support Vector Machine (SVM), k-nearest neighbour, Adaboost, Naïve Bayes (NB), decision trees and Random Forest (RF) to identify spam emails. Dedeturk et al. [22] created a filter using an artificial bee colony as feature selector and Logistic Regression (LR) as classifier. Despite the impact of the deep learning models in many tasks, Mekouar [39] recommended traditional algorithms like RF and NB due to their performance in spam detection. Although the binary classifiers developed recently show high performance, it is worth highlighting that the emails used to calibrate the machine learning models come from publicly available datasets that are dated from the earliest 2000s. For example, Barushka and Hayek [6] obtained remarkable results on publicly available datasets like SpamAssassin7 or the Enron-Spam dataset [40]. Faris et al. [25] also assessed their binary spam filter on the SpamAssassin dataset. However, it is important to highlight that spam email has a changing nature due to time (evolution of subjects) and to the techniques used by spammers wishing to elude spam filters, what inevitably leads to shifts in the dataset [48]. Due to this fact, the most recent works in spam email are training their models without considering current spammer tricks. Bhowmick and Hazarika [8] enumerated a list of the most popular spammer tricks, among them, the use of image-based spam and insertion of random text in the email body. The former consists on inserting the spam message inside an image attached to the email to bypass the filters based on textual analysis. There are some works [12, 51, 37] that have dealt with detecting spam emails through classifying the attached images. They used machine learning models and the image properties, e.g. the metadata or the colour, as features. Other works, like [45], handled the image-based trick from an Optical Character Recognition (OCR) perspective to recognise and extract the letters and words from a spam image [19]. Spammers also try to confuse the textual spam filters by inserting pieces of random text inside the email body and hiding it conveniently, e.g. by reducing the font size or by making it invisible to readers. This trick, which normally uses HTML tags, is known as hidden text or salting [35]. There are a small number of works oriented to detect the salting trick in spam emails and use this content to enhance binary spam classifiers, [7, 35]. They try to identify if a character is hidden text by analysing its visibility, i.e. checking out anomalies in terms of colour or size, presenting an introduction of OCR solution. Despite being a common problem nowadays, it is often overlooked, and we have found no more works that deal with this problem. 2.2. Topic-based detection in the spam field In their study about spam opinion detection, Ligthart et al. [34] concluded that, in some scenarios where binary classification is inadequate, multi-class classification is required, and defined this task as a demanding challenge with high research efforts. To the best of our knowledge, a few works have addressed the multi-classification based on topic-based approach in spam email [44, 49]. Saidini et al. [49] divided both spam and not spam emails into six pre-defined domains according to the topics of most common advertisements, e.g. computer, adult, education, finance, health and others. They developed a model based on machine learning and natural language processing to detect these domains and perform a binary classification of an email. Muragavel and Santhi [44] identified seven spam categories – or threads – through the count of the most frequent words in a dataset of emails. They also provided some statistics, concluding that the most frequent thread on spam 7 https://spamassassin.apache.org/old/publiccorpus/ Retrieved December 2021 Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 4 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach emails is promotional advertisements. However, they did not use either NLP or machine learning techniques to assign an email into a category. Besides this, their work in the multi-classification problem of spam is based on a small dataset, which comprises 1040 emails, 842 of which are spam, which is not large enough for providing consistent statistics, nor for building a robust pipeline for automatic spam classification. Indeed, they also pointed out the image-based spam trick, but they did not mention how to make use of this information or how to classify these emails automatically. However, the previous works did not tackle the spam email problem from a cybersecurity topic-based perspective using NLP and machine learning. 2.3. Hierarchical text clustering The hierarchical clustering methods are divided into agglomerative and divisive, bottom-up and top-down approaches [1]. An agglomerate clustering starts with single-point clusters and, according to their similarity, recursively merges two or more clusters until achieving a stop criterion. These properties allow the algorithm to find out high level relationship among categories and join related clusters with fewer number of examples. Some works have been used the hierarchical clustering in textual tasks due to the versatility and ease to filter the data visually. Al-Mahmoud et al. [4] evaluated a hierarchical algorithm in the task of text clustering and concluded that its high performance did not degrade depending on the number of documents and number of clusters. In their work, De Campos et al. [20] assessed text clustering techniques and remarked that the hierarchical algorithms work quite well for filtering problems. Mahdavi et al. [36] carried out their investigation to discover relationships among dataset entities using hierarchical clustering to analyse the datasets. 3. Datasets: SPEMC-15K-E and SPEMC-15K-S At the time of writing this paper, there is only one work by Murugavel and Santhi [44], which tackles the spam email problem as a multi-classification problem. They identified seven categories or threads using the count of the most frequency words on a dataset which contains 1040 emails, 842 of which are spam. The categories are chain letters, email spoofing, promotional advertisements, hoaxes, malware warning and porn spam. They provided some statistics, number of emails per category and most and less representative class. This dataset does not count with numerous emails to provide a consistent statistics and train a robust automatic classifier. We perform our experimental study on two novel and present-day datasets: SPEMC-15K-E (Spam Email Classification dataset - English) and SPEMC-15K-S (Spam Email Classification dataset - Spanish), containing approximately 15K spam emails each one. The purpose of this work is to provide a real solution for the INCIBE environment to enhance the security and privacy of companies and citizens. We address the study of spam emails in English and Spanish because they are the most reported languages by INCIBE. The analysis of the spam email leads us to a division into eleven spam categories. We defined these categories using machine learning and NLP techniques and with the supervision and support of an expert group of technicians from INCIBE. We focused on the spam email topics from a user point of view, both companies and citizens. We wanted to differentiate what is the bait of the hoaxes, advertisements or chain letters emails in order to detect, with further information, campaigns of spam emails, such as miracle product scams. 3.1. Datasets Creation Since both datasets were created following the same process, contain the same number of categories and almost the same number of emails, and the only difference between them is the language - English or Spanish - we use the abbreviation SPEMC-15K to name them. To build the SPEMC-15K datasets, INCIBE provided us the spam emails, which were previously collected by honeypots of the Spanish national research and education network called RedIRIS8 . We had a total of 70K emails from November and 15K from April of 2019. Once we received the data, we first extracted 15K random emails per language, i.e. English and Spanish, from the initial spam collection provided by INCIBE. In general terms, we used an agglomerative hierarchical clustering to divide each dataset into clusters in order to identify a hierarchy of topic-based groups inside. We extracted 15K emails per language due to the limitations of hierarchical clustering for large datasets, such as high computational time and space complexity [36], as well as human resources to analyse the cluster outputs. The use of 15K emails allowed us to find a trade-off between time and space complexity and an appropriate experimentation. Later, an expert carried out the annotation of each group 8 https://www.rediris.es/index.php.en Retrieved December 2021 Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 5 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach and checked that the division was suitable, while cybersecurity experts from INCIBE supervised the annotation and definition of the classes. All authors of the paper validated the previous annotations during the entire process. In more detail, we applied the email processing (Section 4.1) and the text preprocessing (Section 4.2) steps and we encoded their text using a BOW model [33]. To obtain a first division of the unlabelled email corpus, we followed Biswas et al. [10] work for building a Tor (The Onion Router) image dataset and Zhang et al. [53] work for dividing spam images into clusters using agglomerative clustering. We clustered the BOW feature vectors through an agglomerative hierarchical clustering, evaluating different linkage metrics. The Ward’s minimum variance [31] appeared to be the most suitable linkage approach due to getting a larger and faster separation between clusters. Finally, we manually selected a cut-off distance by observing the resultant dendrograms. We established an approximate range of possible categories based on experts’ suggestions from INCIBE and their preferences. We obtained 16 clusters for English and Spanish. We visually inspected all the emails from every cluster to assign an initial tag that helped to define the category later on. After merging some similar clusters into the same class and looking for the same categories for both languages, we obtained the final categorisation with 11 classes labelled. INCIBE experts checked out our labelling in order to advise us and define a suitable list to contemplate the interests of companies and citizens. We sought to automate a categorization task that is carried out by experts manually, to overcome time and resources limitations. Due to this fact, the experts’ help allowed us to determine classes according to their needs, using the clusters as a baseline. Although we can relate every cluster with a class after some reassignments of emails with the help of experts, a visual inspection is recommended to be sure about the type of emails found in every cluster. Unfortunately, we cannot make both datasets publicly available since they do not belong to us (i.e., they were provided by INCIBE), and they contain personal information, which is difficult to anonymise entirely in a reliable way. Fig. 3 shows the dendrograms for both languages. We can observe a close distance between similar topics and writing styles, such as Academic Media, Health and Pharmacy, Service and Work Offer. Likewise, between Extortion Hacking and Identity Fraud. Although the majority class - Sexual Content Dating - groups several clusters, the content of each one follows the same topic and purpose, and a subdivision could have added noise to the model. Fig. 3: Dendrograms provided by hierarchical clustering of both languages English and Spanish. Axis X represents the class associated to every cluster, and Axis Y represents the distance among the clusters. 3.2. Datasets Characteristics The SPEMC-15K datasets contain the following classes: Academic Media, Extortion Hacking, Fake Reward, Health, Identity Fraud, Money Making, Pharmacy, Service, Sexual Content Dating, Work Offer and Other. Next, we briefly describe each one of the spam classes, Table 1 shows fragments of email messages representative of each spam category and Figures 4 and 5 depict a word cloud per category in SPEMC-15K-E and SPEMC-15K-S, respectively. Table 2 shows the number of emails and the proportion of each class for both datasets. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 6 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Academic Media includes spam emails related to scientific conferences or journals and education services such as masters, seminars or courses. The English emails which belong to this class are mainly focused on the scientific community, and its language is rigorous and formal, emulating real conferences and calls for papers. Apart from including these emails, the Academic Media Spanish emails mostly involve courses for personal skill development. Extortion Hacking contains emails which request a payment from the users in exchange for the sender does not reveal private content of them. The language used is formal without orthographic and grammatical errors in order to scare the victim being as real and severe as possible. Generally, the emails follow a similar structure and use similar words, varying the threat. Fake Rewards covers emails where the sender offers an unexpected recompense for the receiver; a famous example is the Nigerian Prince scam. In general, these emails depend on the reward, and the more valuable is the reward, the longer and more explanatory is the message. Health is related to miracle pieces of advice, products and news which improve the user well-being. These emails attempt to convince the user through close communication and emphasising the importance of what the message promotes. They address a large variety of topics related to health problems, such as sexual, physic or psychological. Identity Fraud includes emails whose sender attempts to pose as a well-know company by using its name and brand, or a person who sends an email very similar to ham email. They sometimes trick users by looking like an email with a wrong receiver in order to achieve a naive response from the victim. They also use social engineering techniques, building phishing emails to obtain private information from the victim. Due to these characteristics, cybersecurity experts point out identity fraud as one of the most harmful classes for companies and citizens. Money Making is composed of emails which offer online services to earn money quickly, such as casinos, fast tricks to earn easy money or betting shops. Money Making emails often use a careless and repeating structure, looking for highlighting the ease of gaining money. Pharmacy includes the sale of many known drugs via the Internet. They emulate a real pharmacy by listing their products and trying to create an email which transmits confidence. Service covers the emails with advertisements of Small and Medium-sized Enterprises (SMEs) or personal services. They promote a profession to solve specific user problems. Sexual Content Dating contains sex web pages and sex propositions. A dating email often uses the same structure, just changing some words, and show explicit sexual content. On the other hand, the propositions are generally a more careful and smart language in order to convince the user with something else than just sexual arguments. Work Offer groups spam emails which offer a fake job with significant benefits to the user. These emails usually follow a pattern and only differ in the work conditions. Other mainly contains disclosures, discoveries and products about politics, economy and technology. These emails are similar to Health class emails, and the major difference between them is the topic. 4. Methodology We divided the automatic classification of spam emails in two stages: email processing, where we extracted the textual information from raw emails, and text classification, where we pre-process, encode and classify the textual information. 4.1. Email processing One of the main purposes of our research is to classify spam emails based on its topic. Generally, emails are divided in two parts: (i) header and (ii) body, which comprises text, multimedia objects and attachments. Since we are working in the NLP field, we focused on those elements of each email where we can extract textual information. From the header, we extract only the field Subject, which usually summarises the content of the body into a few words, and we do not use the rest of the fields of the header, such as CC/BCC or address email. Although they are suitable to detect spam emails, since our objective is to classify the spam email into a default set of categories related to its topic, we only used the subject field because it contains the greater amount of textual information compared to other headers of the email. Traditionally, the email body had plain text without formatting options. Nowadays, emails usually are coded in HTML format, which allows enhancing the email design through the use of templates, images and extra functionalities [14]. However, some emails contain both formats – plain and HTML – to ensure that the client can read the email without depending on the service. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 7 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Table 1 Piece of an email for every spam class defined. Class Example Academic Media Better Packaging Better Living. Join FSQ Europe 2020 taking place on 29th 30th January 2020 in London, UK and hear senior representatives of British Plastic Federation, OFI Technologie Innovation GmbH and Client Earth give presentations in Session 2 in the morning of Conference Day 1 entitled "Better Packaging,Better Living" focusing on Extortion Hacking I’ve been watching you for a few months now. The fact is that you were infected with malware through an adult site that you visited...I made a video showing how you satisfy yourself in the left half of the screen, and in the right half you see the video that you watched. With one click of the mouse, I can send this video to all your emails and contacts on social networks. I can also post access to all your e-mail correspondence and messengers that you use. If you want to prevent this, transfer the amount of $732 to my bitcoin address. (if you do not know how to do this, write to Google: "Buy Bitcoin"). Fake Reward I am interested to transfer and invest in your country through your assistance.I am in Ghana presently and I have the sum of Ten Million Eight hundred thousand US Dollars which I would like to transfer into your account and invest in your country if possible. Health 18 months ago, I discovered a weird method that can safely and naturall improve your hearing , no matter how complicated your hearing problems are. So far, it’s already helped over 96, 623 people who found this brilliant, ear-saving method to save them from going DEAF... Identity Fraud Dear, Please find attached copies of documents that we were sent the original ones. Thanks & Regards, PEF PVT LTD Money Making SlotoCash Casino Trusted Online Since 2007 ExclusiveOffer Get $31FREE NoDepositRequired Code:31FREE 200% MatchBonus+100FreeSpins EnterPromoCode:SLOTO1MATCH Pharmacy Online Pharmacy, Guaranteed Quality! Save your money, time, efforts. You’ll never find better offer! Best medications available are sold at our trusted online pharmacy! This month at half price! - Fast World shipping - Secure ordering Lowest price - NO PRESCRIPTION REQUIERED Service Dear Sir/Madam, Nice day, Glad to hear you are in inflatable outdoor products market! We are Supplier for air track, inflatable SUP board, inflatable sport game ,inflatable water park, we also have very popular style, it is Inflatable gymnastic tumbling mat. Sexual Content Dating Want sex tonight, and new pussy every day? Here you can find any girl for sex! They all want to fuck Work Offer Hello! We are looking for employees working remotely. My name is Anderson, I am the personnel manager of a large International company. Most of the work you can do from home, that is, at a distance. Salary is $3500-$7000. If you are interested in this offer, please visit Our Site Best regards! Other AutoCharge² - Magnetic Phone Holder and Charger Special 50% Black Friday Sale - Order Now at 50% Off! Hat Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 8 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach (a) Academic Media (b) Extortion Hacking (c) Fake Reward (d) Health (e) Identity Fraud (f) Money Making (g) Pharmacy (h) Service (i) Sexual Content Dating (j) Work Offer (k) Other Fig. 4: The top 18 most frequently used words of every class in SPEMC-15K-E depicted in a word cloud. (a) Academic Media (b) Extortion Hacking (c) Fake Reward (d) Health (e) Identity Fraud (f) Money Making (g) Pharmacy (h) Service (i) Sexual Content Dating (j) Work Offer (k) Other Fig. 5: The top 18 most frequently used words of every class in SPEMC-15k-S depicted in a word cloud. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 9 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Table 2 Number of emails and the percentage per class and dataset SPEMC-15K-E SPEMC-15K-S Class Count % Count % Academic Media Extortion Hacking Fake Reward Health Identity Fraud Money Making Pharmacy Service Sexual Content Dating Work Offer Other Total 64 197 240 2499 334 434 17 183 7924 55 2532 14479 0.44 1.36 1.66 17.25 2.31 3.00 0.12 1.26 54.73 0.38 17.49 690 259 37 490 144 513 3833 271 7062 1165 528 14992 4.60 1.73 0.25 3.27 0.96 3.42 25.57 1.81 47.10 7.77 3.52 To process the text from the email body, we consider three scenarios: (i) emails with only plain text, (ii) with only HTML format and (iii) both simultaneously. In emails with both formats, we prioritise the analysis of the HTML one, rather than the plain text. We do that because the HTML format characteristics also give to the spammers a more sophisticated tool to enhance their tricks [8, 48, 26]. Particularly, we found out emails with hidden text within, which would affect the performance of a text classifier due to the introduction of random text, invisible for the users, but utilised by the spam filters to detect a spam email. Although this trick might impact spam detection, to the best of our knowledge, it has not been taken into account during the last few years. For this reason, we recover the use of hidden text from the latest research available ([35],[7]). First, we convert the HTML email body into an image containing the entire email. Then, we transform the image to greyscale, and finally, we extract the visible text from this image by using an OCR. It is worth emphasising that we assume every spam email with HTML part is suspicious to contain hidden text. Following this methodology, we ensure the extraction of all the visible text seen by email clients, allowing the information to be classified in the same way a human being does. According to Bhowmick et al. [8], spammers avoid filters based on textual content attaching images containing text, instead of writing in the email body. Researchers have detected the image-based spam by analysing the attached image [51, 12]. To consider the textual information extracted from an image, we also applied an OCR to extract the text embedded in those images. After checking out the language of the text obtained from each part, i.e. the subject, body and images attached, we joined all the text to be processed, as a whole, in the Text Classification stage. If the text is not written in English or Spanish, it is discarded. Since INCIBE is a Spanish organisation, they are interested in both languages by being the most harmful for the Spanish companies and citizens. We only use emails where the language of all these parts is the same, to avoid that the future classifier performance would be impacted negatively. This step is vital, avoiding emails whose content is in several languages, e.g. English images with text body written in Russian. 4.2. Text Classification This stage is divided into three phases, following other works like [47]: text pre-processing, representation and classification. In the pre-processing, first, we removed single characters, numbers and letters. If there are characters or numbers inside a word, we eliminated them. Then, we changed the text to lowercase, and finally, we removed the stop words, the duplicated words and tokenized the resulting text. We have not applied a stemming method due to their ambiguity, which could be a cause of a wrong classification in a misleading environment as the spam email is. To represent the text, we selected two popular techniques based on word frequency – i.e., Bag of Words (BOW) [29] and Term Frequency - Inverse Document Frequency (TF-IDF) [2] – together with two recent word embedding techniques: word2vec [41, 42] and Bidirectional Encoder Representations from Transformers (BERT) [23]. It is worth highlighting that BOW and TF-IDF allow straightforward implementations with low computational requirements. However, they do not consider the words’ order. BOW [29] represents a text corpus by means of a feature Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 10 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach vector whose components are the frequency of each word. TF-IDF [2] builds a sparse vector assigning a numerical value to each word of the text corpus, emphasising it when a word appears many times in a text and fewer times in the rest of the corpus. On the other hand, word2vec and BERT represent a text as a vector which encodes the relationship between the words, i.e. their context. This enables similar words to be represented closer in the embedding space, enhancing the semantic analysis and context of the words. These techniques build a vector with lower dimensionality than traditional methods, being able to manage large datasets without spending many computational resources. However, the models have a larger size than the word frequency models, which might be a drawback for a real-time application. Word2vec [41, 42] tries to maximise the likelihood that words are predicted from their context, with a Continuous Bag of Word (CBOW) model, or vice versa, skip-gram model. BERT [23] is based on the context, taking as baseline a masked language model and pre-trained using bidirectional transformers [50], and it encodes words using bidirectional instead of unidirectional representations. We selected word2vec and BERT because they are the most significant word embedding with different approaches. The model word2vec is based on learning context-independent word representations, whereas BERT relies on learning context-dependent word representations. Finally, we combined every text representation with each of four well-know machine learning algorithms, Support Vector Machine (SVM) [17], Näive Bayes (NB) [38], Random Forest (RF) [11] and Logistic Regression (LR) [18], resulting in 16 different pipelines or trained models for the task of Text Classification. 5. Empirical Evaluation 5.1. Experimental Setting We carried out our experiments on a personal computer with an Intel(R) Core(TM) 𝑖7−7𝑡ℎ𝐺𝑒𝑛 with 16G of RAM, under Ubuntu 18.04 OS and Python 3. We assessed several multi-classification models on the two datasets presented in this paper, SPEMC-15K-E and SPEMC-15K-S (see Section 3), which contain spam emails in English and Spanish, respectively. Regarding the process of building the datasets, we detected the email language using the Python module langdetect9 and implemented the agglomerative hierarchical clustering algorithm with the Python3 library scipy10 . We extracted the text from images and email HTML image through a Python wrapper of tesseract-ocr11 , called pytesseract12 . Both datasets are highly imbalanced, finding that the majority class in SPEMC-15K-E, Sexual Content Dating, contains 7924 emails whereas the minority class, Pharmacy, only has 17 emails. Similarly, the number of elements in the SPEMC-15K-S dataset in the majority and minority classes are 7062 and 37 emails, corresponding to Sexual Content Dating and Fake Reward, respectively. To address this class imbalance, we assigned a proportional weight for each class depending on its number of emails by using the class-weight parameter in scikit-learn Python library 13 . We used scikit-learn to implement the pipelines and nltk14 to remove the English and Spanish stopwords. For the text representation step with BOW and TF-IDF, we selected a vocabulary size of 7000 and 10000 words, respectively. Regarding the minimum number of appearances per word for the English and Spanish dataset, they were set to 5 and 3, respectively. Spanish verb conjugations were a challenge, and we had to design a preprocessing step without the stemming and lemmatization techniques. Consequently, we considered a fewer number of word appearances to create a robust and wide vocabulary in Spanish vectorizers. We built a doc2vec encoder based on word2vec model provided by gensim15 . The doc2vec model is the sum of all word vectors that compound the email text. We trained the doc2vec model during 10 epochs with an alpha value of 0.025 and the size of the doc2vec vector, which represents each email, is 100 elements. We selected ‘distributed memory’, i.e. DBOW option, to preserve the order of the words. We trained the word2vec model with a vocabulary of 15K words per language, i.e. English and Spanish, extracted from the emails in order to import the relation between words in a spam context due to its difficulty for handling words that have never seen before. The rest of the word2vec and doc2vec parameters were set to default values. We implemented BERT by means of a client-service 16 . After an empirical evaluation, we chose the best configuration of pre-trained BERT models. Thus, for English pipelines, we chose a 9 https://pypi.python.org/pypi/langdetect Retrieved December 2021 Retrieved December 2021 11 https://github.com/tesseract-ocr/tesseract Retrieved December 2021 12 https://pypi.python.org/pypi/pytesseract Retrieved December 2021 13 https://scikit-learn.org Retrieved December 2021 14 https://www.nltk.org/ Retrieved December 2021 15 https://radimrehurek.com/gensim/ Retrieved December 2021 16 https://github.com/hanxiao/bert-as-service Retrieved December 2021 10 https://www.scipy.org/ Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 11 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach BERT model with 24 layers, 1023 hidden layers, 16 heads and 340M parameters, only trained with English vocabulary and for Spanish pipelines, we selected a multi-language BERT model, which was pre-trained in 104 languages, with 12 layers, 768 hidden layers, 12 heads and 110M parameters. For the classification step, we show below the parameter tuning per model, and the rest of the model parameters were left with their default values. We took a “One Vs Rest” (OVR) approach for all the classifiers. We selected a linear kernel for the SVM model, tuning the 𝐶 value. The 𝐶 parameter is an optimiser for both classifiers: a high value looks for a lower margin of hyperplane separation. Regarding NB, we used a Multinomial distribution for the frequencybased encoders, i.e., TF-IDF and BOW. Due to the incompatibility between negative values of word embeddings, i.e., word2vec and BERT, and the Multinomial distribution, we set a Gaussian distribution for these cases. For the RF model, we set the number of trees to 250. Lastly, we chose a 𝐶 value of 1000 and 120 as the maximum number of iterations for the LR model. We evaluated the performance with 10-fold cross validation, reporting accuracy, precision, recall and F1 score. Despite working with imbalanced datasets, we assumed that every class has an equal actual value, and thus, we evaluated every model by means of the macro-average. We seek to classify spam emails without depending on their overall proportion in the dataset. This metric globally aggregates the contributions of each class, considering all classes with the same weight to calculate the average metric. Macro-average is considered more suitable when there are smallsize classes [28, 52]. Additionally, we also obtained the micro-average, which evaluates every class individually, and the weighted-average, which considers the support of each class. We also report the average processing time per email for the entire pipeline. Runtime is an important parameter for converting this solution into a real-world application, due to the massive number of spam emails that are processed on a daily basis. Finally, we selected the most adequate pipeline per language by analysing jointly the F1 score, accuracy and execution time. 5.2. Experimental Results Spam emails were labelled according to their topic into eleven categories. Our purpose is to provide a solution based on machine learning and NLP in order to automatically analyse spam emails, and give support to Cybersecurity Institutes. For that reason, we combined four text representations with four classifiers, resulting 16 pipelines to automatically categorise, for the first time in the literature, spam emails into several categories. Table 3 shows the performance of every pipeline in terms of Macro precision, macro recall, macro F1 score, accuracy and runtime per email. 5.3. Discussion In Table 3, it can be seen that the combination of TF-IDF and LR obtains the highest performance for English spam multi-classification, with a Macro F1 score of 0.953 and an accuracy of 94.6%. For Spanish multi-classification, the combination of TF-IDF along with NB depicts the best performance considering a Macro F1 score of 0.945 and an accuracy of 98.5%. Regarding the runtime, the combination of TF-IDF with LR achieved the shorter execution time in both languages, classifying an English or Spanish email in an average of 2ms and 2.2ms, respectively. SVM combinations are the slowest among the evaluated pipelines, with times from 8.7ms to 98.6ms. In a multiclass setting, micro-averaged precision and recall take the same value and it turns out to be identical to accuracy. Regarding the weighted-average, we observed they are close to the micro-average due to the fact there is a clearly majority class in both languages. Although they are the highest performance pipelines, the accuracy of most pipelines is over 89.0%. The pipelines based on word2vec obtained the lowest results, which means the spam email dataset used to train the model is not suitable for this purpose. We trained our word2vec models as a doc2vec model from scratch using a small dataset for both English and Spanish language, which only included tens of thousands words belonging to email documents. A short vocabulary and similar context may be the main drawbacks to establish a robust relation among words, which may produce word vectors with close values and similar predictions to be assigned to the wrong class. Moreover, the combination word2vec-NB obtained the lowest results regarding overall metrics in the English dataset and the contrast between macro values and accuracy in the Spanish dataset. The BERT-NB combination also suffers the previous contrast. In order to use word embedding vectors, we changed the actual data distribution to the Gaussian distribution, which can disturb the distances between classes, causing more overlapping. This fact, along Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 12 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Table 3 Performance of the sixteen pipelines in Precision, Recall and F1 Score using Average Macro, Micro and Weighted, ACCuracy and RunTime (ms/email) terms. SPEMC-15K-E SPEMC-15K-S Pipeline/ Metrics TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 Avg Macro 0.965 0.923 0.941 0.957 0.858 0.883 0.960 0.906 0.929 0.971 0.939 0.953 0.964 0.923 0.941 0.675 0.827 0.682 0.957 0.908 0.929 0.966 0.937 0.950 0.269 0.402 0.264 0.056 0.125 0.012 0.750 0.346 0.427 0.258 0.397 0.271 0.932 0.951 0.941 0.413 0.412 0.358 0.966 0.898 0.925 0.932 0.949 0.939 Avg Micro 0.924 0.924 0.924 0.934 0.934 0.934 0.925 0.925 0.925 0.946 0.946 0.946 0.934 0.934 0.934 0.890 0.890 0.890 0.925 0.925 0.925 0.943 0.943 0.943 0.332 0.332 0.332 0.001 0.001 0.001 0.624 0.624 0.624 0.405 0.405 0.405 0.937 0.937 0.937 0.614 0.614 0.614 0.930 0.930 0.930 0.942 0.942 0.942 Avg Weighted 0.931 0.924 0.926 0.936 0.934 0.934 0.931 0.925 0.927 0.949 0.946 0.947 0.936 0.934 0.934 0.918 0.890 0.900 0.930 0.925 0.926 0.945 0.943 0.944 0.565 0.315 0.378 0.050 0.012 0.007 0.621 0.623 0.561 0.506 0.405 0.439 0.939 0.937 0.938 0.717 0.614 0.621 0.933 0.931 0.930 0.943 0.942 0.943 ACC (%) RT 92.4 98.6 93.4 3.1 92.1 40.5 94.6 2.0 93.4 55.8 89.0 4.5 92.5 35.6 94.3 2.7 33.2 77.5 1.4 3.7 62.4 17.0 40.5 12.2 93.7 8.7 61.4 3.5 93.0 39.4 94.2 26.1 Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Avg Macro 0.952 0.940 0.945 0.962 0.933 0.945 0.945 0.941 0.943 0.953 0.944 0.947 0.942 0.937 0.939 0.809 0.768 0.734 0.950 0.946 0.948 0.947 0.948 0.947 0.422 0.557 0.442 0.051 0.097 0.067 0.775 0.654 0.688 0.371 0.462 0.337 0.926 0.924 0.925 0.352 0.367 0.297 0.931 0.895 0.908 0.897 0.924 0.908 Avg Micro 0.983 0.983 0.983 0.985 0.985 0.985 0.982 0.982 0.982 0.983 0.983 0.983 0.982 0.982 0.982 0.932 0.932 0.932 0.983 0.983 0.983 0.983 0.983 0.983 0.345 0.345 0.345 0.471 0.471 0.471 0.792 0.792 0.792 0.495 0.495 0.495 0.979 0.979 0.979 0.823 0.823 0.823 0.974 0.974 0.974 0.979 0.979 0.979 Avg Weighted 0.984 0.983 0.983 0.985 0.985 0.985 0.945 0.941 0.943 0.984 0.983 0.983 0.982 0.982 0.982 0.962 0.932 0.930 0.984 0.983 0.983 0.984 0.983 0.983 0.479 0.345 0.372 0.242 0.471 0.320 0.765 0.792 0.758 0.487 0.495 0.424 0.980 0.979 0.979 0.833 0.823 0.811 0.977 0.974 0.974 0.979 0.979 0.979 ACC (%) RT 98.3 59.8 98.5 3.4 98.2 8.7 98.3 2.2 98.2 48.0 93.2 3.8 98.3 9.0 98.3 3.1 34.5 49.2 47.1 5.0 79.2 25.2 49.5 8.9 97.9 4.0 82.3 2.6 97.4 7.8 97.9 16.1 Page 13 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach with an imbalanced dataset entail a high accuracy with classes with more emails, like Sexual Content Dating, and, in consequence, high general accuracy. However, the macro metrics show the poor results by considering all classes with the same weight. Despite working with lower dimensionality vectors, the word embedding techniques do not overcome the processing time of term-frequency models for spam email classification. For instance, the BOW and TF-IDF combinations with LR are remarkably faster in comparison with word2vec and BERT pipelines. The combinations of SVM and LR with BERT, which is an encoder based on Deep Learning, are very close to term frequency encoders. However, they do not yield the best results and, due to their higher runtime –4ms in BERT-SVM and 16.1ms in BERT-LR against 2.2ms of TF-IDF-LR–, we do not select them for this application. We present the confusion matrix for the highest-accuracy performance models for both languages in Figure 6. The classes with the worst accuracy are Service in the English dataset, confused with Health and Other, and Identity Fraud in the Spanish dataset, mislabelled with Academic Media, Other and Work Offer. Although, due to their thematic variety, the category Other presents a major confusion with other categories in both languages. The per-class accuracy metric for the pipelines with the English dataset is shown in Figure 7. These graphics show that the classes Health, Other and Services are the ones with the lowest performance in the 16 pipelines. These three classes contain emails with a similar writing style, which may explain the problem description, relevance and the proposed solution. This email structure causes that the emails share similar words in their content, varying the thematic words. The category Other comprises a wide range of products and tricks, which results in not having a set of specific words to define it. This feature leads to this category to intersect with the rest of the classes. The category Service has a small number of examples, compared to Health and Other, so the pipelines based on term frequency might not differ among them. There are classes with high accuracy in most of the pipelines: Extortion Hacking, Pharmacy, Sexual Content Dating and Work Offer. One of the reasons to explain this high performance in Sexual Content Dating is that it is the most representative class with 7924 emails. However, the remaining categories have a small number of emails. In consequence, the reason might be a robust set of representative words or repetitive email structure, for frequency-based and context-based text representation models, respectively. Finally, Identity Fraud is classified with high accuracy in most pipelines, being BOW the text representation technique with the best performance. Due to this, it might be emphasised the importance of word count to detect identity frauds in English emails. (a) English pipeline TF-IDF with LR (b) Spanish pipeline TF-IDF with NB Fig. 6: Confusion Matrix of the highest performance pipelines per language in terms of accuracy (%). We use the following acronyms Academic Media (AM), Extortion Hacking (EH), Fake Reward (FR), Health (H), Identity Fraud (IF), Money Making (MM), Other (O), Pharmacy (P), Services (S), Sexual Content Dating (SCD) and Work Offer (WO) Regarding text representation models, the term frequency algorithms (TF-IDF and BOW) achieve similar performance in every class. It is worth highlighting the low accuracy of BOW-NB combination in Academic Media. The long extension of the emails implies more words alongside the NB principle of independence between features Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 14 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach might produce the confusion with other classes with long emails. Also, the combination TF-IDF-NB obtained a low performance in Pharmacy class. The word2vec pipelines improve their performance in classes whose content is repetitive for most emails, such as Extortion Hacking, Money Making or Sexual Content Dating. BERT pipelines, except NB model, outperform the other ones in Academic Media, which contains long scientific emails with a formal expression and phrase constructions. Figure 8 depicts the accuracy metric per class for Spanish pipelines. Academic Media, Identity Fraud, Other and Services are the classes with the worst overall results. This confusion might be explained by similar reasons to English. Particularly, the class Other also contains a wide range of topics that might not define it robustly. In Spanish datasets, Identity Fraud contains emails focused on company impersonation, what might interfere with the class Service. As well as it happens with Academic Media and Service, Spanish Academic Media emails involve many training courses from universities or academies that have similarities with Service emails. On the other hand, the classes Extortion Hacking, Money Making, Pharmacy, Sexual Content Dating and Work Offer achieve generally a high performance. As it happens with the English pipelines, the number of examples, a welldefined representative set of words and similar email structure might be the reasons. The word2vec-based combinations, in English, obtain higher results in classes which contain emails with repetitive structure, such as Money Making or Extortion Hacking. The BOW pipelines also stand out to detect Identity Fraud class, what might indicate that the counting of words is important for Spanish multi-classification. Nevertheless, the performance in Spanish is lower than in English. This fact might have relation with a less number of examples and the kind of fraud different from English hoax. Moreover, the combination of BOW with NB decreases its performance in three classes, which are Academic Media, Fake Reward, and Other. Although the extension of the emails is short in these cases, the reasons might be the same as in English. Since the class imbalance may affect negatively the performance of our models, we have carried out an analysis of alternatives to try to overcome this issue. Our datasets contain approximately 15K spam emails unevenly distributed in eleven classes and quite unbalanced. Besides the weighted class approach, we have evaluated two well-known combinations of over- and under-sampling: (i) Random over- and under-sampling, and (ii) Synthetic Minority Oversampling Technique (SMOTE) [13] along with Near-Miss [54] as undersampler. We performed the aforementioned over/under sampling strategies to balance the dataset resulting in a dataset with approximately the same size total size of 15k spam emails. We show the comparison to our previous results using class weight approach and both over-/under-sampling strategies in Table 4. We can observe that applying a class weighted method or sampling strategy (guided by SMOTE and NearMiss) outperforms the solution achieved with a random sampling strategy in order to balance the dataset. Table 4 Performance of the three imbalanced alternatives applied to the hightes performance models, TF-IDF-LR (English) and TF-IDF-NB (Spanish). in Precision, Recall and F1 Score using Average Macro, Micro and Weighted, and ACCuracy terms. Pipeline/ Metrics Class weight Over-Under-sampling SMOTE+NearMiss TF-IDF-LR trained in SPEMC-15K-E Avg Avg Avg ACC (%) Macro Micro Weighted P 0.971 0.946 0.949 R 0.939 0.946 0.946 94.6 F1 0.953 0.946 0.947 P 0.922 0.914 0.922 R 0.914 0.914 0.914 91,4 F1 0.914 0.914 0.914 P 0.971 0.967 0.971 R 0.967 0.967 0.967 96,7 F1 0.967 0.967 0.967 TF-IDF-NB trained in SPEMC-15K-S Avg Avg Avg ACC (%) Macro Micro Weighted 0.962 0.985 0.985 0.933 0.985 0.985 98.5 0.945 0.985 0.985 0.978 0.978 0.978 0.978 0.978 0.978 97,8 0.978 0.978 0.978 0.968 0.967 0.968 0.967 0.967 0.967 96,7 0.967 0.967 0.967 Additionally, we compared our proposed method with the class detection technique used in [44], that matches the most frequent words in a category. The spam keywords used are the 18 most frequent words presented in Section 3. We obtained a 66.7% and 95.9% of accuracy for SPEMC-15-E and SPEMC-15-S, respectively. Although this method provides quite remarkable performance in the Spanish dataset, the poor results in the English dataset show that it has a Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 15 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach high dependency on the similarity among emails from the same dataset. Our proposal outperforms in terms of accuracy both language scenarios with 94.6% in SPEMC-15K-E and 98.5% in SPEMC-15K-S. 6. Conclusions In this work, we addressed the problem of spam email analysis following a multi-class classification approach, focusing on the needs of a cybersecurity institute. With the aim to extract relevant information from massive amounts of spam emails, we categorise spam emails using NLP and machine learning techniques. To the best of our knowledge, our work is among the first to address the problem of spam topics with the purpose of carrying out an advanced analysis of its content. We created SPEMC-15K-E and SPEMC-15K-S, two novel datasets which contain 14479 English and 14992 Spanish spam emails, respectively. We semi-automatically labelled them into eleven spam categories according to their topic, using hierarchical clustering first and, later, manual inspection supported by cybersecurity experts. Additionally, we detected the spammer trick known as hidden text or “salting” inside some emails. We solved it by converting the HTML email into an image and then extracting with an OCR the textual content that is visible for the user. We also addressed the problem of the spam contained in images attached to an email by extracting the text inserted into the images using an OCR. This approach mitigates the confusion made by both spam filters and spam multi-classifiers. In order to categorise spam into eleven classes, we assessed the combination of four text representation techniques (frequency-based and word embedding-based) with four traditional machine learning algorithms, resulting in 16 pipelines and recommending the best combination for each language. We evaluated each pipeline in terms of macro precision, recall, and F1 score as well as accuracy and run time. Most pipelines achieved high overall performance, but the frequency-based text representation models, TF-IDF and BOW, generally outperformed the word embedding models in the spam multi-classification task, being also lighter and quicker models. Considering the metrics F1 score and accuracy, the combination of TF-IDF with LR obtained the highest performance with 0.953 of F1 score and 94.6% of accuracy on SPEMC-15K-E and 0.945 and 98.5% on SPEMC-15K-S, respectively. Regarding the run-time per email, we also recommend the combination of TF-IDF with LR for real-time application on English and Spanish spam multi-classification. They classify an English email into one of 11 classes in 2ms and a Spanish email in 2.2ms, on average. For the next stage of our research, we are interested in looking for more relevant features, alternatives to use NB with negative values and testing pre-trained models for word2vec in order to improve their performance. Moreover, testing other lighter models such as ALBERT or ELECTRA, which may enhance both accuracy and runtime, becomes part of our immediate future research. Experimental results encourage us to deepen in the characteristics of each class in order to detect patterns that help identify campaigns against companies, citizens privacy and security. We will also seek to find associations between classes and spam tricks, which help identify organisations behind campaigns. 7. Acknowledgments This work was supported by the framework agreement between the Universidad de León and INCIBE (Spanish National Cybersecurity Institute) under Addendum 01. References [1] Abasi, A.K., Khader, A.T., Al-Betar, M.A., Naim, S., Makhadmeh, S.N., Alyasseri, Z.A.A., 2020. Link-based multi-verse optimizer for text documents clustering. Applied Soft Computing 87, 106002. doi:https://doi.org/10.1016/j.asoc.2019.106002. [2] Aizawa, A., 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management 39, 45–65. doi:https: //doi.org/10.1016/S0306-4573(02)00021-3. [3] Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L., 2019. Torank: Identifying the most influential suspicious domains in the tor network. Expert Systems with Applications 123, 212 – 226. doi:https://doi.org/10.1016/j.eswa.2019.01.029. [4] AlMahmoud, R.H., Hammo, B., Faris, H., 2020. A modified bond energy algorithm with fuzzy merging and its application to arabic text document clustering. Expert Systems with Applications 159, 113598. doi:https://doi.org/10.1016/j.eswa.2020.113598. [5] Bahgat, E.M., Rady, S., Gad, W., Moawad, I.F., 2018. Efficient email classification approach based on semantic methods. Ain Shams Engineering Journal 9, 3259 – 3269. doi:https://doi.org/10.1016/j.asej.2018.06.001. [6] Barushka, A., Hajek, P., 2018. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Applied Intelligence 48, 3538–3556. doi:https://doi.org/10.1007/s10489-018-1161-y. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 16 of 20 Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (b) Extortion Hacking TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) Accuracy (%) 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 (c) Fake Reward (f) Money Making (i) Sexual Content Dating Page 17 of 20 TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (e) Identity Fraud TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (h) Service TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR 100 90 80 70 60 50 40 30 20 10 Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (k) Other TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR (a) Academic Media Accuracy (%) Accuracy (%) 100 90 80 70 60 50 40 30 20 10 TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR (d) Health (g) Pharmacy TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 (j) Work Offer Fig. 7: Performance of every English pipeline per category in term of accuracy (%). Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Accuracy (%) Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (b) Extortion Hacking TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) Accuracy (%) 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 (c) Fake Reward (f) Money Making (i) Sexual Content Dating Page 18 of 20 TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (e) Identity Fraud TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (h) Service TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR 100 90 80 70 60 50 40 30 20 10 Accuracy (%) 100 90 80 70 60 50 40 30 20 10 (k) Other TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR (a) Academic Media Accuracy (%) Accuracy (%) 100 90 80 70 60 50 40 30 20 10 TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR (d) Health (g) Pharmacy TF-IDF-SVM TF-IDF-NB TF-IDF-RF TF-IDF-LR BOW-SVM BOW-NB BOW-RF BOW-LR word2vec-SVM word2vec-NB word2vec-RF word2vec-LR BERT-SVM BERT-NB BERT-RF BERT-LR Accuracy (%) 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 (j) Work Offer Fig. 8: Performance of every Spanish pipeline per category in terms of accuracy (%). Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Accuracy (%) Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach [7] Bergholz, A., Paass, G., Reichartz, F., Strobel, S., Iais, F., Birlinghoven, S., Moens, M.F., Witten, B., 2008. Detecting known and new salting tricks in unwanted emails. CEAS , 9. [8] Bhowmick, A., Hazarika, S.M., 2018. E-mail spam filtering: A review of techniques and trends. Advances in Electronics, Communication and Computing 443, 583–590. doi:https://doi.org/10.1007/978-981-10-4765-7_61. [9] Biswas, R., Fidalgo, E., Alegre, E., 2017. Recognition of service domains on tor dark net using perceptual hashing and image classification techniques, in: 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017), pp. 7–12. [10] Biswas, R., González-Castro, V., Fidalgo, E., Alegre, E., 2020. Perceptual image hashing based on frequency dominant neighborhood structure applied to tor domains recognition. Neurocomputing 383, 24 – 38. doi:https://doi.org/10.1016/j.neucom.2019.11.065. [11] Breiman, L., 2001. Random forests. Machine learning 45, 5–32. doi:https://doi.org/10.1023/A:1010933404324. [12] Chavda, A., Potika, K., Troia, F.D., Stamp, M., 2018. Support Vector Machines for Image Spam Analysis. ICETE , 597–607doi:https: //doi.org/10.5220/0006921404310441. [13] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 321–357. [14] Cohen, A., Nissim, N., Elovici, Y., 2018a. Novel set of general descriptive features for enhanced detection of malicious emails using machine learning methods. Expert Systems with Applications 110, 143 – 169. doi:https://doi.org/10.1016/j.eswa.2018.05.031. [15] Cohen, Y., Hendler, D., Rubin, A., 2018b. Detection of malicious webmail attachments based on propagation patterns. Knowledge-Based Systems 141, 67 – 79. doi:https://doi.org/10.1016/j.knosys.2017.11.011. [16] Colladon, A.F., Gloor, P.A., 2019. Measuring the impact of spammers on e-mail and twitter networks. International Journal of Information Management 48, 254 – 262. doi:https://doi.org/10.1016/j.ijinfomgt.2018.09.009. [17] Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning 20, 273–297. doi:https://doi.org/10.1007/BF00994018. [18] Cox, D.R., 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 215–232. doi:https://doi.org/10.1111/j.2517-6161.1958.tb00292.x. [19] Dada, E.G., Bassi, J.S., Chiroma, H., Abdulhamid, S.M., Adetunmbi, A.O., Ajibuwa, O.E., 2019. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5, e01802. doi:https://doi.org/10.1016/j.heliyon.2019.e01802. [20] de Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Redondo-Expósito, L., 2020. Automatic construction of multi-faceted user profiles using text clustering and its application to expert recommendation and filtering problems. Knowledge-Based Systems 190, 105337. doi:https: //doi.org/10.1016/j.knosys.2019.105337. [21] Dedeturk, B.K., Akay, B., 2020a. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing 91, 18. doi:https://doi.org/10.1016/j.asoc.2020.106229. [22] Dedeturk, B.K., Akay, B., 2020b. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing 91, 106229. doi:https://doi.org/https://doi.org/10.1016/j.asoc.2020.106229. [23] Devlin, J., Chang, M., Lee, K., Toutanova, K., 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. [24] Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M., 2015. Spam campaign detection, analysis, and investigation. Digital Investigation 12, S12 – S21. doi:https://doi.org/10.1016/j.diin.2015.01.006. [25] Faris, H., Al-Zoubi, A.M., Heidari, A.A., Aljarah, I., Mafarja, M., Hassonah, M.A., Fujita, H., 2019. An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Information Fusion 48, 67–83. doi:https://doi.org/10.1016/j.inffus.2018.08.002. [26] Ferrara, E., 2019. The history of digital spam. Communications of the ACM 62, 82–91. doi:https://www.doi.org/10.1145/3299768. [27] Fidalgo, E., Alegre, E., Fernández-Robles, L., González-Castro, V., 2019. Classifying suspicious content in tor darknet through semantic attention keypoint filtering. Digital Investigation 30, 12 – 22. doi:https://doi.org/10.1016/j.diin.2019.05.004. [28] Gargiulo, F., Silvestri, S., Ciampi, M., De Pietro, G., 2019. Deep neural network for hierarchical extreme multi-label text classification. Applied Soft Computing 79, 125 – 138. doi:https://doi.org/10.1016/j.asoc.2019.03.041. [29] Harris, Z.S., 1954. Distributional structure. Word 10, 146–162. doi:https://doi.org/10.1080/00437956.1954.11659520. [30] Idris, I., Selamat, A., 2014. Improved email spam detection model with negative selection algorithm and particle swarm optimization. Applied Soft Computing 22, 11 – 27. doi:https://doi.org/10.1016/j.asoc.2014.05.002. [31] Jain, A.K., Dubes, R.C., 1988. Algorithms for clustering data. Prentice-Hall, Inc. [32] Jáñez-Martino, F., Fidalgo, E., González-Martínez, S., Velasco-Mata, J., 2020. Classification of spam emails through hierarchical clustering and supervised learning. arXiv:2005.08773. [33] Li, Q., Guindani, M., Reich, B., Bondell, H., Vannucci, M., 2017. A bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints: Li et al. Statistical Analysis and Data Mining: The ASA Data Science Journal 10. doi:https: //doi.org/10.1002/sam.11350. [34] Ligthart, A., Catal, C., Tekinerdogan, B., 2021. Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification. Applied Soft Computing 101, 107023. doi:https://doi.org/10.1016/j.asoc.2020.107023. [35] Lioma, C., Moens, M.F., Gomez, J.C., Beer, J., Bergholz, A., Paass, G., Horkan, P., 2008. Anticipating hidden text salting in emails, in: Recent Advances in Intrusion Detection, 11th International Symposium, pp. 396–397. doi:https://doi.org/10.1007/978-3-540-87403-4_ 24. [36] Mahdavi, S., Rahnamayan, S., Deb, K., Rahnamayan, M., 2019. A knowledge discovery of relationships among dataset entities using optimum hierarchical clustering by de algorithm, in: 2019 IEEE Congress on Evolutionary Computation (CEC), pp. 2761–2770. doi:10.1109/CEC. 2019.8789960. [37] Makkar, A., Kumar, N., Zomaya, A.Y., Dhiman, S., 2020. Spami: A cognitive spam protector for advertisement malicious images. Information Sciences doi:https://doi.org/10.1016/j.ins.2020.05.113. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 19 of 20 Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach [38] Mccallum, A., Nigam, K., 2001. A comparison of event models for naive bayes text classification. Work Learn Text Categ 752. doi:https: //doi.org/10.3115/1067807.1067848. [39] Mekouar, S., 2021. Classifiers selection based on analytic hierarchy process and similarity score for spam identification. Applied Soft Computing 113, 108022. doi:https://doi.org/10.1016/j.asoc.2021.108022. [40] Metsis, V., Androutsopoulos, I., Paliouras, G., 2006. Spam filtering with naive bayes-which naive bayes?, in: CEAS, Mountain View, CA. pp. 28–69. [41] Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient estimation of word representations in vector space, in: Bengio, Y., LeCun, Y. (Eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, p. 12. [42] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their compositionality, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119. [43] Mohammad, R.M.A., 2020. A lifelong spam emails classification model. Applied Computing and Informatics , 11doi:https://doi.org/ 10.1016/j.aci.2020.01.002. [44] Murugavel, U., Santhi, R., 2020. Detection of spam and threads identification in e-mail spam corpus using content based text analytics method. Materials Today: Proceedings doi:https://doi.org/10.1016/j.matpr.2020.04.742. [45] Naiemi, F., Ghods, V., Khalesi, H., 2019. An efficient character recognition method using enhanced hog for spam image detection. Soft Computing 23 , 11759–11774doi:https://doi.org/10.1007/s00500-018-03728-z. [46] Oliveira, D.S., Lin, T., Rocha, H., Ellis, D., Dommaraju, S., Yang, H., Weir, D., Marin, S., Ebner, N.C., 2019. Empirical analysis of weapons of influence, life domains, and demographic-targeting in modern spam: an age-comparative perspective. Crime Science 8, 3. doi:https://doi.org/10.1186/s40163-019-0098-8. [47] Riesco, A., Fidalgo, E., Al-Nabkib, M.W., Jáñez-Martino, F., Alegre, E., 2019. Classifying pastebin content through the generation of pastecc labeled dataset, in: 14th International Conference on Hybrid Artificial Intelligent Systems (HAIS), pp. 1–12. doi:https://doi.org/10. 1007/978-3-030-29859-3_39. [48] Ruano-Ordás, D., Fdez-Riverola, F., Méndez, J.R., 2018. Concept drift in e-mail datasets: An empirical study with practical implications. Information Sciences 428, 120–135. doi:https://doi.org/10.1109/ISTEL.2010.5734082. [49] Saidani, N., Adi, K., Allili, M.S., 2020. A semantic-based classification approach for an enhanced spam detection. Computers & Security 94, 101716. doi:https://doi.org/https://doi.org/10.1016/j.cose.2020.101716. [50] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need. CoRR abs/1706.03762. [51] Zamil, Y.K., Ali, S.A., Naser, M.A., 2019. Spam image email filtering using K-NN and SVM. International Journal of Electrical and Computer Engineering (IJECE) 9, 245. doi:https://doi.org/10.11591/ijece.v9i1.pp245-254. [52] Zha, D., Li, C., 2019. Multi-label dataless text classification with topic modeling. Knowledge and Information Systems 61, 137–160. doi:https://doi.org/10.1007/s10115-018-1280-0. [53] Zhang, C., Chen, X., bang Chen, W., Yang, L., Warner, G., 2009. Spam image clustering for identifying common sources of unsolicited emails. IJDCF 1, 1–20. doi:https://doi.org/10.4018/jdcf.2009070101. [54] Zhang, J., Mani, I., 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction, in: Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets. Francisco Jáñez-Martino et al.: Preprint submitted to Elsevier Page 20 of 20