A Discrete Hidden Markov Model for SMS Spam Detection
<p>Directed graph of hidden Markov model.</p> "> Figure 2
<p>The discrete hidden Markov model for spam detection.</p> "> Figure 3
<p>(<b>a</b>) Observation state distribution in ham messages set; (<b>b</b>) observation state distribution in spam messages set.</p> "> Figure 4
<p>Baum–Welch algorithm.</p> "> Figure 5
<p>Training workflow.</p> "> Figure 6
<p>Short messaging service (SMS) classification workflow.</p> ">
Abstract
:1. Introduction
- We first propose to use a hidden Markov model for spam SMS detection based on word order. This method uses the word order information that consists of the key importance for human language, but it has been ignored by many traditional methods based on the BoW model.
- This research solves the issue where the TF.IDF algorithm for word weighting does not work well in SMS spam detection, due to the extremely low term frequency.
- The proposed method can be applied to alphabetic text (e.g., English) and hieroglyphic text (e.g., Chinese). It is not language-sensitive.
2. Related Work
2.1. Rule-Based Filtering Technologies
2.2. Content-Filtering Technologies
2.3. Hidden Markov Model for Spam Detecions
3. The Discrete HMM for SMS Spam Detection
3.1. Problem Formulation and Notations
3.2. Observation States and Observation Sequence
3.3. Label Each Word in Observation Sequence for HMM Learning
3.4. Observation Probability Distribution
- As some words only have probability in a single dataset and their probability is equal to zero in another dataset, this indicates that these words only appear in the spam messages set or ham messages set;
- As the probabilities of many words are quite different in different datasets, it is referred that these words appear in both sets with much different word frequencies;
- Only a very small portion of them appear in both sets evenly.
3.5. HMM Learning
3.6. SMS Property Prediction
3.7. The Workflow of the Discrete HMM for SMS Spam Detection
3.7.1. Data Preparation and HMM Learning
3.7.2. SMS Classification
4. Experiment Results and Discussion
4.1. Dataset and Analysis
- 9272 words appear only once in a single SMS and accounts for 93.13%.
- The words that appear three times and above only account for 1.03% in total.
4.2. Evaluation Metrics
4.3. Result of the Discrete HMM on the UCI Repository Dataset
4.4. Result of the Discrete HMM on Other SMS Dataset in Chinese
4.5. Result Discussions
4.5.1. UCI Repository Dataset Results
4.5.2. Chinese SMS Dataset Results and Its Non-Language-Sensitivity
5. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- PortioResearch Worldwide A2P SMS Markets 2014–2017: Understanding and Analysis of Application to-Person Text Messaging Markets Worldwide; Portio Research Limited: Chippenham, UK, 2014.
- Ezpeleta, E. Short Messages Spam Filtering Combining Personality Recognition and Sentiment Analysis. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2017, 175–189. [Google Scholar] [CrossRef]
- Statista A2P and P2P SMS Market Revenue Worldwide from 2017 to 2022 (in Billion U.S. Dollars). Available online: https://www.statista.com/statistics/485153/a2p-sms-market-size-worldwide/ (accessed on 9 July 2020).
- Abdulhamid, S.M.; Abd Latiff, M.S.; Chiroma, H.; Osho, O.; Abdul-Salaam, G.; Abubakar, A.I.; Herawan, T. A Review on Mobile SMS Spam Filtering Techniques. IEEE Access 2017, 5, 15650–15666. [Google Scholar] [CrossRef]
- Arutyunov, V.V. Spam: Its past, present, and future. Sci. Tech. Inf. Process. 2013, 40, 205–211. [Google Scholar] [CrossRef]
- Jiang, L.; Li, C.; Wang, S.; Zhang, L. Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 2016, 52, 26–39. [Google Scholar] [CrossRef]
- Sable, A.S.; Kalavadekar, P.N. SMS Classification Based on Naive Bayes Classifier and Semi-Supervised Learning. Int. J. Mod. Trends Eng. Res. 2016, 3, 16–25. [Google Scholar]
- Waheeb, W.; Ghazali, R. Content-based SMS Classification: Statistical Analysis for the Relationship between Number of Features and Classification Performance. Comput. Y Sist. 2017, 21, 771–785. [Google Scholar] [CrossRef]
- Tekerek, A. Support vector machine based spam SMS detection. J. Polytech. 2018, 0900, 779–784. [Google Scholar] [CrossRef] [Green Version]
- Poomka, P.; Pongsena, W.; Kerdprasop, N.; Kerdprasop, K. SMS Spam Detection Based on Long Short-Term Memory and Gated Recurrent Unit. Int. J. Futur. Comput. Commun. 2019, 8, 12–15. [Google Scholar] [CrossRef]
- Roy, P.K.; Singh, J.P.; Banerjee, S. Deep learning to filter SMS Spam. Future Gener. Comput. Syst. 2020, 102, 524–533. [Google Scholar] [CrossRef]
- Serkan, B.; Onur, K. Development of content based SMS classification application by using Word2Vec based feature extraction. IET Softw. 2018, 13, 295–304. [Google Scholar]
- Barushka, A.; Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl. Intell. 2018, 48, 3538–3556. [Google Scholar] [CrossRef] [Green Version]
- Xia, T.; Chai, Y. An improvement to TF: Term distribution based term weight algorithm. J. Softw. 2011, 6, 413–420. [Google Scholar] [CrossRef]
- Rabiner, L.R.; Juang, B.H. An Introduction to Hidden Markov Models. IEEE Assp Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
- Eddy, S.R. What is a hidden Markov model? Nat. Biotechnol. 2004, 22, 1315–1316. [Google Scholar] [CrossRef] [Green Version]
- Group, T.A.S. The First Enterprise Open-Source Spam Filter. Available online: http://spamassassin.apache.org/ (accessed on 2 June 2020).
- Ruano-Ordás, D.; Fdez-Glez, J.; Fdez-Riverola, F.; Méndez, J.R. Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks. J. Syst. Softw. 2013, 86, 3151–3161. [Google Scholar] [CrossRef]
- Wang, Y.H.; Wu, I.C. Wirebrush4SPAM: A novel framework for improving efficiency on spam filtering services. Softw. Pract. Exp. 2009, 39, 701–736. [Google Scholar] [CrossRef]
- Xia, T. A Constant Time Complexity Spam Detection Algorithm for Boosting Throughput on Rule-Based Filtering Systems. IEEE Access 2020, 8, 82653–82661. [Google Scholar] [CrossRef]
- Aragão, M.V.C.; Frigieri, E.P.; Ynoguti, C.A.; Paiva, A.P. Factorial design analysis applied to the performance of SMS anti-spam filtering systems. Expert Syst. Appl. 2016, 64, 589–604. [Google Scholar] [CrossRef]
- Ebadati, O.M.E.; Ahmadzadeh, F. Classification Spam Email with Elimination of Unsuitable Features with Hybrid of GA-Naive Bayes. J. Inf. Knowl. Manag. 2019, 18, 1–19. [Google Scholar] [CrossRef]
- Arifin, D.D.; Shaufiah; Bijaksana, M.A. Enhancing spam detection on mobile phone Short Message Service (SMS) performance using FP-growth and Naive Bayes Classifier. In Proceedings of the 2016 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), Bandung, Indonesia, 13–15 September 2016; pp. 80–84. [Google Scholar] [CrossRef]
- Santos, I.; Laorden, C.; Sanz, B.; Bringas, P.G. Expert Systems with Applications Enhanced Topic-based Vector Space Model for semantics-aware spam filtering. Expert Syst. Appl. 2012, 39, 437–444. [Google Scholar] [CrossRef] [Green Version]
- Chan, P.P.K.; Yang, C.; Yeung, D.S.; Ng, W.W.Y. Spam filtering for short messages in adversarial environment. Neurocomputing 2015, 155, 167–176. [Google Scholar] [CrossRef]
- Zhang, W.; Bu, C.; Yoshida, T.; Zhang, S. CoSpa: A co-training approach for spam review identification with support vector machine. Information 2016, 7, 12. [Google Scholar] [CrossRef] [Green Version]
- Zhang, W.; Bu, C.; Yoshida, T.; Zhang, S. CoFea: A novel approach to spam review identification based on entropy and co-training. Entropy 2016, 18, 429. [Google Scholar] [CrossRef] [Green Version]
- Gashti, M.Z. Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree. Eng. Technol. Appl. Sci. Res. 2017, 7, 1713–1718. [Google Scholar]
- Uysal, A.K.; Gunal, S.; Ergin, S.; Gunal, E.S. The Impact of Feature Extraction and Selection on SMS Spam Filtering. Elektronika ir Elektrotechnika 2013, 19, 67–72. [Google Scholar] [CrossRef]
- Karthika, R.D.; Visalakshi, P. Latent Semantic Indexing Based SVM Model for Email Spam Classification. J. Sci. Ind. Res. 2014, 73, 437–442. [Google Scholar]
- Chandra, A. Spam SMS Filtering using Recurrent Neural Network and Long Short Term Memory. In Proceedings of the 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 21–22 November 2019; pp. 118–122. [Google Scholar]
- Yang, H.; Liu, Q.; Zhou, S.; Luo, Y. A spam filtering method based on multi-modal fusion. Appl. Sci. 2019, 9, 1152. [Google Scholar] [CrossRef] [Green Version]
- Zhao, C.; Xin, Y.; Li, X.; Yang, Y.; Chen, Y. A Heterogeneous Ensemble Learning Frameworkfor Spam Detection in Social Networks with Imbalanced Data. Appl. Sci. 2020, 10, 936. [Google Scholar] [CrossRef] [Green Version]
- Sheikhi, S.; Kheirabadi, M.T.; Bazzazi, A. An Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network. Int. J. Eng. 2020, 33, 221–228. [Google Scholar] [CrossRef]
- Liu, J.; Yuan, X. Spam Short Messages Detection via Mining Social Networks. J. Comput. Sci. Technol. 2012, 27, 506–514. [Google Scholar] [CrossRef]
- Saleh, A.J.; Karim, A.; Shanmugam, B.; Azam, S.; Kannoorpatti, K.; Jonkman, M.; De Boer, F. An intelligent spam detection model based on artificial immune system. Information 2019, 10, 209. [Google Scholar] [CrossRef] [Green Version]
- Shang, Y. Consensus of Hybrid Multi-Agent Systems with Malicious Nodes. IEEE Trans. Circuits Syst. Ii Express Briefs 2020, 67, 685–689. [Google Scholar] [CrossRef]
- Mousas, C.; Anagnostopoulos, C.N. Real-time performance-driven finger motion synthesis. Comput. Graph. 2017, 65, 1–11. [Google Scholar] [CrossRef]
- Mousas, C. Full-body locomotion reconstruction of virtual characters using a single inertial measurement unit. Sensors 2017, 17, 2589. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nakagawa, S.; Zhang, W. Text-independent speaker recognition by speaker-specific GMM and speaker adapted syllable-based HMM. In Proceedings of the EUROSPEECH 8th European Conference on Speech Communication and Technology, Geneva, Switzerland, 1–4 September 2003; pp. 3017–3020. [Google Scholar]
- Niina, G.; Dozono, H. The Spherical Hidden Markov Self Organizing Map for Learning Time Series Data. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2012, Lausanne, Switzerland, 11–14 September 2012; pp. 563–570. [Google Scholar]
- Okhovvat, M.; Minaei, B. Procedia Computer A Hidden Markov Model for Persian Part-of-Speech Tagging. Procedia Comput. Sci. 2011, 3, 977–981. [Google Scholar] [CrossRef] [Green Version]
- Ptaszynski, M.; Momouchi, Y. Expert Systems with Applications Part-of-speech tagger for Ainu language based on higher order Hidden Markov Model. Expert Syst. Appl. 2012, 39, 11576–11582. [Google Scholar] [CrossRef]
- Zhang, J.; Shen, D.; Zhou, G.; Su, J.; Tan, C. Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 2004, 37, 411–422. [Google Scholar] [CrossRef] [Green Version]
- Hussain, N.; Mirza, H.T.; Rasool, G.; Hussain, I.; Kaleem, M. Spam review detection techniques: A systematic literature review. Appl. Sci. 2019, 9, 987. [Google Scholar] [CrossRef] [Green Version]
- Abayomi-Alli, O.; Misra, S.; Abayomi-Alli, A.; Odusami, M. A review of soft techniques for SMS spam classification: Methods, approaches and applications. Eng. Appl. Artif. Intell. 2019, 86, 197–212. [Google Scholar] [CrossRef]
- Rafique, M.; Farooq, M. SMS Spam Detection by Operating on Byte-Level Distributions Using Hidden Markov Models (HMMs). In Proceedings of the 20th Virus Bulletin International Conference, Vancouver, BC, Canada, 29 September–1 October 2010. [Google Scholar]
- Gordillo, J.; Conde, E. An HMM for detecting spam mail. Expert Syst. Appl. 2007, 33, 667–682. [Google Scholar] [CrossRef]
- Ebrahimi, N.; Trabelsi, A.; Islam, M.S.; Hamou-Lhadj, A.; Khanmohammadi, K. An HMM-based approach for automatic detection and classification of duplicate bug reports. Inf. Softw. Technol. 2019, 113, 98–109. [Google Scholar] [CrossRef]
- Washha, M.; Qaroush, A.; Mezghani, M.; Sedes, F. A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering. Procedia Comput. Sci. 2017, 112, 833–843. [Google Scholar] [CrossRef]
- Ganesan, V.; Manikandan, M.s.K.; Suresh, M.N. Detection and prevention of spam over Internet telephony in Voice over Internet Protocol networks using Markov chain with incremental SVM. Int. J. Commun. Syst. 2016, 30, e3255. [Google Scholar] [CrossRef]
- Almeida, T.A.; Gomez Hidalgo, J.M.; Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. Int. J. Inf. Secur. Sci. 2012, 2, 1–18. [Google Scholar]
- Adewole, K.S.; Anuar, N.B.; Kamsin, A.; Sangaiah, A.K. SMSAD: A framework for spam message and spam account detection. Multimed. Tools Appl. 2019, 78, 3925–3960. [Google Scholar] [CrossRef]
- Rahmani, H.; Sahli, N.; Kamoun, F. Simple SMS spam filtering on independent mobile phone. Int. J. Secur. Commun. Netw. 2012, 5, 1209–1220. [Google Scholar] [CrossRef]
- Jain, G.; Sharma, M.; Agarwal, B. Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell. 2019, 85, 21–44. [Google Scholar] [CrossRef]
- Nagwani, N.K.; Sharaff, A. SMS spam filtering and thread identification using bi-level text classification and clustering techniques. J. Inf. Sci. 2017, 43, 75–87. [Google Scholar] [CrossRef]
- Almeida, T.A.; Hidalgo, J.M.G.; Yamakami, A. Contributions to the study of SMS spam filtering: New Collection and Results. In Proceedings of the 11th ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–22 September 2011; pp. 259–262. [Google Scholar] [CrossRef]
- Tagg, C. A Corpus Linguistic Study of SMS Texting. Ph.D. Thesis, University of Birmingham, Birmingham, UK, 2009. [Google Scholar]
- Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification George. J. Mach. Learn. Res. 2003, 1, 1289–1305. [Google Scholar] [CrossRef]
Number of SMS | Percentage of SMS | |
---|---|---|
Spam | 747 | 13.4% |
Ham | 4827 | 86.6% |
Total | 5574 | 100% |
Term Frequency | Number of Words with the Same Term Frequency |
---|---|
1 | 9272 |
2 | 580 |
3 | 77 |
4 | 15 |
5 | 5 |
6 | 3 |
7 | 1 |
8–9 | 0 |
10 | 1 |
11–13 | 0 |
14 | 1 |
Actual | Predicted | |
---|---|---|
Negative | Positive | |
Negative | true negative (TN) | false negative (FN) |
Positive | false positive (FP) | true positive (TP) |
Training Dataset | Testing Dataset | Total Number of SMS | |
---|---|---|---|
Spam | 493 | 254 | 747 |
Ham | 3186 | 1641 | 4827 |
Percentage | 66% | 34% | 100% |
Dataset | Actual | Predicted | Prediction % | |||
---|---|---|---|---|---|---|
The proposed HMM | Spam | Ham | Spam | Ham | AUC | |
Spam | 222 | 50 | 0.892 | 0.031 | 0.900 | |
Ham | 27 | 1559 | 0.108 | 0.969 |
Dataset | Class | Accuracy (A) | Precision (P) | Recall (R) | F-Measure (F1) |
---|---|---|---|---|---|
The proposed HMM | Spam | 0.959 | 0.892 | 0.816 | 0.852 |
Ham | 0.969 | 0.983 | 0.976 |
Model | Class | Accuracy (A) | Precision (P) | Recall (R) | F-Measure (F1) |
---|---|---|---|---|---|
NB [56] | overall | 0.842 | 0.95 | 0.972 | 0.87 |
SVM [56] | overall | 0.936 | 0.97 | 0.977 | 0.94 |
NMF [56] | overall | 0.917 | 0.96 | 0.976 | 0.92 |
LDA [56] | overall | 0.904 | 0.96 | 0.976 | 0.92 |
LSTM [11] | Spam | 0.953 | 0.849 | 0.777 | 0.811 |
Ham | 0.972 | 0.976 | 0.973 | ||
CNN [11] | Spam | 0.979 | 0.988 | 0.858 | 0.922 |
Ham | 0.982 | 0.996 | 0.988 | ||
The proposed HMM | Spam | 0.959 | 0.892 | 0.816 | 0.852 |
Ham | 0.969 | 0.983 | 0.976 |
Training Dataset | Testing Dataset | Total Number of SMS | |
---|---|---|---|
Spam | 700 | 300 | 1000 |
Ham | 700 | 300 | 1000 |
Percentage | 70% | 30% | 100% |
Dataset | Actual | Predicted | Prediction % | |||
---|---|---|---|---|---|---|
The proposed HMM | Spam | Ham | Spam | Ham | AUC | |
Spam | 293 | 7 | 0.977 | 0.023 | 0.985 | |
Ham | 2 | 298 | 0.007 | 0.993 |
Dataset | Class | Accuracy (A) | Precision (P) | Recall (R) | F-Measure (F1) |
---|---|---|---|---|---|
The proposed HMM | Spam | 0.985 | 0.977 | 0.993 | 0.985 |
Ham | 0.993 | 0.977 | 0.985 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xia, T.; Chen, X. A Discrete Hidden Markov Model for SMS Spam Detection. Appl. Sci. 2020, 10, 5011. https://doi.org/10.3390/app10145011
Xia T, Chen X. A Discrete Hidden Markov Model for SMS Spam Detection. Applied Sciences. 2020; 10(14):5011. https://doi.org/10.3390/app10145011
Chicago/Turabian StyleXia, Tian, and Xuemin Chen. 2020. "A Discrete Hidden Markov Model for SMS Spam Detection" Applied Sciences 10, no. 14: 5011. https://doi.org/10.3390/app10145011
APA StyleXia, T., & Chen, X. (2020). A Discrete Hidden Markov Model for SMS Spam Detection. Applied Sciences, 10(14), 5011. https://doi.org/10.3390/app10145011