research-article

Context-Sensitive Malicious Spelling Error Correction

Authors:

Pramod ViswanathAuthors Info & Claims

WWW '19: The World Wide Web Conference

Pages 2771 - 2777

https://doi.org/10.1145/3308558.3313431

Published: 13 May 2019 Publication History

Abstract

Misspelled words of the malicious kind work by changing specific keywords and are intended to thwart existing automated applications for cyber-environment control such as harassing content detection on the Internet and email spam detection. In this paper, we focus on malicious spelling correction, which requires an approach that relies on the context and the surface forms of targeted keywords. In the context of two applications-profanity detection and email spam detection-we show that malicious misspellings seriously degrade their performance. We then propose a context-sensitive approach for malicious spelling correction using word embeddings and demonstrate its superior performance compared to state-of-the-art spell checkers.

References

[1]

2018. AbiWord. Available at: https://www.abisource.com.

[2]

2018. Google Search Engine. Available at: https://www.google.com.

[3]

Sweta Agrawal and Amit Awekar. 2018. Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms. In European Conference on Information Retrieval. Springer, 141-153.

[4]

Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Sofia, Bulgaria, 183-192. http://www.aclweb.org/anthology/W13-3520

[5]

Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 747-754.

[6]

Noah Coad. 2018. Google Spell Check. Available at: https://github.com/noahcoad/google-spell-check.

[7]

Fred J Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM7, 3 (1964), 171-176.

Digital Library

[8]

Noura Farra, Nadi Tomeh, Alla Rozovskaya, and Nizar Habash. 2014. Generalized Character-Level Spelling Error Correction. In ACL (2). 161-167.

[9]

Shaona Ghosh and Per Ola Kristensson. 2017. Neural Networks for Text Correction and Completion in Keyboard Decoding. arXiv preprint arXiv:1709.06429(2017).

[10]

Hongyu Gong, Suma Bhat, and Pramod Viswanath. 2017. Geometry of Compositionality.

[11]

Jigsaw Google. 2017. Perspective. Available at: https://www.perspectiveapi.com/.

[12]

Sergey Gubanov, Irina Galinskaya, and Alexey Baytin. 2014. Improved Iterative Correction for Distant Spelling Errors. In ACL (2). 168-173.

[13]

Itisha Gupta and Nisheeth Joshi. 2017. Tweet normalization: A knowledge based approach. In Infocom Technologies and Unmanned Systems (Trends and Future Directions)(ICTUS), 2017 International Conference on. IEEE, 157-162.

[14]

Hany Hassan and Arul Menezes. 2013. Social Text Normalization using Contextual Graph Random Walks. In ACL (1). 1577-1586.

[15]

Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google's Perspective API Built for Detecting Toxic Comments. arXiv preprint arXiv:1702.08138(2017).

[16]

Fraser Howard. 2008. Web Attacks: Modern web attacks. Network Security2008, 4 (2008), 13-15.

Digital Library

[17]

Dan Jurafsky and James H Martin. 2014. Speech and language processing. Vol. 3. Pearson London.

[18]

Daniel Jurafsky and James H. Martin.2017. Distributed Representations of Words and Phrases and their Compositionality. In Speech and Language Processing. Chapter 5, 1-12.

[19]

Ryan Kelly. 2015. PyEnchant. Available at: https://github.com/rfk/pyenchant.

[20]

Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707-710.

[21]

Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011. Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, 71-76.

Digital Library

[22]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111-3119.

Digital Library

[23]

Abiodun Modupe, Turgay Celik, Vukosi Marivate, and Melvin Diale. 2017. Semi-supervised probabilistics approach for normalising informal short text messages. In Information Communication Technology and Society (ICTAS), Conference on. IEEE, 1-8.

[24]

Howard L. Morgan. 1970. Spelling correction in systems programs. In Communications of the ACM. 90-94.

Digital Library

[25]

Thien Huu Nguyen and Ralph Grishman. 2014. Employing Word Representations and Regularization for Domain Adaptation of Relation Extraction. In ACL (2). 68-74.

[26]

Etienne Papegnies, Vincent Labatut, Richard Dufour, and Georges Linares. 2017. Impact Of Content Features For Automatic Online Abuse Detection. In International Conference on Computational Linguistics and Intelligent Text Processing.

[27]

Aurelia Power, Anthony Keane, Brian Nolan, and Brian O'Neill. 2018. Detecting Discourse-Independent Negated Forms of Public Textual Cyberbullying. Journal of Computer-Assisted Linguistic Research2, 1 (2018), 1-20.

[28]

Luz Rello, Miguel Ballesteros, and Jeffrey P Bigham. 2015. A spellchecker for dyslexia. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. ACM, 39-47.

Digital Library

[29]

Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence20, 5(1998), 522-532.

Digital Library

[30]

Sergio A Rojas-Galeano. 2013. Revealing non-alphabetical guises of spam-trigger vocables. Dyna80, 182 (2013), 15-24.

[31]

Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015. A Word Embedding Approach to Predicting the Compositionality of Multiword Expressions. In HLT-NAACL. 977-983.

[32]

Manish Saxena and PM Khan. 2015. Spamizer: An approach to handle web form Spam. In Computing for Sustainable Global Development (INDIACom), 2015 2nd International Conference on. IEEE, 1095-1100.

[33]

Kaveh Taghipour and Hwee Tou Ng. 2015. Semi-supervised word sense disambiguation using word embeddings in general and specific domains. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 314-323.

[34]

Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88-93.

[35]

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 1113-1120.

Digital Library

[36]

Zar Zar Wint, Theo Ducros, and Masayoshi Aritsugi. 2017. Spell corrector to social media datasets in message filtering systems. In Digital Information Management (ICDIM), 2017 Twelfth International Conference on. IEEE, 209-215.

[37]

Zar Zar Wint, The´o Ducros, and Masayoshi Aritsugi. 2018. Non-words Spell Corrector of Social Media Data in Message Filtering Systems.Journal of Digital Information Management16, 2 (2018).

[38]

Xinwang Zhong. 2014. Deobfuscation based on edit distance algorithm for spam filitering. In Machine Learning and Cybernetics (ICMLC), 2014 International Conference on, Vol. 1. IEEE, 109-114.

Cited By

Mishra ASinha SGeorge C(2024)Shielding against online harm: A survey on text analysis to prevent cyberbullyingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108241133(108241)Online publication date: Jul-2024
https://doi.org/10.1016/j.engappai.2024.108241
Silvestro J(2023)Remember Then Recommend: Critically Engaging Spell Checker Algorithms and Other Text Recommender Systems as Memory InfrastructuresCollege English10.58680/ce20233266186:1(59-88)Online publication date: 1-Sep-2023
https://doi.org/10.58680/ce202332661
Song XYu HLi SWang H(2023)Robust Chinese Named Entity Recognition Based on Fusion Graph EmbeddingElectronics10.3390/electronics1203056912:3(569)Online publication date: 22-Jan-2023
https://doi.org/10.3390/electronics12030569
Show More Cited By

Recommendations

Context-Dependent Sequence-to-Sequence Turkish Spelling Correction

In this article, we make use of sequence-to-sequence (seq2seq) models for spelling correction in the agglutinative Turkish language. In the baseline system, misspelled and target words are split into their letters and the letter sequences are fed into ...
Statistical Context-Sensitive Spelling Correction Using Typing Error Rate
CSE '13: Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering

Error words that appear in Korean texts can be largely categorized into non-word spelling errors and context-sensitive spelling errors. Of the two, context-sensitive spelling errors are shown only when considering the meaning of the word in the given ...
Malicious Behavior Detection using Windows Audit Logs
AISec '15: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security

As antivirus and network intrusion detection systems have increasingly proven insufficient to detect advanced threats, large security operations centers have moved to deploy endpoint-based sensors that provide deeper visibility into low-level events ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '19: The World Wide Web Conference

May 2019

3620 pages

ISBN:9781450366748

DOI:10.1145/3308558

Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '19

WWW '19: The Web Conference

May 13 - 17, 2019

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
552
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mishra ASinha SGeorge C(2024)Shielding against online harm: A survey on text analysis to prevent cyberbullyingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108241133(108241)Online publication date: Jul-2024
https://doi.org/10.1016/j.engappai.2024.108241
Silvestro J(2023)Remember Then Recommend: Critically Engaging Spell Checker Algorithms and Other Text Recommender Systems as Memory InfrastructuresCollege English10.58680/ce20233266186:1(59-88)Online publication date: 1-Sep-2023
https://doi.org/10.58680/ce202332661
Song XYu HLi SWang H(2023)Robust Chinese Named Entity Recognition Based on Fusion Graph EmbeddingElectronics10.3390/electronics1203056912:3(569)Online publication date: 22-Jan-2023
https://doi.org/10.3390/electronics12030569
Bera DOgbanufe OKim D(2023)Towards a thematic dimensional framework of online fraudDecision Support Systems10.1016/j.dss.2023.113977171:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.dss.2023.113977
Lubis ANasution MSitompul OZamzami E(2022)Spelling Checking with Deep Learning Model in Analysis of Tweet Data for Word Classification Process2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946476(343-348)Online publication date: 6-Oct-2022
https://doi.org/10.23919/EECSI56542.2022.9946476
Zhou X(2021)Homology Detection of Malicious Codes Based on a Fuzzy Graph Neural Network2021 IEEE International Conference on Industrial Application of Artificial Intelligence (IAAI)10.1109/IAAI54625.2021.9699879(202-207)Online publication date: 24-Dec-2021
https://doi.org/10.1109/IAAI54625.2021.9699879
Do DNguyen HBui TVo H(2021)VSEC: Transformer-Based Model for Vietnamese Spelling CorrectionPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89363-7_20(259-272)Online publication date: 8-Nov-2021
https://dl.acm.org/doi/10.1007/978-3-030-89363-7_20
Hládek DStaš JPleva M(2020)Survey of Automatic Spelling CorrectionElectronics10.3390/electronics91016709:10(1670)Online publication date: 13-Oct-2020
https://doi.org/10.3390/electronics9101670
Mou GYe PLee Kd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411990(1145-1154)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3411990
Nguyen LDao BNguyen DNguyen N(2020)Vietnamese Context-Sensitive Malicious Spelling Error Correction2020 7th NAFOSTED Conference on Information and Computer Science (NICS)10.1109/NICS51282.2020.9335909(48-53)Online publication date: 26-Nov-2020
https://doi.org/10.1109/NICS51282.2020.9335909
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents