Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3308558.3313431acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Context-Sensitive Malicious Spelling Error Correction

Published: 13 May 2019 Publication History

Abstract

Misspelled words of the malicious kind work by changing specific keywords and are intended to thwart existing automated applications for cyber-environment control such as harassing content detection on the Internet and email spam detection. In this paper, we focus on malicious spelling correction, which requires an approach that relies on the context and the surface forms of targeted keywords. In the context of two applications-profanity detection and email spam detection-we show that malicious misspellings seriously degrade their performance. We then propose a context-sensitive approach for malicious spelling correction using word embeddings and demonstrate its superior performance compared to state-of-the-art spell checkers.

References

[1]
2018. AbiWord. Available at: https://www.abisource.com.
[2]
2018. Google Search Engine. Available at: https://www.google.com.
[3]
Sweta Agrawal and Amit Awekar. 2018. Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms. In European Conference on Information Retrieval. Springer, 141-153.
[4]
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Sofia, Bulgaria, 183-192. http://www.aclweb.org/anthology/W13-3520
[5]
Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 747-754.
[6]
Noah Coad. 2018. Google Spell Check. Available at: https://github.com/noahcoad/google-spell-check.
[7]
Fred J Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM7, 3 (1964), 171-176.
[8]
Noura Farra, Nadi Tomeh, Alla Rozovskaya, and Nizar Habash. 2014. Generalized Character-Level Spelling Error Correction. In ACL (2). 161-167.
[9]
Shaona Ghosh and Per Ola Kristensson. 2017. Neural Networks for Text Correction and Completion in Keyboard Decoding. arXiv preprint arXiv:1709.06429(2017).
[10]
Hongyu Gong, Suma Bhat, and Pramod Viswanath. 2017. Geometry of Compositionality.
[11]
Jigsaw Google. 2017. Perspective. Available at: https://www.perspectiveapi.com/.
[12]
Sergey Gubanov, Irina Galinskaya, and Alexey Baytin. 2014. Improved Iterative Correction for Distant Spelling Errors. In ACL (2). 168-173.
[13]
Itisha Gupta and Nisheeth Joshi. 2017. Tweet normalization: A knowledge based approach. In Infocom Technologies and Unmanned Systems (Trends and Future Directions)(ICTUS), 2017 International Conference on. IEEE, 157-162.
[14]
Hany Hassan and Arul Menezes. 2013. Social Text Normalization using Contextual Graph Random Walks. In ACL (1). 1577-1586.
[15]
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google's Perspective API Built for Detecting Toxic Comments. arXiv preprint arXiv:1702.08138(2017).
[16]
Fraser Howard. 2008. Web Attacks: Modern web attacks. Network Security2008, 4 (2008), 13-15.
[17]
Dan Jurafsky and James H Martin. 2014. Speech and language processing. Vol. 3. Pearson London.
[18]
Daniel Jurafsky and James H. Martin.2017. Distributed Representations of Words and Phrases and their Compositionality. In Speech and Language Processing. Chapter 5, 1-12.
[19]
Ryan Kelly. 2015. PyEnchant. Available at: https://github.com/rfk/pyenchant.
[20]
Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707-710.
[21]
Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. 2011. Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, 71-76.
[22]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111-3119.
[23]
Abiodun Modupe, Turgay Celik, Vukosi Marivate, and Melvin Diale. 2017. Semi-supervised probabilistics approach for normalising informal short text messages. In Information Communication Technology and Society (ICTAS), Conference on. IEEE, 1-8.
[24]
Howard L. Morgan. 1970. Spelling correction in systems programs. In Communications of the ACM. 90-94.
[25]
Thien Huu Nguyen and Ralph Grishman. 2014. Employing Word Representations and Regularization for Domain Adaptation of Relation Extraction. In ACL (2). 68-74.
[26]
Etienne Papegnies, Vincent Labatut, Richard Dufour, and Georges Linares. 2017. Impact Of Content Features For Automatic Online Abuse Detection. In International Conference on Computational Linguistics and Intelligent Text Processing.
[27]
Aurelia Power, Anthony Keane, Brian Nolan, and Brian O'Neill. 2018. Detecting Discourse-Independent Negated Forms of Public Textual Cyberbullying. Journal of Computer-Assisted Linguistic Research2, 1 (2018), 1-20.
[28]
Luz Rello, Miguel Ballesteros, and Jeffrey P Bigham. 2015. A spellchecker for dyslexia. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. ACM, 39-47.
[29]
Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence20, 5(1998), 522-532.
[30]
Sergio A Rojas-Galeano. 2013. Revealing non-alphabetical guises of spam-trigger vocables. Dyna80, 182 (2013), 15-24.
[31]
Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015. A Word Embedding Approach to Predicting the Compositionality of Multiword Expressions. In HLT-NAACL. 977-983.
[32]
Manish Saxena and PM Khan. 2015. Spamizer: An approach to handle web form Spam. In Computing for Sustainable Global Development (INDIACom), 2015 2nd International Conference on. IEEE, 1095-1100.
[33]
Kaveh Taghipour and Hwee Tou Ng. 2015. Semi-supervised word sense disambiguation using word embeddings in general and specific domains. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 314-323.
[34]
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88-93.
[35]
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 1113-1120.
[36]
Zar Zar Wint, Theo Ducros, and Masayoshi Aritsugi. 2017. Spell corrector to social media datasets in message filtering systems. In Digital Information Management (ICDIM), 2017 Twelfth International Conference on. IEEE, 209-215.
[37]
Zar Zar Wint, The´o Ducros, and Masayoshi Aritsugi. 2018. Non-words Spell Corrector of Social Media Data in Message Filtering Systems.Journal of Digital Information Management16, 2 (2018).
[38]
Xinwang Zhong. 2014. Deobfuscation based on edit distance algorithm for spam filitering. In Machine Learning and Cybernetics (ICMLC), 2014 International Conference on, Vol. 1. IEEE, 109-114.

Cited By

View all
  • (2024)Shielding against online harm: A survey on text analysis to prevent cyberbullyingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108241133(108241)Online publication date: Jul-2024
  • (2023)Remember Then Recommend: Critically Engaging Spell Checker Algorithms and Other Text Recommender Systems as Memory InfrastructuresCollege English10.58680/ce20233266186:1(59-88)Online publication date: 1-Sep-2023
  • (2023)Robust Chinese Named Entity Recognition Based on Fusion Graph EmbeddingElectronics10.3390/electronics1203056912:3(569)Online publication date: 22-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Malicious spelling correction
  2. cyberbullying
  3. machine learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Shielding against online harm: A survey on text analysis to prevent cyberbullyingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108241133(108241)Online publication date: Jul-2024
  • (2023)Remember Then Recommend: Critically Engaging Spell Checker Algorithms and Other Text Recommender Systems as Memory InfrastructuresCollege English10.58680/ce20233266186:1(59-88)Online publication date: 1-Sep-2023
  • (2023)Robust Chinese Named Entity Recognition Based on Fusion Graph EmbeddingElectronics10.3390/electronics1203056912:3(569)Online publication date: 22-Jan-2023
  • (2023)Towards a thematic dimensional framework of online fraudDecision Support Systems10.1016/j.dss.2023.113977171:COnline publication date: 1-Aug-2023
  • (2022)Spelling Checking with Deep Learning Model in Analysis of Tweet Data for Word Classification Process2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)10.23919/EECSI56542.2022.9946476(343-348)Online publication date: 6-Oct-2022
  • (2021)Homology Detection of Malicious Codes Based on a Fuzzy Graph Neural Network2021 IEEE International Conference on Industrial Application of Artificial Intelligence (IAAI)10.1109/IAAI54625.2021.9699879(202-207)Online publication date: 24-Dec-2021
  • (2021)VSEC: Transformer-Based Model for Vietnamese Spelling CorrectionPRICAI 2021: Trends in Artificial Intelligence10.1007/978-3-030-89363-7_20(259-272)Online publication date: 8-Nov-2021
  • (2020)Survey of Automatic Spelling CorrectionElectronics10.3390/electronics91016709:10(1670)Online publication date: 13-Oct-2020
  • (2020)SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech DetectionProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3411990(1145-1154)Online publication date: 19-Oct-2020
  • (2020)Vietnamese Context-Sensitive Malicious Spelling Error Correction2020 7th NAFOSTED Conference on Information and Computer Science (NICS)10.1109/NICS51282.2020.9335909(48-53)Online publication date: 26-Nov-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media