Fighting Adversarial Attacks on Online Abusive Language Moderation

Nestor Rodriguez¹² &
Sergio Rojas-Galeano¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 915))

Included in the following conference series:

Workshop on Engineering Applications

1139 Accesses
2 Citations

Abstract

Lack of moderation in online conversations may result in personal aggression, harassment or cyberbullying. Such kind of hostility is usually expressed by using profanity or abusive language. On the basis of this assumption, recently Google has developed a machine-learning model to detect hostility within a comment. The model is able to assess to what extent abusive language is poisoning a conversation, obtaining a “toxicity” score for the comment. Unfortunately, it has been suggested that such a toxicity model can be deceived by adversarial attacks that manipulate the text sequence of the abusive language. In this paper we aim to fight this anomaly; firstly we characterise two types of adversarial attacks, one using obfuscation and the other using polarity transformations. Then, we propose a two–stage approach to disarm such attacks by coupling a text deobfuscation method and the toxicity scoring model. The approach was validated on a dataset of approximately 24000 distorted comments showing that it is feasible to restore the toxicity score of the adversarial variants. We anticipate that combining machine learning and text pattern recognition methods operating on different layers of linguistic features, will help to foster aggression–safe online conversations despite the adversary challenges inherent to the versatile nature of written language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automated Offensive Comment Detection for the Romanian Language

A Review on Offensive Language Detection

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

References

Dale, R.: NLP in a post-truth world. Nat. Lang. Eng. 23(2), 319–324 (2017)
Article MathSciNet Google Scholar
Hosseinmardi, H.: Survey of computational methods in cyberbullying research. In: Proceedings of the First International Workshop on Computational Methods for CyberSafety. ACM, New York (2016)
Google Scholar
Burnap, P., Williams, M.L.: Us and them: identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data Sci. 5(1), 11 (2016)
Article Google Scholar
Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web (2016)
Google Scholar
Wulczyn, E., Thain, N., Dixon, L.: Ex machina: personal attacks seen at scale. arXiv preprint arXiv:1610.08914, February 2017
Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective API built for detecting toxic comments. arXiv preprint arXiv:1702.08138, February 2017
Rojas-Galeano, S.: On obstructing obscenity obfuscation. ACM Trans. Web 11(2), 12:1–12:24 (2017). https://doi.org/10.1145/3032963
Article Google Scholar
Laskov, P., Lippmann, R.: Machine learning in adversarial environments. Mach. Learn. 81(2), 115–119 (2010)
Article Google Scholar
Samanta, S., Mehta, S.: Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812 (2017)
PerspectiveAPI: Jigsaw (2017). https://www.perspectiveapi.com. Accessed 26 May 2018
TextPatrolAPI: TPLabs (2017). https://api.textpatrol.tk. Accessed 26 May 2018
Stone, T.E., McMillan, M., Hazelton, M.: Back to swear one: a review of English language literature on swearing and cursing in western health settings. Aggress. Violent Behav. 25, 65–74 (2015)
Article Google Scholar
Hosseinmardi, H., Mattson, S.A., Ibn Rafiq, R., Han, R., Lv, Q., Mishra, S.: Analyzing labeled cyberbullying incidents on the instagram social network. In: Liu, T.Y., Scollon, C., Zhu, W. (eds.) Social Informatics. LNCS, vol. 9471, pp. 49–66. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27433-1_4
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
Nestor Rodriguez & Sergio Rojas-Galeano

Authors

Nestor Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Rojas-Galeano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio Rojas-Galeano .

Editor information

Editors and Affiliations

Department of Industrial Engineering, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
Juan Carlos Figueroa-García
Industrial Engineering Department, Univ. Distrital Francisco José de Caldas, Bogotá, Colombia
Eduyn Ramiro López-Santana
Department of Industrial Engineering, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
José Ignacio Rodriguez-Molano

Appendix. Original Comments

Table 3 shows the original aggressive comments extracted from the GP Website [10] with their toxicity scores obtained at the beginning of this study (notice that since GP is continuously refining its model by learning from new examples, these scores may have varied over time). The terms triggering toxicity are indicated in bold type and were found as explained in Sect. 2.4.

Table 3. Original aggressive comments extracted from [10].

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodriguez, N., Rojas-Galeano, S. (2018). Fighting Adversarial Attacks on Online Abusive Language Moderation. In: Figueroa-García, J., López-Santana, E., Rodriguez-Molano, J. (eds) Applied Computer Sciences in Engineering. WEA 2018. Communications in Computer and Information Science, vol 915. Springer, Cham. https://doi.org/10.1007/978-3-030-00350-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-00350-0_40
Published: 13 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00349-4
Online ISBN: 978-3-030-00350-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fighting Adversarial Attacks on Online Abusive Language Moderation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automated Offensive Comment Detection for the Romanian Language

A Review on Offensive Language Detection

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix. Original Comments

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fighting Adversarial Attacks on Online Abusive Language Moderation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Automated Offensive Comment Detection for the Romanian Language

A Review on Offensive Language Detection

Classifying Offensive Speech of Bangla Text and Analysis Using Explainable AI

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix. Original Comments

Appendix. Original Comments

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation