Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3319535.3363271acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
poster

Poster: Adversarial Examples for Hate Speech Classifiers

Published: 06 November 2019 Publication History

Abstract

With the advent of the Internet, social media platforms have become an increasingly popular medium of communication for people. Platforms like Twitter and Quora allow people to express their opinions on a large scale. These platforms are, however, plagued by the problem of hate speech and toxic content. Such content is generally sexist, homophobic or racist. Automatic text classification can filter out toxic content so some extent. In this paper, we discuss the adversarial attacks on hate speech classifiers. We demonstrate that by changing the text slightly, a classifier can be fooled to misclassifying a toxic comment as acceptable. We attack hate speech classifiers with known attacks as well as introduce four new attacks. We find that our method can degrade the performance of a Random Forest classifier by 20%. We hope that our work sheds light on the vulnerabilities of text classifiers, and opens doors for further research on this topic.

References

[1]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[2]
Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques .Elsevier.
[3]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[4]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[5]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

Cited By

View all
  • (2024)A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directionsPeerJ Computer Science10.7717/peerj-cs.196610(e1966)Online publication date: 2-Apr-2024
  • (2024)Hate speech detection in social media: Techniques, recent trends, and future challengesWIREs Computational Statistics10.1002/wics.164816:2Online publication date: 11-Mar-2024
  • (2023)Adversarial NLP for Social Network Applications: Attacks, Defenses, and Research DirectionsIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321874310:6(3089-3108)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security
November 2019
2755 pages
ISBN:9781450367479
DOI:10.1145/3319535
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2019

Check for updates

Author Tags

  1. adversarial machine learning
  2. hate speech

Qualifiers

  • Poster

Conference

CCS '19
Sponsor:

Acceptance Rates

CCS '19 Paper Acceptance Rate 149 of 934 submissions, 16%;
Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directionsPeerJ Computer Science10.7717/peerj-cs.196610(e1966)Online publication date: 2-Apr-2024
  • (2024)Hate speech detection in social media: Techniques, recent trends, and future challengesWIREs Computational Statistics10.1002/wics.164816:2Online publication date: 11-Mar-2024
  • (2023)Adversarial NLP for Social Network Applications: Attacks, Defenses, and Research DirectionsIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321874310:6(3089-3108)Online publication date: Dec-2023
  • (2023)Twitter Hate Speech Detection: A Systematic Review of Methods, Taxonomy Analysis, Challenges, and OpportunitiesIEEE Access10.1109/ACCESS.2023.323937511(16226-16249)Online publication date: 2023
  • (2022)Chinese Spam Detection Using a Hybrid BiGRU-CNN Network with Joint Textual and Phonetic EmbeddingElectronics10.3390/electronics1115241811:15(2418)Online publication date: 3-Aug-2022
  • (2021)Hidden Backdoors in Human-Centric Language ModelsProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484576(3123-3140)Online publication date: 12-Nov-2021
  • (2020)Graph-based methods to detect hate speech diffusion on TwitterProceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM49781.2020.9381473(502-506)Online publication date: 7-Dec-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media