Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3630106.3658913acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open access

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Published: 05 June 2024 Publication History

Abstract

With text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on “implicitly adversarial” prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. We present an in-depth account of our methodology, a systematic study of novel attack strategies and safety failures, and a visualization tool for easy exploration of the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. This work will enable proactive, iterative safety assessments and promote responsible development of T2I models.

References

[1]
[n. d.]. 4chan. https://www.4chan.org/ [Accessed on 10/01/2024 ].
[2]
[n. d.]. Lexica. https://www.lexica.art/ [Accessed on 10/01/2024 ].
[3]
2023. Nudity and sexual activity: publisher and creator guidelines - Guidelines for safe, respectful behavior. https://www.facebook.com/business/help/725672454452774?id=208060977200861 Accessed on 12/04/23.
[4]
2023. The X Rules: safety, privacy, authenticity and more. https://help.twitter.com/en/rules-and-policies/x-rules Accessed on 12/04/23.
[5]
Lora Aroyo and Praveen Paritosh. [n. d.]. Uncovering Unknown Unknowns in Machine Learning. https://ai.googleblog.com/2021/02/uncovering-unknown-unknowns-in-machine.html
[6]
Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics 8 (2020), 662–678.
[7]
Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. 2022. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3754–3767.
[8]
Abhipsa Basu, R Venkatesh Babu, and Danish Pruthi. 2023. Inspecting the Geographical Representativeness of Images from Text-to-Image Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[9]
Donald Bertucci, Md Montaser Hamid, Yashwanthi Anand, Anita Ruangrotsakun, Delyar Tabatabai, Melissa Perez, and Minsuk Kahng. 2022. DendroMap: Visual Exploration of Large-Scale Image Datasets for Machine Learning with Treemaps. IEEE Transactions on Visualization and Computer Graphics (VIS) 29, 1 (2022), 320–330.
[10]
Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. 2023. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT).
[11]
Abeba Birhane, Vinay Prabhu, Sang Han, Vishnu Naresh Boddeti, and Alexandra Sasha Luccioni. 2023. Into the LAIONs Den: Investigating Hate in Multimodal Datasets. arXiv preprint arXiv:2311.03449 (2023).
[12]
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021).
[13]
Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, and Gillian et al. Hadfield. 2023. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. arXiv preprint arXiv:2211.03759 (2023).
[14]
Cats4ML. [n. d.]. Cats4ML Challenge. https://cats4ml.humancomputation.com/
[15]
Jaemin Cho, Abhay Zala, and Mohit Bansal. 2023. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[16]
DeepLearning.AI. [n. d.]. Data-Centric AI Competition. https://https-deeplearning-ai.github.io/data-centric-comp/
[17]
Leon Derczynski, Hannah Rose Kirk, Vidhisha Balachandran, Sachin Kumar, Yulia Tsvetkov, MR Leiser, and Saif Mohammad. 2023. Assessing Language Model Deployment with Risk Cards. arXiv preprint arXiv:2303.18190 (2023).
[18]
Hayden Field. 2022. How Microsoft and Google use AI red teams to “stress test” their systems. https://www.emergingtechbrew.com/stories/2022/06/14/how-microsoft-and-google-use-ai-red-teams-to-stress-test-their-system Accessed on 03/08/23.
[19]
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
[20]
Naman Goel and Boi Faltings. 2019. Crowdsourcing with fairness, diversity and budget constraints. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 297–304.
[21]
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4110–4124.
[22]
Alexandra Sasha Luccioni and Joseph D Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. arXiv preprint arXiv:2105.02732 (2021).
[23]
Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Douwe Kiela, David Jurado, 2022. Dataperf: Benchmarks for data-centric AI development. arXiv preprint arXiv:2207.10062 (2022).
[24]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A Survey on Bias and Fairness in Machine Learning. Comput. Surveys 54, 6, Article 115 (2021), 35 pages.
[25]
Midjourney. 2023. Midjourney Documentation and User Guide. https://docs.midjourney.com/. (Accessed on 04/19/2023).
[26]
Raphaël Millière. 2022. Adversarial attacks on image generation with made-up words. arXiv preprint arXiv:2208.04135 (2022).
[27]
Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500 (2023).
[28]
Madhumita Murgia. [n. d.]. OpenAI’s red team: the experts hired to ‘break’ ChatGPT. ([n. d.]). https://www.ft.com/content/0876687a-f8b7-4b39-b513-5fee942831e8
[29]
Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 1: Long Papers (2021), 5356–5371.
[30]
Ranjita Naik and Besmira Nushi. 2023. Social Biases through the Text-to-Image Generation Lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES).
[31]
Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, 2023. DMLR: Data-centric Machine Learning Research–Past, Present and Future. arXiv preprint arXiv:2311.13028 (2023).
[32]
OpenAI. [n. d.]. DALL-E 2 System Card. https://github.com/openai/dalle-2-preview/blob/main/system-card.md#early-work
[33]
Alicia Parrish, Sarah Laszlo, and Lora Aroyo. 2023. " Is a picture of a bird a bird": Policy recommendations for dealing with ambiguity in machine vision models. arXiv preprint arXiv:2306.15777 (2023).
[34]
Nikita Pavlichenko and Dmitry Ustalov. 2023. Best prompts for text-to-image models and how to find them. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2067–2071.
[35]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[36]
Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. 2021. DynaSent: A Dynamic Benchmark for Sentiment Analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2388–2404.
[37]
Yiting Qu, Xinlei He, Shannon Pierson, Michael Backes, Yang Zhang, and Savvas Zannettou. 2023. On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 293–310.
[38]
Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. 2023. Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3403–3417.
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[40]
Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 33–44.
[41]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arxiv:2204.06125 [cs.CV]
[42]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 8821–8831.
[43]
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. 2022. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610 (2022).
[44]
Charvi Rastogi, Marco Tulio Ribeiro, Nicholas King, and Saleema Amershi. 2023. Supporting Human-AI Collaboration in Auditing LLMs with LLMs. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES). 913–926.
[45]
Jonathan C Roberts. 2007. State of the art: Coordinated & multiple views in exploratory visualization. In Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization. IEEE, 61–71.
[46]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
[47]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
[48]
Joni Salminen, Soon-gyo Jung, Shammur Chowdhury, and Bernard J. Jansen. 2020. Analyzing Demographic Bias in Artificially Generated Facial Pictures. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, 1–8.
[49]
Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. 2023. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22522–22531.
[50]
Xinyue Shen, Xinlei He, Michael Backes, Jeremy Blackburn, Savvas Zannettou, and Yang Zhang. 2022. On Xing Tian and the Perseverance of Anti-China Sentiment Online. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 16. 944–955.
[51]
Ben Shneiderman. 1996. The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages. IEEE, 336–343.
[52]
Eric Slyman, Minsuk Kahng, and Stefan Lee. 2023. VLSlice: Interactive Vision-and-Language Slice Discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15291–15301.
[53]
Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9180–9211.
[54]
Snorkel. [n. d.]. Data-centric AI: A complete primer. https://snorkel.ai/data-centric-ai-primer/
[55]
Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C Park. 2021. A large-scale comprehensive abusiveness detection dataset with multifaceted labels from reddit. In Proceedings of the 25th Conference on Computational Natural Language Learning. 552–561.
[56]
Tristan Thrush, Kushal Tirumala, Anmol Gupta, Max Bartolo, Pedro Rodriguez, Tariq Kane, William Gaviria Rojas, Peter Mattson, Adina Williams, and Douwe Kiela. 2022. Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 174–181.
[57]
Guillaume Wenzek, Vishrav Chaudhary, Angela Fan, Sahir Gomez, Naman Goyal, Somya Jain, Douwe Kiela, Tristan Thrush, and Francisco Guzmán. 2021. Findings of the WMT 2021 shared task on large-scale multilingual machine translation. In Proceedings of the Sixth Conference on Machine Translation. 89–99.
[58]
Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. 2023. SneakyPrompt: Jailbreaking Text-to-image Generative Models. arXiv preprint arXiv:2305.12082 (2023).
[59]
Zhenge Zhao, Panpan Xu, Carlos Scheidegger, and Liu Ren. 2021. Human-in-the-Loop Extraction of Onterpretable Concepts in Deep Learning Models. IEEE Transactions on Visualization and Computer Graphics (VIS) 28, 1 (2021), 780–790.

Cited By

View all
  • (2025)Adversarial attacks and defenses on text-to-image diffusion models: A surveyInformation Fusion10.1016/j.inffus.2024.102701114(102701)Online publication date: Feb-2025
  • (2024)(Un)designing AI for Mental and Spiritual WellbeingCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3689132(117-120)Online publication date: 11-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
June 2024
2580 pages
ISBN:9798400704505
DOI:10.1145/3630106
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2024

Check for updates

Author Tags

  1. Adversarial Testing
  2. Crowdsourcing
  3. Data-centric AI
  4. Red teaming
  5. Text-to-image

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

FAccT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)446
  • Downloads (Last 6 weeks)125
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Adversarial attacks and defenses on text-to-image diffusion models: A surveyInformation Fusion10.1016/j.inffus.2024.102701114(102701)Online publication date: Feb-2025
  • (2024)(Un)designing AI for Mental and Spiritual WellbeingCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3689132(117-120)Online publication date: 11-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media