Computer Science > Cryptography and Security

arXiv:2405.19360 (cs)

[Submitted on 24 May 2024 (v1), last revised 11 Oct 2024 (this version, v3)]

Title:ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Authors:Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

Abstract:Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in this https URL.

Comments:	Accepted by NeurIPS 2024
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.19360 [cs.CR]
	(or arXiv:2405.19360v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2405.19360

Submission history

From: GuanLin Li [view email]
[v1] Fri, 24 May 2024 07:44:27 UTC (2,674 KB)
[v2] Mon, 17 Jun 2024 10:00:33 UTC (2,674 KB)
[v3] Fri, 11 Oct 2024 06:52:18 UTC (4,246 KB)

Computer Science > Cryptography and Security

Title:ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators