Computer Science > Artificial Intelligence

arXiv:2311.09473 (cs)

[Submitted on 16 Nov 2023]

Title:JAB: Joint Adversarial Prompting and Belief Augmentation

Authors:Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

View PDF

Abstract:With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2311.09473 [cs.AI]
	(or arXiv:2311.09473v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2311.09473

Submission history

From: Ninareh Mehrabi [view email]
[v1] Thu, 16 Nov 2023 00:35:54 UTC (126 KB)

Computer Science > Artificial Intelligence

Title:JAB: Joint Adversarial Prompting and Belief Augmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:JAB: Joint Adversarial Prompting and Belief Augmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators