Computer Science > Computation and Language

arXiv:2401.09002 (cs)

[Submitted on 17 Jan 2024 (v1), last revised 3 Aug 2024 (this version, v5)]

Title:AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Authors:Dong shu, Mingyu Jin, Chong Zhang, Liangyao Li, Zihao Zhou, Yongfeng Zhang

Abstract:Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.

Comments:	34 pages, 6 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2401.09002 [cs.CL]
	(or arXiv:2401.09002v5 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.09002

Submission history

From: Mingyu Jin [view email]
[v1] Wed, 17 Jan 2024 06:42:44 UTC (7,692 KB)
[v2] Tue, 13 Feb 2024 02:20:31 UTC (7,823 KB)
[v3] Wed, 20 Mar 2024 14:08:39 UTC (7,823 KB)
[v4] Wed, 31 Jul 2024 06:46:44 UTC (1,465 KB)
[v5] Sat, 3 Aug 2024 06:39:25 UTC (1,465 KB)

Computer Science > Computation and Language

Title:AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators