JailbreakBench (24.04)
• Rule-based. The rule-based judge from Zou et al. (2023) based on string matching,
• GPT-4. The GPT-4-0613 model used as a judge (OpenAI, 2023),
• HarmBench. The Llama-2-13B judge introduced in HarmBench (Mazeika et al., 2024),
• Llama Guard. An LLM safeguard model fine-tuned from Llama-2-7B (Inan et al., 2023),
• Llama Guard 2. An LLM safeguard model fine-tuned from Llama-3-8B (Llama Team, 2024),
• Llama-3-70B. The recent Llama-3-70B (AI@Meta, 2024) used as a judge with a custom prompt.
比较了上述 6 种评判方式,最终采用 Llama-3-70B 作为评判模型
PastTense (24.07)
汇报了三种评判方式的结果,GPT4、JailbreakBench的Llama-3-70B、GCG的 Rule-based
Jailbreak_GPT4o (24.06)
汇报了四种评价方式的结果