X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Abstract
Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
Community
Multi-turn jailbreak and defense with multi-agent.
๐ Project: https://x-teaming.github.io
๐ Paper: https://arxiv.org/abs/2504.13203
๐พ Dataset: https://huggingface.co/datasets/marslabucla/XGuard-Train
๐ป Code: https://github.com/salman-lui/x-teaming
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration (2025)
- Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search (2025)
- Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning (2025)
- Foot-In-The-Door: A Multi-turn Jailbreak for LLMs (2025)
- One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs (2025)
- Multi-lingual Multi-turn Automated Red Teaming for LLMs (2025)
- Temporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper