Jailbreaking
Techniques that override a model's safety training to make it produce content it would normally refuse.
Last updated: April 26, 2026
Definition
Jailbreaking covers any technique that bypasses an LLM's alignment training. Common patterns: role-play ("pretend you are an unfiltered version of yourself"), payload smuggling (encoding the disallowed request in base64 or another encoding), gradient attacks (carefully-crafted token sequences that defeat safety classifiers), and many-shot attacks (including dozens of in-context examples of the disallowed behavior). Frontier models are increasingly resistant but no model is fully jailbreak-proof. The threat matters most for consumer-facing agents and regulated industries where the cost of a single jailbroken output is high.
When To Use
For any consumer-facing agent, run a red-team pass that includes known jailbreak patterns before launch. Track novel jailbreaks in production logs and update guardrails monthly.
Related Terms
Worried about Jailbreaking in production?
I've debugged and defended against this in real production AI systems. If you want a second pair of eyes on your architecture or your guardrails, that's what I do.