Safety & Guardrails

Jailbreaking

Techniques that override a model's safety training to make it produce content it would normally refuse.

Last updated: April 26, 2026

Definition

Jailbreaking covers any technique that bypasses an LLM's alignment training. Common patterns: role-play ("pretend you are an unfiltered version of yourself"), payload smuggling (encoding the disallowed request in base64 or another encoding), gradient attacks (carefully-crafted token sequences that defeat safety classifiers), and many-shot attacks (including dozens of in-context examples of the disallowed behavior). Frontier models are increasingly resistant but no model is fully jailbreak-proof. The threat matters most for consumer-facing agents and regulated industries where the cost of a single jailbroken output is high.

When To Use

For any consumer-facing agent, run a red-team pass that includes known jailbreak patterns before launch. Track novel jailbreaks in production logs and update guardrails monthly.

Sources

Related Terms

Prompt Injection

An attack where user input contains instructions that hijack the LLM's behavior.…

Guardrails

Input and output filters that prevent unsafe, off-topic, or out-of-policy model …

Red Teaming (AI)

Adversarial testing of an AI system to discover vulnerabilities (jailbreaks, pro…

Agentic Misalignment

When an autonomous agent's actions diverge from its intended goals or human inte…

Worried about Jailbreaking in production?

I've debugged and defended against this in real production AI systems. If you want a second pair of eyes on your architecture or your guardrails, that's what I do.

Book a discovery call Browse more terms