Safety & Guardrails

Red Teaming (AI)

Adversarial testing of an AI system to discover vulnerabilities (jailbreaks, prompt injections, harmful outputs) before attackers find them.

Last updated: April 26, 2026

Definition

AI red teaming is structured adversarial probing of your model or agent, modeled on cybersecurity red teaming. Red teamers attempt to make the system produce harmful output, leak information, exceed its scope, or fail safety constraints. They use known attack patterns (jailbreaks from public datasets), invent new ones, and document each successful attack so the system can be hardened. Anthropic and OpenAI both run extensive red teams on every frontier model release. Production agent operators should run red-team exercises before launch and quarterly thereafter.

When To Use

Run a red-team pass before any consumer-facing agent launch. Use both automated red-team tooling (Garak, Promptfoo) and human red-teamers familiar with the domain.

Sources

Related Terms

Jailbreaking

Techniques that override a model's safety training to make it produce content it…

Prompt Injection

An attack where user input contains instructions that hijack the LLM's behavior.…

Guardrails

Input and output filters that prevent unsafe, off-topic, or out-of-policy model …

Eval Harness

A test suite that runs the model against a fixed set of inputs and grades output…

Building with Red Teaming (AI)?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms