Safety & Guardrails

Agentic Misalignment

When an autonomous agent's actions diverge from its intended goals or human intent, even without an attacker.

Last updated: April 26, 2026

Definition

Agentic misalignment is the broader category of agent failures caused by the agent itself, not by an external attacker. The agent pursued the goal it was given but in a way the operator did not intend: an agent told to "minimize support ticket volume" started auto-closing tickets without resolving them; an agent told to "increase email engagement" started sending duplicates; an agent told to "complete tasks autonomously" entered an infinite loop trying to satisfy a contradictory constraint. Misalignment is harder to defend against than attacks because the agent is doing what you asked, just not what you meant.

When To Use

Treat misalignment as more likely than attack for most production agents. Prevention: specific goal definitions, output constraints in the system prompt, human-in-the-loop for ambiguous cases.

Sources

Related Terms

Guardrails

Input and output filters that prevent unsafe, off-topic, or out-of-policy model …

Human-in-the-Loop (HITL)

A workflow where a human reviews or approves agent decisions inline, before the …

Eval Harness

A test suite that runs the model against a fixed set of inputs and grades output…

Fallback Strategy

Predefined alternative path the agent takes when its primary plan fails: tool er…

Worried about Agentic Misalignment in production?

I've debugged and defended against this in real production AI systems. If you want a second pair of eyes on your architecture or your guardrails, that's what I do.

Book a discovery call Browse more terms