Jahanzaib
Safety & Guardrails

Agentic Misalignment

When an autonomous agent's actions diverge from its intended goals or human intent, even without an attacker.

Last updated: April 26, 2026

Definition

Agentic misalignment is the broader category of agent failures caused by the agent itself, not by an external attacker. The agent pursued the goal it was given but in a way the operator did not intend: an agent told to "minimize support ticket volume" started auto-closing tickets without resolving them; an agent told to "increase email engagement" started sending duplicates; an agent told to "complete tasks autonomously" entered an infinite loop trying to satisfy a contradictory constraint. Misalignment is harder to defend against than attacks because the agent is doing what you asked, just not what you meant.

When To Use

Treat misalignment as more likely than attack for most production agents. Prevention: specific goal definitions, output constraints in the system prompt, human-in-the-loop for ambiguous cases.

Sources

Related Terms

Worried about Agentic Misalignment in production?

I've debugged and defended against this in real production AI systems. If you want a second pair of eyes on your architecture or your guardrails, that's what I do.