Safety & Guardrails

Agent Hijacking

Attack where an adversary manipulates an agent's context, memory, or tool usage to redirect its behavior toward attacker goals.

Last updated: April 26, 2026

Definition

Agent hijacking is the umbrella term for attacks that take over an agent's decision-making. It covers indirect prompt injection (poisoning content the agent reads), memory poisoning (planting false facts in the agent's long-term memory), tool poisoning (compromising one of the tools the agent calls so it returns malicious results), and social engineering of the agent (convincing it via dialogue to act outside its scope). The defining property of hijacking attacks: the user does not need to type anything malicious. The hijack happens through the agent's normal information-gathering behavior.

When To Use

Threat-model agent hijacking before launching any agent that reads external content or has long-term memory across sessions. The attack surface grows with autonomy.

Sources

Related Terms

Indirect Prompt Injection

Attack where malicious instructions are hidden in external content (web pages, e…

Prompt Injection

An attack where user input contains instructions that hijack the LLM's behavior.…

Least Privilege (Agent)

Security principle that an agent should have only the minimum permissions, tools…

Guardrails

Input and output filters that prevent unsafe, off-topic, or out-of-policy model …

Worried about Agent Hijacking in production?

I've debugged and defended against this in real production AI systems. If you want a second pair of eyes on your architecture or your guardrails, that's what I do.

Book a discovery call Browse more terms