Jahanzaib
Safety & Guardrails

Indirect Prompt Injection

Attack where malicious instructions are hidden in external content (web pages, emails, documents) the agent reads, not in the user's message.

Last updated: April 26, 2026

Definition

Indirect prompt injection is the hardest class of LLM attack to defend against. Instead of the attacker putting "ignore previous instructions" directly into the user prompt, they hide the same instructions in content the agent will retrieve and read: a poisoned web page the agent browses, a document in a shared drive, an email forwarded to the agent for processing, a comment in a code repository the agent reads. When the agent reads that content, it treats the embedded instructions as authoritative. Production agents that read untrusted external content are at constant risk of this. OWASP lists it as LLM01 in the LLM Top 10.

Defenses are layered. First, structural: wrap retrieved content in clearly delimited XML tags and instruct the model to treat it as data, not instructions. Second, model-side filtering: AWS Bedrock Guardrails and similar services include prompt-attack detectors trained specifically for indirect injection patterns. Third, action-space restriction: agents that read untrusted content should have minimal tool access (read-only, not write/delete). Fourth, human approval for high-stakes actions: an agent that reads emails and processes refunds must require human review before issuing money. No single defense is sufficient; layer all four.

When To Use

Assume any agent that reads external content (web, email, documents) faces indirect prompt injection attempts. Layer defenses; never trust a single guardrail.

Sources

Related Terms

Building an agent that reads external content?

Indirect prompt injection is the hardest class of attack to defend against. If your agent processes emails, web pages, or shared docs, let me audit your guardrails before it ships.