Safety & Guardrails

Trust Boundary

The line in your agent system between trusted inputs (system prompt, internal config) and untrusted inputs (user messages, retrieved documents, tool outputs).

Last updated: April 26, 2026

Definition

A trust boundary is where data crosses from a trusted source to a less-trusted one. For LLM agents, the system prompt and your own configured tool definitions are trusted. Everything else, user messages, retrieved RAG content, web pages the agent fetches, results from tool calls, output from sub-agents, is untrusted. The model has no built-in way to distinguish trusted from untrusted text inside the same context window: they are all just tokens. The defense is structural: wrap untrusted content in clearly-delimited tags, instruct the model to treat tagged content as data not instructions, and never let the model take a high-stakes action based purely on untrusted content.

When To Use

Map every trust boundary in your agent's data flow. Wherever untrusted content enters the prompt, mark it explicitly. Audits should walk these boundaries.

Sources

Related Terms

Indirect Prompt Injection

Attack where malicious instructions are hidden in external content (web pages, e…

Prompt Injection

An attack where user input contains instructions that hijack the LLM's behavior.…

Guardrails

Input and output filters that prevent unsafe, off-topic, or out-of-policy model …

Least Privilege (Agent)

Security principle that an agent should have only the minimum permissions, tools…

Building with Trust Boundary?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms