Production

Tracing

Recording every step an agent takes (LLM calls, tool calls, memory reads, routing decisions) into a structured trace for debugging and audit.

Last updated: April 26, 2026

Definition

A trace is the full execution graph of one agent run: every LLM call with its prompt and response, every tool invocation with input and output, every memory retrieval, every routing decision. Each trace lives in an observability platform (Langfuse, LangSmith, Phoenix, Datadog) and is queryable by user, by session, by error class, by latency percentile. Without traces, debugging an agent that "sometimes does the wrong thing" is detective work without evidence. With traces, you reproduce the failing run in seconds.

The hardest part of tracing is keeping it cheap. Full traces of every agent run can produce gigabytes per day. Two patterns help: tail-based sampling (trace 100 percent of failed runs, 5 to 10 percent of successful ones), and selective field capture (log inputs and outputs but truncate the system prompt that's already cached). For multi-agent systems, propagate a trace ID through every sub-agent so you can stitch the full conversation together at debug time.

When To Use

Wire tracing in from day one of any production agent. Adding it after a production incident is too late.

Sources

Related Terms

Observability

Logging, metrics, and tracing for LLM calls so you can debug, audit, and optimiz…

Span

A single operation within a trace: one LLM call, one tool invocation, one retrie…

OpenTelemetry (OTel)

Open vendor-neutral standard for telemetry data (traces, metrics, logs) across a…

LLMOps

The MLOps equivalent for LLM-powered systems: prompt versioning, evaluation pipe…

Building with Tracing?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms