Agentic RAG: The Complete Production Guide Nobody Else Wrote
A practitioner's guide to agentic RAG covering the five-component architecture, chunking strategies, four common failure modes, LLM-as-judge evaluation, and real cost-per-query numbers from 109 production deployments.

Three months into a contract with a mid-sized insurance company, I was sitting across from their CTO watching their "AI knowledge base" answer questions about their own products. The system retrieved the right documents 90% of the time. But on anything involving multi-part questions, comparisons, or anything that required checking two sources together, it fell apart. Their agentic RAG system wasn't agentic at all. It was a fixed pipeline wearing an agent costume, and it was costing them about $4,200 a month in API calls to produce answers that were wrong 62% of the time on complex queries.
That project is what pushed me to formalize what I now call an agentic RAG system the right way. I've since deployed some form of this architecture across 38 of my 109 production AI systems, and the patterns I'm about to share are hard-won. This guide covers what most agentic RAG articles skip: real chunking decisions, embedding model comparisons, the four failure modes that will definitely hit you in production, evaluation methods, and actual cost-per-query numbers. If you want a high-level intro to what RAG is, I wrote a separate guide for business owners. This post is for engineers building the thing.
Key Takeaways
- Agentic RAG replaces fixed retrieve-then-generate pipelines with a loop that routes, retrieves, grades, and self-corrects before answering
- The five core components are Router, Retriever, Grader, Generator, and Hallucination Checker, each can be tuned independently
- Chunk size and embedding model choice have more impact on accuracy than model selection
- Four failure modes kill most first deployments: infinite loops, graders that never reject, context overflow, and latency spirals
- Real production cost per query ranges from $0.02 for simple lookups to $0.31 for complex multi-source reasoning
- Agentic RAG is not always the right choice and I'll give you a clear decision framework for when simpler approaches win
What Traditional RAG Gets Wrong
Standard RAG works like this: a query comes in, you embed it, you pull the top-k chunks from your vector database, you stuff those chunks into a prompt, and you generate an answer. The pipeline is deterministic and linear. That's both its strength and its fatal flaw.
The Fixed Pipeline Problem
The assumption baked into every traditional RAG pipeline is that a single retrieval step produces sufficient context for every possible question. That's almost never true. Consider a user asking: "Compare our cancellation policy for personal auto versus commercial auto, and tell me which has the shorter waiting period." That question requires pulling from at least two separate sections of two separate documents, understanding what "waiting period" means in the context of each policy type, and synthesizing a comparison the original documents never made.
Traditional RAG will retrieve the top-k chunks most similar to the query embedding. Maybe it pulls the right chunks, maybe it doesn't. There's no retry, no grading, no fallback. If the retrieved chunks don't contain the answer, you hallucinate. And you'll never know it happened unless you're running evaluation.
Where I've Seen Standard RAG Break
In my experience, fixed RAG pipelines reliably fail in four scenarios. First, multi-hop questions that require connecting information across documents. Second, questions where the answer depends on recency and your index isn't perfectly current. Third, numerical comparisons where the LLM needs to find and compare specific data points. Fourth, any question where the user's phrasing is far from the language in the source documents, making vector similarity a weak signal. In the insurance project I mentioned, 68% of the failing queries fell into one of these four categories.
What Agentic RAG Actually Does
Agentic RAG turns the pipeline into a loop. Instead of one retrieval step, you have an agent that decides whether to retrieve at all, what to retrieve, whether the retrieved content is good enough, and whether to try again with a different query before generating an answer. The agent controls the entire process.
This isn't just a theoretical improvement. NVIDIA's engineering blog documented accuracy improvements from 34% to 78% on complex multi-hop queries when moving from traditional to agentic retrieval.That's a major shift in what you can actually trust in production.
The Five Component Architecture
Every agentic RAG system I've built uses five core components, regardless of the underlying framework:
1. Router: classifies the incoming query and decides what kind of retrieval, if any, is needed. Some questions don't need retrieval at all (factual questions the LLM already knows well). The router keeps you from burning tokens on unnecessary vector searches.
2. Retriever: executes the actual search against your vector store, SQL database, or other knowledge sources. In multi-agent setups, different retriever agents may handle different knowledge domains in parallel.
3. Grader: evaluates whether the retrieved documents are actually relevant to the question. This is the component most implementations skip, and it's why most agentic RAG systems still fail on edge cases.
4. Generator: synthesizes the final answer using the graded, relevant context. Only runs when the grader says the retrieved content is sufficient.
5. Hallucination Checker: verifies that the generated answer is grounded in the retrieved context, not invented. If it detects fabrication, it routes back to retrieval or flags the query for human review.
Building Agentic RAG with LangGraph
LangGraph is the right tool for implementing this architecture in 2026. Its graph-based state machine maps directly to the agentic loop. You define nodes (the five components), edges (conditional transitions between them), and shared state (the query, retrieved docs, and generated answer flowing through the graph). If you've read my complete guide to building AI agents in production, LangGraph will look familiar.
Here's how the core graph looks in Python:
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class AgenticRAGState(TypedDict):
query: str
reformulated_query: str
retrieved_docs: List[str]
relevant_docs: List[str]
answer: str
hallucination_detected: bool
retry_count: int
def build_rag_graph():
graph = StateGraph(AgenticRAGState)
graph.add_node("router", router_node)
graph.add_node("retriever", retriever_node)
graph.add_node("grader", grader_node)
graph.add_node("generator", generator_node)
graph.add_node("hallucination_checker", hallucination_checker_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route_query, {
"retrieve": "retriever",
"direct_answer": "generator"
})
graph.add_edge("retriever", "grader")
graph.add_conditional_edges("grader", grade_documents, {
"sufficient": "generator",
"insufficient": "retriever" # reformulate and retry
})
graph.add_edge("generator", "hallucination_checker")
graph.add_conditional_edges("hallucination_checker", check_hallucination, {
"grounded": END,
"hallucinated": "retriever"
})
return graph.compile()
The Router Node
The router uses an LLM call (I use a small, fast model here, Claude Haiku or GPT-4o-mini) to classify the query. Don't over-engineer this. A simple prompt asking "Does this question require searching a knowledge base, or can it be answered from general knowledge?" works well for most use cases. I add a third category for queries that should be declined entirely.
def router_node(state: AgenticRAGState) -> AgenticRAGState:
router_prompt = f"""
Classify this query into one of three categories:
- "retrieve": requires searching specific documents or knowledge base
- "direct": can be answered from general knowledge
- "decline": off-topic, harmful, or outside system scope
Query: {state["query"]}
Return only the category word.
"""
result = llm.invoke(router_prompt).content.strip().lower()
state["route"] = result
return state
The Grader Node
The grader is where most implementations cut corners and pay for it. A weak grader that accepts marginally relevant documents will produce hallucinations downstream, because the generator will try to answer from insufficient context. I use binary grading: relevant or not relevant, no middle ground.
def grader_node(state: AgenticRAGState) -> AgenticRAGState:
relevant_docs = []
for doc in state["retrieved_docs"]:
grade_prompt = f"""
Is this document relevant to answering the query?
Query: {state["query"]}
Document: {doc}
Answer with only "relevant" or "irrelevant".
"""
grade = llm.invoke(grade_prompt).content.strip().lower()
if grade == "relevant":
relevant_docs.append(doc)
state["relevant_docs"] = relevant_docs
state["retry_count"] = state.get("retry_count", 0) + 1
return state
def grade_documents(state: AgenticRAGState) -> str:
if len(state["relevant_docs"]) >= 2:
return "sufficient"
if state["retry_count"] >= 3:
return "sufficient" # proceed with what we have, don't loop forever
return "insufficient"
Notice the retry cap at 3. This is critical and I'll come back to it in the failure modes section.
Chunking and Embedding: The Choices That Actually Matter
I've seen engineers spend weeks tuning LangGraph routing logic while ignoring the fact that their chunk size is wrong. Chunking and embedding choice have more impact on retrieval quality than almost anything else in the system. Most articles on agentic RAG skip this entirely. Don't make that mistake.
Why Chunk Size Is Not a Default Setting
The default chunk size in most RAG tutorials is 512 tokens or 1024 tokens. Both numbers are arbitrary. The right chunk size depends entirely on your documents.
For dense technical documentation with short, precise statements: 256 to 512 tokens works well. Larger chunks dilute the embedding signal. For narrative or explanatory content, policy documents, and legal text: 1024 to 2048 tokens. These documents derive meaning from context, and splitting too aggressively loses that. For tabular data or structured records: chunk by row or entity, not by token count at all.
The test I run on every new project: take 50 representative queries, retrieve against 256, 512, and 1024 token chunks, and measure what percentage of the time the correct chunk ranks in the top 3. That number tells you everything. I've seen accuracy jump from 61% to 89% just by changing chunk size from 512 to 256 on a technical API documentation project.
I also use chunk overlap. A 20% overlap between adjacent chunks catches information that spans chunk boundaries. For a 512-token chunk, that's about 100 tokens of overlap. This adds storage cost but meaningfully reduces retrieval gaps.
Choosing Your Embedding Model
The three models I actually use in production are compared below. I'm not listing every available option and I'm only listing the ones I've shipped against real queries at scale.
| Model | Dimensions | Cost per 1M tokens | Best for | Weakness |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (reducible) | $0.13 | General purpose, mixed document types | Latency on large batches |
| Cohere embed-v3 | 1024 | $0.10 | Multilingual content, e-commerce | Needs Cohere SDK dependency |
| nomic-embed-text (local) | 768 | $0 (compute only) | Privacy-sensitive data, on-prem | 8K token context limit |
For most projects, I start with text-embedding-3-large and reduce dimensions to 1536 using the dimensions parameter. You get 98% of the quality at half the storage cost. If you're running on healthcare or legal data that can't leave your environment, nomic-embed-text via Ollama runs fine on a single GPU and performs respectably against the paid models on domain-specific text.
One thing I never do: switch embedding models mid-project without re-indexing everything. Different models encode semantic meaning differently. Mixing embeddings from two models in the same vector store breaks similarity search in ways that are hard to debug.
The Four Failure Modes I See in Every First Deployment
These aren't edge cases. They're standard. Every team building their first agentic RAG system hits at least two of them in the first week of production traffic.
1. The Infinite Loop
The grader rejects retrieved documents. The system reformulates the query and tries again. The new retrieval also fails the grader. The system loops. Without a retry cap and loop detection, this runs until you hit your rate limit or your daily cost cap. I saw this cost a client $340 in a single afternoon because one ambiguous user query triggered a loop that ran 87 iterations.
Fix: Hard cap retry count at 3. After 3 failed retrievals, either generate from whatever you have or return a graceful "I don't have sufficient information" response. Never let the graph run without a termination condition. In the code above, I implemented this as if state["retry_count"] >= 3: return "sufficient". You can tune the threshold, but it must exist.
2. The Grader That Never Says No
This is the opposite problem. Your grader accepts everything, relevance scoring becomes meaningless, and the generator tries to synthesize answers from unrelated documents. The symptom is plausible-sounding but wrong answers. These are the most dangerous kind because they pass casual review.
Fix: Test your grader in isolation before integrating it into the graph. Give it 20 known-relevant and 20 known-irrelevant document pairs and measure precision. If it's accepting more than 15% of irrelevant documents, your grading prompt needs work. I add specificity by including the query type in the grading prompt: "Is this document relevant to a question about [classification of query type]?" That context tightens the grader significantly.
3. Context Window Overflow
You retrieve 10 documents, each 2048 tokens, plus a 4000-token system prompt, plus the query. That's 26,000 tokens of context before the generator says a single word. On Claude Sonnet or GPT-4o, you're paying $0.78 per query just for input tokens. On systems with high query volume, that compounds fast. And beyond cost, stuffing a 200,000-token context window doesn't improve accuracy. It degrades it, because attention diffuses across too much content.
Fix: Cap the context sent to the generator. I use a hard limit of 6 retrieved documents, each truncated to 800 tokens of the most relevant passage using a lightweight extraction step. Total context budget for retrieved content: 4800 tokens. This number came from testing on 200 real queries. Going above it produced no accuracy gains while increasing cost and latency significantly.
4. The Latency Spiral
Each node in the graph makes at least one LLM call. A full agentic RAG cycle (router, retriever, grader per doc, generator, hallucination checker) can easily make 8 to 15 LLM calls. At 300ms to 800ms per call, you're looking at 2.4 to 12 seconds of total latency before the user gets an answer. That's fine for async batch processing. It's unacceptable for a real-time chatbot.
Fix: Use the smallest capable model for each node. The router doesn't need GPT-4o. It's making a three-way classification. Claude Haiku or GPT-4o-mini handles this in under 200ms. The grader is also a classification task, not a generation task. Only the generator and hallucination checker need a more capable model. I run a "model tiering" approach: small model for router and grader ($0.001 per call), large model for generator and checker ($0.015 per call). This cuts total latency by 35 to 45% while preserving answer quality.
How to Evaluate Your Agentic RAG System
Most teams skip this step entirely. They test their system manually, say "it looks good," and ship. Then production traffic surfaces edge cases their manual testing never caught. Proper evaluation isn't optional and it's what separates systems you can trust from systems you're constantly firefighting.
The Four Metrics That Actually Matter
Retrieval Recall: what percentage of queries result in at least one relevant document being retrieved? Measure this by building a labeled test set of 100 queries with known ground-truth documents. If retrieval recall is below 85%, your embedding model or chunk size is wrong.
Grader Precision: of the documents your grader marks as relevant, what percentage actually are? Test this in isolation with a held-out labeled set. Below 80% means your grader prompt needs tightening.
Answer Faithfulness: is the generated answer grounded in the retrieved context? This is where the hallucination checker comes in. I measure this with an LLM-as-judge prompt on 200 sampled production queries per week.
Answer Relevance: does the answer actually address what the user asked? Faithfulness and relevance are different things. A faithful answer can still be off-topic. I track this through user feedback signals (thumbs up/down) and spot-check sampling.
LLM-as-Judge Evaluation
For continuous evaluation in production, I use an LLM judge running nightly on a random sample of 50 queries. The judge prompt looks like this:
EVALUATION_PROMPT = """
You are an evaluation assistant. Rate the following RAG system response.
Query: {query}
Retrieved Context: {context}
Generated Answer: {answer}
Rate on three dimensions (1-5):
1. Faithfulness: Is the answer grounded in the retrieved context?
2. Relevance: Does the answer address what the query asks?
3. Completeness: Does the answer cover all aspects of the query?
Return a JSON object with scores and a one-sentence explanation for each.
"""
I run this with GPT-4o-mini on a cron job and store results in a simple Postgres table. When any dimension drops below 3.5 average over a 7-day window, I get an alert and review the flagged queries. This has caught three separate regression issues across production deployments, each caused by a document sync failure or prompt change that wasn't tested against the full eval set.
Real Cost Numbers from Production
Nobody publishes these. Here's what I actually see across deployments.
A simple query that the router sends directly to the generator (no retrieval needed) costs about $0.02: one small model call for routing, one large model call for generation. A standard single-retrieval query with grading and hallucination checking runs $0.06 to $0.09: five to six LLM calls across small and large models, plus one vector search. A complex multi-hop query requiring two retrieval iterations costs $0.18 to $0.31: ten to fourteen LLM calls. Queries that hit the retry cap and fall back to a "no information" response cost $0.04 to $0.07.
For a system handling 1,000 queries per day with a typical distribution (40% direct, 45% standard retrieval, 15% complex), daily LLM costs run $60 to $90 per day, or roughly $1,800 to $2,700 per month. Add vector store costs and infrastructure, and you're looking at $2,200 to $3,400 per month all-in for a mid-volume deployment.
Where to Cut Costs Without Sacrificing Quality
The router is your biggest lever. If you can correctly classify 40% of queries as "direct answer" (no retrieval needed), you cut costs on those queries by 70%. Invest time in making your router accurate. The second lever is caching. Many queries in enterprise systems are semantically similar or identical. Semantic caching (embedding the query and checking similarity against a cache of recent queries and their answers) can serve 20 to 35% of queries at near-zero cost on high-repetition workloads like internal HR chatbots or product documentation systems.
When NOT to Use Agentic RAG
This is the section nobody else writes. Agentic RAG adds complexity, latency, and cost. It's the right choice for some systems and clearly wrong for others.
Use agentic RAG when: your queries are complex and multi-part, your documents span multiple topics that require routing, you need high accuracy and can tolerate 2 to 8 seconds of latency, and your domain has a meaningful hallucination risk (legal, medical, financial).
Stick with standard RAG when: your queries are simple and well-defined, your knowledge base has a single topic and good semantic coverage, sub-second latency is required, and your volume is too high for per-query LLM grading to be economically viable. Standard RAG at high volume with a well-structured index often outperforms agentic RAG on cost-adjusted accuracy.
Use direct LLM calls (no RAG at all) when: the information needed is within the model's training data, the query is more about reasoning than retrieval, or you're building a creative or generative use case where external grounding would constrain the output.
I've seen teams add agentic RAG to a simple FAQ bot that had 200 predefined questions and answers. The standard RAG system answered correctly 94% of the time. The agentic system answered correctly 96% of the time. But it cost 8x more per query and took 3 seconds instead of 0.4 seconds. That's not a win. Use our AI readiness assessment to figure out which approach actually fits your situation before committing to an architecture.
If you're building agentic systems at scale and want a second opinion on architecture, I review these in detail as part of my AI systems work. And if you want to go deeper on the multi-agent orchestration patterns that sit on top of agentic RAG, the n8n AI agent workflow guide covers how I connect retrieval systems to action-taking agents in production. Reach out via the contact page if you want to talk through a specific deployment.
Frequently Asked Questions
What is the difference between RAG and agentic RAG?
Standard RAG follows a fixed pipeline: embed the query, retrieve top-k documents, generate an answer. Agentic RAG replaces that pipeline with a loop where an AI agent decides whether to retrieve, grades what it retrieved, and retries with a reformulated query if the context isn't good enough. The agent controls the process rather than following predetermined steps. This makes agentic RAG significantly more accurate on complex, multi-part questions but also more expensive and slower per query.
Is LangGraph the best framework for building agentic RAG?
In 2026, LangGraph is the most mature option for production agentic RAG systems. Its state graph abstraction maps cleanly to the iterative retrieval loop, it handles human-in-the-loop checkpoints well, and the LangSmith integration gives you production observability out of the box. CrewAI is easier to get started with but gives you less control over the retrieval loop internals. For most teams building their first agentic RAG system, LangGraph is the right choice. For teams that need something working in a day and will live with slightly less control, CrewAI's approach is reasonable.
How many LLM calls does an agentic RAG system make per query?
A typical single-retrieval agentic RAG cycle makes five to seven LLM calls: one for routing, one for retrieval query reformulation if needed, one per document for grading (typically two to four documents), one for generation, and one for hallucination checking. A complex multi-hop query requiring two retrieval iterations can make ten to fifteen calls. This is why model tiering (using small models for routing and grading, large models for generation) is critical for keeping latency and cost manageable.
What chunk size should I use for my RAG system?
There is no universal answer. Dense technical documentation typically does better with 256 to 512 token chunks. Narrative and policy documents do better with 1024 to 2048 tokens. Structured data should be chunked by entity or row, not by token count. The only reliable method is empirical testing: take 50 representative queries, test against multiple chunk sizes, and measure retrieval recall (what percentage of queries surface the correct document in the top 3 results). Add 20% overlap between chunks to catch information that spans boundaries.
How do I prevent infinite loops in agentic RAG?
Set a hard retry cap. I use a maximum of 3 retrieval attempts. After 3 failed retrievals, the system proceeds with whatever context it has, or returns a graceful "insufficient information" response. Never build a graph node without a termination condition. You also want loop detection at the query level. If the same reformulated query appears twice, break the cycle and escalate to fallback behavior. These two controls together eliminate the infinite loop problem.
What's the real cost of running agentic RAG in production?
At 1,000 queries per day with a typical distribution of simple and complex queries, expect $1,800 to $2,700 per month in LLM API costs. Add vector store costs ($50 to $200 depending on index size) and compute infrastructure, and total monthly cost runs $2,200 to $3,400 for a mid-volume deployment. Cost per query averages $0.06 to $0.09 for standard retrievals and $0.18 to $0.31 for complex multi-hop queries. Semantic caching on high-repetition workloads can cut overall cost by 20 to 35%.
When should I use standard RAG instead of agentic RAG?
Use standard RAG when your queries are simple and well-defined, your knowledge base has good semantic coverage of a single topic, you need sub-second response times, or your query volume is too high for per-query LLM grading to be cost-effective. Agentic RAG adds real value when questions are complex and multi-part, documents span multiple domains requiring routing decisions, high accuracy justifies 2 to 8 seconds of latency, and your use case has meaningful consequences for hallucination (legal, financial, medical). Many deployments that think they need agentic RAG actually need better chunking and a stronger embedding model first.
How do I evaluate whether my agentic RAG system is working correctly?
Track four metrics: retrieval recall (what percentage of queries surface at least one relevant document), grader precision (what percentage of documents marked relevant actually are), answer faithfulness (is the generated answer grounded in the retrieved context), and answer relevance (does the answer address what the user actually asked). Build a labeled test set of 100 queries with known ground-truth documents and run it before every major change. Use an LLM-as-judge prompt on a nightly sample of production queries to catch regressions automatically.
Citation Capsule: Accuracy comparison data (34% traditional RAG vs 78% agentic RAG on complex queries) sourced from production benchmarks covered by NVIDIA Technical Blog. Query routing cost savings (40% reduction) from Adaline Labs production RAG architecture guide. Embedding model pricing from official API documentation as of April 2026. LangGraph framework documentation at LangChain LangGraph. Agentic retrieval architecture overview at Weaviate: What Is Agentic RAG.
Related Posts

Model Context Protocol: How I Build MCP Servers That Run in Production (and What Most Guides Skip)

Google Just Released the Most Capable Open Source AI Agent Model. Here Is What It Means for Your Business.

n8n 2.0 AI Agents: The Workflow Architecture I Use Across Every Client Deployment

Jahanzaib Ahmed
AI Systems Engineer & Founder
AI Systems Engineer with 109 production systems shipped. I run AgenticMode AI (AI agents, RAG systems, voice AI) and ECOM PANDA (ecommerce agency, 4+ years). I build AI that works in the real world for businesses across home services, healthcare, ecommerce, SaaS, and real estate.