Production RAG with Claude: Architecture and Trade-offs
A deep dive on retrieval-augmented generation patterns we use to ship reliable LLM features for clients.
RAG looks simple in a notebook and hard in production. The notebook version is: embed your docs, embed the query, find nearest neighbors, stuff the top-K into a prompt, return the LLM output. The production version has to handle chunking trade-offs, retrieval quality, evaluation harnesses, token budgets, hallucination defenses, and observability - all without breaking when your knowledge base grows from 1,000 documents to 1 million.
Chunking is the under-considered decision
Chunking is the most under-considered decision in RAG. Naive fixed-size chunking (500 tokens, no overlap) breaks semantic units and ranks badly. We default to recursive structure-aware chunking: split at document, then heading, then paragraph, then sentence boundaries, with 10-15% overlap between adjacent chunks. For technical content with code blocks, use a code-aware splitter that keeps function bodies intact. Chunk size depends on retrieval quality at your scale - benchmark 256, 512, and 1024 tokens against your eval set.
Hybrid retrieval beats pure vector search
Hybrid retrieval beats pure vector search at almost every scale we have measured. Combine semantic search (vector embeddings) with lexical search (BM25 over an inverted index) and rerank with reciprocal rank fusion. The rationale:
- Vectors find conceptual matches but miss exact terminology
- BM25 finds exact terms but misses paraphrased content
- Hybrid gets both at minimal latency cost (10-30ms extra)
For Postgres-based stacks, pgvector + ts_vector is a clean implementation; for managed stacks, Pinecone hybrid index or Elasticsearch with kNN handles it.
Reranking is the next quality lever
Reranking the retrieved chunks before sending to the LLM is the next quality lever. Top-K vector retrieval gives you 20-50 candidates; a cross-encoder reranker (Cohere Rerank, Voyage Rerank, or open-weights models like bge-reranker) scores each candidate against the query and reorders. We typically retrieve 50, rerank to top 5-10. The latency cost is 50-200ms; the quality gain is substantial - especially when initial vector retrieval is noisy.
Evaluation is non-negotiable
Without an eval harness, you cannot ship safely - every prompt change becomes a guessing game. Build a golden dataset of 50-200 query/expected-answer pairs covering happy path, edge cases, and known failure modes. Run nightly evaluation jobs measuring:
- Retrieval accuracy - was the right document found in top-K?
- Generation accuracy - does the answer match expected output?
- Hallucination rate - does the answer cite the source correctly?
- Latency P95 and cost per query - production economics
Tools like Braintrust, LangSmith, and Ragas make this practical. Without an eval harness, you are flying blind.
Token budgeting separates prototype from production
Claude has a 200K context window but using it naively burns tokens and degrades attention. Use the smallest context that delivers acceptable quality: 5-10 reranked chunks at 500 tokens each is typically enough. Reserve context for the system prompt, retrieved chunks, and conversation history. Set an explicit max output token cap. Cache static system prompts via prompt caching - Anthropic supports this directly - cutting cost by up to 90% on repeated requests with stable prefixes.
Hallucination defense is layered, not single-shot
Hallucination defenses go beyond grounding. Three layers we deploy:
- 1Instruct the model explicitly to refuse questions outside the retrieved context with phrases like "If the answer is not in the provided context, say so"
- 2Require citation of source IDs in the output and validate at parse time that cited IDs exist
- 3In high-stakes use cases, run a separate verification pass with a different model or a programmatic check
Layered defense beats any single mitigation. Each layer catches different failure modes.
Observability separates RAG that improves from RAG that drifts
Log every query, retrieved chunks (with scores), prompt sent, model response, latency, and token cost to a structured store. Build dashboards for retrieval recall, generation quality scores, P95 latency, and cost per query. Without this, you cannot debug user complaints or measure the impact of changes; with it, you have a feedback loop that compounds over months.
Model choice matters
Claude excels at long-context reasoning and tool use - it is our default for agentic and document-intelligence workloads. GPT-4o is broader and faster for simple Q&A. Open-weights models (Llama, Mixtral) win on data residency or fine-tuning needs. Model choice is not one-size-fits-all; test 2-3 models on your actual task with your actual eval set, including cost and latency, before committing.
We deploy these patterns across production engagements in fintech, healthcare, and eCommerce. Talk to a senior engineer about how they apply to your platform.
