ArchBits
ArchBits

Building Production RAG Systems: A Practical Guide for Developers

Learn how RAG systems retrieve relevant information to extend AI capabilities beyond training data. Covers core concepts, architecture patterns, evaluation strategies, and production considerations with real benchmarks.

The Problem: Models Can't Know Everything

Foundation models like GPT-4, Claude, and Gemini are remarkably capable. They can write code, analyze data, and reason through complex problems. But they have three fundamental limitations that matter in production.

First, context windows have limits. Even Claude 3.5 Sonnet's 200,000 token context can't fit your entire codebase, database, or real-time data streams. You need to be selective about what information goes into the prompt.

Second, models hallucinate without grounding. When a model lacks specific information, it doesn't say "I don't know." Instead, it generates plausible-sounding fiction. The probabilistic nature of language models means they'll confidently fill gaps with made-up details. For building information systems or engineering documentation, this is unacceptable.

Third, cost scales with context length. More tokens mean higher API costs and longer latency. Loading 100,000 tokens into every query burns money fast. If you're building a system that handles thousands of queries daily, this becomes a real constraint.

RAG (Retrieval-Augmented Generation) addresses all three limitations. Instead of cramming everything into context, RAG retrieves only the most relevant information for each specific query. It's targeted, cost-efficient, and reduces hallucinations by grounding responses in retrieved evidence.

What is RAG? Retrieval-Augmented Generation is a pattern where you retrieve relevant information from a knowledge base before generating a response. Think of it like a developer searching Stack Overflow before answering a question, rather than relying purely on memory.

How RAG Works: Retrieve, Then Generate

The core RAG workflow is straightforward: when a query comes in, you search your knowledge base for relevant information, then use that information as context for the language model to generate a response.

Here's what happens at the system level:

Loading diagram...

There are two phases: indexing (offline) and querying (online).

During indexing, you process your documents—whether that's BIM documentation, building codes, project specifications, or technical manuals. You split them into manageable chunks, convert each chunk into a numerical vector representation (an embedding), and store these in a specialized database optimized for similarity search. This happens once, or whenever you need to update your knowledge base.

What are embeddings? Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings, even if they use different words. "Steel beam" and "structural steel member" would have similar embeddings because they mean similar things.

During querying, when a user asks a question, you convert that question into the same vector format, search for the most similar chunks in your database, and feed those chunks as context to the language model. The model then generates a response grounded in that specific, retrieved information.

The key insight: you're not asking the model to know everything. You're asking it to process and synthesize specific information you've retrieved. This is similar to how you might work—you don't memorize every building code, but you know how to look up the relevant sections when needed.

The Two Types of Search: Keywords vs Meaning

Before neural networks dominated information retrieval, systems relied on keyword matching. These term-based methods like BM25 and TF-IDF are surprisingly effective and still relevant today.

BM25 works by scoring documents based on how often query terms appear, how rare those terms are across all documents, and normalizing for document length. If you search for "concrete compressive strength," documents containing all three words—especially if those words are rare in your corpus—rank highest. It's fast, simple, and works well when users search with specific terminology.

Embedding-based retrieval takes a different approach. Instead of matching words, it matches meaning. Neural network models convert text into high-dimensional vectors where semantic similarity translates to geometric proximity. "Load-bearing wall" and "structural partition" might share zero words but have similar embeddings because they're semantically related.

The practical difference: term-based search excels with specific jargon and exact matches. If someone searches for "IFC 4.3" or a specific building code reference, keyword matching is hard to beat. Embedding-based search excels when queries are phrased differently than documents, when synonyms matter, or when understanding intent is crucial.

Best Practice: Use both approaches together. Hybrid search that combines term-based and embedding-based retrieval consistently outperforms either method alone by 15-25% in precision metrics. Major vector databases like Weaviate, Pinecone, and Qdrant now support hybrid search natively.

Here's why hybrid search wins: some queries benefit from exact keyword matching (product codes, technical specifications), while others need semantic understanding (natural language questions). By fusing both approaches, you capture the strengths of each. The most common fusion method is Reciprocal Rank Fusion (RRF), which combines rankings from both search methods to produce a final score.

A Production RAG Architecture

Let's look at what a production RAG system actually needs. This isn't every possible component, but rather a practical architecture that handles real traffic and scales as needed.

Loading diagram...

The API layer is your entry point—FastAPI is popular for Python-based systems because it's fast, supports async operations, and generates automatic documentation. This handles authentication, rate limiting, and request validation.

Caching makes the biggest immediate impact on cost and latency. Redis stores frequently-requested results. When someone asks "What's the fire resistance rating for Type X gypsum board?" for the hundredth time, you return the cached response in milliseconds instead of running the entire retrieval and generation pipeline again. Semantic caching takes this further: even if the question is phrased slightly differently, if it's sufficiently similar (cosine similarity > 0.95), return the cached result.

Cost reduction potential:80%

The retrieval layer has two components. First, the vector database (Pinecone, Weaviate, or Qdrant are the main production options) stores your document embeddings and handles similarity search. Second, the embedding API (typically OpenAI's text-embedding-3-large) converts queries and documents into vector representations. You can self-host embedding models, but managed APIs are simpler for most teams.

The generation layer is where the language model lives. OpenAI's GPT-4o is the current sweet spot for most applications—cheaper than GPT-4 Turbo while maintaining quality. This is where most of your cost accumulates (typically 90%+ of total RAG system cost).

Document processing happens asynchronously. When new documents arrive, background workers (Celery is common) handle parsing, chunking, embedding generation, and database insertion. This keeps your API responsive—uploading a 100-page PDF doesn't block the request/response cycle.

Monitoring is not optional for production systems. LangSmith (from LangChain) or Arize AI track retrieval quality, generation quality, latency, costs, and user behavior. You need visibility into what's working and what's breaking.

Orchestration frameworks like LangChain and LlamaIndex provide pre-built components for this architecture. LangChain offers broader functionality across the stack, while LlamaIndex specializes in retrieval and indexing patterns. Both integrate with major vector databases and LLM providers.

The data flow for a query: Request hits your API → Check cache → If miss, generate query embedding → Search vector database → Retrieve top-k chunks → Construct prompt with context → Call LLM → Cache result → Return response. Each step has its own latency and cost characteristics, which is why monitoring matters.

Chunking: The Most Overlooked Quality Lever

How you split documents into chunks has an outsized impact on retrieval quality. Too small, and chunks lack context. Too large, and they contain too much irrelevant information that obscures what's actually relevant to the query.

Consider technical documentation for building systems. A 500-word section explaining HVAC load calculations should probably stay together as one chunk. Breaking it mid-calculation loses critical context. But a 5,000-word chapter covering multiple topics should be split so each chunk focuses on a coherent concept.

There's no universal optimal chunk size—it depends on your content structure. Code documentation might work best at 200-300 tokens (roughly one function or class). Technical specifications might need 500-800 tokens to capture complete procedures. Legal documents often require even larger chunks because clauses depend on each other for meaning.

The common approach is semantic chunking: respect document structure by splitting on natural boundaries like paragraphs, sections, or headings. This preserves coherence better than arbitrary token-count splits. Many frameworks offer built-in semantic chunking strategies.

Add 10-20% overlap between consecutive chunks. If one chunk ends at token 500, the next might start at token 450. This prevents information loss at boundaries—concepts that span chunk edges appear in both chunks, improving retrieval coverage.

Test empirically. Start with 400-token chunks as a baseline, then measure retrieval precision at 200, 400, and 800 tokens. The optimal size depends on your specific content and query patterns. What works for API documentation won't necessarily work for building codes.

Recent research from Anthropic showed that adding contextual information to chunks improves retrieval precision by 40%. The idea: before indexing each chunk, use an LLM to generate 50-100 tokens explaining what document it came from, what section it covers, and how it relates to the broader context. Prepend this context to the chunk. When someone searches, they're not just matching against the raw chunk text but also against this explanatory context, which dramatically improves relevance.

Measuring What Matters: Evaluation Strategy

You can't improve what you don't measure. RAG systems need two levels of evaluation: component-level (how good is the retrieval?) and end-to-end (how good is the final response?).

Component-level evaluation isolates the retriever. Context Precision measures what percentage of retrieved chunks are actually relevant to the query. If you retrieve 5 chunks and 4 are relevant, precision is 80%. Context Recall measures what percentage of all relevant chunks in your database you successfully retrieved. If 10 chunks are relevant and you retrieved 4, recall is 40%.

There's a fundamental tradeoff: retrieving more chunks (higher k parameter) increases recall but decreases precision. You get more of the relevant information but also more noise. Balance this based on your use case—if missing information is catastrophic, bias toward recall. If irrelevant information confuses the model, bias toward precision.

End-to-end evaluation looks at the complete system output. Answer Relevance: Does the response actually address what was asked? Faithfulness: Is the response grounded in the retrieved context, or did the model hallucinate? Quality: Is it coherent, well-structured, and helpful?

Tools like RAGAS (RAG Assessment) automate these measurements, but you'll still need human evaluation for a ground truth dataset. The key question: how many examples do you need?

Samples for 10% improvement:~100
Samples for 3% improvement:~1,000

For production systems, aim for 500-1,000 evaluation examples covering common queries (80% of your traffic), edge cases (challenging or ambiguous queries), and known failure modes (queries that previously broke). Build this dataset incrementally—start with 100 examples, identify gaps, add more.

Advanced Patterns: Reranking and Multi-Hop Retrieval

Basic RAG—retrieve once, generate once—gets you 70% of the way to production quality. Advanced patterns unlock the remaining 30%.

Reranking is a two-stage retrieval process. First, use fast algorithms (BM25, HNSW) to retrieve 20-50 candidate chunks. Then, use a more sophisticated model (a cross-encoder) to rerank these candidates and select the top 5. Cross-encoders see both the query and each candidate together, unlike embedding-based retrieval where query and document embeddings are generated independently. This joint analysis produces better rankings.

The tradeoff: reranking adds 50-100ms latency but improves ranking quality (NDCG) by 10-15%. For quality-critical applications like engineering documentation or compliance systems, this is usually worth it. Cohere offers a reranking API, or you can use open-source cross-encoders like ms-marco.

Multi-hop retrieval handles queries requiring information from multiple sources. Consider the query: "Compare the fire resistance requirements for steel versus concrete structural systems in commercial buildings." No single document chunk contains this comparison—you need to retrieve steel requirements, concrete requirements, and then synthesize.

The approach: retrieve initial candidates based on the query, have the model analyze what additional information it needs, retrieve again with refined queries, then generate the final response. This is where RAG intersects with agent patterns—the model decides what to retrieve next based on what it's already seen.

Multi-hop adds complexity and latency (multiple retrieval rounds), but for complex analytical queries, it's the difference between superficial and comprehensive responses.

RAG vs Long Context: Choosing the Right Approach

Context windows keep expanding. Claude 3.5 Sonnet handles 200,000 tokens, GPT-4 Turbo handles 128,000 tokens. Does this make RAG obsolete?

Not quite. Here's the practical decision framework:

If your entire knowledge base fits comfortably in context (under 50,000 tokens), just include everything. The simplicity is worth it—no retrieval infrastructure, no chunking strategy, no vector database. This works for many smaller applications.

Beyond 100,000 tokens, RAG becomes compelling for three reasons. Cost: processing 100,000 input tokens with GPT-4 Turbo costs $1 per query. RAG that retrieves 5,000 relevant tokens costs $0.05 per query—a 20× difference. Latency: more tokens mean longer processing time. 100,000 tokens can take 3-5 seconds before the first token arrives. 5,000 tokens is under 1 second. Attention: research shows models lose focus in very long contexts, missing information buried in the middle. RAG retrieves targeted sections, keeping them in focus.

Lost in the Middle is a well-documented phenomenon where language models perform worse on information buried in long contexts compared to information at the beginning or end. RAG retrieves relevant sections and places them prominently in the prompt.

There are exceptions. Use long context when you need access to complete documents (legal contract analysis, code review), when data rarely changes (static knowledge base), or when cost and latency are acceptable for your use case. Use RAG when data exceeds 100,000 tokens, queries only need small subsets, cost or latency matter, or data updates frequently.

Importantly, you can combine both approaches. Retrieve the most relevant documents with RAG, then include entire documents in context if they're not too large. This gives you retrieval's targeting with long context's completeness.

What Actually Costs Money: The RAG Economics Breakdown

Understanding where money goes helps prioritize optimizations.

Embedding generation happens at two points. During indexing, you convert all document chunks into embeddings. With OpenAI's text-embedding-3-large ($0.13 per 1M tokens), processing 10M tokens of documents costs $1.30. If you reindex monthly, that's $1.30/month. During queries, you embed user input (~50 tokens average), which costs $0.0000065 per query—essentially negligible.

Vector storage varies by provider. Pinecone's serverless tier charges for storage, writes, and reads. For a typical application with 1M vectors, expect $20-50/month. Self-hosting (Qdrant, Weaviate on your own infrastructure) costs $100-200/month for a capable VM but has fixed costs regardless of query volume.

LLM generation dominates the cost structure. This is 90-95% of total spending for most RAG systems. With GPT-4o ($2.50 per 1M input tokens, $10 per 1M output tokens), a typical query with 2,000 tokens of retrieved context and 500 tokens of output costs $0.010. At 100,000 queries per month, that's $1,000 in generation costs alone.

The math for 100,000 queries/month:

  • Query embeddings: $6.50
  • Vector database operations: $50
  • LLM generation (GPT-4o): $1,000
  • Total: ~$1,056

Optimize generation costs first. Use caching aggressively (80% cost reduction for repeated queries is achievable), choose appropriate models (GPT-4o vs GPT-4 Turbo saves 60%), limit output length when appropriate, and batch requests when possible.

Common Production Mistakes

Not evaluating retrieval separately from generation. When quality is poor, developers often tune prompts or switch models without first checking if retrieval is broken. If you're retrieving irrelevant chunks, no prompt engineering will fix the output. Measure context precision and recall independently.

Ignoring chunk size impact. Many developers use default chunking parameters (often 500 or 1000 tokens) without testing. Chunk size has an outsized impact on quality. Test empirically—measure retrieval precision at 200, 400, and 800 tokens for your specific content and query patterns.

No reranking step. Initial retrieval optimizes for speed. The top 20 candidates from fast algorithms aren't necessarily the most relevant. Adding a reranking stage improves ranking quality by 10-15% with minimal latency cost (~100ms). For quality-critical applications, this is one of the highest-ROI optimizations.

Not monitoring retrieval quality over time. Data distributions shift. Documents get added or updated. Query patterns evolve. Retrieval quality that was excellent in January might be mediocre by June if you're not tracking it. Log precision and recall metrics daily, alert when they drop below thresholds, and investigate.

Assuming caching won't help. Many developers underestimate how much query overlap exists. Even with "unique" user queries, semantic clustering means many questions are similar enough for cache hits. A 50-60% cache hit rate is typical for customer-facing applications, translating to 80%+ cost reduction on cached queries.

The biggest mistake: Treating RAG as a solved problem. Production RAG requires continuous measurement, evaluation, and optimization. Set up monitoring infrastructure on day one, not after problems emerge.

When RAG Breaks: Limitations and Failure Modes

RAG isn't appropriate for every problem. Know when it fails.

The model hallucinates despite retrieved context. Sometimes models ignore retrieved information and generate responses based on their training data instead. Stronger prompting helps ("Answer ONLY using the context below"), as does finetuning models specifically for your domain to teach grounded behavior.

Retrieved context contradicts itself. Documents might contain conflicting information—outdated versions, different sources, or legitimately disputed facts. When this happens, the model can get confused or pick the wrong source. Solutions include document versioning, source quality scoring, or having the model explicitly acknowledge conflicts: "Document A states X, while Document B states Y."

The query requires reasoning across too many documents. Complex analytical queries might need information from dozens of sources. Retrieving all relevant information exceeds context limits or introduces too much noise. Multi-hop retrieval helps, as does query decomposition (break complex questions into simpler sub-queries).

Domain-specific terminology isn't captured by general embeddings. OpenAI's embeddings are trained on broad internet data. If your domain uses specialized terminology (BIM-specific terms, proprietary codes, industry jargon), general embeddings might not capture the right semantic relationships. This is when finetuning embedding models or using domain-specific models becomes worth the complexity.