What Actually Makes a RAG System Work in Production

Retrieval-Augmented Generation (RAG) has become the default architecture for enterprise AI applications that need to answer questions grounded in internal documents. The concept is straightforward: when a user asks a question, retrieve the relevant chunks of text from a knowledge base, then pass them to a language model as context.

In practice, every component of that pipeline has failure modes that are not obvious until you are in production. Here is what I have learned from building and troubleshooting these systems.

The Retrieval Problem Is Harder Than It Looks

The most common failure mode I encounter is poor retrieval quality. The language model gets blamed when it gives a bad answer, but usually the problem is upstream — the wrong context was retrieved, or relevant context was missed entirely.

Chunking strategy matters enormously. Most tutorials demonstrate fixed-size chunking (e.g., 512 tokens with 50-token overlap). This works fine for homogeneous prose but falls apart on structured documents — tables, code blocks, numbered lists — where a chunk boundary cuts through a logical unit of information. A better default is to chunk by semantic structure (paragraphs, sections, list items) with length limits as guardrails rather than as the primary splitting mechanism.

Embedding model choice is not neutral. General-purpose models like OpenAI’s text-embedding-3-small or text-embedding-3-large are solid starting points. For specialized domains — technical documentation, legal contracts, medical literature — a domain-specific fine-tuned embedding model often retrieves substantially better results. The quality difference shows up in the long tail of edge-case queries, which is precisely where production systems tend to break down.

Hybrid retrieval outperforms dense-only search in most enterprise settings. Dense vector search (semantic similarity) and sparse keyword search (BM25) capture different signals. Dense search finds semantically related content even when keywords differ. Sparse search excels when users query with specific terms, product names, or error codes. Combining both with reciprocal rank fusion typically improves retrieval recall by 15–30% over either method alone.

Context Quality Beats Context Quantity

There is a temptation to retrieve more chunks to reduce the chance of missing something. This approach backfires. Language models are surprisingly poor at attending to information in the middle of a long context window — a phenomenon documented in research as the “lost in the middle” problem. More retrieved chunks increases noise, increases cost, and often decreases answer quality.

The right approach:

Retrieve more candidates than you need (e.g., top 20)
Rerank using a cross-encoder model or a fast LLM call to select the best 4–6 chunks
Pass only those to the final generation model

The reranking step is consistently the single highest-leverage improvement in RAG pipelines that were built without it.

The Generation Step Has Its Own Failure Modes

Once you have good context, generation failures typically fall into two categories.

Hallucination despite good context. This is rarer when retrieval is working well, but it happens — especially when the model is asked to synthesize across multiple chunks, infer unstated information, or answer a question that the retrieved context does not actually address. The fix is prompt engineering: explicitly instruct the model to respond with “I do not have enough information to answer this” when the context is insufficient, and then test that behavior specifically with adversarial queries.

Faithfulness versus fluency tension. Models are trained to produce fluent, confident text. They will sometimes reformulate retrieved content in ways that introduce subtle inaccuracies. For high-stakes applications (legal, medical, financial), you need citation-level grounding — the system should cite the specific chunk a claim comes from, and ideally a secondary verification step confirms the claim is actually supported by that source.

Observability Is Not Optional

RAG systems fail silently. A retrieval failure looks like a bad answer, not a system error. Without logging the retrieved chunks alongside queries and responses, you have no visibility into whether the pipeline is functioning correctly.

Minimum viable observability for a production RAG system:

Log the query, retrieved chunks with their relevance scores, and the final response
Track retrieval latency separately from generation latency
Sample-evaluate 50–100 responses per week against a held-out test question set
Monitor the rate of “I do not have information on this” responses — a sudden spike usually indicates a retrieval problem, not a model problem

When RAG Is Not the Right Tool

RAG is the right architecture when your knowledge base changes frequently (new documents, updated policies, live data feeds) and you need grounded, factual answers with traceable sources. It is the wrong tool when:

The knowledge is stable and compact enough to fine-tune directly into the model weights
You need to perform complex multi-step reasoning across many documents rather than targeted lookup (agentic architectures handle this better)
End-to-end retrieval and generation latency is prohibitive for your use case

The best RAG systems I have built started from a clear analysis of the actual query distribution — what users genuinely ask, not what designers imagine they will ask. That analysis shapes every architectural decision that follows: chunking strategy, embedding model selection, retrieval depth, and reranking approach. If you are considering a RAG implementation, mapping your real query patterns is where to start.

Building a knowledge retrieval system for your organization? Let’s talk.