Retrieval-augmented generation is the dominant pattern for grounding an LLM in your own data. The shape is straightforward in the docs (chunk your corpus, embed, store in a vector DB, retrieve at query time, feed to an LLM, return the answer); the production reality has more failure modes than the docs let on. I have shipped three RAG systems across different domains and worked on a fourth that did not ship. The patterns of what goes wrong are remarkably consistent.
This is the long-form writeup of the pipeline as I have ended up actually building it, the seven failure modes I have hit and what each one looked like, and the evaluation discipline that has separated the RAG features that worked in production from the ones that demoed well and fell over on real users.
The end-to-end pipeline, drawn out
That is the whole picture. Each box hides a few production decisions; the rest of this article walks through where I have personally hit walls.
Ingest and chunking: where 60% of the quality is decided
The shape of the ingestion is the most consequential decision in the whole system. If you ingest the wrong unit (a whole 80-page PDF as one document), no amount of better embeddings or smarter prompting will save you. The chunks are what the model can actually retrieve; everything downstream is at the mercy of the chunking.
The heuristics that have worked for me, applied in order:
- Respect natural boundaries first. Split on chapter, section, paragraph. The boundary that matters is the unit of meaning a human would describe as "one topic". A chunk that ends mid-sentence is a chunk you do not want.
- Chunk to 200 to 500 tokens. Long enough to carry context, short enough to be retrievable. Below 100 the chunks are too narrow; above 800 the model retrieves big blobs that bury the relevant fact.
- Overlap by 50 to 100 tokens. A fact that straddles a chunk boundary is captured in at least one chunk. Without overlap, the boundary becomes a blind spot.
- Prepend the path or title. A chunk that starts
Doc: API Reference > Pagination > Cursorscarries enough provenance that a near-duplicate chunk from a different doc does not confuse retrieval. - Treat tables and code differently. A table chunked into rows loses its structure. The pattern that has worked: keep the table as one chunk (with the schema row included), and include a one-sentence description of what the table is. Code blocks similarly: do not chunk inside a function.
The specific number is less important than the discipline of trying a few and measuring. I have shipped RAG systems where a chunk-size change from 800 to 400 tokens improved retrieval recall by 15%, and I have shipped systems where the same change made things worse because the docs were so dense that 400 tokens was too narrow. The eval loop (covered later) is what tells you which.
Embedding: pick once, plan the migration
The embedding model is the function that maps text to vectors. The quality of your vectors caps the quality of your retrieval; pick a recent, well-evaluated, general-purpose embedding model and re-evaluate annually. The technical details (dimension count, normalization, model architecture) are mostly secondary.
The operational concern that matters more than the choice itself: when the model is updated or replaced, every vector in your index is in a different geometric space. New query vectors cannot meaningfully be compared to old indexed vectors. You must re-embed the entire corpus.
For a small corpus (under a million chunks) this is a few hours and a few hundred dollars. For a large corpus (tens of millions of chunks) this is a multi-day batch job and can be tens of thousands of dollars in API fees alone. Plan the migration before you ship: log which embedding model produced each vector, version the index, and have a budget for re-embedding when the day comes.
Retrieval: hybrid, almost always
Pure vector search misses exact matches. A user querying for a specific order ID, a product SKU, a function name, an error code does not want a fuzzy semantic match; they want the chunk containing exactly that string. Pure vector search is a fuzzy semantic matcher and will rank the wrong things first.
Hybrid retrieval (vector + BM25 / keyword search, fused with reciprocal rank fusion) is the production default I now reach for without thinking. The two retrievers have complementary strengths: vectors find semantically similar chunks even when no words overlap; keyword search finds exact-match chunks even when the surrounding language is unusual. Fusing them gets you most of both.
That function is parameter-free in any meaningful sense (the k=60 is a stabilizer, not a tuning knob). I use it everywhere.
Re-ranking: the second-stage filter that buys you precision
Hybrid retrieval (vector + keyword) is good at recall: it surfaces the right chunks somewhere in the top 50. It is less good at putting the very best chunk at rank 1. For features where the user reads only the top 3 to 5 results, or where the LLM only fits 5 to 10 chunks in its prompt, the order within the top-K matters as much as which chunks are in it.
A cross-encoder re-ranker is a smaller model that scores each (query, chunk) pair jointly. Unlike the embedding model (which encodes query and chunks independently and then compares vectors), a cross-encoder reads them together and produces a single relevance score per pair. The trade-off: cross-encoders are slower per pair (you cannot precompute them at index time), but they are more accurate.
The pattern in production:
Fifty cross-encoder calls take, in my experience, 50 to 200 ms total (parallelized) on a hosted reranker API. That is acceptable latency on top of a multi-second LLM call; it would not be acceptable on a sub-second search box. Pick whether to add the second stage based on the latency budget of the feature, not on the assumption that more is always better.
The LLM-as-judge variant ("score this chunk's relevance to this query on a 1-5 scale") works too, but is more expensive in dollars per query, slower in latency, and varies more across runs at temperature > 0. I default to a dedicated cross-encoder for re-ranking and reserve LLM-as-judge for offline evaluation where deterministic latency matters less.
Prompt construction: the 200 lines of code that shape the answer
The code that takes the retrieved chunks and assembles them into the LLM prompt is small but it does an outsized amount of the work. Five patterns I have settled on:
Cite by ID, not by position. Every chunk in the prompt gets a stable identifier. The model is instructed to cite by that ID. Do not number them as [1] [2] [3] in the prompt order; if the order changes (because retrieval ranked them differently), the citations stay valid because they are tied to the chunk's stable ID, not its position.
Quote, do not paraphrase, the chunk in the prompt. The chunk text goes in verbatim, fenced or quoted so the model sees a clear boundary between context and instruction. Paraphrasing or summarizing chunks before passing them in is a tempting compression but introduces a new failure mode (the summary loses the fact the user asked about).
Order matters within the context window. Models tend to attend more to the start and end of the context than the middle. I put the most relevant chunks first, the second-most-relevant chunks last, and the merely-related-but-not-critical chunks in the middle. This is not exact science; it is a hedge based on the lost-in-the-middle effect.
Strict instruction phrasing. "Answer ONLY using the provided context" beats "Try to use the provided context" by a meaningful margin in the eval results I have run. The phrasing of the instruction is a tunable, testable parameter.
Refusal example in the prompt. Including a one-shot example of "if the context does not contain the answer, respond with: 'I do not have enough information to answer that.'" reduces the rate at which the model invents an answer when retrieval came up empty.
A stripped-down version of the prompt template I ship:
The shape is dull on purpose. Every flourish I have tried adding (cute role-play framing, encouraging the model to "think step by step", multi-paragraph instructions) has either no measurable effect or a small negative one in the evals. Boring prompts win.
Observability: the metrics I keep on a dashboard
A RAG system in production deserves a dashboard, not just an eval harness. The metrics I have on the dashboard for every shipped RAG feature:
Panel 4 is the early-warning system for retrieval drift. If the top-1 cosine score distribution shifts left over a week (chunks are matching less well), something has changed (the corpus, the embedding model, the query distribution) and you want to know before users do. I have caught two real regressions on this panel before the eval set caught them, and on both occasions the cause was an embedding-model deprecation that started returning slightly different vectors.
Panel 5 is the symmetric early-warning. If the refusal rate spikes, retrieval is failing to find good chunks; if it crashes to zero, the model has stopped honoring the refusal instruction (a prompt regression). Both are bad and the dashboard catches them in hours, not weeks.
The seven failure modes I have actually hit
In the order I have hit them, with what each one looked like in the wild:
Failure 1: the answer is correct, but cites the wrong chunk
The model produced a fluent, accurate answer; the cited chunks were unrelated. This was the embedding-model-update case for me. New embeddings, old indexed vectors, retrieval was effectively random, but the LLM was confident enough in its parametric knowledge to answer anyway. The user got a plausible-looking answer with garbage citations.
Fix: log the cosine score of every retrieved chunk; alert when the top-1 score collapses to noise levels. Make it impossible for the system to answer if no chunk is sufficiently similar (return "I do not have enough information" instead).
Failure 2: the model ignores the context
The context was retrieved correctly. The answer was wrong. The reason: the model preferred its parametric memory over the provided context, especially when the parametric answer is a more common pattern.
Fix: explicit instruction in the system prompt ("Answer ONLY using the provided context. If the context does not contain the answer, say 'I do not have enough information.'"). Set a low temperature. Test specifically with adversarial cases where the parametric answer differs from the provided one. The fix is mostly prompt discipline but it has to be tested.
Failure 3: ranked chunks all say the same thing
The top 10 retrieved chunks were near-duplicates from different documents (or different versions of the same document). The model saw "the answer is X" ten times and confidently said X. The actual correct answer was buried at rank 12.
Fix: deduplication after retrieval. The pattern that works: pairwise cosine between the top retrieved chunks; if any two are above a threshold (say, 0.95), keep only the one ranked higher. This drops near-duplicates and surfaces variety in the context the LLM sees.
Failure 4: the model hallucinates citations
The model invented a chunk reference that did not exist in the retrieved set. "According to chunk #4..." but chunk #4 was never given.
Fix: structured output for citations. The model emits citations in a JSON-shaped form keyed to actual chunk IDs you provided. The post-processing step rejects any citation that does not appear in the retrieval set. Rerun the generation if it fails the check (or hide the citation if there are valid ones).
Failure 5: long-context queries return the wrong thing
The user pasted in a long document and asked a question about it. Retrieval returned chunks from the user's document, but also chunks from elsewhere in the corpus that semantically matched. The model conflated the two sources.
Fix: scope retrieval. If the user provided a document, retrieval is restricted to that document only; the corpus is queried only when the user explicitly asks. This is a product choice as much as a technical one.
Failure 6: retrieval is slow during peak
The vector DB was sized for off-peak load. Peak traffic 4x off-peak. The DB hit query timeouts. RAG returned no context. The LLM, given an empty context, answered from memory and was wrong.
Fix: load test the vector DB at 2-3x peak before launch. Set a strict timeout on retrieval; if it times out, return a graceful failure ("The system is busy, please try again.") rather than answering without context. Empty-context generation is a failure mode in itself.
Failure 7: the corpus is out of date and nobody noticed
The documents were ingested six months ago. The product changed; the docs got rewritten. The RAG index never re-ingested. Users got correct-looking answers about a product that did not exist anymore.
Fix: an ingestion freshness metric. Last-update timestamp per document, alert when any high-traffic source is older than its expected refresh cadence. The index is a living artifact; its freshness needs the same monitoring as any other production data store.
Evaluation: the discipline that keeps RAG honest
All of the failure modes above are caught by a half-decent evaluation harness. RAG without an eval loop is a bet that the demo cases are representative. They are not.
The minimum viable eval setup that I have shipped on every RAG system since:
A modest set is 50 to 200 cases. It does not have to be huge to be useful; it has to be representative. Build it from real user queries (sampled, anonymized) plus a handful of adversarial cases per failure mode above.
The metrics I report from the eval harness:
Faithfulness is the single most useful metric in the production loop. An answer that is correct but not grounded in the retrieved context is a failure waiting to happen on a question where the model's parametric knowledge is wrong. Measuring it explicitly (an LLM judge that scores each claim against the retrieved chunks) catches the slow drift between the system being grounded and the system being confidently wrong.
Running the eval set weekly, with a Slack alert when any metric regresses by more than a small threshold, is the difference between a RAG system that improves over time and one that decays.
When RAG is not the right shape
RAG is great. It is not always great. A short list of cases where I have walked away from RAG and used something else:
- The corpus is small and stable. A few thousand short docs that change rarely. Just put them all in the system prompt of a long-context model; you save the entire retrieval pipeline.
- The query needs computation, not retrieval. "How many users signed up last month?" is not a retrieval question; it is a SQL question. RAG over docs about user signups will not answer it correctly.
- The accuracy bar is regulatory. If wrong answers create legal exposure (medical, financial, legal advice), RAG provides grounding but does not guarantee accuracy. The product probably needs human review in the loop, not better retrieval.
- The user really wants browsing, not Q&A. Sometimes the right interface is search results, not a generated answer. RAG is one shape of UI; it is not the right shape for every information need.
The shape of every RAG project I have led
The pattern, in calendar time, that every RAG project I have shipped has followed:
Nothing about that shape is unique to a specific tool or provider; the work is the same with any embedding model, any vector DB, any LLM. The systems that ship are the ones whose teams accept that the second half (weeks 5 through 12) is where the actual quality comes from. The systems that get cancelled are usually the ones whose teams thought the demo was the product.
The single most useful piece of advice
If I could distill the entire experience of building RAG into one sentence, it would be: build the eval loop before you tune anything. Every quality decision (chunk size, retrieval algorithm, reranker model, prompt wording) is a hypothesis, and without an eval loop you cannot tell which hypothesis is right. The teams I have worked with who skipped this step burnt months arguing about prompt phrasings and chunking strategies on the basis of demo-day vibes; the teams who built the eval loop first had answers in days. The eval loop is not the polish step; it is the foundation, and laying it second is laying it backwards.
