Community Article

Building RAG: The Pipeline and Its Failure Modes

The full RAG pipeline (ingest, chunk, embed, retrieve, generate, evaluate), the seven failure modes I have actually hit, and the eval discipline that has kept my retrieval-augmented features honest in production.

Building RAG: The Pipeline and Its Failure Modes

The full RAG pipeline (ingest, chunk, embed, retrieve, generate, evaluate), the seven failure modes I have actually hit, and the eval discipline that has kept my retrieval-augmented features honest in production.

machine-learning
vector-search
embedding
openai
ml-system-design
weimorales

By @weimorales

May 4, 2026

·

Updated May 18, 2026

537 views

4

4.3 (12)

Retrieval-augmented generation is the dominant pattern for grounding an LLM in your own data. The shape is straightforward in the docs (chunk your corpus, embed, store in a vector DB, retrieve at query time, feed to an LLM, return the answer); the production reality has more failure modes than the docs let on. I have shipped three RAG systems across different domains and worked on a fourth that did not ship. The patterns of what goes wrong are remarkably consistent.

This is the long-form writeup of the pipeline as I have ended up actually building it, the seven failure modes I have hit and what each one looked like, and the evaluation discipline that has separated the RAG features that worked in production from the ones that demoed well and fell over on real users.

The end-to-end pipeline, drawn out

The RAG pipeline, from a customer's question to a grounded answer
  Ingest          documents arrive (PDFs, docs, wiki, db rows, tickets)
     |
  Chunk           split into 200-500 token pieces with overlap
     |
  Enrich          attach metadata (title, source, doc_id, date, author, tags)
     |
  Embed           call the embedding model on each chunk -> vector
     |
  Store           write {vector, chunk_text, metadata} to a vector DB
     |
  ----            (above is offline; below is per-query)
     |
  Query           user asks something
     |
  Embed query     same embedding model -> query vector
     |
  Retrieve        hybrid (vector + keyword) -> top 50 candidates
     |
  Filter          drop chunks the user is not allowed to see
     |
  Rerank          cross-encoder or LLM judge -> top 10
     |
  Construct       build a prompt with chunks as grounding context
     |
  Generate        call the LLM, stream the answer
     |
  Cite            map answer claims back to chunks
     |
  Log + eval      offline scoring loop, weekly

That is the whole picture. Each box hides a few production decisions; the rest of this article walks through where I have personally hit walls.

Ingest and chunking: where 60% of the quality is decided

The shape of the ingestion is the most consequential decision in the whole system. If you ingest the wrong unit (a whole 80-page PDF as one document), no amount of better embeddings or smarter prompting will save you. The chunks are what the model can actually retrieve; everything downstream is at the mercy of the chunking.

The heuristics that have worked for me, applied in order:

  1. Respect natural boundaries first. Split on chapter, section, paragraph. The boundary that matters is the unit of meaning a human would describe as "one topic". A chunk that ends mid-sentence is a chunk you do not want.
  2. Chunk to 200 to 500 tokens. Long enough to carry context, short enough to be retrievable. Below 100 the chunks are too narrow; above 800 the model retrieves big blobs that bury the relevant fact.
  3. Overlap by 50 to 100 tokens. A fact that straddles a chunk boundary is captured in at least one chunk. Without overlap, the boundary becomes a blind spot.
  4. Prepend the path or title. A chunk that starts Doc: API Reference > Pagination > Cursors carries enough provenance that a near-duplicate chunk from a different doc does not confuse retrieval.
  5. Treat tables and code differently. A table chunked into rows loses its structure. The pattern that has worked: keep the table as one chunk (with the schema row included), and include a one-sentence description of what the table is. Code blocks similarly: do not chunk inside a function.

The specific number is less important than the discipline of trying a few and measuring. I have shipped RAG systems where a chunk-size change from 800 to 400 tokens improved retrieval recall by 15%, and I have shipped systems where the same change made things worse because the docs were so dense that 400 tokens was too narrow. The eval loop (covered later) is what tells you which.

Embedding: pick once, plan the migration

The embedding model is the function that maps text to vectors. The quality of your vectors caps the quality of your retrieval; pick a recent, well-evaluated, general-purpose embedding model and re-evaluate annually. The technical details (dimension count, normalization, model architecture) are mostly secondary.

The operational concern that matters more than the choice itself: when the model is updated or replaced, every vector in your index is in a different geometric space. New query vectors cannot meaningfully be compared to old indexed vectors. You must re-embed the entire corpus.

For a small corpus (under a million chunks) this is a few hours and a few hundred dollars. For a large corpus (tens of millions of chunks) this is a multi-day batch job and can be tens of thousands of dollars in API fees alone. Plan the migration before you ship: log which embedding model produced each vector, version the index, and have a budget for re-embedding when the day comes.

Retrieval: hybrid, almost always

Pure vector search misses exact matches. A user querying for a specific order ID, a product SKU, a function name, an error code does not want a fuzzy semantic match; they want the chunk containing exactly that string. Pure vector search is a fuzzy semantic matcher and will rank the wrong things first.

Hybrid retrieval (vector + BM25 / keyword search, fused with reciprocal rank fusion) is the production default I now reach for without thinking. The two retrievers have complementary strengths: vectors find semantically similar chunks even when no words overlap; keyword search finds exact-match chunks even when the surrounding language is unusual. Fusing them gets you most of both.

# the simplest reciprocal rank fusion, applied to two ranked lists
def rrf(rank_lists, k=60):
    scores = {}
    for ranks in rank_lists:
        for rank, doc_id in enumerate(ranks):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

vector_top  = vector_search(query_vec, k=50)        # list of doc_ids
keyword_top = bm25_search(query_text, k=50)
fused = rrf([vector_top, keyword_top])[:50]

That function is parameter-free in any meaningful sense (the k=60 is a stabilizer, not a tuning knob). I use it everywhere.

Re-ranking: the second-stage filter that buys you precision

Hybrid retrieval (vector + keyword) is good at recall: it surfaces the right chunks somewhere in the top 50. It is less good at putting the very best chunk at rank 1. For features where the user reads only the top 3 to 5 results, or where the LLM only fits 5 to 10 chunks in its prompt, the order within the top-K matters as much as which chunks are in it.

A cross-encoder re-ranker is a smaller model that scores each (query, chunk) pair jointly. Unlike the embedding model (which encodes query and chunks independently and then compares vectors), a cross-encoder reads them together and produces a single relevance score per pair. The trade-off: cross-encoders are slower per pair (you cannot precompute them at index time), but they are more accurate.

The pattern in production:

# two-stage retrieval: cheap recall, expensive precision
candidates = hybrid_retrieve(query, k=50)
scored = [(c, cross_encoder_score(query, c.text)) for c in candidates]
scored.sort(key=lambda x: -x[1])
top_k = [c for c, _ in scored[:8]]

Fifty cross-encoder calls take, in my experience, 50 to 200 ms total (parallelized) on a hosted reranker API. That is acceptable latency on top of a multi-second LLM call; it would not be acceptable on a sub-second search box. Pick whether to add the second stage based on the latency budget of the feature, not on the assumption that more is always better.

The LLM-as-judge variant ("score this chunk's relevance to this query on a 1-5 scale") works too, but is more expensive in dollars per query, slower in latency, and varies more across runs at temperature > 0. I default to a dedicated cross-encoder for re-ranking and reserve LLM-as-judge for offline evaluation where deterministic latency matters less.

Prompt construction: the 200 lines of code that shape the answer

The code that takes the retrieved chunks and assembles them into the LLM prompt is small but it does an outsized amount of the work. Five patterns I have settled on:

Cite by ID, not by position. Every chunk in the prompt gets a stable identifier. The model is instructed to cite by that ID. Do not number them as [1] [2] [3] in the prompt order; if the order changes (because retrieval ranked them differently), the citations stay valid because they are tied to the chunk's stable ID, not its position.

Quote, do not paraphrase, the chunk in the prompt. The chunk text goes in verbatim, fenced or quoted so the model sees a clear boundary between context and instruction. Paraphrasing or summarizing chunks before passing them in is a tempting compression but introduces a new failure mode (the summary loses the fact the user asked about).

Order matters within the context window. Models tend to attend more to the start and end of the context than the middle. I put the most relevant chunks first, the second-most-relevant chunks last, and the merely-related-but-not-critical chunks in the middle. This is not exact science; it is a hedge based on the lost-in-the-middle effect.

Strict instruction phrasing. "Answer ONLY using the provided context" beats "Try to use the provided context" by a meaningful margin in the eval results I have run. The phrasing of the instruction is a tunable, testable parameter.

Refusal example in the prompt. Including a one-shot example of "if the context does not contain the answer, respond with: 'I do not have enough information to answer that.'" reduces the rate at which the model invents an answer when retrieval came up empty.

A stripped-down version of the prompt template I ship:

System prompt template I have settled on for RAG
  You are a helpful assistant. Answer the user's question using ONLY the
  context provided below. Cite sources using the format [chunk_id]. If the
  context does not contain the answer, respond exactly with:
  "I do not have enough information to answer that."

  --- CONTEXT START ---
  [chunk_id: doc_42_chunk_3] (Doc: API Reference > Pagination)
  GET /items?limit=N supports up to N=100. The default is 25.

  [chunk_id: doc_19_chunk_7] (Doc: Performance Tuning)
  Large limits trigger rate-limit checks at 1000 req/min.
  --- CONTEXT END ---

  USER: What is the maximum limit for the items endpoint?

The shape is dull on purpose. Every flourish I have tried adding (cute role-play framing, encouraging the model to "think step by step", multi-paragraph instructions) has either no measurable effect or a small negative one in the evals. Boring prompts win.

Observability: the metrics I keep on a dashboard

A RAG system in production deserves a dashboard, not just an eval harness. The metrics I have on the dashboard for every shipped RAG feature:

RAG production dashboard, eight panels
  1. Queries per second                       (load)
  2. Retrieval p50 / p95 latency              (vector + BM25 + rerank, separately)
  3. Generation p50 / p95 latency             (LLM call)
  4. Top-1 chunk cosine score, distribution   (retrieval health signal)
  5. Refusal rate                             ("I do not have enough information")
  6. Citation rate                            (responses with at least one valid citation)
  7. Tokens per request, p50 / p95            (cost driver)
  8. Eval scores, weekly                      (faithfulness, correctness, refusal)

Panel 4 is the early-warning system for retrieval drift. If the top-1 cosine score distribution shifts left over a week (chunks are matching less well), something has changed (the corpus, the embedding model, the query distribution) and you want to know before users do. I have caught two real regressions on this panel before the eval set caught them, and on both occasions the cause was an embedding-model deprecation that started returning slightly different vectors.

Panel 5 is the symmetric early-warning. If the refusal rate spikes, retrieval is failing to find good chunks; if it crashes to zero, the model has stopped honoring the refusal instruction (a prompt regression). Both are bad and the dashboard catches them in hours, not weeks.

The seven failure modes I have actually hit

In the order I have hit them, with what each one looked like in the wild:

Failure 1: the answer is correct, but cites the wrong chunk

The model produced a fluent, accurate answer; the cited chunks were unrelated. This was the embedding-model-update case for me. New embeddings, old indexed vectors, retrieval was effectively random, but the LLM was confident enough in its parametric knowledge to answer anyway. The user got a plausible-looking answer with garbage citations.

Fix: log the cosine score of every retrieved chunk; alert when the top-1 score collapses to noise levels. Make it impossible for the system to answer if no chunk is sufficiently similar (return "I do not have enough information" instead).

Failure 2: the model ignores the context

The context was retrieved correctly. The answer was wrong. The reason: the model preferred its parametric memory over the provided context, especially when the parametric answer is a more common pattern.

Fix: explicit instruction in the system prompt ("Answer ONLY using the provided context. If the context does not contain the answer, say 'I do not have enough information.'"). Set a low temperature. Test specifically with adversarial cases where the parametric answer differs from the provided one. The fix is mostly prompt discipline but it has to be tested.

Failure 3: ranked chunks all say the same thing

The top 10 retrieved chunks were near-duplicates from different documents (or different versions of the same document). The model saw "the answer is X" ten times and confidently said X. The actual correct answer was buried at rank 12.

Fix: deduplication after retrieval. The pattern that works: pairwise cosine between the top retrieved chunks; if any two are above a threshold (say, 0.95), keep only the one ranked higher. This drops near-duplicates and surfaces variety in the context the LLM sees.

Failure 4: the model hallucinates citations

The model invented a chunk reference that did not exist in the retrieved set. "According to chunk #4..." but chunk #4 was never given.

Fix: structured output for citations. The model emits citations in a JSON-shaped form keyed to actual chunk IDs you provided. The post-processing step rejects any citation that does not appear in the retrieval set. Rerun the generation if it fails the check (or hide the citation if there are valid ones).

Failure 5: long-context queries return the wrong thing

The user pasted in a long document and asked a question about it. Retrieval returned chunks from the user's document, but also chunks from elsewhere in the corpus that semantically matched. The model conflated the two sources.

Fix: scope retrieval. If the user provided a document, retrieval is restricted to that document only; the corpus is queried only when the user explicitly asks. This is a product choice as much as a technical one.

Failure 6: retrieval is slow during peak

The vector DB was sized for off-peak load. Peak traffic 4x off-peak. The DB hit query timeouts. RAG returned no context. The LLM, given an empty context, answered from memory and was wrong.

Fix: load test the vector DB at 2-3x peak before launch. Set a strict timeout on retrieval; if it times out, return a graceful failure ("The system is busy, please try again.") rather than answering without context. Empty-context generation is a failure mode in itself.

Failure 7: the corpus is out of date and nobody noticed

The documents were ingested six months ago. The product changed; the docs got rewritten. The RAG index never re-ingested. Users got correct-looking answers about a product that did not exist anymore.

Fix: an ingestion freshness metric. Last-update timestamp per document, alert when any high-traffic source is older than its expected refresh cadence. The index is a living artifact; its freshness needs the same monitoring as any other production data store.

Evaluation: the discipline that keeps RAG honest

All of the failure modes above are caught by a half-decent evaluation harness. RAG without an eval loop is a bet that the demo cases are representative. They are not.

The minimum viable eval setup that I have shipped on every RAG system since:

The minimum viable RAG eval, six fields per case
  query             the user input as it would arrive
  expected_chunks   IDs (or excerpts) that should be retrieved
  expected_answer   a short ground-truth or rubric
  forbidden_facts   (optional) facts the answer must NOT contain
  metadata          source domain, query type, expected difficulty
  notes             why this case is in the set

A modest set is 50 to 200 cases. It does not have to be huge to be useful; it has to be representative. Build it from real user queries (sampled, anonymized) plus a handful of adversarial cases per failure mode above.

The metrics I report from the eval harness:

RAG eval metrics, in order of how often I check them
  Retrieval hit rate    percent of cases where the expected chunk was in top-K
  Faithfulness          does the answer use only the retrieved context?
  Answer correctness    does the answer match the expected one (rubric or LLM-judge)?
  Refusal correctness   when no answer is possible, does the system refuse cleanly?
  P95 latency           per case, over a few runs
  Cost per query        input tokens + output tokens, in dollars

Faithfulness is the single most useful metric in the production loop. An answer that is correct but not grounded in the retrieved context is a failure waiting to happen on a question where the model's parametric knowledge is wrong. Measuring it explicitly (an LLM judge that scores each claim against the retrieved chunks) catches the slow drift between the system being grounded and the system being confidently wrong.

Running the eval set weekly, with a Slack alert when any metric regresses by more than a small threshold, is the difference between a RAG system that improves over time and one that decays.

When RAG is not the right shape

RAG is great. It is not always great. A short list of cases where I have walked away from RAG and used something else:

  • The corpus is small and stable. A few thousand short docs that change rarely. Just put them all in the system prompt of a long-context model; you save the entire retrieval pipeline.
  • The query needs computation, not retrieval. "How many users signed up last month?" is not a retrieval question; it is a SQL question. RAG over docs about user signups will not answer it correctly.
  • The accuracy bar is regulatory. If wrong answers create legal exposure (medical, financial, legal advice), RAG provides grounding but does not guarantee accuracy. The product probably needs human review in the loop, not better retrieval.
  • The user really wants browsing, not Q&A. Sometimes the right interface is search results, not a generated answer. RAG is one shape of UI; it is not the right shape for every information need.

The shape of every RAG project I have led

The pattern, in calendar time, that every RAG project I have shipped has followed:

RAG project shape, weeks 1-12
  Week 1     Ingest a small corpus, naive chunking, vector-only retrieval. Demo.
  Week 2     The demo wows the team. Real users reveal failure mode 1 or 2.
  Weeks 3-4  Hybrid retrieval, citation discipline, refusal handling.
             The system goes from impressive-but-shaky to actually useful.
  Weeks 5-6  Build the eval set from observed failures. First weekly run reveals 3 more failure modes.
  Weeks 7-9  Tune chunking, add re-ranking, refine prompt. Eval scores climb.
  Weeks 10-12 Production hardening: rate limits, scope filters, freshness monitoring.
              Ship to a beta cohort.

Nothing about that shape is unique to a specific tool or provider; the work is the same with any embedding model, any vector DB, any LLM. The systems that ship are the ones whose teams accept that the second half (weeks 5 through 12) is where the actual quality comes from. The systems that get cancelled are usually the ones whose teams thought the demo was the product.

The single most useful piece of advice

If I could distill the entire experience of building RAG into one sentence, it would be: build the eval loop before you tune anything. Every quality decision (chunk size, retrieval algorithm, reranker model, prompt wording) is a hypothesis, and without an eval loop you cannot tell which hypothesis is right. The teams I have worked with who skipped this step burnt months arguing about prompt phrasings and chunking strategies on the basis of demo-day vibes; the teams who built the eval loop first had answers in days. The eval loop is not the polish step; it is the foundation, and laying it second is laying it backwards.

Back to Articles