Tutorials

How to Build a RAG Pipeline from Scratch in 2026

Brixnex Editorial

📅 April 14, 2026 ⏱ 18 min read 👁 31.2K views

RAG Python LangChain

Why Most RAG Implementations Underperform

Retrieval-Augmented Generation is one of the most over-hyped and under-engineered patterns in the current AI stack. The basic idea — retrieve relevant documents, inject them into the prompt, generate an answer — is genuinely useful. The gap between a working prototype and a production system that users trust is larger than most teams expect when they start building. [original RAG paper]

The failure mode I see repeatedly: teams implement the simplest possible version, see that it mostly works, and ship it. Then users find the edge cases. The questions where retrieval returns the wrong documents. The cases where the right documents exist but the answer is not in any single chunk. The long documents where the relevant sentence is surrounded by irrelevant noise that confuses the model.

The three failure modes that account for the majority of underperforming RAG deployments are poor chunking, mismatched embedding models, and missing re-ranking. Poor chunking — typically fixed-size splitting at 512 tokens regardless of document structure — breaks semantic coherence across chunk boundaries, causing the retriever to return chunks that contain the right keywords but lack the surrounding context needed to answer the question. Mismatched embedding models occur when teams use a general-purpose embedding model for domain-specific content where a domain-fine-tuned model would significantly outperform it. Missing re-ranking leaves the final context assembly to the retriever's raw similarity score, which correlates imperfectly with actual answer relevance.

The pipeline architecture matters as much as the components. A retrieval pipeline that fetches 50 candidates and re-ranks them to 5 consistently outperforms one that fetches 5 directly, because re-ranking with a cross-encoder can apply full bidirectional attention between query and document — a more accurate relevance signal than the asymmetric dot-product comparison in ANN retrieval. This two-stage retrieve-then-rerank pattern is now the standard recommendation for production RAG systems, and frameworks like LlamaIndex and LangChain implement it as a configurable default.

Chunking Strategy Determines Retrieval Quality

How you split documents into chunks is the most consequential decision in a RAG pipeline, and it gets the least attention. Fixed-size chunking — split every 512 tokens, done — is the wrong default for most document types. It breaks sentences mid-thought, separates tables from their headers, and splits code examples in ways that make them meaningless.

The approach that works better: chunk by semantic unit. Paragraphs for prose. Sections with headers for technical documentation. Function-level chunks for code. Chunk overlap matters too — zero overlap means information at the boundary of a chunk will never appear in full context. Ten to fifteen percent of chunk size is a reasonable starting point, tuned against your evaluation set.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["

", "
", ". ", " "],
    keep_separator=True
)
chunks = splitter.split_documents(documents)

Embedding Models Are Not All Equal for Your Domain

The default embedding model works for general-purpose retrieval. For specialised domains — medical, legal, code, scientific literature — domain-specific embedding models outperform general models meaningfully. The improvement is often 10 to 20 percentage points on retrieval accuracy for technical queries. If you're in a specialised domain and have not evaluated domain-specific embedding models, it's worth spending a day on that before optimising anything else. See our LoRA fine-tuning guide.

Hybrid Search: Dense Plus Sparse

Pure semantic search misses exact keyword matches. A user asking about a specific regulation or product name wants documents containing that exact string, not just semantically similar ones. The answer is hybrid search — combining dense vector retrieval with sparse keyword retrieval like BM25 and fusing the results. Reciprocal Rank Fusion is the simplest fusion approach that works well across different result distributions. [hybrid search explained]

Pure vector search misses exact-match queries. Pure keyword search misses semantic queries. Hybrid search — combining dense vector retrieval with sparse BM25 scoring — consistently outperforms either approach alone across diverse query types. The Reciprocal Rank Fusion (RRF) algorithm is the standard way to merge ranked lists from both retrievers, and most managed vector databases (Weaviate, Elasticsearch with vector support, and Azure AI Search) now support hybrid queries natively.

In practice, tune the alpha parameter that controls the balance between dense and sparse retrieval based on your query distribution. Knowledge-base Q&A with precise terminology benefits from a higher sparse weight; broad semantic search over unstructured documents benefits from a higher dense weight. Running offline evaluation on a labelled query set before deploying to production pays dividends quickly.

Pure vector search misses exact-match queries that users phrase precisely. Pure BM25 keyword search misses semantic queries where the user's wording differs from the document's wording. Hybrid search combines both signals and consistently outperforms either approach alone across diverse query distributions. The standard combination method is Reciprocal Rank Fusion (RRF), which merges ranked lists from both retrievers by summing the reciprocal ranks — a robust merging strategy that doesn't require tuning the relative weight between the two signals.

The practical implementation depends on your vector database. Weaviate supports hybrid search natively with a tunable alpha parameter balancing dense and sparse contributions. Elasticsearch and OpenSearch support kNN vector search alongside BM25, with RRF available as a built-in rank fusion operator. Qdrant's sparse vector support allows BM25 scores to be represented as sparse vectors and combined with dense embeddings through a unified query interface. For production deployments where retrieval quality directly impacts user satisfaction, running hybrid search evaluation against your specific query distribution before committing to a single retrieval strategy is worth the engineering investment.

Re-ranking: The Step That Pays Back Its Cost

The initial retrieval step is fast but imprecise. A cross-encoder re-ranker — a model that scores the relevance of a query-document pair — is slower but much more accurate. The pattern that works: retrieve twenty to fifty candidates with fast vector search, re-rank with a cross-encoder, keep the top three to five for the generation context. This two-stage approach consistently outperforms using only initial retrieval. [HuggingFace cross-encoders]

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, chunk.page_content) for chunk in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(scores, candidates), reverse=True)
top_k = [doc for _, doc in reranked[:5]]

Evaluating Your RAG Pipeline

The metrics that matter: retrieval precision (are the retrieved documents relevant), retrieval recall (did you find all the relevant documents), answer faithfulness (is the generated answer supported by the retrieved documents), and answer relevance (does the answer actually address the question). RAGAS is the standard open-source evaluation framework for this and is worth using.

Build an evaluation set of 100 to 200 question-answer pairs with known correct answers before you start optimising. Run your evaluation after every significant change. Without this, you will make changes that improve one thing and break another without knowing it.

Common RAG Failure Modes in Production

Context stuffing: injecting too many chunks that dilute the relevant signal. The model's attention spreads across irrelevant content and answer quality drops. Keep your context window focused — three to five highly relevant chunks is almost always better than fifteen loosely relevant ones.

Hallucination on gaps: when no retrieved document contains the answer, some models will hallucinate one anyway. Add an explicit fallback response for this case. Stale retrieval: documents get updated but the index doesn't. Schedule index refreshes and monitor for retrieval-generation mismatches in production. These three failure modes account for most user complaints in deployed RAG systems.

Query Transformation Techniques

The query the user types is often not the best query for retrieval. Users phrase things conversationally, use ambiguous pronouns, leave context implicit, or ask follow-up questions that only make sense in the context of earlier turns. Query transformation — using an LLM to rewrite, expand, or decompose the user's query before retrieval — consistently improves retrieval recall in conversational applications.

The simplest version: have the model rewrite the query as a standalone question that includes context from earlier in the conversation. More sophisticated versions include hypothetical document generation (generate what a good answer would look like and use that as the retrieval query), query decomposition (break a complex question into sub-questions and retrieve for each), and query expansion (generate related queries and merge the results). Each adds latency, but for applications where retrieval quality matters, the tradeoff is usually worth it.

When to Not Use RAG

RAG is not the right architecture for every knowledge retrieval problem. If your knowledge base is small enough to fit in the context window, retrieval is not adding value — just include the documents directly. If your users are asking questions where the answer requires synthesising across many documents simultaneously, retrieval will struggle to assemble the right context. And if the latency budget is very tight, the retrieval step adds meaningful overhead that simpler architectures avoid.

RAG earns its complexity when: your knowledge base is large enough that you can't include all of it, the relevant information for any given query is a small fraction of the total, and your users' questions are grounded enough that good retrieval can identify the right context reliably. If those three conditions are met, RAG is the right tool. If they're not all met, the simpler architecture will serve you better.

References & Further Reading

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — Foundational RAG paper from Facebook AI
RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023) — Metrics framework for evaluating RAG pipeline quality
Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020) — Key paper on dense retrieval methods for RAG
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — Adaptive RAG system that evaluates its own retrieval quality

Frequently Asked Questions

What is RAG (Retrieval-Augmented Generation)?

RAG is an AI architecture that enhances LLM outputs by retrieving relevant documents from an external knowledge base and including them in the model's context at inference time. Rather than relying solely on the model's parametric knowledge (baked into weights during training), RAG grounds responses in retrieved evidence — enabling accurate answers on proprietary, recent, or domain-specific content the model was not trained on.

When should I use RAG vs fine-tuning?

Use RAG when: your knowledge base is large or frequently updated, you need source citations for answers, your content is proprietary and cannot be in training data, or you need to reduce hallucination on specific factual domains. Use fine-tuning when: you need a consistent output style or format, you want to teach the model new capabilities or reasoning patterns, or you need lower inference latency. For most enterprise knowledge management use cases, RAG is the right first choice because it's easier to update and debug than fine-tuning.

What are the main components of a RAG pipeline?

A production RAG pipeline has five main components: (1) document ingestion and chunking (splitting documents into retrievable segments), (2) embedding model (converting chunks to vector representations), (3) vector store (storing and indexing embeddings for similarity search), (4) retrieval logic (query understanding, hybrid search, re-ranking), and (5) generation (passing retrieved context + query to an LLM). Each component has significant quality impact and requires careful engineering.

How do I evaluate RAG pipeline quality?

Evaluate RAG pipelines on two dimensions: retrieval quality (does the system retrieve the right chunks for a given query?) and generation quality (given correct chunks, does the model answer accurately?). Frameworks like RAGAS provide automated metrics including faithfulness (is the answer grounded in retrieved context?), answer relevance, and context precision/recall. Building a diverse evaluation dataset covering edge cases, out-of-scope questions, and adversarial queries is essential before production deployment.

📢 Found this useful? Share it: