Retrieval-augmented generation is one of the most useful patterns for building AI products on top of real documents. Instead of asking a model to answer from memory, you retrieve relevant source material and ask the model to answer from that context.
That sounds simple. The hard part is making the retrieved context good enough that the final answer is trustworthy.
This guide walks through a practical RAG pipeline you can build, test, and improve without pretending there is one magic chunk size or one perfect vector database.
What A RAG Pipeline Does
A RAG system has two paths: an indexing path and an answering path.
The indexing path prepares your knowledge base:
- Load documents.
- Clean and normalize text.
- Split documents into chunks.
- Attach metadata.
- Create embeddings.
- Store chunks in a searchable index.
The answering path runs for each user query:
- Rewrite or normalize the query if needed.
- Retrieve candidate chunks.
- Filter or rerank those chunks.
- Build a prompt with the best context.
- Generate an answer.
- Return citations and confidence signals.
- Log the result for evaluation.
RAG fails when teams only focus on the last step. The answer quality is usually determined earlier, by document quality, chunking, metadata, and retrieval.
Step 1: Define The Questions First
Before indexing everything, write 30 to 100 realistic questions users will ask. Include easy questions, vague questions, multi-hop questions, and questions the system should refuse because the answer is not in the source material.
For each question, record:
- The ideal answer.
- The source document that supports it.
- The source section or page.
- Whether the answer changes over time.
- Whether the question is high-risk.
This becomes your evaluation set. Without it, every pipeline change becomes a vibes-based debate.
Step 2: Prepare The Documents
Start with the documents users actually need: product docs, policies, internal SOPs, support articles, research PDFs, contracts, meeting notes, or code documentation.
Clean the text before embedding it. Remove navigation, duplicated headers, cookie banners, unrelated footers, boilerplate legal text, and broken OCR output. Keep tables only if you can preserve their structure. For PDFs, check whether the extracted text is in the right reading order.
Useful metadata includes:
- Source URL or file path.
- Title.
- Author or owner.
- Last updated date.
- Document type.
- Section heading.
- Page number.
- Access-control group.
- Version.
Metadata is not decoration. It powers filtering, citations, freshness checks, and access control.
Step 3: Chunk By Meaning, Not Just Size
Chunking is where many RAG systems quietly break. If chunks are too small, the model receives fragments without enough context. If chunks are too large, retrieval returns noisy blocks that bury the answer.
Start with semantic boundaries: headings, sections, paragraphs, functions, policy clauses, FAQ entries, or table rows. Then apply size limits inside those boundaries.
A reasonable starting point is 500 to 1,000 tokens per chunk with overlap only where context genuinely spans boundaries. For code, chunk by function, class, or module. For legal and policy documents, preserve section titles and clause numbers. For product docs, include the nearest heading with each chunk.
Do not treat one chunk size as universal. Test it against your evaluation questions.
Step 4: Choose Embeddings And Storage
Embeddings convert text into vectors that can be searched by semantic similarity. The embedding model should match your language, domain, privacy constraints, and budget.
Managed embedding APIs are often easiest to start with. Open-source embedding models are useful when you need local processing, lower marginal cost, or more control. Whatever you choose, keep the same embedding model for indexing and querying unless you rebuild the index.
Common storage choices include:
| Option | Good fit |
|---|---|
| Postgres with pgvector | Teams already using Postgres |
| Qdrant | Fast vector search with solid filtering |
| Weaviate | Rich schema and hybrid search options |
| Pinecone | Managed vector infrastructure |
| Chroma | Local prototypes and small internal tools |
| Elasticsearch/OpenSearch | Hybrid keyword plus vector search in search-heavy stacks |
The best database is the one your team can operate reliably. Retrieval quality comes more from the pipeline than the logo.
Step 5: Retrieve More Than One Way
Basic semantic search is a good starting point:
query -> embed query -> find nearest chunks -> send top chunks to model
But semantic search can miss exact terms, product names, error codes, legal clauses, and fresh acronyms. For production systems, compare at least three retrieval strategies:
- Dense semantic retrieval.
- Keyword or BM25 retrieval.
- Hybrid retrieval that combines dense and keyword search.
Reranking is often worth testing. The first retrieval step can fetch 20 to 50 candidates quickly, then a reranker can reorder them and pass only the best few chunks into the model.
Step 6: Build A Source-Grounded Prompt
The prompt should be strict about source use. A good RAG prompt tells the model:
- Answer only from the provided context.
- Say when the context is insufficient.
- Cite the source IDs used.
- Do not invent policy, pricing, medical, legal, or technical details.
- Keep the answer in the format the product needs.
Example prompt structure:
You are answering from retrieved source material.
Rules:
- Use only the context below.
- If the answer is not in the context, say what is missing.
- Cite sources using the source IDs.
- Do not guess.
Context:
[S1] ...
[S2] ...
[S3] ...
Question:
...
Answer:
This does not magically eliminate hallucinations, but it makes the expected behavior testable.
Step 7: Return Citations Users Can Inspect
Do not just attach document titles. The citation should take users as close as possible to the supporting evidence: page, section, URL anchor, row ID, timestamp, or paragraph.
Also check that the cited source actually supports the sentence. Citation accuracy is its own evaluation category. A model can cite a relevant-looking source while making a claim the source does not say.
Step 8: Evaluate The Pipeline
Evaluate retrieval before generation. For each question in your test set, ask:
- Did the correct source appear in the top 3 results?
- Did the correct source appear in the top 10 results?
- Were irrelevant chunks crowding out better sources?
- Did metadata filtering remove anything important?
Then evaluate the final answer:
- Is the answer faithful to the retrieved context?
- Is it complete?
- Does it refuse when the source is missing?
- Are citations correct?
- Is the output format valid?
- Is latency acceptable?
Automated metrics can help, but human review is still important for high-stakes domains.
Step 9: Handle Freshness And Access Control
Production RAG needs more than a vector index.
Track document freshness. Re-index changed documents. Delete removed documents. Keep versions when users may ask about past policy. Respect permissions before retrieval, not after generation.
Access control is especially important. If a user should not see a document, that document should not be retrieved for that user. Filtering after the model has already seen sensitive text is too late.
Step 10: Monitor Real Usage
Log enough information to debug failures:
- User question.
- Retrieved chunk IDs.
- Retrieval scores.
- Final answer.
- Cited sources.
- Latency.
- Model and embedding versions.
- User feedback.
Monitor for stale documents, repeated unanswered questions, bad citations, slow queries, and retrieval drift after content changes.
A Simple Architecture
For a first production-minded version, keep it boring:
- Store source documents in a clear content repository.
- Run an indexing job when documents change.
- Split documents using headings and token limits.
- Store chunks with metadata.
- Use hybrid retrieval.
- Rerank candidates if quality needs it.
- Generate answers with strict source rules.
- Show citations.
- Keep a small human-reviewed evaluation set.
That architecture will outperform a flashier system with poor documents and no evaluation.
Common Mistakes
The biggest mistake is indexing messy text. Bad extraction creates bad retrieval.
The second mistake is optimizing for a demo query. Real users ask vague, incomplete, misspelled, and multi-part questions.
The third mistake is skipping refusals. A good RAG system should say “I do not have enough source material” instead of guessing.
The fourth mistake is ignoring product permissions. Retrieval must respect the same access rules as the rest of your application.
Bottom Line
RAG is not just “vector database plus chatbot.” It is a source-grounded answering system. The quality comes from document preparation, chunking, retrieval, citations, evaluation, and monitoring.
Start simple. Measure retrieval. Fix the documents. Add hybrid search or reranking only when your evaluation set shows a real gap.
Frequently Asked Questions
What chunk size should I use?
Start around 500 to 1,000 tokens, then test. Use semantic boundaries first and token limits second. Code, legal documents, tables, and FAQs often need different chunking rules.
Is a larger context window a replacement for RAG?
Not usually. Large context helps, but RAG still gives you retrieval, permissions, freshness, citations, and lower context cost. For small document sets, long-context prompting may be enough. For living knowledge bases, RAG is usually cleaner.
Do I need a vector database?
Not always. A small prototype can use local search or Postgres. A production system with many documents, filters, and concurrent users usually benefits from dedicated vector or hybrid search infrastructure.
Should I use hybrid search?
Test it. Hybrid search often helps when users search for exact names, IDs, error messages, or domain-specific terms that pure semantic retrieval can miss.
How do I know my RAG system is good?
Use a fixed evaluation set. Measure retrieval hit rate, answer faithfulness, citation accuracy, refusal quality, latency, and user feedback.
Verified Sources
- Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” arXiv, 2020: https://arxiv.org/abs/2005.11401
- LangGraph documentation, accessed April 27, 2026: https://docs.langchain.com/oss/python/langgraph/overview
- OpenAI Agents SDK documentation, accessed April 27, 2026: https://openai.github.io/openai-agents-python/agents/
- LlamaIndex agent documentation, accessed April 27, 2026: https://developers.llamaindex.ai/python/framework/use_cases/agents/