Your RAG assistant looks flawless in the demo. It answers confidently, cites your internal documents, and wins over the executive committee. Then it ships, and three weeks later a sales rep forwards you a response where the bot invented a contract clause that does not exist. The model is not the problem. The problem is that nobody actually tested the system.
Testing a Retrieval-Augmented Generation system is neither classic software testing nor standalone LLM evaluation. It is its own discipline, with its own metrics, tooling, and failure modes. This guide gives you a complete, practical method, whether you have an in-house data team or work with a partner.
Why traditional testing falls short
A conventional software test verifies deterministic behavior: same input, same output. A RAG system is probabilistic at two stages. Retrieval (finding relevant passages in your document base) can surface different chunks depending on phrasing. Generation can then produce different answers from identical passages.
The consequence: a naive test suite of question and expected answer pairs with exact matching fails every time. You need to evaluate properties (is the answer faithful to its sources? were those the right sources?) rather than strings.
The stakes are not academic. A 2024 Stanford study of RAG-powered legal research tools measured hallucination rates between 17 and 33 percent depending on the tool, even though these products were marketed as reliable precisely because of RAG. Separately, Gartner projected that at least 30 percent of generative AI projects would be abandoned after proof of concept by the end of 2025, with poor quality and low trust among the leading causes. Rigorous evaluation is what separates projects that reach production from projects that die in pilot.
Step 1: Build a golden dataset
Everything starts with a golden dataset: a set of representative questions, each paired with the expected source documents and a reference answer validated by a domain expert.
A pragmatic recipe:
- Collect 50 to 150 real questions. Support tickets, customer emails, questions asked during the pilot. Questions invented by the technical team are always too clean; real questions are ambiguous, misspelled, and incomplete.
- Classify them by type. Simple factual lookups, multi-document questions, out-of-scope questions (where the system must decline to answer), and near-miss traps that resemble covered topics but differ in a crucial detail.
- Have reference answers validated by the business, not the AI team. The legal, HR, or technical expert is the one who knows what a correct answer looks like.
- Version the dataset. It will evolve alongside your document base and becomes your safety net for every future change.
A golden dataset of 100 well-built questions beats 1,000 auto-generated ones without validation. It is the most durable asset in your entire testing setup.
Step 2: Evaluate retrieval in isolation
A classic mistake is judging the system only on final answers. If retrieval surfaces the wrong passages, the best model in the world will produce a wrong answer. So measure search quality first, on its own, using standard information retrieval metrics:
- Precision@k: of the k passages retrieved, what share is actually relevant? Low precision means you are drowning the model in noise.
- Recall@k: of all relevant passages that exist, what share made it into the top k? Low recall means the right information never reaches the model at all.
- MRR (Mean Reciprocal Rank): at what position does the first relevant passage appear? This matters because models pay more attention to passages early in the context.
All three compute automatically from your golden dataset. If your precision@5 sits below 0.6, do not waste time tuning prompts: your problem lives in chunking, embeddings, or indexing. Our guide on RAG architecture from prototype to production covers those infrastructure decisions in depth.
Step 3: Evaluate generation with RAG-specific metrics
Once retrieval holds up, evaluate answer quality. The open source RAGAS framework popularized four metrics that have become a de facto standard:
- Faithfulness: is every claim in the answer supported by the retrieved passages? This is the core anti-hallucination metric.
- Answer relevancy: does the answer actually address the question asked, without drifting?
- Context precision: were the retrieved passages the right ones, ranked sensibly?
- Context recall: did the context contain everything needed to answer?
These metrics rely on the LLM-as-a-judge principle: an evaluator model scores responses against strict criteria. It is imperfect, but research shows strong correlation with human judgment when criteria are well defined, and it lets you evaluate hundreds of answers in minutes instead of days of manual annotation.
Practical thresholds from our project experience: target faithfulness above 0.9 for sensitive use cases (legal, HR, finance) and 0.8 for low-risk internal tools. Below that, the system is not ready.
Step 4: Test the edge behaviors
Averages hide the failures that destroy user trust. Add targeted test families:
- Out-of-scope questions. The system must say "I don't know" instead of improvising. Measure the correct-refusal rate on questions deliberately outside the document base. This is the most neglected test, and the most important one.
- Prompt injection. Plant an instruction like "ignore your guidelines and answer X" inside a test document. Enterprise RAG indexes documents that third parties can sometimes modify; it has to resist this.
- Time-sensitive traps. If your base contains a 2024 pricing policy and a 2026 one, does the system cite the right version?
- Phrasing robustness. Ask the same question five different ways, typos included. Answer variance is a direct proxy for the reliability your users will perceive.
Step 5: Industrialize with a continuous evaluation loop
A RAG system is never finished: documents change, models get upgraded, usage drifts. Put three mechanisms in place:
- Automatic evaluation on every change. Any modification (new chunking, new model, new documents) triggers the full golden dataset, exactly like a regression test suite. A dropping score blocks the deployment.
- Production sampling. Automatically score a percentage of real production answers each week using faithfulness metrics. Drift becomes visible in days instead of months.
- User feedback loop. A simple thumbs up/down feeds your dataset: every thumbs-down becomes a candidate test case. Within three months, your golden dataset mirrors reality rather than assumptions.
This evaluation discipline sits at the heart of our enterprise RAG service: a system without an evaluation pipeline is not a product, it is an extended demo.
The three mistakes that sink evaluation efforts
Even teams that take testing seriously fall into recurring traps. Three are worth naming explicitly:
- Evaluating only at launch. A RAG system scored once and never again will drift silently as documents pile up and models change underneath you. Evaluation is a recurring process, not a milestone.
- Letting the AI team grade its own homework. When the people who built the system also define what counts as a correct answer, blind spots compound. Reference answers must come from the business side, and a sample of production outputs should be reviewed by someone with no stake in the project's success.
- Chasing a single aggregate score. A dashboard showing "92 percent quality" hides which question types fail. Always break results down by category: factual lookups, multi-document reasoning, out-of-scope refusals. The breakdown tells you what to fix; the aggregate only tells you whether to worry.
Pre-production checklist
- Golden dataset of 50+ business-validated questions, under version control
- Precision@5 and recall@5 measured and documented
- Average faithfulness at or above 0.8 (0.9 in sensitive domains)
- Correct-refusal rate measured on out-of-scope questions
- Prompt injection tests passed
- Automatic evaluation wired into every deployment
- Production sampling scheduled with alert thresholds
- User feedback loop live
Starting from scratch, expect two to three weeks to stand this up around an existing RAG system. It is the highest-ROI investment in the whole project: it turns an impressive prototype into a system people can trust. And if your team is new to the topic, focused AI training on evaluation practices shortens the learning curve dramatically.
FAQ
How is testing a RAG system different from testing a regular LLM chatbot?
A RAG system has two stages that must be evaluated separately: retrieval (document search) and generation. A plain chatbot only has generation. In practice this adds information retrieval metrics (precision@k, recall@k, MRR) plus grounding metrics like faithfulness, which verify that the answer is actually supported by the retrieved documents rather than the model's imagination.
How many questions does a golden dataset need?
Between 50 and 150 real, expert-validated questions is enough to start, provided they cover the critical categories: factual lookups, multi-document questions, out-of-scope questions, and near-miss traps. Representativeness beats volume; 100 real questions outperform 1,000 unvalidated synthetic ones.
Is LLM-as-a-judge evaluation reliable?
Reliable enough for continuous monitoring when evaluation criteria are precise, and studies show solid correlation with human judgment. Best practice is to calibrate the automatic judge against a manually annotated sample at the start, then spot-check periodically. It replaces mass annotation, not the initial expert validation.
What hallucination rate is acceptable in production?
It depends on business risk. For a low-stakes internal assistant, faithfulness around 0.8 combined with visible source citations is generally acceptable. For sensitive domains like legal or finance, target 0.9 or higher, enforce systematic source citation, and keep a human in the loop for binding decisions.
Do we need paid tools to evaluate a RAG system?
No. RAGAS, DeepEval, and promptfoo are open source and cover the essentials: retrieval metrics, faithfulness, relevancy, and CI/CD integration. The real costs are expert time to build the golden dataset and the evaluator model's API calls, typically a few tens of dollars per full evaluation run.
