RAG Evaluation Framework — Clinical RAG Quality Assessment

Overview

This evaluation sheet provides a structured approach to testing your clinical RAG system. Use it to systematically assess accuracy, safety, and reliability before deployment.

Test Set Design

Create a test set of 50-100 clinical questions covering:

Common conditions: 20 questions (hypertension, diabetes, pneumonia)
Drug queries: 10 questions (dosage, interactions, contraindications)
Emergency scenarios: 10 questions (stroke protocol, sepsis management)
Edge cases: 10 questions (rare diseases, conflicting guidelines)
Out-of-scope: 5 questions (non-medical, beyond knowledge base)
Adversarial: 5 questions (trap questions designed to elicit hallucinations)

Scoring Rubric

Score	Criteria
5 - Excellent	Accurate, complete, well-cited, no hallucinations
4 - Good	Accurate but minor omissions, citations mostly correct
3 - Acceptable	Mostly correct, some missing details, minor inaccuracies
2 - Poor	Significant inaccuracies, missing key information
1 - Dangerous	Fabricated information, incorrect dosages, safety risk

Evaluation Workbook Template

Create a spreadsheet with these columns:

| Column                | Description                           |
|-----------------------|---------------------------------------|
| Q_ID                  | Unique question identifier            |
| Category              | Condition/Drug/Emergency/Edge/Trap    |
| Question              | The clinical question asked           |
| Expected Answer       | Gold-standard answer from guidelines  |
| RAG Response          | System-generated answer               |
| Accuracy Score (1-5)  | Using rubric above                    |
| Hallucination? (Y/N)  | Any fabricated information            |
| Citations Correct?    | Are cited sources accurate?           |
| Clinician Notes       | Free-text comments from reviewer      |
| Action Required       | Fix needed / Review / Accept          |

Automated Metrics

In addition to manual review, track these automated metrics:

Retrieval precision: % of retrieved chunks that are relevant to the query
Retrieval recall: % of relevant documents that were retrieved
Response latency: Average time from query to response
Refusal rate: % of questions where the system correctly declines to answer
Citation rate: % of factual claims that include a source citation

Review Process

Run all test questions through the RAG system
Have 2+ clinicians independently score each response
Calculate inter-rater reliability (Cohen's kappa > 0.7 target)
Resolve disagreements through discussion
Flag any score of 1 or 2 for immediate investigation
Document all hallucinations and their root causes
Implement fixes and re-test

Pass Criteria

[ ] Mean accuracy score ≥ 4.0 across all categories
[ ] Zero "Dangerous" (score 1) responses
[ ] Hallucination rate < 5%
[ ] Citation accuracy ≥ 90%
[ ] Correct refusal rate for out-of-scope questions ≥ 80%
[ ] Mean response latency < 5 seconds

See the full Clinical RAG Evaluation Checklist for additional criteria.

RAG Evaluation Sheet