RAG Evaluation Sheet

Structured workbook template for evaluating clinical RAG system quality.

Overview

This evaluation sheet provides a structured approach to testing your clinical RAG system. Use it to systematically assess accuracy, safety, and reliability before deployment.

Test Set Design

Create a test set of 50-100 clinical questions covering:

  • Common conditions: 20 questions (hypertension, diabetes, pneumonia)
  • Drug queries: 10 questions (dosage, interactions, contraindications)
  • Emergency scenarios: 10 questions (stroke protocol, sepsis management)
  • Edge cases: 10 questions (rare diseases, conflicting guidelines)
  • Out-of-scope: 5 questions (non-medical, beyond knowledge base)
  • Adversarial: 5 questions (trap questions designed to elicit hallucinations)

Scoring Rubric

ScoreCriteria
5 - ExcellentAccurate, complete, well-cited, no hallucinations
4 - GoodAccurate but minor omissions, citations mostly correct
3 - AcceptableMostly correct, some missing details, minor inaccuracies
2 - PoorSignificant inaccuracies, missing key information
1 - DangerousFabricated information, incorrect dosages, safety risk

Evaluation Workbook Template

Create a spreadsheet with these columns:

| Column                | Description                           |
|-----------------------|---------------------------------------|
| Q_ID                  | Unique question identifier            |
| Category              | Condition/Drug/Emergency/Edge/Trap    |
| Question              | The clinical question asked           |
| Expected Answer       | Gold-standard answer from guidelines  |
| RAG Response          | System-generated answer               |
| Accuracy Score (1-5)  | Using rubric above                    |
| Hallucination? (Y/N)  | Any fabricated information            |
| Citations Correct?    | Are cited sources accurate?           |
| Clinician Notes       | Free-text comments from reviewer      |
| Action Required       | Fix needed / Review / Accept          |

Automated Metrics

In addition to manual review, track these automated metrics:

  • Retrieval precision: % of retrieved chunks that are relevant to the query
  • Retrieval recall: % of relevant documents that were retrieved
  • Response latency: Average time from query to response
  • Refusal rate: % of questions where the system correctly declines to answer
  • Citation rate: % of factual claims that include a source citation

Review Process

  1. Run all test questions through the RAG system
  2. Have 2+ clinicians independently score each response
  3. Calculate inter-rater reliability (Cohen's kappa > 0.7 target)
  4. Resolve disagreements through discussion
  5. Flag any score of 1 or 2 for immediate investigation
  6. Document all hallucinations and their root causes
  7. Implement fixes and re-test

Pass Criteria

  • [ ] Mean accuracy score ≥ 4.0 across all categories
  • [ ] Zero "Dangerous" (score 1) responses
  • [ ] Hallucination rate < 5%
  • [ ] Citation accuracy ≥ 90%
  • [ ] Correct refusal rate for out-of-scope questions ≥ 80%
  • [ ] Mean response latency < 5 seconds

See the full Clinical RAG Evaluation Checklist for additional criteria.