What I Learned Building a Medical PDF RAG Workflow

Author: ClinRAG Editorial TeamLast updated: May 15, 2026Series: Implementation Notes

Real-world lessons from building a clinical RAG system that ingests medical PDFs — what went wrong, what worked, and what I would do differently. This is not a tutorial. It's a field report.

The Setup

We set out to build a RAG system that could answer clinical questions using a knowledge base of medical guidelines — AHA, NICE, IDSA, and institutional protocols. The source material was almost entirely PDF: hundreds of documents, ranging from well-formatted prescribing information to scanned copies of older hospital policies.

The first version took about three weeks. It worked in the sense that it returned answers to most queries. It failed in the sense that many of those answers were wrong in subtle, dangerous ways. Here is what we learned.

Lesson 1: The PDF Parser Determines Everything

This was the single biggest lesson. No matter how good your embedding model is, no matter how sophisticated your retrieval strategy, if the PDF parser loses critical information during ingestion, your RAG system is built on sand.

Our first attempt used a standard PyPDF2-based loader. It extracted text fine for simple documents. But for clinical guidelines — which often have multi-column layouts, tables of drug dosages, sidebars with safety warnings, and footnotes with important qualifiers — the output was frequently garbled. Table rows were concatenated into single paragraphs. Column boundaries disappeared. Footnotes were either duplicated or lost entirely.

What we did: Switched to RAGFlow for document parsing. Its layout analysis engine preserved the structure of complex medical PDFs significantly better. Tables stayed as tables. Multi-column layouts were handled correctly. Footnotes remained attached to their reference points.

What I would do differently: Start with the parser. Don't build the rest of the pipeline until you can verify that your ingestion process correctly handles the full range of your source documents. This is the foundation — everything else is secondary.

Lesson 2: Chunking Strategy Should Match Document Structure

Our initial approach used fixed-size chunks of 500 tokens with 10% overlap. This worked okay for narrative text but destroyed the coherence of guideline recommendations, which are often organized as hierarchical lists with conditions, subconditions, and evidence grades.

What we did: Switched to section-based chunking where possible. We configured the chunker to split at heading boundaries and keep related recommendation blocks together. This meant uneven chunk sizes — some chunks were 200 tokens, others 1,200 — but retrieval quality improved dramatically because the LLM received coherent recommendation blocks rather than fragmented snippets.

Lesson 3: Drug Dosage Tables Are a Special Case

Drug interaction tables and dosing schedules present a unique challenge. They are often dense, multi-column, and contain abbreviations that are critical for correct interpretation (e.g., "q6h" vs "q12h"). Standard chunking splits these tables across chunks, losing the row-column relationships.

What we did: Extracted tables as separate chunks with preserved structure. We used RAGFlow's table extraction to convert each table to a structured format, then stored it as a separate document with metadata linking it to the parent guideline. This allowed the retrieval system to return complete tables rather than fragmented rows.

Lesson 4: Metadata Is as Important as the Text

Our first knowledge base had no metadata beyond the document name. This meant that when the system retrieved a guideline, it couldn't distinguish between the 2019 version and the 2024 update. In clinical practice, this distinction can be the difference between an obsolete and a current recommendation.

What we did: Added structured metadata to every chunk: document source, publication date, version number, medical specialty, evidence level, and guideline type. This enabled specialty-specific filtering and version-aware retrieval, which significantly improved answer accuracy.

Lesson 5: Test with Questions That Have Known Answers

It's easy to convince yourself that your RAG system works because it returns fluent, confident answers. The reality is that fluent wrong answers are more dangerous than hesitant correct ones.

What we did: Built a test set of 50 clinical questions with documented correct answers. Questions ranged from straightforward ("What is the first-line treatment for hypertension in adults?") to edge cases ("What are the dosing adjustments for amoxicillin in patients with severe renal impairment?"). We evaluated each response for factual accuracy, citation quality, and safety.

The results were humbling. Our initial system scored a 3.2 out of 5 on average — with several "Dangerous" scores where fabricated drug names or incorrect dosages appeared. After improving the parser and chunking strategy, the score improved to 4.1.

Lesson 6: Citation Grounding Is Non-Negotiable

The most important design decision in a clinical RAG system is whether every generated claim is linked to a specific source document. Without citations, there is no way for a clinician to verify the answer. Without verification, there is no safety net.

What we did: Designed the prompt template to require source citations for every factual claim. We also built a post-processing step that extracts cited sources and verifies they are present in the retrieved context. If a citation references a document that was not retrieved, it's flagged as a potential hallucination.

Lesson 7: Plan for Maintenance From Day One

Clinical guidelines are updated regularly. Drug monographs change. New research supersedes old recommendations. A RAG system that works today will produce outdated answers tomorrow if the knowledge base is not maintained.

What we did: Built a document versioning system that tracks when each guideline was last updated. We set up alerts for new guideline releases from major organizations (AHA, NICE, IDSA) and established a quarterly review cycle for the knowledge base.

What I would do differently: Build this from the start, not as an afterthought. Versioning and update workflows are not optional — they are core to the system's safety.

Bottom Line

Building a medical PDF RAG workflow is 80% document preparation and 20% everything else. The quality of your parser, chunking strategy, and metadata tagging determines whether your system will produce reliable, verifiable answers or fluent fabrications. Invest in the foundation.

Disclaimer: This is a technical field report about building RAG systems. It does not constitute medical or legal advice. All clinical RAG deployments should be reviewed by qualified healthcare professionals.


Related Resources