← Back to Templates

Medical PDF Preparation Checklist

Step-by-step checklist for preparing medical documents for RAG ingestion.

Document Collection

[ ] Identify all relevant medical document sources
[ ] Collect clinical guidelines (NICE, AHA, ACC, IDSA, etc.)
[ ] Gather drug databases and prescribing information
[ ] Download relevant medical literature (PubMed Central open-access)
[ ] Include hospital-specific protocols and pathways
[ ] Verify all documents are from authoritative sources
[ ] Note the publication date of each document

Document Validation

[ ] Check that PDFs are text-based (not scanned images)
[ ] For scanned PDFs, run OCR with medical vocabulary
[ ] Verify table and figure extraction quality
[ ] Check for encoding issues in medical notation (subscripts, Greek letters)
[ ] Ensure references and citations are parseable
[ ] Flag documents with complex layouts for manual review

Chunking Strategy

[ ] Choose chunk size: 500-1000 tokens for medical content
[ ] Set overlap: 10-20% to preserve context across boundaries
[ ] Split by section headings where possible
[ ] Keep tables intact as single chunks when possible
[ ] Keep drug dosage tables as single chunks
[ ] Preserve hierarchical structure (document → section → subsection)

Metadata Tagging

[ ] Document source (journal, organization, database)
[ ] Publication date and version
[ ] Medical specialty (cardiology, oncology, etc.)
[ ] Document type (guideline, research paper, drug label)
[ ] Evidence level (if applicable)
[ ] Author and institution
[ ] DOI or unique identifier

Quality Checks

[ ] Sample retrieval test: query each document and verify relevant chunks are found
[ ] Check for duplicate content across documents
[ ] Verify that drug names and dosages survive chunking intact
[ ] Test with medical abbreviations and acronyms
[ ] Ensure superseded guidelines are marked or removed
[ ] Validate encoding of special characters (chemical formulas, units)

Storage and Versioning

[ ] Store original documents in version-controlled repository
[ ] Maintain a document inventory with metadata
[ ] Set up a process for regular document updates
[ ] Track which document versions are in the current knowledge base
[ ] Archive superseded documents with expiration dates

Recommended Tools

PDF parsing: RAGFlow for complex layouts, PyMuPDF for simple PDFs
OCR: Tesseract with medical dictionary
Chunking: LangChain RecursiveCharacterTextSplitter or LlamaIndex SentenceSplitter
Validation: Manual review sample + automated retrieval quality test