Medical PDF Preparation Checklist

Step-by-step checklist for preparing medical documents for RAG ingestion.

Document Collection

  • [ ] Identify all relevant medical document sources
  • [ ] Collect clinical guidelines (NICE, AHA, ACC, IDSA, etc.)
  • [ ] Gather drug databases and prescribing information
  • [ ] Download relevant medical literature (PubMed Central open-access)
  • [ ] Include hospital-specific protocols and pathways
  • [ ] Verify all documents are from authoritative sources
  • [ ] Note the publication date of each document

Document Validation

  • [ ] Check that PDFs are text-based (not scanned images)
  • [ ] For scanned PDFs, run OCR with medical vocabulary
  • [ ] Verify table and figure extraction quality
  • [ ] Check for encoding issues in medical notation (subscripts, Greek letters)
  • [ ] Ensure references and citations are parseable
  • [ ] Flag documents with complex layouts for manual review

Chunking Strategy

  • [ ] Choose chunk size: 500-1000 tokens for medical content
  • [ ] Set overlap: 10-20% to preserve context across boundaries
  • [ ] Split by section headings where possible
  • [ ] Keep tables intact as single chunks when possible
  • [ ] Keep drug dosage tables as single chunks
  • [ ] Preserve hierarchical structure (document → section → subsection)

Metadata Tagging

  • [ ] Document source (journal, organization, database)
  • [ ] Publication date and version
  • [ ] Medical specialty (cardiology, oncology, etc.)
  • [ ] Document type (guideline, research paper, drug label)
  • [ ] Evidence level (if applicable)
  • [ ] Author and institution
  • [ ] DOI or unique identifier

Quality Checks

  • [ ] Sample retrieval test: query each document and verify relevant chunks are found
  • [ ] Check for duplicate content across documents
  • [ ] Verify that drug names and dosages survive chunking intact
  • [ ] Test with medical abbreviations and acronyms
  • [ ] Ensure superseded guidelines are marked or removed
  • [ ] Validate encoding of special characters (chemical formulas, units)

Storage and Versioning

  • [ ] Store original documents in version-controlled repository
  • [ ] Maintain a document inventory with metadata
  • [ ] Set up a process for regular document updates
  • [ ] Track which document versions are in the current knowledge base
  • [ ] Archive superseded documents with expiration dates

Recommended Tools

  • PDF parsing: RAGFlow for complex layouts, PyMuPDF for simple PDFs
  • OCR: Tesseract with medical dictionary
  • Chunking: LangChain RecursiveCharacterTextSplitter or LlamaIndex SentenceSplitter
  • Validation: Manual review sample + automated retrieval quality test