Document Collection
- [ ] Identify all relevant medical document sources
- [ ] Collect clinical guidelines (NICE, AHA, ACC, IDSA, etc.)
- [ ] Gather drug databases and prescribing information
- [ ] Download relevant medical literature (PubMed Central open-access)
- [ ] Include hospital-specific protocols and pathways
- [ ] Verify all documents are from authoritative sources
- [ ] Note the publication date of each document
Document Validation
- [ ] Check that PDFs are text-based (not scanned images)
- [ ] For scanned PDFs, run OCR with medical vocabulary
- [ ] Verify table and figure extraction quality
- [ ] Check for encoding issues in medical notation (subscripts, Greek letters)
- [ ] Ensure references and citations are parseable
- [ ] Flag documents with complex layouts for manual review
Chunking Strategy
- [ ] Choose chunk size: 500-1000 tokens for medical content
- [ ] Set overlap: 10-20% to preserve context across boundaries
- [ ] Split by section headings where possible
- [ ] Keep tables intact as single chunks when possible
- [ ] Keep drug dosage tables as single chunks
- [ ] Preserve hierarchical structure (document → section → subsection)
Metadata Tagging
- [ ] Document source (journal, organization, database)
- [ ] Publication date and version
- [ ] Medical specialty (cardiology, oncology, etc.)
- [ ] Document type (guideline, research paper, drug label)
- [ ] Evidence level (if applicable)
- [ ] Author and institution
- [ ] DOI or unique identifier
Quality Checks
- [ ] Sample retrieval test: query each document and verify relevant chunks are found
- [ ] Check for duplicate content across documents
- [ ] Verify that drug names and dosages survive chunking intact
- [ ] Test with medical abbreviations and acronyms
- [ ] Ensure superseded guidelines are marked or removed
- [ ] Validate encoding of special characters (chemical formulas, units)
Storage and Versioning
- [ ] Store original documents in version-controlled repository
- [ ] Maintain a document inventory with metadata
- [ ] Set up a process for regular document updates
- [ ] Track which document versions are in the current knowledge base
- [ ] Archive superseded documents with expiration dates
Recommended Tools
- PDF parsing: RAGFlow for complex layouts, PyMuPDF for simple PDFs
- OCR: Tesseract with medical dictionary
- Chunking: LangChain RecursiveCharacterTextSplitter or LlamaIndex SentenceSplitter
- Validation: Manual review sample + automated retrieval quality test