Back to blog Technical guide

Developer's Toolkit: Integrating LeapOCR for Medical Document Processing (Python SDK)

Stop wrestling with brittle PDFs. Learn how to build a scalable, schema-first medical extraction pipeline using the LeapOCR Python SDK and AsyncIO.

developer python medical sdk data-engineering asyncio
Published
January 26, 2026
Read time
3 min
Word count
628
Developer's Toolkit: Integrating LeapOCR for Medical Document Processing (Python SDK) preview

Python SDK Hero

Medical datastreams are messy. You deal with faxed referrals, scanned insurance cards, and sprawling PDF medical records. Traditional OCR gives you “text soup”—a raw string of characters that is useless for downstream automation.

As a developer, you don’t want text; you want Data.

In this guide, we will build a production-grade extraction pipeline using the LeapOCR Python SDK. We will process a batch of patient records, enforce a strict JSON schema, and handle the asynchronous nature of long-running OCR jobs.

The Goal: From PDF to ICD-Ready JSON

We are building a service that takes a URL (e.g., a PDF stored in S3) and returns a structured JSON object containing:

  • Patient Demographics
  • Encounter Date
  • List of Diagnoses (for ICD-10 coding)

Prerequisites

First, install the SDK.

pip install leapocr

Step 1: Initialize the Client

LeapOCR uses API keys for authentication. In a production environment, never hardcode these. Use environment variables.

import os
from leapocr import LeapOCR

# Initialize with Type-Safety
client = LeapOCR(api_key=os.environ["LEAPOCR_API_KEY"])

Step 2: Define Your Schema (The “Contract”)

This is the most critical step. Unlike generic OCR, LeapOCR allows you to define the shape of the data you expect. This acts as a contract between your input documents and your database.

Schema Validation Shield

# Define the structure we want to extract
medical_record_schema = {
    "patient": {
        "name": "string",
        "dob": "date (YYYY-MM-DD)",
        "mrn": "string"
    },
    "encounter": {
        "date": "date (YYYY-MM-DD)",
        "provider": "string",
        "facility": "string"
    },
    "clinical_data": {
        "chief_complaint": "string",
        "diagnoses": [
            {
                "condition": "string",
                "icd_10_code": "string (optional)",
                "status": "string (active|resolved)"
            }
        ]
    }
}

Step 3: The Async Batch Pipeline

Medical records can be hundreds of pages long. Processing them synchronously (blocking the thread) is a bad pattern. Instead, we use Python’s asyncio to submit jobs and poll for results concurrently.

Async Batch Pipeline

Here is a robust implementation using asyncio.gather:

import asyncio
from leapocr.types import JobStatus

async def process_document(url: str):
    print(f"Submitting: {url}")

    # 1. Submit the Job (Non-blocking)
    job = client.ocr.process_url(
        url,
        format="structured",
        model="pro-v1",  # Optimized for handwriting/medical
        schema=medical_record_schema
    )

    # 2. Polling / Webhook Alternative
    # For scripts, we can use the helper `.wait_until_done()`
    print(f"Processing job {job.job_id}...")
    result = client.ocr.wait_until_done(job.job_id)

    if result.status == JobStatus.FAILED:
        print(f"Job {job.job_id} failed: {result.error}")
        return None

    return result.data

async def batch_process(urls: list[str]):
    # Create tasks for all URLs
    tasks = [process_document(url) for url in urls]

    # Run them concurrently
    results = await asyncio.gather(*tasks)
    return results

# Run the pipeline
if __name__ == "__main__":
    urls = [
        "https://s3.bucket/patient_A.pdf",
        "https://s3.bucket/patient_B.pdf",
        # ... 1000s of files
    ]
    results = asyncio.run(batch_process(urls))
    print(f"Processed {len(results)} records.")

Step 4: Defense in Depth (Validation)

Even with a schema, you should validate data before it hits your database. The SDK guarantees JSON structure, but your business logic might have stricter rules.

def validate_record(record):
    # Business Rule: Documents older than 10 years are archived
    if record['encounter']['date'] < '2016-01-01':
        flag_for_archive(record)
        return False

    # Business Rule: Mandatory MRN
    if not record['patient']['mrn']:
        raise ValueError("Missing Medical Record Number")

    return True

Why Developer Experience Matters

We built the Python SDK to respect Python patterns.

  • Type Hinting: Methods are typed for IDE autocompletion.
  • Exceptions: We raise native exceptions (leapocr.errors.AuthenticationError) rather than generic 400s.
  • Context Managers: (Coming soon) for handling efficient file streaming.

Next Steps

You have a working pipeline. Where do you go from here?

  1. Webhooks: For high-volume production, replace polling (wait_until_done) with Webhooks to decouple submission from processing.
  2. Fine-Tuning: If you have specific internal forms, use the API to create a Custom Template for higher accuracy.
  3. Human-in-the-Loop: Use the confidence_score field in the response to route low-confidence docs to a UI for review.

Start building today. Get your API key or View the Full API Reference.

Try LeapOCR on your own documents

Start with 100 free credits and see how your workflow holds up on real files.

Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.

Keep reading

Related notes for the same operating context

More implementation guides, benchmarks, and workflow notes for teams building document pipelines.