Back to blog Technical guide

Checklist: What to Do Before Feeding Documents to an OCR Engine

Garbage in, garbage out. A pre-flight checklist to ensure your documents are ready for high-accuracy extraction.

ocr data-quality checklist best-practices
Published
December 8, 2025
Read time
3 min
Word count
655
Checklist: What to Do Before Feeding Documents to an OCR Engine preview

OCR Checklist Header

Checklist: What to Do Before Feeding Documents to an OCR Engine

OCR engines can handle handwritten notes on crumpled receipts and coffee-stained invoices. But they still struggle with poor quality input. A low-resolution JPEG thumbnail or a sideways document will produce unreliable results.

After processing millions of pages at LeapOCR, we’ve noticed the same issues appearing repeatedly. This checklist covers what to check before sending documents through your pipeline.

1. The File Format Check

LeapOCR accepts over 100 file formats through URL or direct upload. This includes documents (PDF, DOCX, DOC, ODT, RTF, TXT), images (PNG, JPG, WEBP, TIFF, GIF, BMP, HEIC), spreadsheets (XLSX, XLS, CSV, ODS), and presentations (PPTX, PPT, ODP).

PDFs and Word documents are rasterized during processing, meaning embedded text layers are ignored. Image formats go through directly.

Checklist:

  • Use native formats when you have them. If you have the original digital PDF or Word document from Word or an accounting system, use it rather than printing and rescanning.
  • For images, choose high-quality formats like PNG or TIFF and avoid heavy compression. WEBP works too.
  • Remove password protection before uploading. The API needs access to read the file contents.

2. The Resolution Check (DPI)

Resolution affects OCR accuracy more than any other factor.

Comparison between 72 DPI screenshot and 300 DPI scan FIG 1.0 — Resolution matters: Screenshots vs Scans

Checklist:

  • Scan at 300 DPI minimum. Anything below this threshold will produce unreliable results.
  • Fill the frame when taking photos with your phone. Get close enough that the text is clearly readable.
  • Avoid screenshots. Document screenshots usually fail because the text becomes too pixelated for character recognition.

3. The Visual Check (Preprocessing)

Basic image cleanup improves accuracy.

Visualizing skew correction and binarization steps FIG 2.0 — Preprocessing pipeline: Deskewing and contrast enhancement

Checklist:

  • Fix orientation if the page is upside down or rotated. LeapOCR detects and corrects rotation automatically, but extreme skew (around 45 degrees) can cause lines to be read out of order.
  • Convert color scans of black-and-white documents to grayscale or true black-and-white. This sharpens text contrast.
  • Straighten crooked scans before processing.

4. The “Data Hygiene” Check

Consider what appears on the page before sending it to the cloud.

Checklist:

  • Redact sensitive data when necessary. If you’re processing medical records or PII and only need header information, mask the body text locally before uploading.
  • Split combined documents. A PDF containing 50 concatenated invoices makes it harder to identify where one ends and the next begins. Separate them into individual files when possible.

Note: LeapOCR jobs auto-delete after 7 days. For sensitive documents, delete immediately after processing.

5. The Integration Check

Review your implementation code.

Flow chart showing idempotency check and API call logic FIG 3.0 — Robust integration flow with idempotency checks

Checklist:

  • Handle idempotency. If your network fails and you retry, you might send the same job twice. Generate a hash of the file content (MD5/SHA256) and use it as your idempotency_key to prevent duplicate processing.
  • Set appropriate timeouts. A complex 10-page legal contract won’t finish in 500ms. Configure your HTTP client with a generous timeout (30+ seconds) or use asynchronous polling.
  • Use webhooks for async processing. Available on Growth+ plans through the dashboard UI, webhooks work better than polling for production flows.

Summary

DPI and file format make the difference between 90% and 99% accuracy.

You can implement a simple pre-flight script that validates image resolution before calling the API:

from PIL import Image

def is_file_ready(path):
    with Image.open(path) as img:
        if img.width < 1000 or img.height < 1000:
            print("Warning: Low resolution image.")
            return False
    return True

Test your prepared documents in the LeapOCR Dashboard.

Try LeapOCR on your own documents

Start with 100 free credits and see how your workflow holds up on real files.

Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.

Keep reading

Related notes for the same operating context

More implementation guides, benchmarks, and workflow notes for teams building document pipelines.