Checklist: What to Do Before Feeding Documents to an OCR Engine
Garbage in, garbage out. A pre-flight checklist to ensure your documents are ready for high-accuracy extraction.
Checklist: What to Do Before Feeding Documents to an OCR Engine
OCR engines can handle handwritten notes on crumpled receipts and coffee-stained invoices. But they still struggle with poor quality input. A low-resolution JPEG thumbnail or a sideways document will produce unreliable results.
After processing millions of pages at LeapOCR, we’ve noticed the same issues appearing repeatedly. This checklist covers what to check before sending documents through your pipeline.
1. The File Format Check
LeapOCR accepts over 100 file formats through URL or direct upload. This includes documents (PDF, DOCX, DOC, ODT, RTF, TXT), images (PNG, JPG, WEBP, TIFF, GIF, BMP, HEIC), spreadsheets (XLSX, XLS, CSV, ODS), and presentations (PPTX, PPT, ODP).
PDFs and Word documents are rasterized during processing, meaning embedded text layers are ignored. Image formats go through directly.
Checklist:
- Use native formats when you have them. If you have the original digital PDF or Word document from Word or an accounting system, use it rather than printing and rescanning.
- For images, choose high-quality formats like PNG or TIFF and avoid heavy compression. WEBP works too.
- Remove password protection before uploading. The API needs access to read the file contents.
2. The Resolution Check (DPI)
Resolution affects OCR accuracy more than any other factor.
FIG 1.0 — Resolution matters: Screenshots vs Scans
Checklist:
- Scan at 300 DPI minimum. Anything below this threshold will produce unreliable results.
- Fill the frame when taking photos with your phone. Get close enough that the text is clearly readable.
- Avoid screenshots. Document screenshots usually fail because the text becomes too pixelated for character recognition.
3. The Visual Check (Preprocessing)
Basic image cleanup improves accuracy.
FIG 2.0 — Preprocessing pipeline: Deskewing and contrast enhancement
Checklist:
- Fix orientation if the page is upside down or rotated. LeapOCR detects and corrects rotation automatically, but extreme skew (around 45 degrees) can cause lines to be read out of order.
- Convert color scans of black-and-white documents to grayscale or true black-and-white. This sharpens text contrast.
- Straighten crooked scans before processing.
4. The “Data Hygiene” Check
Consider what appears on the page before sending it to the cloud.
Checklist:
- Redact sensitive data when necessary. If you’re processing medical records or PII and only need header information, mask the body text locally before uploading.
- Split combined documents. A PDF containing 50 concatenated invoices makes it harder to identify where one ends and the next begins. Separate them into individual files when possible.
Note: LeapOCR jobs auto-delete after 7 days. For sensitive documents, delete immediately after processing.
5. The Integration Check
Review your implementation code.
FIG 3.0 — Robust integration flow with idempotency checks
Checklist:
- Handle idempotency. If your network fails and you retry, you might send the same job twice. Generate a hash of the file content (MD5/SHA256) and use it as your
idempotency_keyto prevent duplicate processing. - Set appropriate timeouts. A complex 10-page legal contract won’t finish in 500ms. Configure your HTTP client with a generous timeout (30+ seconds) or use asynchronous polling.
- Use webhooks for async processing. Available on Growth+ plans through the dashboard UI, webhooks work better than polling for production flows.
Summary
DPI and file format make the difference between 90% and 99% accuracy.
You can implement a simple pre-flight script that validates image resolution before calling the API:
from PIL import Image
def is_file_ready(path):
with Image.open(path) as img:
if img.width < 1000 or img.height < 1000:
print("Warning: Low resolution image.")
return False
return True
Test your prepared documents in the LeapOCR Dashboard.
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
Best OCR APIs for Developers in 2026
An honest guide to the strongest OCR APIs for developers, including when to choose a parsing-first tool, an invoice-focused API, or a schema-first OCR layer.
How to Extract Text From Scanned PDFs Without Losing Structure
A developer guide to scanned PDF OCR: how to decide between markdown and JSON, where PDF parsing fails, and how to build an extraction layer that still works on ugly real files.
Reducing Detention and Demurrage Costs with Automated Document Processing
Detention and demurrage fees are the silent killers of logistics margins. See how automated document processing stops the clock and saves $100+ per container daily.