10 OCR Tips That Actually Work (We Tested Them)
Real-world OCR advice from people who've spent way too much time scanning documents. Learn from our mistakes and get better results, faster.
10 OCR Tips That Actually Work (We Tested Them)
OCR (Optical Character Recognition) seems like magic until it isn’t. You feed it a pristine PDF, and it works perfectly. Then you feed it a crumpled receipt photo taken in a dimly lit bar, and it returns “Total: $5.00” as “Tot@l: S5.OO”.
We’ve spent countless hours debugging extraction pipelines, scanning everything from coffee-stained invoices to 19th-century handwritten letters. We’ve learned that 90% of OCR problems are input problems.
If you’re building a document processing pipeline, these 10 tips will save you from months of frustration. We’ve tested them all in production environments processing millions of pages.
1. Good Photos Beat Bad Scans
That fancy scanner in your office isn’t always the right tool. Sometimes your phone camera does better work, but only if you use it correctly.
OCR engines struggle with:
- Skew: Even a 5-degree tilt can break line segmentation.
- Glare: Reflections on glossy paper hide text.
- Shadows: Uneven lighting creates false contrasts.
The Fix: Fill the frame with text. Find decent lighting (natural light is best). Hold steady. Most importantly, shoot straight on—don’t tilt the phone.
2. Preprocessing is Non-Negotiable
Spend five minutes preprocessing and you’ll save hours of frustration later. Raw images from cameras often contain noise, shadows, and color gradients that confuse AI models.
Key Preprocessing Steps:
- Binarization: Convert color images to black and white (not grayscale). This forces clear separation between text and background.
- Deskewing: Automatically rotate the image so text lines are horizontal.
- Denoising: Remove isolated pixels (salt-and-pepper noise).
FIG 1.0 — Even basic preprocessing can boost accuracy significantly.
Here’s a simple Python snippet using OpenCV to preprocess an image:
import cv2
import numpy as np
def preprocess_image(image_path):
# Load image in grayscale
img = cv2.imread(image_path, 0)
# Apply thresholding (binarization)
_, thresh = cv2.threshold(img, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Denoise
denoised = cv2.fastNlMeansDenoising(thresh, None, 10, 7, 21)
return denoised
3. File Formats Matter
Not all files are created equal. We support over 100 formats (PDF, DOCX, PNG, HEIC, TIFF), but lossless formats always win.
- Avoid: JPG (compression artifacts look like text characters to AI).
- Prefer: PNG or TIFF (lossless, pixel-perfect).
- Resolution: Aim for 300 DPI. Anything below 200 DPI will degrade performance rapidly.
If you control the input source, force users to upload high-resolution images. If they upload a 50KB thumbnail, no amount of AI magic will recover the text.
4. Match the Model to the Document
One size does not fit all. Using a general-purpose OCR model on a handwritten letter is like using a hammer to turn a screw.
- Standard Model: Best for printed text (invoices, contracts). Fast and cheap.
- Neural/Handwriting Model: Specialized for cursive and messy print. Slower but necessary for human inputs.
- Layout/Table Model: Essential for spreadsheets, financial statements, and forms where position matters as much as text.
FIG 2.0 — Don’t use a standard model for complex tables.
5. Explicitly Define the Language
It sounds obvious, but this is the #1 mistake we see. While modern engines can auto-detect languages, it reduces accuracy.
If you know a document is in German, tell the engine lang="de".
- It loads the correct dictionary.
- It expects specific characters (ä, ö, ß).
- It stops trying to interpret “Die” as the English word “Die”.
6. Handle Multi-Column Layouts Specifically
Newspapers, academic papers, and newsletters are OCR nightmares. Standard engines read left-to-right, ignoring columns, so you end up with “The president said / reported earnings of” (mixing two columns).
The Fix: Use a layout-aware model or pre-segment the page into regions. Process each column as a separate image block if your OCR engine supports region-of-interest (ROI) definition.
7. Clean Up Margins and Borders
Scanners often leave black borders or hole-punch marks around the edges. OCR engines try to read these artifacts as text like I, l, or |.
Aggressively crop your images before processing. Removing the outer 5% of pixels is usually safe and eliminates most scanning artifacts.
8. Use Confidence Scores for Quality Control
Blindly trusting OCR output is dangerous. Every professional OCR engine returns a confidence score (0.0 to 1.0) for each extracted field.
The Strategy:
- Confidence > 0.95: Auto-approve.
- Confidence < 0.90: Flag for human review.
- Confidence < 0.50: Reject image and request re-upload.
FIG 3.0 — Use confidence thresholds to route documents for review.
9. Validate Data with Logic
OCR output should never go directly into your database. Validate it against known logic.
- Dates: Is the
due_datebefore theinvoice_date? (Impossible). - Math: Does
subtotal+tax=total? - Formats: Does the extracted IBAN match the regex for a valid IBAN?
If the math doesn’t add up, the OCR likely misread a digit. Use logic to auto-correct or flag these errors.
10. Batch Similar Documents
If you have 500 invoices from standard suppliers and 50 handwritten notes, don’t mix them.
Processing homogenous batches allows you to:
- Tune parameters for that specific document type.
- Identify systematic errors (e.g., “Supplier X’s logo is always read as text”).
- Apply specific post-processing rules (e.g., “For Supplier Y, ignore the first 3 lines”).
Advanced Technique: Zonal OCR
For fixed-layout forms (like government ID cards or tax forms), don’t OCR the whole page. Define specific “zones” (coordinates) where you expect the Name, ID Number, and Date.
Only processing these small rectangles:
- Reduces noise (you ignore the rest of the page).
- Increases speed (processing 5% of pixels).
- Allows field-specific configuration (Zone A is numbers only; Zone B is text only).
Conclusion
Good OCR isn’t about finding a “perfect” AI model that reads garbage inputs. It’s about building a pipeline that ensures good inputs, selects the right tools, and validates the outputs.
Start with better images, and the text will follow.
Ready to try these tips? Get started with LeapOCR’s free tier and see the difference preprocessing makes.
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
How to Extract Text From Scanned PDFs Without Losing Structure
A developer guide to scanned PDF OCR: how to decide between markdown and JSON, where PDF parsing fails, and how to build an extraction layer that still works on ugly real files.
How to Scale Document Processing — From 10 Pages to Millions — Using LeapOCR
A practical guide to evolving your OCR architecture from simple scripts to high-throughput, queue-based pipelines that handle millions of documents.
PDF to JSON in Production: A Schema-First Playbook
A production-focused guide to turning PDFs and scans into schema-fit JSON without building a brittle cleanup layer after OCR.