PDF to JSON in Production: A Schema-First Playbook
A production-focused guide to turning PDFs and scans into schema-fit JSON without building a brittle cleanup layer after OCR.
PDF to JSON in Production: A Schema-First Playbook
The hard part of PDF to JSON is not serializing a payload at the end. The hard part is getting document data into a shape another system can trust.
That is why so many “PDF extraction” projects stall in the same place: the OCR step technically works, but the team still spends its time normalizing dates, fixing vendor names, repairing totals, and patching edge cases one document class at a time.
This guide walks through a more reliable approach: start from the downstream contract, keep the schema narrow, and make validation part of the pipeline instead of an afterthought.
Who this is for: developers, engineers, and technical teams who need to turn PDFs and scans into clean JSON with minimal manual cleanup.
Start From The Record You Need To Produce
Before worrying about file formats or OCR models, work backward from your goal. What decision are you trying to automate?
- Paying a vendor
- Approving an expense
- Onboarding a customer
- Triggering a contract workflow
Once you know the decision, identify the 8–12 fields that power it. For an invoice workflow, that might look like invoice_number, invoice_date, due_date, vendor.name, total, and line_items[]. Everything else is noise.
Then figure out where the data needs to go: your database, an ERP system, a CRM, a data warehouse, or a webhook handler.
Pick The Narrowest Output That Still Works
The right output format depends on your document structure and use case:
structured – Returns a JSON object for the document. This works well for invoices, receipts, and forms where all the information belongs together.
markdown – Returns clean markdown text. Best for search indexing, human review, or archival purposes where you don’t need structured fields.
If you need both searchability and structure, you can run both passes and store them side-by-side.
Choose The Model For The Worst Pages, Not The Best Ones
LeapOCR offers three models, and you’ll usually only need one:
standard-v1 – The default choice. Fast and cost-effective. Works well for clean documents and prototypes.
english-pro-v1 – Optimized for English-only content where accuracy matters more than speed. Good for legal documents or financial statements.
pro-v1 – Handles the difficult cases: mixed languages, handwriting, complex tables, and poor-quality scans.
The practical approach: start with standard-v1, measure where it fails, and only upgrade the workflows where harder pages justify the extra cost.
Define The Schema Like A Contract, Not A Wish List
You have two options for controlling the JSON structure:
JSON Schema – Define the exact structure in code. This gives you precision and lets you version control the schema. Best when you want to enforce specific types and validation rules.
Template slug – Use a predefined template configured in the LeapOCR dashboard. Useful when non-technical team members need to adjust the extraction rules without a deployment.
Either way, keep your schema focused. Mark critical fields as required, use descriptive names, and be explicit about formats like dates and numeric amounts.
Build A Pipeline That Assumes Documents Will Be Messy
Here’s the complete flow from document to database:
- Upload – Send the file via URL (
/ocr/uploads/url) or direct upload (/ocr/uploads/direct) - Process – Specify your format, model, and schema/template
- Wait – Use
waitUntilDone()for simple scripts, or poll/ocr/status/{job_id}if you need progress updates - Fetch results – Call
/ocr/result/{job_id}to get the extracted JSON, metadata, and usage info - Validate – Run sanity checks (totals match line items, required fields exist, dates are valid)
- Store – Save the JSON to your database or push it to your downstream systems
- Clean up – Call
/ocr/delete/{job_id}to remove the data (jobs auto-delete after 7 days regardless) - Monitor – Track job IDs, processing time, and failure rates for debugging
The key point is that JSON extraction is not finished at step 4. It is finished when the result survives your validation rules and is safe to write downstream.
Implementation Example
Here’s a complete TypeScript example showing the full workflow:
import { LeapOCR } from "leapocr";
const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });
// Submit the document for processing
const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
format: "structured",
model: "pro-v1",
schema: {
invoice_number: "string",
invoice_date: "string",
due_date: "string",
vendor: { name: "string" },
line_items: [{ description: "string", total: "number" }],
total: "number",
},
instructions: "Return currency as numbers and dates as YYYY-MM-DD.",
});
// Wait for processing to complete
await client.ocr.waitUntilDone(job.jobId);
// Retrieve the results
const result = await client.ocr.getJobResult(job.jobId);
// Validate the extracted data
const page = result.pages[0].result as any;
const summed = (page.line_items || []).reduce(
(sum: number, item: any) => sum + (item.total || 0),
0,
);
if (Math.abs(summed - page.total) > 1) {
console.warn("Line item total doesn't match invoice total — manual review needed");
}
// Store in your database
// await db.invoices.create({ data: page });
// Clean up
await client.ocr.deleteJob(job.jobId);
// Alternative: using a template instead of inline schema
// const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
// templateSlug: "invoice-extraction",
// });
// await client.ocr.waitUntilDone(job.jobId);
// const result = await client.ocr.getJobResult(job.jobId);
Handling The Failure Modes That Matter
Real-world documents come with complications. Here’s how to handle them:
Tables and multi-column layouts – Use structured format. The AI-native layout engine preserves row structure and reading order better than traditional OCR.
Mixed languages or handwriting – Switch to pro-v1 and keep your schema focused on the essential fields. The broader your schema, the more likely you’ll get noise in difficult cases.
Large files – The TypeScript SDK handles multipart uploads automatically. For custom workflows, you can use the direct upload API with resumable uploads.
Progress tracking in UIs – Poll /ocr/status/{job_id} to get status, progress, processed_pages, and total_pages. This keeps users informed during long-running jobs.
Low-quality or skewed scans – Request higher-resolution scans when possible. If you’re stuck with poor quality, use pro-v1 and narrow your schema to only the critical fields.
Searchability + structure – Run both structured and markdown passes. Store the JSON for automation and the markdown for human review and search indexing.
Where Teams Usually Create More Work For Themselves
A few common mistakes:
- extracting every possible field instead of only the fields the workflow needs
- changing the schema and the model at the same time, so failures become hard to debug
- skipping validation because the OCR output “looks mostly right”
- treating markdown and JSON as competing outputs instead of complementary ones
If your document pipeline still needs heavy cleanup after extraction, the issue is usually the contract around the OCR step, not only the OCR model itself.
Optimizing Costs and Performance
A few practical tips for production deployments:
- Structured extraction adds +1 credit per page – Worth it for the automation value, but be aware of the cost impact
- Batch similar documents together – This keeps model selection predictable and helps with cost forecasting
- Delete jobs immediately after processing – Don’t rely on the 7-day auto-deletion; clean up as soon as you’ve persisted the data
- Track key metrics – The API returns pages processed, latency, and credits used. Log these alongside your job IDs for debugging
Run A Pilot Against Real Failure Cases
Before rolling this out to production, run a small pilot:
- Gather test documents – Get 10–20 real PDFs or scans, including some messy ones that represent your worst-case scenarios
- Define your schema – Identify the 8–12 fields you actually need
- Test with standard-v1 – Run all documents through with
structuredformat andstandard-v1model - Retry failures with pro-v1 – For documents that miss critical fields, rerun with
pro-v1and compare results - Validate the output – Check how well the JSON matches your target schema and count how many fields need manual correction
- Add validation logic – Implement basic sanity checks (totals reconcile, required fields present, dates are valid)
- Integrate with your system – Wire up the deletion step and determine where the JSON goes (database, ERP, webhook, etc.)
Final Take
PDF to JSON gets easier once you stop treating OCR output as the finished product. The real product is a stable record that can survive validation, review, and downstream writes.
If you start from the schema, test on ugly files early, and validate before writing, the pipeline gets much simpler to reason about.
For more details, check out the documentation on formats, models, schemas, and the API reference.
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
How AI Improves OCR: What Makes AI-Native OCR Better Than Legacy Systems
Why classic OCR struggles on real-world documents and how AI-native, layout-aware extraction turns PDFs and scans into reliable, structured data your systems can trust.
How to Extract Text From Scanned PDFs Without Losing Structure
A developer guide to scanned PDF OCR: how to decide between markdown and JSON, where PDF parsing fails, and how to build an extraction layer that still works on ugly real files.
How Startups Can Save Time & Money by Automating Document Workflows with LeapOCR
A practical, low-lift way to turn invoices, receipts, onboarding packs, and contracts into structured data—without burning precious engineering cycles.