From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data
PDFs and scans are great for humans, bad for systems. This guide walks through a lean, production-ready way to turn those documents into JSON your apps can trust—without bolting together a dozen tools.
Who this is for: product/ops/engineering teams that need structured JSON from PDFs/scans with minimal manual cleanup.
What “Machine-Readable” Means for You
- Start with the decision you need: pay a vendor, approve an expense, onboard a customer, trigger a contract workflow.
- Define the 8–12 fields that make that decision automatic (dates, totals, names, IDs, line items, status flags).
- Decide where the JSON lands: DB, ERP/accounting, CRM/support, data warehouse, webhook processor.
Start With the Fields, Not the Files
Pick the 8–12 fields that actually drive your workflow (e.g., invoice_number, invoice_date, due_date, vendor.name, total, line_items[]). Everything else is optional and can wait.
Choose the Right Output Format
structured: One JSON object for the whole document—best for invoices, receipts, and forms.per_page_structured: One JSON object per page—best when each page is its own section (multi-step forms, mixed layouts).markdown: Clean text per page—best for search, review, or summaries (archives, legal text).- If you need both structure and search, store structured JSON and a lightweight markdown pass.
Match the Model to the Job
standard-v1— economical default for clean docs and prototypes.english-pro-v1— higher accuracy for English-only legal/financial content.pro-v1— toughest cases: mixed languages, handwriting, complex layouts and tables.- Start
standard-v1for almost everything; upgrade only when you see recurring misses (handwriting, noisy scans, heavy tables, mixed languages).
Define the Shape With a Schema or Template
- Schema (JSON Schema): Precise control from code; versionable and testable.
- Template slug: Centralized config you can tweak without redeploying.
Keep schemas tight: required fields for critical data, descriptive names, and clear date/money expectations (e.g., “YYYY-MM-DD”, “amounts as numbers”).
The Minimal Pipeline
- Ingest: URL or file upload (
/ocr/uploads/urlor/ocr/uploads/direct). - Process: Pick format, model, and schema/template.
- Wait:
waitUntilDone()for simple flows, or poll/ocr/status/{job_id}if you need progress. - Fetch:
/ocr/result/{job_id}returns pages, model, credits used, and metadata. - Validate: Totals vs line items, required fields present, date/number sanity checks.
- Deliver: Save JSON, emit events, or push to your ERP/CRM.
- Delete:
/ocr/delete/{job_id}to remove data immediately (jobs auto-delete after 7 days). - Observe: Track processed pages, failures, latency; log job_id + source for debugging.
Code Sketch (TypeScript SDK)
import { LeapOCR } from "leapocr";
const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });
const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
format: "structured",
model: "pro-v1",
schema: {
invoice_number: "string",
invoice_date: "string",
due_date: "string",
vendor: { name: "string" },
line_items: [
{ description: "string", total: "number" },
],
total: "number",
},
instructions: "Return currency as numbers and dates as YYYY-MM-DD.",
});
await client.ocr.waitUntilDone(job.jobId);
const result = await client.ocr.getJobResult(job.jobId);
// Example: simple validation
const page = result.pages[0].result as any;
const summed = (page.line_items || []).reduce(
(sum: number, item: any) => sum + (item.total || 0),
0,
);
if (Math.abs(summed - page.total) > 1) {
console.warn("Totals do not reconcile — review needed.");
}
await client.ocr.deleteJob(job.jobId);
// Template path (no schema or instructions)
// const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
// templateSlug: "invoice-extraction",
// });
// await client.ocr.waitUntilDone(job.jobId);
// const result = await client.ocr.getJobResult(job.jobId);Handling Edge Cases
- Tables and multi-column layouts: Prefer
structuredorper_page_structured; AI-native layout handling preserves rows and reading order. - Mixed languages or handwriting: Use
pro-v1; keep schemas focused on the essentials. - Large files: SDK handles multipart; you can also use direct upload + complete upload for custom flows.
- Long runs/UI feedback: Poll
/ocr/status/{job_id}forstatus,progress,processed_pages, andtotal_pages. - Low-res or skewed scans: Re-request better images where possible; otherwise stick to
pro-v1and keep schemas narrow. - Search plus structure: Store markdown alongside structured JSON for audit/search while keeping fields machine-ready.
Cost and Throughput Tips
- Structured extraction adds +1 credit/page; pick models per workflow to balance accuracy vs cost.
- Batch similar documents to keep model selection predictable.
- Delete jobs right after persistence; auto-deletion runs after 7 days.
- Monitor: pages processed, failures, latency, and credits used (returned in
/ocr/result/{job_id}).
A 10-Document Pilot Checklist
- Grab 10–20 real PDFs/scans (messy ones included).
- Define the 8–12 fields you need.
- Run with
structured+standard-v1; rerun tough docs withpro-v1. - Check JSON shape vs your target schema; measure manual fixes required.
- Add lightweight validation (totals, required fields, date sanity).
- Wire deletion into the flow after storing results.
- Decide where the JSON goes: ERP/accounting, CRM/support, DB/warehouse, webhook.
Takeaway
You don’t need a sprawling pipeline to get machine-readable data. Pick a format, choose a model, define a schema or template, and run a small pilot. If the JSON drops cleanly into your system with minimal fixes, you’re ready to scale from PDF to production-grade structured data. Docs to skim next: /docs/concepts/formats, /docs/concepts/models, /docs/concepts/schemas, /docs/api.
Ready to automate your document workflows?
Join thousands of developers using LeapOCR to extract data from documents with high accuracy.
Get Started for Free