From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data

PDFs and scans are great for humans, bad for systems. This guide walks through a lean, production-ready way to turn those documents into JSON your apps can trust—without bolting together a dozen tools.

Who this is for: product/ops/engineering teams that need structured JSON from PDFs/scans with minimal manual cleanup.

What “Machine-Readable” Means for You

Start with the decision you need: pay a vendor, approve an expense, onboard a customer, trigger a contract workflow.
Define the 8–12 fields that make that decision automatic (dates, totals, names, IDs, line items, status flags).
Decide where the JSON lands: DB, ERP/accounting, CRM/support, data warehouse, webhook processor.

Start With the Fields, Not the Files

Pick the 8–12 fields that actually drive your workflow (e.g., invoice_number, invoice_date, due_date, vendor.name, total, line_items[]). Everything else is optional and can wait.

Choose the Right Output Format

structured: One JSON object for the whole document—best for invoices, receipts, and forms.
per_page_structured: One JSON object per page—best when each page is its own section (multi-step forms, mixed layouts).
markdown: Clean text per page—best for search, review, or summaries (archives, legal text).
If you need both structure and search, store structured JSON and a lightweight markdown pass.

Match the Model to the Job

standard-v1 — economical default for clean docs and prototypes.
english-pro-v1 — higher accuracy for English-only legal/financial content.
pro-v1 — toughest cases: mixed languages, handwriting, complex layouts and tables.
Start standard-v1 for almost everything; upgrade only when you see recurring misses (handwriting, noisy scans, heavy tables, mixed languages).

Define the Shape With a Schema or Template

Schema (JSON Schema): Precise control from code; versionable and testable.
Template slug: Centralized config you can tweak without redeploying.

Keep schemas tight: required fields for critical data, descriptive names, and clear date/money expectations (e.g., “YYYY-MM-DD”, “amounts as numbers”).

The Minimal Pipeline

Ingest: URL or file upload (/ocr/uploads/url or /ocr/uploads/direct).
Process: Pick format, model, and schema/template.
Wait: waitUntilDone() for simple flows, or poll /ocr/status/{job_id} if you need progress.
Fetch: /ocr/result/{job_id} returns pages, model, credits used, and metadata.
Validate: Totals vs line items, required fields present, date/number sanity checks.
Deliver: Save JSON, emit events, or push to your ERP/CRM.
Delete: /ocr/delete/{job_id} to remove data immediately (jobs auto-delete after 7 days).
Observe: Track processed pages, failures, latency; log job_id + source for debugging.

Code Sketch (TypeScript SDK)

import { LeapOCR } from "leapocr";

const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });

const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
  format: "structured",
  model: "pro-v1",
  schema: {
    invoice_number: "string",
    invoice_date: "string",
    due_date: "string",
    vendor: { name: "string" },
    line_items: [
      { description: "string", total: "number" },
    ],
    total: "number",
  },
  instructions: "Return currency as numbers and dates as YYYY-MM-DD.",
});

await client.ocr.waitUntilDone(job.jobId);

const result = await client.ocr.getJobResult(job.jobId);

// Example: simple validation
const page = result.pages[0].result as any;
const summed = (page.line_items || []).reduce(
  (sum: number, item: any) => sum + (item.total || 0),
  0,
);
if (Math.abs(summed - page.total) > 1) {
  console.warn("Totals do not reconcile — review needed.");
}

await client.ocr.deleteJob(job.jobId);

// Template path (no schema or instructions)
// const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
//   templateSlug: "invoice-extraction",
// });
// await client.ocr.waitUntilDone(job.jobId);
// const result = await client.ocr.getJobResult(job.jobId);

Handling Edge Cases

Tables and multi-column layouts: Prefer structured or per_page_structured; AI-native layout handling preserves rows and reading order.
Mixed languages or handwriting: Use pro-v1; keep schemas focused on the essentials.
Large files: SDK handles multipart; you can also use direct upload + complete upload for custom flows.
Long runs/UI feedback: Poll /ocr/status/{job_id} for status, progress, processed_pages, and total_pages.
Low-res or skewed scans: Re-request better images where possible; otherwise stick to pro-v1 and keep schemas narrow.
Search plus structure: Store markdown alongside structured JSON for audit/search while keeping fields machine-ready.

Cost and Throughput Tips

Structured extraction adds +1 credit/page; pick models per workflow to balance accuracy vs cost.
Batch similar documents to keep model selection predictable.
Delete jobs right after persistence; auto-deletion runs after 7 days.
Monitor: pages processed, failures, latency, and credits used (returned in /ocr/result/{job_id}).

A 10-Document Pilot Checklist

Grab 10–20 real PDFs/scans (messy ones included).
Define the 8–12 fields you need.
Run with structured + standard-v1; rerun tough docs with pro-v1.
Check JSON shape vs your target schema; measure manual fixes required.
Add lightweight validation (totals, required fields, date sanity).
Wire deletion into the flow after storing results.
Decide where the JSON goes: ERP/accounting, CRM/support, DB/warehouse, webhook.

Takeaway

You don’t need a sprawling pipeline to get machine-readable data. Pick a format, choose a model, define a schema or template, and run a small pilot. If the JSON drops cleanly into your system with minimal fixes, you’re ready to scale from PDF to production-grade structured data. Docs to skim next: /docs/concepts/formats, /docs/concepts/models, /docs/concepts/schemas, /docs/api.

From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data

From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data

What “Machine-Readable” Means for You

Start With the Fields, Not the Files

Choose the Right Output Format

Match the Model to the Job

Define the Shape With a Schema or Template

The Minimal Pipeline

Code Sketch (TypeScript SDK)

Handling Edge Cases

Cost and Throughput Tips

A 10-Document Pilot Checklist

Takeaway

Ready to automate your document workflows?