3 min read

From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data

A practical pipeline for taking PDFs and scans, extracting only the fields you care about, and delivering clean JSON to your database, CRM, or ERP.

From PDF to JSON: How to Turn Your Documents Into Machine-Readable Data

PDFs and scans are great for humans, bad for systems. This guide walks through a lean, production-ready way to turn those documents into JSON your apps can trust—without bolting together a dozen tools.

Who this is for: product/ops/engineering teams that need structured JSON from PDFs/scans with minimal manual cleanup.

What “Machine-Readable” Means for You

  • Start with the decision you need: pay a vendor, approve an expense, onboard a customer, trigger a contract workflow.
  • Define the 8–12 fields that make that decision automatic (dates, totals, names, IDs, line items, status flags).
  • Decide where the JSON lands: DB, ERP/accounting, CRM/support, data warehouse, webhook processor.

Start With the Fields, Not the Files

Pick the 8–12 fields that actually drive your workflow (e.g., invoice_number, invoice_date, due_date, vendor.name, total, line_items[]). Everything else is optional and can wait.

Choose the Right Output Format

  • structured: One JSON object for the whole document—best for invoices, receipts, and forms.
  • per_page_structured: One JSON object per page—best when each page is its own section (multi-step forms, mixed layouts).
  • markdown: Clean text per page—best for search, review, or summaries (archives, legal text).
  • If you need both structure and search, store structured JSON and a lightweight markdown pass.

Match the Model to the Job

  • standard-v1 — economical default for clean docs and prototypes.
  • english-pro-v1 — higher accuracy for English-only legal/financial content.
  • pro-v1 — toughest cases: mixed languages, handwriting, complex layouts and tables.
  • Start standard-v1 for almost everything; upgrade only when you see recurring misses (handwriting, noisy scans, heavy tables, mixed languages).

Define the Shape With a Schema or Template

  • Schema (JSON Schema): Precise control from code; versionable and testable.
  • Template slug: Centralized config you can tweak without redeploying.

Keep schemas tight: required fields for critical data, descriptive names, and clear date/money expectations (e.g., “YYYY-MM-DD”, “amounts as numbers”).

The Minimal Pipeline

  1. Ingest: URL or file upload (/ocr/uploads/url or /ocr/uploads/direct).
  2. Process: Pick format, model, and schema/template.
  3. Wait: waitUntilDone() for simple flows, or poll /ocr/status/{job_id} if you need progress.
  4. Fetch: /ocr/result/{job_id} returns pages, model, credits used, and metadata.
  5. Validate: Totals vs line items, required fields present, date/number sanity checks.
  6. Deliver: Save JSON, emit events, or push to your ERP/CRM.
  7. Delete: /ocr/delete/{job_id} to remove data immediately (jobs auto-delete after 7 days).
  8. Observe: Track processed pages, failures, latency; log job_id + source for debugging.

Code Sketch (TypeScript SDK)

import { LeapOCR } from "leapocr";

const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });

const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
  format: "structured",
  model: "pro-v1",
  schema: {
    invoice_number: "string",
    invoice_date: "string",
    due_date: "string",
    vendor: { name: "string" },
    line_items: [
      { description: "string", total: "number" },
    ],
    total: "number",
  },
  instructions: "Return currency as numbers and dates as YYYY-MM-DD.",
});

await client.ocr.waitUntilDone(job.jobId);

const result = await client.ocr.getJobResult(job.jobId);

// Example: simple validation
const page = result.pages[0].result as any;
const summed = (page.line_items || []).reduce(
  (sum: number, item: any) => sum + (item.total || 0),
  0,
);
if (Math.abs(summed - page.total) > 1) {
  console.warn("Totals do not reconcile — review needed.");
}

await client.ocr.deleteJob(job.jobId);

// Template path (no schema or instructions)
// const job = await client.ocr.processURL("https://example.com/invoice.pdf", {
//   templateSlug: "invoice-extraction",
// });
// await client.ocr.waitUntilDone(job.jobId);
// const result = await client.ocr.getJobResult(job.jobId);

Handling Edge Cases

  • Tables and multi-column layouts: Prefer structured or per_page_structured; AI-native layout handling preserves rows and reading order.
  • Mixed languages or handwriting: Use pro-v1; keep schemas focused on the essentials.
  • Large files: SDK handles multipart; you can also use direct upload + complete upload for custom flows.
  • Long runs/UI feedback: Poll /ocr/status/{job_id} for status, progress, processed_pages, and total_pages.
  • Low-res or skewed scans: Re-request better images where possible; otherwise stick to pro-v1 and keep schemas narrow.
  • Search plus structure: Store markdown alongside structured JSON for audit/search while keeping fields machine-ready.

Cost and Throughput Tips

  • Structured extraction adds +1 credit/page; pick models per workflow to balance accuracy vs cost.
  • Batch similar documents to keep model selection predictable.
  • Delete jobs right after persistence; auto-deletion runs after 7 days.
  • Monitor: pages processed, failures, latency, and credits used (returned in /ocr/result/{job_id}).

A 10-Document Pilot Checklist

  1. Grab 10–20 real PDFs/scans (messy ones included).
  2. Define the 8–12 fields you need.
  3. Run with structured + standard-v1; rerun tough docs with pro-v1.
  4. Check JSON shape vs your target schema; measure manual fixes required.
  5. Add lightweight validation (totals, required fields, date sanity).
  6. Wire deletion into the flow after storing results.
  7. Decide where the JSON goes: ERP/accounting, CRM/support, DB/warehouse, webhook.

Takeaway

You don’t need a sprawling pipeline to get machine-readable data. Pick a format, choose a model, define a schema or template, and run a small pilot. If the JSON drops cleanly into your system with minimal fixes, you’re ready to scale from PDF to production-grade structured data. Docs to skim next: /docs/concepts/formats, /docs/concepts/models, /docs/concepts/schemas, /docs/api.

Back to Blog
Share this article

Ready to automate your document workflows?

Join thousands of developers using LeapOCR to extract data from documents with high accuracy.

Get Started for Free