Back to blog Technical guide

How to Extract Text From Scanned PDFs Without Losing Structure

A developer guide to scanned PDF OCR: how to decide between markdown and JSON, where PDF parsing fails, and how to build an extraction layer that still works on ugly real files.

ocr scanned pdf developer guide markdown json
Published
March 23, 2026
Read time
5 min
Word count
1,085
How to Extract Text From Scanned PDFs Without Losing Structure preview

How to Extract Text From Scanned PDFs Without Losing Structure header illustration

How to Extract Text From Scanned PDFs Without Losing Structure

If every file in your test set came straight out of a modern export button, scanned-PDF extraction looks easy.

Production queues are not that kind. They include office scans, phone photos, re-scanned contracts, skewed invoices, and PDFs where the selectable text only covers half the page.

At that point the problem is no longer “how do I read a PDF?” It becomes “how do I return output that a person, model, or downstream system can still trust?”

This guide walks through scanned-PDF OCR from that angle: structure first, workflow second, raw text last.

Extraction flow for how to extract text from scanned pdfs without losing structure FIG 1.0 - Extraction flow from scan document to schema-fit JSON.

The First Split: Digital PDF Versus Scanned PDF

A scanned PDF often contains little or no machine-readable text. It is closer to a bundle of images than a structured document.

That creates three practical problems:

  • reading order becomes unreliable
  • tables and sections lose structure
  • downstream systems still need fields, not a blob of text

If the file has to become an ERP record, an AP review task, or a clean payload for another service, raw OCR text is still not the finish line.

Decide By Failure Mode, Not By File Type

Teams usually choose the wrong extraction tool because they ask the wrong question.

They ask:

  • Is this a PDF parser?
  • Does it support OCR?
  • Can it return text?

The better questions are:

  • What breaks first when the document quality gets worse?
  • Does the output preserve enough structure for the next step?
  • How much cleanup still exists after extraction?

That is the lens that separates a parser that looks good on demos from an OCR workflow that still holds up on bad source files.

Choose The Output Contract Before The Model

Before choosing a model or SDK, decide what the next consumer needs.

Use markdown when:

  • a human needs to review the document
  • you want readable extracted text
  • the result will be passed to an LLM or search index
  • structure like headings and tables still matters

If that is your workflow, a PDF to Markdown API is usually the cleanest fit.

Use structured JSON when:

  • another system expects stable fields
  • you need validation rules
  • you are mapping invoices, receipts, or forms into a known schema
  • manual cleanup needs to shrink over time

If that is your workflow, start with a PDF to JSON OCR API.

Schema checklist for how to extract text from scanned pdfs without losing structure FIG 2.0 - Validation checklist highlighting the fields and failure modes that matter before downstream use.

A Safer Workflow For Scanned PDFs

For scanned PDFs, the safest workflow looks like this:

  1. upload the file or a source URL
  2. choose markdown or structured
  3. add instructions only if the document needs normalization
  4. wait for completion
  5. validate the result before writing downstream

What matters here is that validation is not optional. On scanned files, success from the API does not always mean success for the business process.

Example: Extract Text From A Scanned PDF Into Markdown

import { LeapOCR } from "leapocr";

const client = new LeapOCR({
  apiKey: process.env.LEAPOCR_API_KEY,
});

const job = await client.ocr.processURL("https://example.com/scanned-invoice.pdf", {
  format: "markdown",
  instructions: "Keep headings and tables intact. Normalize dates where possible.",
});

const result = await client.ocr.waitUntilDone(job.jobId);

console.log(result.pages[0].text);

This works well when your main need is readable output for review, search, or LLM context building.

Example: Extract Fields From A Scanned PDF Into JSON

import { LeapOCR } from "leapocr";
import { z } from "zod";

const InvoiceSchema = z.object({
  invoice_number: z.string().nullable(),
  invoice_date: z.string().nullable(),
  total_amount: z.number().nullable(),
  currency: z.string().nullable(),
});

const client = new LeapOCR({
  apiKey: process.env.LEAPOCR_API_KEY,
});

const job = await client.ocr.processURL("https://example.com/scanned-invoice.pdf", {
  format: "structured",
  schema: z.toJSONSchema(InvoiceSchema),
  instructions:
    "Extract invoice header fields. Return null when a field is not confidently present.",
});

const result = await client.ocr.waitUntilDone(job.jobId);

console.log(result.pages[0].result);

This is the better path when the document has to power automation, validation, or a downstream record.

Where Scanned-PDF Pipelines Usually Fail

1. Text is present, but structure is gone

This happens when you get page text without structure. Tables collapse, labels drift, and downstream systems still cannot consume the output.

2. The pipeline works on clean PDFs and quietly degrades on scans

A lot of parsing tools look strong until the queue includes:

  • phone photos
  • low-resolution scans
  • skewed pages
  • mixed-language invoices
  • forms with handwritten notes

3. OCR technically succeeds, but operators still rebuild the document by hand

This is the most expensive failure mode. The OCR label looks correct, but the business process still depends on manual normalization.

How To Evaluate A Scanned-PDF OCR API

Use a test set that includes your worst real files, not only clean samples.

Score each option on:

  • readability of the extracted output
  • table and section preservation
  • field accuracy on messy scans
  • how much validation and cleanup still remains
  • whether the API can return both markdown and JSON

If you are evaluating alternatives, start with LeapOCR vs PDF Vector for a close comparison on parsing-first versus extraction-first workflows.

What Good Looks Like

A good scanned-PDF workflow should give you four things:

  • readable output when humans need to inspect the result
  • stable JSON when software needs to act on it
  • a narrow place to add cleanup instructions
  • a review path for documents that still fail validation

That is usually strongest for:

  • invoice OCR API workflows
  • receipts and expense docs
  • scanned forms
  • logistics paperwork
  • archived PDFs that need to become searchable or structured again

Final Take

For scanned PDFs, the goal is not to extract the most text. The goal is to keep enough structure that the next step does not collapse.

That usually means choosing the output contract first, then choosing an OCR layer that can keep that contract intact across messy, real-world files.

If you want readable extracted text, start with PDF to Markdown. If you need a payload another system can trust, start with structured JSON extraction. If you are evaluating the broader workflow, use the docs and run a real scanned file through the API before deciding.

Try LeapOCR on your own documents

Start with 100 free credits and see how your workflow holds up on real files.

Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.

Keep reading

Related notes for the same operating context

More implementation guides, benchmarks, and workflow notes for teams building document pipelines.