6 min read

How to Build an Automated Invoice Processing System Using LeapOCR

Design and implement a real-world invoice processing pipeline with LeapOCR – from defining your data schema to handling async jobs, validation, and integrations.

How to Build an Automated Invoice Processing System Using LeapOCR

Invoice processing is one of those tasks that looks simple on paper and quietly eats hours of your team’s time each week. In this guide, we’ll design a straightforward, production-minded workflow that uses the LeapOCR Python SDK to turn incoming invoices into structured data and push them into your existing tools.

We’ll focus on Python 3.9+ and the async-first SDK. If you’d rather use TypeScript, Go, another language, or call the HTTP API directly, you can find language-specific guides and reference docs at LeapOCR docs.

What You’ll Build: From Inbox to Structured Invoice Records

At a high level, we’re building this:

  • Invoices arrive as PDFs or images (via email, upload, or object storage)
  • A small Python service sends them to LeapOCR
  • LeapOCR returns structured JSON (invoice number, dates, totals, vendor, etc.)
  • Your service stores that data and syncs it to your accounting or ERP system

Think of this guide as a reference architecture. You can adapt the pieces to FastAPI, Django, Celery, your message queue of choice, or whatever stack you already have.

Architecture Overview: The Core Building Blocks

Let’s break the system into a few small, understandable components:

  • Ingestion layer
    How invoices enter your world: file uploads, an “invoices@company.com” email inbox, a scheduled job that watches an S3 bucket, etc.

  • Processing service (Python + LeapOCR)
    A background worker or API that:

    • Receives an invoice file or URL
    • Submits it to LeapOCR using the Python SDK
    • Waits for processing to finish
    • Normalizes and stores the structured result
  • Data store
    A database table or collection that represents “processed invoices” with fields your business cares about.

  • Integration layer
    Code that pushes those processed invoices into tools like QuickBooks, Xero, or internal finance systems.

A simple text diagram:

Inbox / Upload / Storage


 ┌─────────────────────┐
 │  Invoice Processor  │  (Python + LeapOCR SDK)
 └─────────────────────┘


   Invoice Database


Accounting / ERP / BI

We’ll focus on the Invoice Processor box and how to connect it to LeapOCR cleanly.

Step 1: Decide What Data You Actually Need From Invoices

Before writing code, decide what “done” means for an invoice.

Talk to the people who currently process invoices and list the fields they actually use. Common ones:

  • invoice_number
  • vendor_name
  • invoice_date
  • due_date
  • currency
  • subtotal
  • tax_amount
  • total_amount
  • purchase_order_number
  • line_items (description, quantity, unit_price, line_total)

To make this concrete, here’s a real example invoice we’ll use throughout this guide:

Example French invoice used in this guide

And here’s one way that same document could be represented as structured JSON:

{
  "seller_info": {
    "name": "MAISON JEAN ÉLION",
    "business_type": "EPICERIE EN GROS",
    "address": "66-68, Place Voltaire - CHATEAUROUX",
    "phone": "89"
  },
  "invoice_details": {
    "invoice_number": "1560",
    "date": "Avril 1920",
    "place_of_issue": "Châteauroux"
  },
  "buyer_info": {
    "name": "Vve LABRUNE",
    "location": "Niherne"
  },
  "line_items": [
    {
      "quantity": 24,
      "description": "1/2 tomate",
      "unit_price": 0.5,
      "total_price": 12
    },
    {
      "quantity": 24,
      "description": "4/4 tomate",
      "unit_price": 1,
      "total_price": 24
    },
    {
      "quantity": 10,
      "description": "Lessive k",
      "unit_price": 0.7,
      "total_price": 7
    },
    {
      "quantity": 1,
      "description": "Boite pains epice",
      "unit_price": null,
      "total_price": 3.4
    },
    {
      "quantity": 3,
      "description": "Tapioca",
      "unit_price": 3.25,
      "total_price": 9.75
    },
    {
      "quantity": 10,
      "description": "Paquets de bougies 6 T",
      "unit_price": 3.4,
      "total_price": 34
    },
    {
      "quantity": null,
      "description": "Sortie",
      "unit_price": null,
      "total_price": 0.1
    }
  ],
  "total_amount": 90.25
}

In practice you might normalize or flatten this a bit for your internal systems. A simple JSON shape could look like:

{
  "invoice_number": "INV-10293",
  "vendor_name": "Acme Corp",
  "invoice_date": "2025-11-02",
  "due_date": "2025-11-30",
  "currency": "USD",
  "subtotal": 1200.0,
  "tax_amount": 299.0,
  "total_amount": 1499.0,
  "purchase_order_number": "PO-5541",
  "line_items": [
    {
      "description": "SaaS subscription - November",
      "quantity": 1,
      "unit_price": 1499.0,
      "line_total": 1499.0
    }
  ]
}

This is what we’ll ultimately want out of LeapOCR for each invoice.

Step 2: Set Up LeapOCR and Choose a Model/Format

First, install the Python SDK:

pip install leapocr
# or, with uv:
uv add leapocr

You’ll also need:

  • Python 3.9 or higher
  • A LeapOCR API key (LEAPOCR_API_KEY in your environment)

The SDK is async-first and uses async with for proper cleanup:

import asyncio
import os
from leapocr import LeapOCR, ProcessOptions, Format


async def quick_smoke_test() -> None:
  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_url(
      "https://example.com/invoice.pdf",
      options=ProcessOptions(
        format=Format.STRUCTURED,
      ),
    )

    result = await client.ocr.wait_until_done(job.job_id)
    print(f"Credits used: {result.credits_used}")
    print(f"Pages processed: {len(result.pages)}")


asyncio.run(quick_smoke_test())

For invoices, you’ll usually want:

  • Format: Format.STRUCTURED (so you get JSON-like results you can store directly)
  • Model: a higher-accuracy model such as Model.PRO_V1 or Model.ENGLISH_PRO_V1 for messy, real-world invoices

Full model/format options are documented at LeapOCR docs.

Step 3: Define an Invoice Extraction Schema or Template

Next, we’ll tell LeapOCR what we care about by defining a schema for our invoices.

Here’s a minimal JSON schema that matches the shape we sketched earlier:

INVOICE_SCHEMA: dict = {
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string"},
    "vendor_name": {"type": "string"},
    "invoice_date": {"type": "string"},
    "due_date": {"type": "string"},
    "currency": {"type": "string"},
    "subtotal": {"type": "number"},
    "tax_amount": {"type": "number"},
    "total_amount": {"type": "number"},
    "purchase_order_number": {"type": "string"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "unit_price": {"type": "number"},
          "line_total": {"type": "number"},
        },
      },
    },
  },
}

You can pass this schema directly when you process an invoice file:

from pathlib import Path
from leapocr import LeapOCR, ProcessOptions, Format, Model


async def extract_invoice(path: Path) -> dict:
  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_file(
      path,
      options=ProcessOptions(
        format=Format.STRUCTURED,
        model=Model.ENGLISH_PRO_V1,
        schema=INVOICE_SCHEMA,
        # instructions are optional if your schema and field names are descriptive enough
      ),
    )

    result = await client.ocr.wait_until_done(job.job_id)

    # First page usually holds the main invoice data
    first_page = result.pages[0].result

    # For structured format, result is typically a dict already
    if isinstance(first_page, dict):
      return first_page

    # Fallback: parse JSON string if necessary
    import json
    return json.loads(first_page)

Alternatively, if you set up a template in LeapOCR (with its own schema, instructions, and model), you can reference it by template_slug instead of sending a schema from code:

from leapocr import ProcessOptions


async def extract_invoice_with_template(path: Path) -> dict:
  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_file(
      path,
      options=ProcessOptions(
        template_slug="invoice-extraction",  # configure this in LeapOCR
      ),
    )

    result = await client.ocr.wait_until_done(job.job_id)
    page = result.pages[0].result
    return page if isinstance(page, dict) else {}

Schemas vs templates:

  • Schemas in code: easier to version-control and test, good when one service owns the behavior.
  • Templates in LeapOCR: easier to tweak centrally and reuse across many services or teams.

Step 4: Build a Minimal Processing Service in Python

Now let’s wrap this into a small, reusable processing function your app can call.

Here’s a simplified “invoice processor” service function:

import asyncio
from pathlib import Path
from typing import Any, Dict

from leapocr import LeapOCR, ProcessOptions, Format, Model


async def process_invoice_file(path: Path) -> Dict[str, Any]:
  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_file(
      path,
      options=ProcessOptions(
        format=Format.STRUCTURED,
        model=Model.ENGLISH_PRO_V1,
        schema=INVOICE_SCHEMA,
        instructions="Extract invoice fields and line items for accounting import.",
      ),
    )

    result = await client.ocr.wait_until_done(job.job_id)
    first_page = result.pages[0].result

    if isinstance(first_page, dict):
      invoice = first_page
    else:
      import json
      invoice = json.loads(first_page)

    # Optional: delete job immediately instead of waiting 7 days
    await client.ocr.delete_job(job.job_id)

    return invoice


# For manual testing:
if __name__ == "__main__":
  asyncio.run(process_invoice_file(Path("invoice.pdf")))

In a real system, this function might be called from:

  • A FastAPI route that accepts file uploads
  • A Celery / RQ / Dramatiq worker consuming messages from a queue
  • A scheduled job that iterates over new files in storage

Step 5: Handle Asynchronous Jobs and Scaling

For early prototypes or low volume, calling wait_until_done directly is fine. As volume grows, you’ll want more control.

Simple approach: wait_until_done

The examples above already use:

result = await client.ocr.wait_until_done(job.job_id)

This internally polls the job until it completes or times out.

Manual polling for more control

If you want to handle progress yourself (or store intermediate status), you can poll manually:

import asyncio
from leapocr import LeapOCR, ProcessOptions, Format


async def manual_polling() -> None:
  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_url(
      "https://example.com/invoice.pdf",
      options=ProcessOptions(format=Format.MARKDOWN),
    )

    print(f"Job created: {job.job_id}")

    while True:
      status = await client.ocr.get_job_status(job.job_id)
      print(f"Status: {status.status.value} - {status.progress:.1f}%")

      if status.status.value == "completed":
        break

      await asyncio.sleep(2)

    result = await client.ocr.get_results(job.job_id)
    print(f"Processing complete: {len(result.pages)} pages")


asyncio.run(manual_polling())

Progress callbacks for large batches

The Python SDK also supports progress callbacks with PollOptions:

from leapocr import PollOptions


async def track_progress_for_large_invoice(path: Path) -> None:
  def on_progress(status) -> None:
    print(
      f"Progress: {status.progress:.1f}% "
      f"({status.processed_pages}/{status.total_pages} pages)"
    )

  async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
    job = await client.ocr.process_file(path)

    result = await client.ocr.wait_until_done(
      job.job_id,
      poll_options=PollOptions(
        poll_interval=2.0,
        max_wait=300.0,
        on_progress=on_progress,
      ),
    )

    print(f"Done! {len(result.pages)} pages processed.")

As you scale, you’ll typically:

  • Move invoice processing off the request/response path
  • Queue jobs and have workers call LeapOCR concurrently
  • Use progress tracking to give ops/finance visibility into backlogs

Step 6: Validate Results and Flag Exceptions

LeapOCR will get you close, but you still need some business logic to catch weird cases.

At minimum, you’ll want to:

  • Check that required fields are present (invoice number, vendor, total)
  • Ensure totals and line items roughly reconcile
  • Flag obviously bad dates (far in the past/future)

Example of a simple validator:

from typing import Any, Dict, Literal, TypedDict


ReviewStatus = Literal["auto_approved", "needs_review"]


class ValidatedInvoice(TypedDict):
  data: Dict[str, Any]
  review_status: ReviewStatus
  issues: list[str]


def validate_invoice(invoice: Dict[str, Any]) -> ValidatedInvoice:
  issues: list[str] = []

  if not invoice.get("invoice_number"):
    issues.append("Missing invoice_number")
  if not invoice.get("vendor_name"):
    issues.append("Missing vendor_name")

  total = invoice.get("total_amount")
  if total is None:
    issues.append("Missing total_amount")

  # Very naive reconciliation example
  line_items = invoice.get("line_items") or []
  line_total_sum = sum(
    (item.get("line_total") or 0) for item in line_items if isinstance(item, dict)
  )

  if isinstance(total, (int, float)) and line_total_sum and abs(total - line_total_sum) > 1:
    issues.append("Total does not match sum of line_items (±1 tolerance)")

  status: ReviewStatus = "auto_approved" if not issues else "needs_review"

  return {
    "data": invoice,
    "review_status": status,
    "issues": issues,
  }

In your pipeline, you can:

  • Auto-ingest invoices with review_status == "auto_approved"
  • Send needs_review invoices to a human queue in your internal UI

Step 7: Syncing With Your Accounting or ERP System

Once invoices are parsed and validated, you can push them into your accounting system.

Common patterns:

  • Direct API calls: after successful OCR + validation, call QuickBooks/Xero/internal APIs immediately.
  • Outbox table + periodic sync: store processed invoices in a DB table with synced=false and have a separate job push them in batches.
  • Webhooks/events: emit an event like invoice.processed that other services subscribe to.

Whichever approach you take, keep a clear audit trail:

  • Store a reference to the original file location (e.g., S3 key or blob ID)
  • Store the LeapOCR job ID (for debugging if something looks off)
  • Keep the structured JSON you sent to the accounting API

That way, if finance asks “why is this number here?”, you can trace it all the way back.

Step 8: Monitoring, Logging, and Cost Awareness

Running this in production means paying attention to:

  • Logging
    Log:

    • LeapOCR job_id
    • File identifier or URL
    • Vendor name, total, and review status
    • Any validation issues
  • Metrics and alerts
    Track:

    • Number of invoices processed
    • Error rate (OCR failures, validation failures, sync failures)
    • Average processing time per invoice
  • Cost and rate limits
    Watch your usage/credits and:

    • Batch large backfills instead of sending thousands of invoices at once
    • Spread heavy tasks across off-peak hours where it makes sense

Step 9: Extending Beyond Invoices

Once you have the invoice pipeline working, extending it is mostly a matter of new schemas or templates.

Nearby use cases:

  • Bills and statements: similar to invoices but with slightly different fields
  • Receipts and expenses: smaller, more varied documents, often from phones
  • Credit notes and refunds: negative invoices that still follow a pattern
  • Purchase orders: upstream of invoices but structurally similar

Each of these can be:

  • A new schema in code
  • A new template in LeapOCR
  • Reusing the same Python processing infrastructure you just built

Putting It All Together and Next Steps

You now have all the core pieces of an automated invoice processing system:

  • A clear idea of which fields matter
  • A schema or template that tells LeapOCR what to extract
  • A Python async service that submits invoices, waits for results, and deletes jobs
  • Validation logic to separate easy cases from edge cases
  • A path to sync data into your accounting or ERP tools

From here, good next steps are:

  • Wire this into your actual ingestion path (email, uploads, storage)
  • Harden error handling, logging, and observability
  • Iterate on your schemas/templates as you see more real invoices

For more details on models, options, and advanced patterns, check docs.leapocr.com and the LeapOCR Python SDK examples. If you’re integrating in multiple languages, pair this with the TypeScript SDK guide so your team can speak the same “document data” language everywhere.

Back to Blog
Share this article

Ready to automate your document workflows?

Join thousands of developers using LeapOCR to extract data from documents with high accuracy.

Get Started for Free