How to Build an Automated Invoice Processing System Using LeapOCR
Invoice processing is one of those tasks that looks simple on paper and quietly eats hours of your team’s time each week. In this guide, we’ll design a straightforward, production-minded workflow that uses the LeapOCR Python SDK to turn incoming invoices into structured data and push them into your existing tools.
We’ll focus on Python 3.9+ and the async-first SDK. If you’d rather use TypeScript, Go, another language, or call the HTTP API directly, you can find language-specific guides and reference docs at LeapOCR docs.
What You’ll Build: From Inbox to Structured Invoice Records
At a high level, we’re building this:
- Invoices arrive as PDFs or images (via email, upload, or object storage)
- A small Python service sends them to LeapOCR
- LeapOCR returns structured JSON (invoice number, dates, totals, vendor, etc.)
- Your service stores that data and syncs it to your accounting or ERP system
Think of this guide as a reference architecture. You can adapt the pieces to FastAPI, Django, Celery, your message queue of choice, or whatever stack you already have.
Architecture Overview: The Core Building Blocks
Let’s break the system into a few small, understandable components:
Ingestion layer
How invoices enter your world: file uploads, an “invoices@company.com” email inbox, a scheduled job that watches an S3 bucket, etc.Processing service (Python + LeapOCR)
A background worker or API that:- Receives an invoice file or URL
- Submits it to LeapOCR using the Python SDK
- Waits for processing to finish
- Normalizes and stores the structured result
Data store
A database table or collection that represents “processed invoices” with fields your business cares about.Integration layer
Code that pushes those processed invoices into tools like QuickBooks, Xero, or internal finance systems.
A simple text diagram:
Inbox / Upload / Storage
│
▼
┌─────────────────────┐
│ Invoice Processor │ (Python + LeapOCR SDK)
└─────────────────────┘
│
▼
Invoice Database
│
▼
Accounting / ERP / BIWe’ll focus on the Invoice Processor box and how to connect it to LeapOCR cleanly.
Step 1: Decide What Data You Actually Need From Invoices
Before writing code, decide what “done” means for an invoice.
Talk to the people who currently process invoices and list the fields they actually use. Common ones:
invoice_numbervendor_nameinvoice_datedue_datecurrencysubtotaltax_amounttotal_amountpurchase_order_numberline_items(description, quantity, unit_price, line_total)
To make this concrete, here’s a real example invoice we’ll use throughout this guide:

And here’s one way that same document could be represented as structured JSON:
{
"seller_info": {
"name": "MAISON JEAN ÉLION",
"business_type": "EPICERIE EN GROS",
"address": "66-68, Place Voltaire - CHATEAUROUX",
"phone": "89"
},
"invoice_details": {
"invoice_number": "1560",
"date": "Avril 1920",
"place_of_issue": "Châteauroux"
},
"buyer_info": {
"name": "Vve LABRUNE",
"location": "Niherne"
},
"line_items": [
{
"quantity": 24,
"description": "1/2 tomate",
"unit_price": 0.5,
"total_price": 12
},
{
"quantity": 24,
"description": "4/4 tomate",
"unit_price": 1,
"total_price": 24
},
{
"quantity": 10,
"description": "Lessive k",
"unit_price": 0.7,
"total_price": 7
},
{
"quantity": 1,
"description": "Boite pains epice",
"unit_price": null,
"total_price": 3.4
},
{
"quantity": 3,
"description": "Tapioca",
"unit_price": 3.25,
"total_price": 9.75
},
{
"quantity": 10,
"description": "Paquets de bougies 6 T",
"unit_price": 3.4,
"total_price": 34
},
{
"quantity": null,
"description": "Sortie",
"unit_price": null,
"total_price": 0.1
}
],
"total_amount": 90.25
}In practice you might normalize or flatten this a bit for your internal systems. A simple JSON shape could look like:
{
"invoice_number": "INV-10293",
"vendor_name": "Acme Corp",
"invoice_date": "2025-11-02",
"due_date": "2025-11-30",
"currency": "USD",
"subtotal": 1200.0,
"tax_amount": 299.0,
"total_amount": 1499.0,
"purchase_order_number": "PO-5541",
"line_items": [
{
"description": "SaaS subscription - November",
"quantity": 1,
"unit_price": 1499.0,
"line_total": 1499.0
}
]
}This is what we’ll ultimately want out of LeapOCR for each invoice.
Step 2: Set Up LeapOCR and Choose a Model/Format
First, install the Python SDK:
pip install leapocr
# or, with uv:
uv add leapocrYou’ll also need:
- Python 3.9 or higher
- A LeapOCR API key (
LEAPOCR_API_KEYin your environment)
The SDK is async-first and uses async with for proper cleanup:
import asyncio
import os
from leapocr import LeapOCR, ProcessOptions, Format
async def quick_smoke_test() -> None:
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_url(
"https://example.com/invoice.pdf",
options=ProcessOptions(
format=Format.STRUCTURED,
),
)
result = await client.ocr.wait_until_done(job.job_id)
print(f"Credits used: {result.credits_used}")
print(f"Pages processed: {len(result.pages)}")
asyncio.run(quick_smoke_test())For invoices, you’ll usually want:
- Format:
Format.STRUCTURED(so you get JSON-like results you can store directly) - Model: a higher-accuracy model such as
Model.PRO_V1orModel.ENGLISH_PRO_V1for messy, real-world invoices
Full model/format options are documented at LeapOCR docs.
Step 3: Define an Invoice Extraction Schema or Template
Next, we’ll tell LeapOCR what we care about by defining a schema for our invoices.
Here’s a minimal JSON schema that matches the shape we sketched earlier:
INVOICE_SCHEMA: dict = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"vendor_name": {"type": "string"},
"invoice_date": {"type": "string"},
"due_date": {"type": "string"},
"currency": {"type": "string"},
"subtotal": {"type": "number"},
"tax_amount": {"type": "number"},
"total_amount": {"type": "number"},
"purchase_order_number": {"type": "string"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"line_total": {"type": "number"},
},
},
},
},
}You can pass this schema directly when you process an invoice file:
from pathlib import Path
from leapocr import LeapOCR, ProcessOptions, Format, Model
async def extract_invoice(path: Path) -> dict:
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_file(
path,
options=ProcessOptions(
format=Format.STRUCTURED,
model=Model.ENGLISH_PRO_V1,
schema=INVOICE_SCHEMA,
# instructions are optional if your schema and field names are descriptive enough
),
)
result = await client.ocr.wait_until_done(job.job_id)
# First page usually holds the main invoice data
first_page = result.pages[0].result
# For structured format, result is typically a dict already
if isinstance(first_page, dict):
return first_page
# Fallback: parse JSON string if necessary
import json
return json.loads(first_page)Alternatively, if you set up a template in LeapOCR (with its own schema, instructions, and model), you can reference it by template_slug instead of sending a schema from code:
from leapocr import ProcessOptions
async def extract_invoice_with_template(path: Path) -> dict:
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_file(
path,
options=ProcessOptions(
template_slug="invoice-extraction", # configure this in LeapOCR
),
)
result = await client.ocr.wait_until_done(job.job_id)
page = result.pages[0].result
return page if isinstance(page, dict) else {}Schemas vs templates:
- Schemas in code: easier to version-control and test, good when one service owns the behavior.
- Templates in LeapOCR: easier to tweak centrally and reuse across many services or teams.
Step 4: Build a Minimal Processing Service in Python
Now let’s wrap this into a small, reusable processing function your app can call.
Here’s a simplified “invoice processor” service function:
import asyncio
from pathlib import Path
from typing import Any, Dict
from leapocr import LeapOCR, ProcessOptions, Format, Model
async def process_invoice_file(path: Path) -> Dict[str, Any]:
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_file(
path,
options=ProcessOptions(
format=Format.STRUCTURED,
model=Model.ENGLISH_PRO_V1,
schema=INVOICE_SCHEMA,
instructions="Extract invoice fields and line items for accounting import.",
),
)
result = await client.ocr.wait_until_done(job.job_id)
first_page = result.pages[0].result
if isinstance(first_page, dict):
invoice = first_page
else:
import json
invoice = json.loads(first_page)
# Optional: delete job immediately instead of waiting 7 days
await client.ocr.delete_job(job.job_id)
return invoice
# For manual testing:
if __name__ == "__main__":
asyncio.run(process_invoice_file(Path("invoice.pdf")))In a real system, this function might be called from:
- A FastAPI route that accepts file uploads
- A Celery / RQ / Dramatiq worker consuming messages from a queue
- A scheduled job that iterates over new files in storage
Step 5: Handle Asynchronous Jobs and Scaling
For early prototypes or low volume, calling wait_until_done directly is fine. As volume grows, you’ll want more control.
Simple approach: wait_until_done
The examples above already use:
result = await client.ocr.wait_until_done(job.job_id)This internally polls the job until it completes or times out.
Manual polling for more control
If you want to handle progress yourself (or store intermediate status), you can poll manually:
import asyncio
from leapocr import LeapOCR, ProcessOptions, Format
async def manual_polling() -> None:
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_url(
"https://example.com/invoice.pdf",
options=ProcessOptions(format=Format.MARKDOWN),
)
print(f"Job created: {job.job_id}")
while True:
status = await client.ocr.get_job_status(job.job_id)
print(f"Status: {status.status.value} - {status.progress:.1f}%")
if status.status.value == "completed":
break
await asyncio.sleep(2)
result = await client.ocr.get_results(job.job_id)
print(f"Processing complete: {len(result.pages)} pages")
asyncio.run(manual_polling())Progress callbacks for large batches
The Python SDK also supports progress callbacks with PollOptions:
from leapocr import PollOptions
async def track_progress_for_large_invoice(path: Path) -> None:
def on_progress(status) -> None:
print(
f"Progress: {status.progress:.1f}% "
f"({status.processed_pages}/{status.total_pages} pages)"
)
async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
job = await client.ocr.process_file(path)
result = await client.ocr.wait_until_done(
job.job_id,
poll_options=PollOptions(
poll_interval=2.0,
max_wait=300.0,
on_progress=on_progress,
),
)
print(f"Done! {len(result.pages)} pages processed.")As you scale, you’ll typically:
- Move invoice processing off the request/response path
- Queue jobs and have workers call LeapOCR concurrently
- Use progress tracking to give ops/finance visibility into backlogs
Step 6: Validate Results and Flag Exceptions
LeapOCR will get you close, but you still need some business logic to catch weird cases.
At minimum, you’ll want to:
- Check that required fields are present (invoice number, vendor, total)
- Ensure totals and line items roughly reconcile
- Flag obviously bad dates (far in the past/future)
Example of a simple validator:
from typing import Any, Dict, Literal, TypedDict
ReviewStatus = Literal["auto_approved", "needs_review"]
class ValidatedInvoice(TypedDict):
data: Dict[str, Any]
review_status: ReviewStatus
issues: list[str]
def validate_invoice(invoice: Dict[str, Any]) -> ValidatedInvoice:
issues: list[str] = []
if not invoice.get("invoice_number"):
issues.append("Missing invoice_number")
if not invoice.get("vendor_name"):
issues.append("Missing vendor_name")
total = invoice.get("total_amount")
if total is None:
issues.append("Missing total_amount")
# Very naive reconciliation example
line_items = invoice.get("line_items") or []
line_total_sum = sum(
(item.get("line_total") or 0) for item in line_items if isinstance(item, dict)
)
if isinstance(total, (int, float)) and line_total_sum and abs(total - line_total_sum) > 1:
issues.append("Total does not match sum of line_items (±1 tolerance)")
status: ReviewStatus = "auto_approved" if not issues else "needs_review"
return {
"data": invoice,
"review_status": status,
"issues": issues,
}In your pipeline, you can:
- Auto-ingest invoices with
review_status == "auto_approved" - Send
needs_reviewinvoices to a human queue in your internal UI
Step 7: Syncing With Your Accounting or ERP System
Once invoices are parsed and validated, you can push them into your accounting system.
Common patterns:
- Direct API calls: after successful OCR + validation, call QuickBooks/Xero/internal APIs immediately.
- Outbox table + periodic sync: store processed invoices in a DB table with
synced=falseand have a separate job push them in batches. - Webhooks/events: emit an event like
invoice.processedthat other services subscribe to.
Whichever approach you take, keep a clear audit trail:
- Store a reference to the original file location (e.g., S3 key or blob ID)
- Store the LeapOCR job ID (for debugging if something looks off)
- Keep the structured JSON you sent to the accounting API
That way, if finance asks “why is this number here?”, you can trace it all the way back.
Step 8: Monitoring, Logging, and Cost Awareness
Running this in production means paying attention to:
Logging
Log:- LeapOCR
job_id - File identifier or URL
- Vendor name, total, and review status
- Any validation issues
- LeapOCR
Metrics and alerts
Track:- Number of invoices processed
- Error rate (OCR failures, validation failures, sync failures)
- Average processing time per invoice
Cost and rate limits
Watch your usage/credits and:- Batch large backfills instead of sending thousands of invoices at once
- Spread heavy tasks across off-peak hours where it makes sense
Step 9: Extending Beyond Invoices
Once you have the invoice pipeline working, extending it is mostly a matter of new schemas or templates.
Nearby use cases:
- Bills and statements: similar to invoices but with slightly different fields
- Receipts and expenses: smaller, more varied documents, often from phones
- Credit notes and refunds: negative invoices that still follow a pattern
- Purchase orders: upstream of invoices but structurally similar
Each of these can be:
- A new schema in code
- A new template in LeapOCR
- Reusing the same Python processing infrastructure you just built
Putting It All Together and Next Steps
You now have all the core pieces of an automated invoice processing system:
- A clear idea of which fields matter
- A schema or template that tells LeapOCR what to extract
- A Python async service that submits invoices, waits for results, and deletes jobs
- Validation logic to separate easy cases from edge cases
- A path to sync data into your accounting or ERP tools
From here, good next steps are:
- Wire this into your actual ingestion path (email, uploads, storage)
- Harden error handling, logging, and observability
- Iterate on your schemas/templates as you see more real invoices
For more details on models, options, and advanced patterns, check docs.leapocr.com and the LeapOCR Python SDK examples. If you’re integrating in multiple languages, pair this with the TypeScript SDK guide so your team can speak the same “document data” language everywhere.
Ready to automate your document workflows?
Join thousands of developers using LeapOCR to extract data from documents with high accuracy.
Get Started for Free