Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests
How to use AI to untangle the messy, paper-heavy world of global supply chain documentation.
Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests
Global trade still runs on paper. Freight forwarders, 3PLs, and shippers spend their days swapping PDFs—bills of lading (BOLs), manifests, packing lists, EDIFACT guides. Then someone manually types container numbers, HS codes, weights, and references into TMS/ERP systems. The work is slow, mistakes happen, and the costs add up.
This guide shows how to extract reliable structured data from those documents using layout-aware OCR. LeapOCR accepts files via URL or direct upload, supporting over 100 formats including documents (PDF, DOCX, DOC, ODT, RTF, TXT), images (PNG, JPG, WEBP, TIFF, GIF, BMP, HEIC), spreadsheets (XLSX, XLS, CSV, ODS), presentations (PPTX, PPT, ODP), and more. PDFs and Word documents are rasterized (embedded text layers are ignored), while image formats process directly.
Why Logistics Documents Challenge Generic OCR
Standard OCR tools stumble on logistics paperwork for several reasons:
- Dense multi-page tables with small fonts and merged cells
- Stamps, signatures, and handwriting covering key fields
- Multi-column layouts and footers that disrupt reading order
- Template variations across carriers—columns and headers shift constantly
When you rely on basic text extraction plus regex patterns, a single column change breaks everything. Layout-aware table extraction handles these variations better.
What Data You Actually Need From BOLs and Manifests
Most teams manually re-key these fields:
- Parties: shipper, consignee, notify party
- Routing: ports, vessel/voyage, ETD/ETA
- Containers: IDs, seal numbers, type/size, counts
- Line items: SKU/description, packages, weights/measure, HS codes
- References: booking numbers, customer references, Incoterms
These map naturally to a structured schema:
FIG 1.0 — Mapping unstructured BOLs to structured entities
{
"shipper": { "name": "string", "address": "string" },
"consignee": { "name": "string", "address": "string" },
"routing": {
"port_of_loading": "string",
"port_of_discharge": "string",
"vessel": "string",
"voyage": "string"
},
"containers": [
{
"id": "string",
"seal": "string",
"type": "string",
"packages": "number",
"weight_kg": "number"
}
],
"line_items": [
{
"description": "string",
"hs_code": "string",
"packages": "number",
"weight_kg": "number"
}
],
"references": {
"booking": "string",
"customer": "string",
"incoterms": "string"
}
}
How LeapOCR Handles Logistics Documents
LeapOCR takes a different approach:
- Layout-aware extraction: Preserves columns, table boundaries, and reading order instead of flattening everything to raw text
- Schema-first design: Define the fields you need (schema alone, or schema + instructions). LeapOCR returns structured JSON, not text blobs
- Handles messy documents: Stamps, handwriting, and multi-page tables work with table extraction and structured output
- Automatic cleanup: Jobs delete after 7 days by default, or you can delete them sooner
Visual Example: Bill of Lading Pages
The bill of lading sample below shows what makes these documents challenging:
FIG 3.0 — Preserving complex table structures in logistics documents
Look at the dense container and line-item tables. That’s where layout-aware extraction makes the difference—maintaining table structure instead of treating everything as unstructured text.
Sample Request (TypeScript)
import { LeapOCR } from "leapocr";
const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });
const job = await client.ocr.processURL("https://your-bucket.example.com/bill-of-lading.pdf", {
format: "structured",
model: "pro-v1", // use pro for noisy/stamped docs
schema: {
shipper: { name: "string" },
consignee: { name: "string" },
routing: {
port_of_loading: "string",
port_of_discharge: "string",
vessel: "string",
voyage: "string",
},
containers: [
{
id: "string",
seal: "string",
type: "string",
packages: "number",
weight_kg: "number",
},
],
line_items: [
{
description: "string",
hs_code: "string",
packages: "number",
weight_kg: "number",
},
],
references: {
booking: "string",
customer: "string",
incoterms: "string",
},
},
instructions: "Return numbers as numbers; keep HS codes as strings.",
});
await client.ocr.waitUntilDone(job.jobId);
const result = await client.ocr.getJobResult(job.jobId);
// Optional: configure webhooks (Growth+ plans) to avoid polling in production flows
await client.ocr.deleteJob(job.jobId);
console.log(result.output);
Validation and Guardrails
Once you have structured data, add validation to catch issues:
- Totals verification: Sum of line-item weights should match container totals
- Presence checks: HS codes exist for all items; containers have seals
- Routing completeness: Ports and vessel/voyage are populated; ETD/ETA present when available
- Exception handling: Route suspect documents to a review queue instead of auto-posting to your TMS
- Async delivery: For production workflows, enable webhooks (Growth+ plans) to receive completion callbacks instead of polling
FIG 4.0 — Data integrity gates preventing bad data from entering TMS
Case Study Flow (BOL → JSON → TMS)
Here’s a typical automation pipeline:
FIG 2.0 — End-to-end logistics automation workflow
- Ingest PDF from an S3 bucket or email dropbox
- Call LeapOCR with the schema defined above
- Validate totals and required fields
- Post the structured payload into your TMS/ERP (containers, line items, references)
- Store the original PDF alongside the structured JSON for audit trails
Extending Beyond BOLs
The same approach works for other logistics documents:
- Shipping manifests: Similar schema with more line items
- Packing lists: Use the same table extraction with different field definitions
- EDIFACT mapping: Map the returned JSON to EDIFACT segments (containers → EQD/SEL, line items → GID/FTX)
- Reference example:
/assets/blog/pdf-images/edifact-guide-20160630-001.png,-002.png
- Reference example:
Takeaways
- Bills of lading contain dense tables and messy markings—generic OCR can’t maintain structure
- A schema-first, layout-aware approach converts BOLs into reliable JSON output
- Built-in validation and automatic deletion give you control over the automation process
Try the approach with your own BOL PDFs and plug the JSON directly into your TMS/ERP. If you need help with a tailored schema or want more sample templates, we can add those.
Try LeapOCR on your own documents
Start with 100 free credits and see how your workflow holds up on real files.
Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.
Keep reading
Related notes for the same operating context
More implementation guides, benchmarks, and workflow notes for teams building document pipelines.
Reducing Detention and Demurrage Costs with Automated Document Processing
Detention and demurrage fees are the silent killers of logistics margins. See how automated document processing stops the clock and saves $100+ per container daily.
Automating the Bill of Lading: How AI is Eliminating Manual Data Entry in Logistics
A technical breakdown of how document AI extracts BOL data reliably across carriers and formats.
Case Study: Global Manufacturer Cuts Customs Clearance Time by 60% with Document AI
A hypothetical case study showing how automation accelerates cross-border workflows.