Back to blog Technical guide

Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests

How to use AI to untangle the messy, paper-heavy world of global supply chain documentation.

logistics supply-chain ocr automation case-study
Published
December 8, 2025
Read time
5 min
Word count
935
Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests preview

Logistics Automation Header

Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests

Global trade still runs on paper. Freight forwarders, 3PLs, and shippers spend their days swapping PDFs—bills of lading (BOLs), manifests, packing lists, EDIFACT guides. Then someone manually types container numbers, HS codes, weights, and references into TMS/ERP systems. The work is slow, mistakes happen, and the costs add up.

This guide shows how to extract reliable structured data from those documents using layout-aware OCR. LeapOCR accepts files via URL or direct upload, supporting over 100 formats including documents (PDF, DOCX, DOC, ODT, RTF, TXT), images (PNG, JPG, WEBP, TIFF, GIF, BMP, HEIC), spreadsheets (XLSX, XLS, CSV, ODS), presentations (PPTX, PPT, ODP), and more. PDFs and Word documents are rasterized (embedded text layers are ignored), while image formats process directly.

Why Logistics Documents Challenge Generic OCR

Standard OCR tools stumble on logistics paperwork for several reasons:

  • Dense multi-page tables with small fonts and merged cells
  • Stamps, signatures, and handwriting covering key fields
  • Multi-column layouts and footers that disrupt reading order
  • Template variations across carriers—columns and headers shift constantly

When you rely on basic text extraction plus regex patterns, a single column change breaks everything. Layout-aware table extraction handles these variations better.

What Data You Actually Need From BOLs and Manifests

Most teams manually re-key these fields:

  • Parties: shipper, consignee, notify party
  • Routing: ports, vessel/voyage, ETD/ETA
  • Containers: IDs, seal numbers, type/size, counts
  • Line items: SKU/description, packages, weights/measure, HS codes
  • References: booking numbers, customer references, Incoterms

These map naturally to a structured schema:

Data mapping schema showing relationship between bills of lading, routing, containers, and line items FIG 1.0 — Mapping unstructured BOLs to structured entities

{
  "shipper": { "name": "string", "address": "string" },
  "consignee": { "name": "string", "address": "string" },
  "routing": {
    "port_of_loading": "string",
    "port_of_discharge": "string",
    "vessel": "string",
    "voyage": "string"
  },
  "containers": [
    {
      "id": "string",
      "seal": "string",
      "type": "string",
      "packages": "number",
      "weight_kg": "number"
    }
  ],
  "line_items": [
    {
      "description": "string",
      "hs_code": "string",
      "packages": "number",
      "weight_kg": "number"
    }
  ],
  "references": {
    "booking": "string",
    "customer": "string",
    "incoterms": "string"
  }
}

How LeapOCR Handles Logistics Documents

LeapOCR takes a different approach:

  • Layout-aware extraction: Preserves columns, table boundaries, and reading order instead of flattening everything to raw text
  • Schema-first design: Define the fields you need (schema alone, or schema + instructions). LeapOCR returns structured JSON, not text blobs
  • Handles messy documents: Stamps, handwriting, and multi-page tables work with table extraction and structured output
  • Automatic cleanup: Jobs delete after 7 days by default, or you can delete them sooner

Visual Example: Bill of Lading Pages

The bill of lading sample below shows what makes these documents challenging:

Layout aware extraction visual showing complex table handling FIG 3.0 — Preserving complex table structures in logistics documents

Look at the dense container and line-item tables. That’s where layout-aware extraction makes the difference—maintaining table structure instead of treating everything as unstructured text.

Sample Request (TypeScript)

import { LeapOCR } from "leapocr";

const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });

const job = await client.ocr.processURL("https://your-bucket.example.com/bill-of-lading.pdf", {
  format: "structured",
  model: "pro-v1", // use pro for noisy/stamped docs
  schema: {
    shipper: { name: "string" },
    consignee: { name: "string" },
    routing: {
      port_of_loading: "string",
      port_of_discharge: "string",
      vessel: "string",
      voyage: "string",
    },
    containers: [
      {
        id: "string",
        seal: "string",
        type: "string",
        packages: "number",
        weight_kg: "number",
      },
    ],
    line_items: [
      {
        description: "string",
        hs_code: "string",
        packages: "number",
        weight_kg: "number",
      },
    ],
    references: {
      booking: "string",
      customer: "string",
      incoterms: "string",
    },
  },
  instructions: "Return numbers as numbers; keep HS codes as strings.",
});

await client.ocr.waitUntilDone(job.jobId);
const result = await client.ocr.getJobResult(job.jobId);
// Optional: configure webhooks (Growth+ plans) to avoid polling in production flows
await client.ocr.deleteJob(job.jobId);

console.log(result.output);

Validation and Guardrails

Once you have structured data, add validation to catch issues:

  • Totals verification: Sum of line-item weights should match container totals
  • Presence checks: HS codes exist for all items; containers have seals
  • Routing completeness: Ports and vessel/voyage are populated; ETD/ETA present when available
  • Exception handling: Route suspect documents to a review queue instead of auto-posting to your TMS
  • Async delivery: For production workflows, enable webhooks (Growth+ plans) to receive completion callbacks instead of polling

Validation guardrails showing pass/fail logic for data integrity FIG 4.0 — Data integrity gates preventing bad data from entering TMS

Case Study Flow (BOL → JSON → TMS)

Here’s a typical automation pipeline:

Process pipeline diagram from PDF ingestion to TMS entry FIG 2.0 — End-to-end logistics automation workflow

  1. Ingest PDF from an S3 bucket or email dropbox
  2. Call LeapOCR with the schema defined above
  3. Validate totals and required fields
  4. Post the structured payload into your TMS/ERP (containers, line items, references)
  5. Store the original PDF alongside the structured JSON for audit trails

Extending Beyond BOLs

The same approach works for other logistics documents:

  • Shipping manifests: Similar schema with more line items
  • Packing lists: Use the same table extraction with different field definitions
  • EDIFACT mapping: Map the returned JSON to EDIFACT segments (containers → EQD/SEL, line items → GID/FTX)
    • Reference example: /assets/blog/pdf-images/edifact-guide-20160630-001.png, -002.png

Takeaways

  • Bills of lading contain dense tables and messy markings—generic OCR can’t maintain structure
  • A schema-first, layout-aware approach converts BOLs into reliable JSON output
  • Built-in validation and automatic deletion give you control over the automation process

Try the approach with your own BOL PDFs and plug the JSON directly into your TMS/ERP. If you need help with a tailored schema or want more sample templates, we can add those.

Try LeapOCR on your own documents

Start with 100 free credits and see how your workflow holds up on real files.

Eligible paid plans include a 3-day trial with 100 credits after you add a credit card, so you can test actual PDFs, scans, and forms before committing to a rollout.

Keep reading

Related notes for the same operating context

More implementation guides, benchmarks, and workflow notes for teams building document pipelines.