Logistics Automation Header

Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests

Global trade still runs on paper. Freight forwarders, 3PLs, and shippers spend their days swapping PDFs—bills of lading (BOLs), manifests, packing lists, EDIFACT guides. Then someone manually types container numbers, HS codes, weights, and references into TMS/ERP systems. The work is slow, mistakes happen, and the costs add up.

This guide shows how to extract reliable structured data from those documents using layout-aware OCR. LeapOCR accepts files via URL or direct upload, supporting over 100 formats including documents (PDF, DOCX, DOC, ODT, RTF, TXT), images (PNG, JPG, WEBP, TIFF, GIF, BMP, HEIC), spreadsheets (XLSX, XLS, CSV, ODS), presentations (PPTX, PPT, ODP), and more. PDFs and Word documents are rasterized (embedded text layers are ignored), while image formats process directly.

Why Logistics Documents Challenge Generic OCR

Standard OCR tools stumble on logistics paperwork for several reasons:

Dense multi-page tables with small fonts and merged cells
Stamps, signatures, and handwriting covering key fields
Multi-column layouts and footers that disrupt reading order
Template variations across carriers—columns and headers shift constantly

When you rely on basic text extraction plus regex patterns, a single column change breaks everything. Layout-aware table extraction handles these variations better.

What Data You Actually Need From BOLs and Manifests

Most teams manually re-key these fields:

Parties: shipper, consignee, notify party
Routing: ports, vessel/voyage, ETD/ETA
Containers: IDs, seal numbers, type/size, counts
Line items: SKU/description, packages, weights/measure, HS codes
References: booking numbers, customer references, Incoterms

These map naturally to a structured schema:

Data mapping schema showing relationship between bills of lading, routing, containers, and line items FIG 1.0 — Mapping unstructured BOLs to structured entities

{
  "shipper": { "name": "string", "address": "string" },
  "consignee": { "name": "string", "address": "string" },
  "routing": {
    "port_of_loading": "string",
    "port_of_discharge": "string",
    "vessel": "string",
    "voyage": "string"
  },
  "containers": [
    {
      "id": "string",
      "seal": "string",
      "type": "string",
      "packages": "number",
      "weight_kg": "number"
    }
  ],
  "line_items": [
    {
      "description": "string",
      "hs_code": "string",
      "packages": "number",
      "weight_kg": "number"
    }
  ],
  "references": {
    "booking": "string",
    "customer": "string",
    "incoterms": "string"
  }
}

How LeapOCR Handles Logistics Documents

LeapOCR takes a different approach:

Layout-aware extraction: Preserves columns, table boundaries, and reading order instead of flattening everything to raw text
Schema-first design: Define the fields you need (schema alone, or schema + instructions). LeapOCR returns structured JSON, not text blobs
Handles messy documents: Stamps, handwriting, and multi-page tables work with table extraction and structured output
Automatic cleanup: Jobs delete after 7 days by default, or you can delete them sooner

Visual Example: Bill of Lading Pages

The bill of lading sample below shows what makes these documents challenging:

Layout aware extraction visual showing complex table handling FIG 3.0 — Preserving complex table structures in logistics documents

Look at the dense container and line-item tables. That’s where layout-aware extraction makes the difference—maintaining table structure instead of treating everything as unstructured text.

Sample Request (TypeScript)

import { LeapOCR } from "leapocr";

const client = new LeapOCR({ apiKey: process.env.LEAPOCR_API_KEY });

const job = await client.ocr.processURL("https://your-bucket.example.com/bill-of-lading.pdf", {
  format: "structured",
  model: "pro-v1", // use pro for noisy/stamped docs
  schema: {
    shipper: { name: "string" },
    consignee: { name: "string" },
    routing: {
      port_of_loading: "string",
      port_of_discharge: "string",
      vessel: "string",
      voyage: "string",
    },
    containers: [
      {
        id: "string",
        seal: "string",
        type: "string",
        packages: "number",
        weight_kg: "number",
      },
    ],
    line_items: [
      {
        description: "string",
        hs_code: "string",
        packages: "number",
        weight_kg: "number",
      },
    ],
    references: {
      booking: "string",
      customer: "string",
      incoterms: "string",
    },
  },
  instructions: "Return numbers as numbers; keep HS codes as strings.",
});

await client.ocr.waitUntilDone(job.jobId);
const result = await client.ocr.getJobResult(job.jobId);
// Optional: configure webhooks (Growth+ plans) to avoid polling in production flows
await client.ocr.deleteJob(job.jobId);

console.log(result.output);

Validation and Guardrails

Once you have structured data, add validation to catch issues:

Totals verification: Sum of line-item weights should match container totals
Presence checks: HS codes exist for all items; containers have seals
Routing completeness: Ports and vessel/voyage are populated; ETD/ETA present when available
Exception handling: Route suspect documents to a review queue instead of auto-posting to your TMS
Async delivery: For production workflows, enable webhooks (Growth+ plans) to receive completion callbacks instead of polling

Validation guardrails showing pass/fail logic for data integrity FIG 4.0 — Data integrity gates preventing bad data from entering TMS

Case Study Flow (BOL → JSON → TMS)

Here’s a typical automation pipeline:

Process pipeline diagram from PDF ingestion to TMS entry FIG 2.0 — End-to-end logistics automation workflow

Ingest PDF from an S3 bucket or email dropbox
Call LeapOCR with the schema defined above
Validate totals and required fields
Post the structured payload into your TMS/ERP (containers, line items, references)
Store the original PDF alongside the structured JSON for audit trails

Extending Beyond BOLs

The same approach works for other logistics documents:

Shipping manifests: Similar schema with more line items
Packing lists: Use the same table extraction with different field definitions
EDIFACT mapping: Map the returned JSON to EDIFACT segments (containers → EQD/SEL, line items → GID/FTX)
- Reference example: /assets/blog/pdf-images/edifact-guide-20160630-001.png, -002.png

Takeaways

Bills of lading contain dense tables and messy markings—generic OCR can’t maintain structure
A schema-first, layout-aware approach converts BOLs into reliable JSON output
Built-in validation and automatic deletion give you control over the automation process

Try the approach with your own BOL PDFs and plug the JSON directly into your TMS/ERP. If you need help with a tailored schema or want more sample templates, we can add those.

Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests

Automating Logistics: Extracting Data from Bills of Lading and Shipping Manifests

Why Logistics Documents Challenge Generic OCR

What Data You Actually Need From BOLs and Manifests

How LeapOCR Handles Logistics Documents

Visual Example: Bill of Lading Pages

Sample Request (TypeScript)

Validation and Guardrails

Case Study Flow (BOL → JSON → TMS)

Extending Beyond BOLs

Takeaways

Start with 100 free credits and see how your workflow holds up on real files.

Related notes for the same operating context

Reducing Detention and Demurrage Costs with Automated Document Processing

Automating the Bill of Lading: How AI is Eliminating Manual Data Entry in Logistics

Case Study: Global Manufacturer Cuts Customs Clearance Time by 60% with Document AI