Document digitization

Turn messy paperwork into trusted records.

Put receipts, dense forms, multilingual invoices, and archival scans through one intake path without flattening the page into a text blob. Keep readable OCR for people, stable fields for software, and explicit review states when a document should pause instead of posting bad data.

Digitization lane

One intake path for scans, photos, multilingual paperwork, and archive pages without dropping the document logic people still need to trust.

input mix

Phone photos, scanned PDFs, multilingual invoices, field-heavy forms, and fading freight paperwork can share one digitization lane.

handoff

Deliver searchable markdown, structured JSON, or an explicit exception without losing the OCR layer that explains how the record was built.

control

Review routing stays page-specific, so one ambiguous scan does not stall the batch or quietly degrade the downstream system.

From scan to record

Show the raw page, keep the OCR layer visible, and route only the exceptions that need a human decision.

From scan to record

One intake lane for documents that do not share a template, language, or capture quality.

Multilingual invoices

Source page

public-domain web scan

Scanned invoice with Chinese text, official stamps, and printed totals

Extraction contract

Invoice in another language with official marks, codes, and totals that still need a stable record.

Handoff behavior

Operators can inspect the OCR layer while the downstream contract stays clean enough for finance and routing systems.

invoice_contract.json
{ "merchant": "Chengdu Ito-Yokado",  "invoice_code": "151012071004",  "invoice_number": "00231103",  "currency": "CNY",  "total_received": 5.00 }

Review state

Assembling OCR

Ready to post

Exception routed

Where OCR projects usually break

Extraction is the easy part. Production trust is the hard part.

Plain text drops the document logic.

Totals separate from labels. Checkboxes lose their meaning. Stamps, signatures, and side notes collapse into a wall of text that nobody can trust in production.

Template maintenance does not scale.

A clean invoice from one vendor becomes three different layouts next quarter. A freight document from one carrier becomes six after an acquisition. The page keeps changing while the system expects stability.

The real cost shows up after extraction.

People still have to reconcile totals, review low-confidence pages, and explain why a field landed in the wrong system. Digitization is only done when the record is usable downstream.

Proof from actual pages

Three difficult documents, shown as source image and downstream record.

Each row pairs the original page with the form of output that matters most for that workflow.

Receipt photo with handwritten tip total

Receipt proof

A mobile capture can still become a structured expense record.

This image carries perspective distortion, uneven lighting, and a handwritten amount. The useful output is not the raw text block. It is a record with merchant, items, tax, and total that can survive audit and reimbursement.

Normalized JSON

  {  "merchant": "Contoso",  "timestamp": "2019-06-10T13:54:00",  "items": [    { "name": "Cappuccino", "quantity": 1, "amount": 2.2 },    {      "name": "BACON & EGGS",      "quantity": 1,      "amount": 9.5,      "modifier": "Sunny-side-up"    }  ],  "subtotal": 11.7,  "tax": 1.17,  "total": 14.5}
Proposal form with typed values and checkboxes

Form proof

Dense forms need field relationships, not just recognized characters.

The proposal scan mixes labels, checkboxes, quantities, tooling charges, and dates in a cramped layout. The value comes from keeping the field map intact so people and systems read the same page the same way.

Searchable markdown

  # Proposal- Vendor: STOUT INDUSTRIES, INC.- Customer: Lorillard Corporation- Date: October 16, 1987- Product: Metal "Pack" Plaque- Quantity: 500 Plaques- Unit price: $9.18 each- One-time tooling: $3,015.00- Steel tips: $1,045.00
Scanned invoice with Chinese text, official stamps, and printed totals

Multilingual proof

Digitization has to hold up when the page is in another language, not just another layout.

This invoice adds another failure mode: the fields, labels, and totals are not in English. A useful digitization layer still has to preserve document structure, extract the critical values, and hand off a record the rest of the workflow can use.

Chinese + English OCR

  中文四川通用机打发票国家税务总局四川省税务局发票联发票代码 151012071004发票号码 00231103成都伊藤洋华堂有限公司锦华店购货方名称: [██████████]地址: 二环路东五段29号电话: 028-841920002020年06月09日 13:14 <011>512 百事可乐桂花                             5.00(6906946288923)                              )商品数量                                     1个小计                                         5.00支付宝优惠                                   0.00实收(支付宝)                                 5.00支付宝合计                                   5.00收款条码交易号(c)c064114587001469dc1机号: 0001    收银员: 012    页码: 1/191510400669627198P退货提示,如发生退货必须退回原发票请顾客核对发票内容是否相符底部红色印章内容:成都伊藤洋华堂有限公司锦华店发票专用章(1)5101040066913凡持购物卡购买商品金额部分不作报销左侧垂直文本:210万份(7000卷 × 300)# 00000001-02100000 成都印钞有限公司2020年3月印制右侧垂直文本:除购货方名称外手写无效---EnglishSichuan General Machine-Printed InvoiceState Taxation AdministrationSichuan Provincial Tax BureauInvoice CopyInvoice code 151012071004Invoice number 00231103Chengdu Ito-Yokado Co., Ltd., Jinhua StorePurchaser name: [██████████]Address: No. 29, East 5th Section, 2nd Ring RoadPhone: 028-841920002020-06-09 13:14 <011>512 Pepsi Cola Osmanthus                            5.00(6906946288923)                                    )Item quantity                                      1Subtotal                                           5.00Alipay discount                                    0.00Amount received (Alipay)                           5.00Alipay total                                       5.00Payment barcodeTransaction number(c)c064114587001469dc1Machine no.: 0001    Cashier: 012    Page: 1/191510400669627198PReturn notice: if a return occurs, the original invoice must be brought backPlease verify that the invoice details match the purchaseBottom red seal:Chengdu Ito-Yokado Co., Ltd., Jinhua StoreInvoice seal(1)5101040066913Amounts paid with shopping cards are not reimbursableLeft vertical text:2.1 million copies (7000 rolls × 300) # 00000001-02100000Printed by Chengdu Banknote Company in March 2020Right vertical text:Handwriting is invalid except for the purchaser name

Delivery surface

Digitization is a three-layer handoff: readable, structured, and reviewable.

Readable archive layer

Use markdown when operations teams, auditors, and reviewers still need to read the page in a human way.

System-ready record layer

Use structured JSON when AP, logistics, records, or compliance workflows need stable fields and predictable contracts.

Exception layer

Keep flagged pages visible when handwriting, poor scans, or ambiguous totals need a deliberate human decision.

Archive to system

Use one pipeline for document intake, OCR, normalization, and exception routing.

capture

PDFs, scans, phone photos, mixed batches

parse

Reading order, field anchors, handwriting, totals

normalize

Markdown for people, JSON for systems

review

Low-confidence pages continue as exceptions, not silent failures

Final step

Start with the pages people still send around as attachments. Those are the records worth digitizing first.

If the document still needs a person to translate it before a system can act on it, the digitization layer is the missing part of the workflow.

Start with a live sample Book a workflow review