AI Development

AI Document Understanding API Tutorial: From PDFs to Structured Data in Production

Build a production‑ready pipeline for AI document understanding: upload, OCR, schema‑based extraction, tables, QA, validation, and storage.

ASOasis

Apr 17, 2026

9 min read

AI Document Understanding API Tutorial: From PDFs to Structured Data in Production

Image used for representation purposes only.

Overview

AI document understanding turns messy, multi‑page PDFs, scans, and images into clean, structured data your apps can trust. In this tutorial you’ll build an end‑to‑end pipeline that:

Uploads documents (PDFs, TIFFs, JPEGs)
Runs OCR + layout analysis
Extracts fields and tables with a schema
Answers questions about the document
Validates and stores results
Handles webhooks, retries, and monitoring for production

All examples are provider‑agnostic and use a hypothetical base URL. Replace endpoints and headers with your vendor’s equivalents.

What you’ll build

We’ll process invoices and produce a normalized JSON payload ready for a database, then query the document (“What’s the due date?”) using retrieval‑augmented QA.

Prerequisites

An AI document understanding API key (export as DOC_API_KEY)
Python 3.10+ and Node.js 18+
curl and jq for quick tests
A few sample documents: one clean digital PDF and one photographed/low‑contrast scan

Optional but helpful:

A Postgres instance for storing results
ngrok or a public HTTPS endpoint to receive webhooks

Architecture at a glance

[Client] -> Upload -> [API: /documents]
            -> OCR/Layout -> [API: /pipelines/ocr]
            -> Extract -> [API: /extract] --(schema)--> JSON
            -> Tables -> [API: /tables]
            -> QA -> [API: /query]
            -> Webhook -> [Your service]
            -> Validate/Store -> [DB]

Key ideas:

Separate ingestion from extraction so you can re‑use uploaded pages with different prompts/schemas.
Use webhooks for long‑running jobs; poll only when webhooks aren’t possible.
Keep extraction prompts and schemas versioned to guarantee reproducibility.

Step 1 — Upload a document

Use multipart upload; request an ID you’ll reuse downstream.

curl

curl -X POST \
  https://api.example.ai/v1/documents \
  -H "Authorization: Bearer $DOC_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -F file=@./samples/invoice.pdf \
  -F metadata='{"source":"tutorial","doctype":"invoice"}'

Example response

{
  "id": "doc_12345",
  "status": "uploaded",
  "pages": 3,
  "content_type": "application/pdf"
}

Notes

For large files, chunked uploads or pre‑signed URLs may be required.
Store the mapping between your internal record and doc_id.

Step 2 — Run OCR and layout analysis

Trigger OCR with language hints and quality controls.

curl

curl -X POST \
  https://api.example.ai/v1/pipelines/ocr \
  -H "Authorization: Bearer $DOC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_12345",
    "languages": ["en"],
    "features": {
      "tables": true,
      "handwriting": false,
      "barcodes": true
    },
    "webhook_url": "https://yourapp.com/webhooks/ocr"
  }'

Example response

{
  "job_id": "job_ocr_7788",
  "status": "queued"
}

Polling (if no webhook)

curl -H "Authorization: Bearer $DOC_API_KEY" \
  https://api.example.ai/v1/jobs/job_ocr_7788

When complete, you’ll have page‑level text, tokens, bounding boxes, and reading order—crucial for reliable extraction and table parsing.

Step 3 — Define a structured extraction schema

Treat extraction as “programming with data contracts.” Define the fields, types, and validations you expect. Keep this schema in version control.

invoice.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://schemas.yourapp.com/invoice.v1.json",
  "title": "Invoice v1",
  "type": "object",
  "required": ["invoice_number", "invoice_date", "seller", "buyer", "total"],
  "properties": {
    "invoice_number": {"type": "string"},
    "invoice_date": {"type": "string", "format": "date"},
    "due_date": {"type": "string", "format": "date"},
    "seller": {"type": "string"},
    "buyer": {"type": "string"},
    "currency": {"type": "string", "pattern": "^[A-Z]{3}$"},
    "subtotal": {"type": "number"},
    "tax": {"type": "number"},
    "total": {"type": "number"}
  }
}

Extraction request

curl -X POST \
  https://api.example.ai/v1/extract \
  -H "Authorization: Bearer $DOC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_12345",
    "schema": {"$ref": "https://schemas.yourapp.com/invoice.v1.json"},
    "prompt": "Extract invoice fields. Prefer values on the first page if duplicates appear.",
    "webhook_url": "https://yourapp.com/webhooks/extract"
  }'

Example response payload

{
  "job_id": "job_extract_9012",
  "status": "queued"
}

Job result (abridged)

{
  "document_id": "doc_12345",
  "status": "succeeded",
  "output": {
    "invoice_number": {"value": "INV-2026-0417", "page": 1, "bbox": [92, 118, 250, 136]},
    "invoice_date": {"value": "2026-04-01", "page": 1},
    "due_date": {"value": "2026-04-30", "page": 1},
    "seller": {"value": "Acme Components LLC"},
    "buyer": {"value": "Northwind Traders"},
    "currency": {"value": "USD"},
    "subtotal": {"value": 1249.5},
    "tax": {"value": 112.46},
    "total": {"value": 1361.96}
  },
  "confidence": 0.94
}

Tips

Ask for bounding boxes and page numbers to support review UIs.
Prefer ISO‑8601 dates and three‑letter ISO‑4217 currencies for easy validation.

Step 4 — Extract line items as a table

Tables often cause the most pain. Request structured rows with column mappings and tolerances for merged cells.

curl -X POST \
  https://api.example.ai/v1/tables \
  -H "Authorization: Bearer $DOC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_12345",
    "table_hint": "Line items table usually below the header on page 1",
    "columns": [
      {"name": "description", "type": "string"},
      {"name": "qty", "type": "number"},
      {"name": "unit_price", "type": "number"},
      {"name": "line_total", "type": "number"}
    ]
  }'

Example table output

{
  "rows": [
    {"description": "Gear Assembly A2", "qty": 5, "unit_price": 199.90, "line_total": 999.50},
    {"description": "Shipping", "qty": 1, "unit_price": 250.00, "line_total": 250.00}
  ],
  "page": 1,
  "confidence": 0.91
}

Best practices

Provide column names and types to reduce hallucinations.
Enable header detection and tolerance for rotated text when scanning.
Consider a “strict mode” to fail extraction when any cell is low confidence.

Step 5 — Ask questions about the document

Use retrieval QA to answer user queries grounded in the uploaded pages.

curl -X POST \
  https://api.example.ai/v1/query \
  -H "Authorization: Bearer $DOC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc_12345",
    "question": "What is the payment due date and accepted methods?",
    "citations": true
  }'

Example answer

{
  "answer": "Due 2026-04-30. Accepted methods: ACH, wire, major credit cards.",
  "citations": [
    {"page": 1, "bbox": [80, 640, 420, 700], "snippet": "Payment due 30 Apr 2026..."}
  ]
}

Step 6 — Receive results with webhooks

Long extractions benefit from webhooks.

Node.js (Express)

import express from 'express';
import crypto from 'crypto';

const app = express();
app.use(express.json({ type: '*/*' }));

function verify(signature, body, secret) {
  const h = crypto.createHmac('sha256', secret).update(body).digest('hex');
  return crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(h));
}

app.post('/webhooks/extract', (req, res) => {
  const sig = req.header('X-Signature') || '';
  const raw = JSON.stringify(req.body);
  if (!verify(sig, raw, process.env.WEBHOOK_SECRET)) return res.sendStatus(401);

  // Upsert job status and results
  // enqueue post-processing
  res.sendStatus(200);
});

app.listen(3000, () => console.log('Webhook listening'));

Step 7 — Validate and normalize

Add domain checks before writing to your systems.

Python

from decimal import Decimal, ROUND_HALF_UP
from datetime import datetime

def normalize_amount(x):
    return Decimal(str(x)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)

def validate_invoice(o):
    # Required fields
    for k in ["invoice_number","invoice_date","seller","buyer","total"]:
        assert k in o, f"Missing {k}"

    # Dates
    inv_date = datetime.fromisoformat(o["invoice_date"]).date()
    if due := o.get("due_date"):
        assert datetime.fromisoformat(due).date() >= inv_date, "Due before invoice date"

    # Arithmetic
    subtotal = normalize_amount(o.get("subtotal", 0))
    tax = normalize_amount(o.get("tax", 0))
    total = normalize_amount(o["total"])
    assert subtotal + tax == total, "Subtotal + tax != total"

    # Currency
    if cur := o.get("currency"):
        assert len(cur) == 3 and cur.isupper(), "Invalid currency code"

    return {**o, "total": float(total)}

Step 8 — Store results

SQL

create table invoices (
  id text primary key,
  invoice_number text not null,
  invoice_date date not null,
  due_date date,
  seller text,
  buyer text,
  currency char(3),
  subtotal numeric(12,2),
  tax numeric(12,2),
  total numeric(12,2) not null,
  raw jsonb not null,
  created_at timestamptz default now()
);

Python insert

import psycopg2

def save_invoice(conn, doc_id, parsed):
    with conn.cursor() as cur:
        cur.execute(
            """
            insert into invoices (id, invoice_number, invoice_date, due_date, seller, buyer,
                                  currency, subtotal, tax, total, raw)
            values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
            on conflict (id) do update set raw = excluded.raw
            """,
            (
              doc_id,
              parsed.get('invoice_number'),
              parsed.get('invoice_date'),
              parsed.get('due_date'),
              parsed.get('seller'),
              parsed.get('buyer'),
              parsed.get('currency'),
              parsed.get('subtotal'),
              parsed.get('tax'),
              parsed.get('total'),
              json.dumps(parsed)
            )
        )
        conn.commit()

Step 9 — Evaluate and benchmark

Create a golden set of documents and expected outputs. Compute field‑level precision/recall and table accuracy.

Metrics

Exact‑match accuracy per field
Numeric tolerance (e.g., ±0.01) for amounts
Edit distance for IDs and names
Table row recall and column accuracy

Process

Run the pipeline on the golden set after each change.
Diff against the ground truth; fail the build if KPIs drop.
Track confidence scores and add a “human review required” threshold (e.g., < 0.85).

Step 10 — Production hardening

Idempotency: send an Idempotency‑Key on uploads and extraction calls.
Rate limits: implement exponential backoff and jitter; parallelize within vendor quotas.
Retries: retry 429/5xx with backoff; don’t retry 4xx except 408/409 when safe.
Webhook security: verify HMAC signatures; reject unsigned payloads.
Redaction: mask PII you don’t need (SSNs, account numbers) before logging.
Data residency: select regions and disable training on your data if required.
Observability: log job_id, document_id, model version, latency, token usage, and confidence.

Cost and performance tuning

Prefer text‑understanding for digital PDFs; use vision‑heavy OCR only for scans.
Crop to content: remove blank margins to cut tokens and speed up.
Page selection: extract only header + totals for invoices when tables aren’t needed.
Caching: memoize stable results by document hash.
Schema scoping: smaller schemas increase accuracy and reduce hallucinations.
Batch mode: group multiple documents to amortize overhead and save on requests.

Common errors and fixes

400 Invalid schema: validate locally against JSON Schema before calling the API.
413 Payload too large: switch to chunked uploads or pre‑signed URLs.
415 Unsupported media type: convert images to PNG/JPEG and PDFs to PDF/A when possible.
429 Too many requests: respect Retry‑After; implement client‑side rate limiting.
499/408 Client timeout: use webhooks or increase request timeouts for long jobs.
Mis‑extracted numbers: strip thousands separators; normalize decimals with a locale‑aware parser.
Wrong tables: provide explicit column headers and page/table hints; enable rotated text detection.

End‑to‑end Python example

import os, time, json, requests
BASE = "https://api.example.ai/v1"
KEY = os.environ["DOC_API_KEY"]
H = {"Authorization": f"Bearer {KEY}"}

# 1) Upload
def upload(path):
    with open(path, 'rb') as f:
        r = requests.post(f"{BASE}/documents", headers=H, files={"file": f})
    r.raise_for_status(); return r.json()["id"]

# 2) OCR

def ocr(doc_id):
    r = requests.post(f"{BASE}/pipelines/ocr", headers={**H, "Content-Type":"application/json"},
        data=json.dumps({"document_id": doc_id, "languages":["en"], "features":{"tables":True}}))
    r.raise_for_status(); return r.json()["job_id"]

# Helper: poll

def wait_job(job_id):
    while True:
        r = requests.get(f"{BASE}/jobs/{job_id}", headers=H)
        r.raise_for_status(); j = r.json()
        if j["status"] in ("succeeded","failed"): return j
        time.sleep(1.2)

# 3) Extract

def extract(doc_id):
    payload = {
        "document_id": doc_id,
        "schema": {"$ref": "https://schemas.yourapp.com/invoice.v1.json"},
        "prompt": "Extract invoice fields precisely; output ISO dates and 2dp numbers."
    }
    r = requests.post(f"{BASE}/extract", headers={**H, "Content-Type":"application/json"}, data=json.dumps(payload))
    r.raise_for_status(); job = wait_job(r.json()["job_id"]) ; return job["output"]

if __name__ == "__main__":
    doc = upload("./samples/invoice.pdf")
    print("doc:", doc)
    job = ocr(doc)
    print(wait_job(job)["status"])  # ensure OCR done before extraction
    result = extract(doc)
    print(json.dumps(result, indent=2))

Troubleshooting checklist

If confidence is low, try: better scans (300+ DPI), deskewing, or enabling table detection.
If dates parse incorrectly, specify expected locale/format in the prompt and normalize post‑extraction.
If totals don’t reconcile, extract line items and recompute; prefer numeric parsing over text copying.
For multilingual docs, pass language hints or enable auto‑detect; consider fallback OCR models.

Where to go next

Add a human‑in‑the‑loop review UI wired to page bboxes.
Build templates for common vendors and route docs by classifier.
Introduce active learning: send low‑confidence fields to annotation, retrain periodically.
Expand to receipts, ID cards, utility bills, and bank statements with tailored schemas and QA.

Summary

You now have a repeatable, production‑grade approach to document understanding: upload, OCR, extract with a contract, validate, and store—augmented with tables, QA, and strong operational practices. Adapt the endpoints to your provider, keep schemas versioned, and measure quality continuously. That’s the difference between a demo and a dependable pipeline.

DeepSeek API Integration Tutorial: From First Call to Production

Step-by-step DeepSeek API integration: base URL, models, cURL/Python/Node code, streaming, thinking mode, tool calls, errors, and production tips.

ASOasis

Mar 26, 2026

REST API Multipart File Upload: A Practical, End-to-End Guide

A concise, end-to-end guide to designing, implementing, and securing multipart/form-data file uploads in REST APIs, with examples in Node, Python, Java, .NET, and Go.

ASOasis

Apr 1, 2026

LangGraph Multi‑Agent Workflow Tutorial: From Supervisor Routing to Tool Execution

Build a production-ready LangGraph multi-agent workflow with a supervisor, tools, checkpointing, and streaming—step-by-step with tested Python code.

ASOasis

Mar 28, 2026

AI Document Understanding API Tutorial: From PDFs to Structured Data in Production

Overview

What you’ll build

Prerequisites

Architecture at a glance

Step 1 — Upload a document

Step 2 — Run OCR and layout analysis

Step 3 — Define a structured extraction schema

Step 4 — Extract line items as a table

Step 5 — Ask questions about the document

Step 6 — Receive results with webhooks

Step 7 — Validate and normalize

Step 8 — Store results

Step 9 — Evaluate and benchmark

Step 10 — Production hardening

Cost and performance tuning

Common errors and fixes

End‑to‑end Python example

Troubleshooting checklist

Where to go next

Summary

Tags

Related Posts

DeepSeek API Integration Tutorial: From First Call to Production

REST API Multipart File Upload: A Practical, End-to-End Guide

LangGraph Multi‑Agent Workflow Tutorial: From Supervisor Routing to Tool Execution

Services

Products

Company

Legal