Automationintermediate

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output

Why Adobe's table-layout extraction holds up on messy real-world PDFs where pdfplumber and pypdf fall apart, with the full S2S auth, extract, and parse path I run in production.

By··8 min read
Terminal showing a structuredData.json table extraction from a scanned PDF via Adobe PDF Services REST

I keep a folder of PDFs that break parsers. A scanned two-column vendor invoice, a card statement with rotated column headers, a utility bill where the amount table sits inside a bordered box. I ran all three through [email protected] first, because it is free, runs offline, and is usually my first reach. On a clean digital invoice it was fine. On the scan it returned nothing, because pdfplumber reads the existing text layer and a scan has no text layer at all. On the statement with merged header cells, extract_table() handed back ragged rows where two columns had collapsed into one and the totals row drifted into the wrong column. pypdf was worse for this, because it is built for page text and structure, not for grid reconstruction.

Then I ran the same three files through Adobe PDF Services Extract over REST, which I have wrapped as a local tool on my Linux box. The scan came back with real rows and columns, because Adobe runs OCR before extraction. The merged-header statement kept its grid. The bordered utility table came out as a clean CSV with the totals row in the right place. That gap is the entire reason this tutorial exists.

Why the table-layout extraction wins on messy input

Adobe documents the behavior in their Extract how-to, but the short version is that Extract reconstructs table structure and reading order instead of just dumping a text layer. On messy input that is a different and much harder job.

Capability [email protected] pypdf Adobe PDF Services Extract
Reads scanned image PDF No text layer, returns nothing Same OCR runs first, returns text plus tables
Merged or rotated headers Often collapses columns Not built for tables Keeps cell structure with NumRow, NumCol
Reading order, multi-column Manual, sort by x-coordinate Manual Reconstructed into the output Path
Output format Python lists Page text structuredData.json plus CSV and XLSX
Cost Free, offline Free, offline Free tier, about 500 transactions a month
Runs offline Yes Yes No, cloud REST call

The trade is plain. pdfplumber and pypdf are free and run on your own machine, and for a clean digital PDF with a real text layer they are the right tool, and I still reach for them first. Adobe earns its place the moment the input is scanned, rotated, multi-column, or carries the kind of merged-cell tables that real statements and invoices ship with.

The S2S auth flow

Adobe retired JWT service-account auth, so any project you create now uses OAuth Server-to-Server. You provision a client ID and client secret in the Adobe Developer Console, store them in a file outside your repo with chmod 600, and exchange them for a bearer token against the IMS endpoint. The grant type is client_credentials. No browser, no redirect, no logged-in user.

curl -s -X POST 'https://ims-na1.adobelogin.com/ims/token/v3' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'grant_type=client_credentials' \
  -d "client_id=$ADOBE_CLIENT_ID" \
  -d "client_secret=$ADOBE_CLIENT_SECRET" \
  -d 'scope=openid,AdobeID,DCAPI'

The token is good for roughly 24 hours. In production I cache it to a chmod 600 file and refresh it 300 seconds before expiry, so a long batch never dies mid-run on a token that expired between files. Every later call carries two headers: Authorization: Bearer <token> and x-api-key: <client_id>. Drop the x-api-key and you get a 401 that never says the api-key is the missing piece, which is easy to miss the first time.

import json, time, os, httpx
from pathlib import Path

IMS = "https://ims-na1.adobelogin.com/ims/token/v3"
CACHE = Path("~/.cache/adobe-pdf/token.json").expanduser()

def get_token(client_id: str, client_secret: str) -> str:
    if CACHE.exists():
        c = json.loads(CACHE.read_text())
        if c["expires_at"] > int(time.time()):
            return c["access_token"]
    r = httpx.post(IMS, data={
        "grant_type": "client_credentials",
        "client_id": client_id,
        "client_secret": client_secret,
        "scope": "openid,AdobeID,DCAPI",
    }, timeout=60.0)
    r.raise_for_status()
    body = r.json()
    CACHE.parent.mkdir(parents=True, exist_ok=True)
    CACHE.write_text(json.dumps({
        "access_token": body["access_token"],
        "expires_at": int(time.time()) + int(body["expires_in"]) - 300,
    }))
    os.chmod(CACHE, 0o600)
    return body["access_token"]

Upload, extract, poll

Every operation follows the same three beats. Upload the file to get an asset ID, start the operation, then poll a job URL until it reports done. The API base is https://pdf-services.adobe.io. Upload is itself two steps: ask for a pre-signed URL, then PUT the bytes to it.

BASE = "https://pdf-services.adobe.io"

def upload(token, client_id, path, media_type="application/pdf"):
    h = {"Authorization": f"Bearer {token}", "x-api-key": client_id}
    init = httpx.post(f"{BASE}/assets",
                      json={"mediaType": media_type},
                      headers={**h, "Content-Type": "application/json"}, timeout=60.0)
    init.raise_for_status()
    j = init.json()
    asset_id, put_uri = j["assetID"], j["uploadUri"]
    with open(path, "rb") as f:
        httpx.put(put_uri, content=f.read(),
                  headers={"Content-Type": media_type}, timeout=180.0).raise_for_status()
    return asset_id

def extract(token, client_id, asset_id):
    h = {"Authorization": f"Bearer {token}", "x-api-key": client_id}
    op = httpx.post(f"{BASE}/operation/extractpdf",
                    json={"assetID": asset_id,
                          "elementsToExtract": ["text", "tables"],
                          "elementsToExtractRenditions": ["tables"]},
                    headers={**h, "Content-Type": "application/json"}, timeout=60.0)
    if op.status_code >= 400:
        raise RuntimeError(f"extract HTTP {op.status_code}: {op.text[:300]}")
    job = op.headers["location"]
    while True:
        s = httpx.get(job, headers=h, timeout=60.0); s.raise_for_status()
        b = s.json()
        if b["status"] == "done":
            return b
        if b["status"] == "failed":
            raise RuntimeError(b.get("error", b))
        time.sleep(1.5)

The poll interval I run is 1.5 seconds, and the client gives up after 80 tries, which caps any single job at 120 seconds. In my testing, most extract jobs on a multi-page document finish inside the first few cycles, call it roughly 7 seconds end to end including the upload, and a single-page file is quicker. The elementsToExtract list is the lever. Pass ["text"] for plain reading order, add "tables" for cell structure, and set "elementsToExtractRenditions": ["tables"] to also receive every table as a real CSV and XLSX file. With renditions on, the result is a ZIP instead of a bare JSON, so I sniff the first four bytes for PK\x03\x04 to tell them apart rather than trusting the filename extension.

Parsing the structured output

The done response carries a download URL for the result. Inside the ZIP are structuredData.json and a tables/ folder. The JSON has an elements array, and the two fields that matter most are Path and Text. Path reads like an HTML tree, for example //Document/Table[2]/TR[3]/TD[1]/P, so filtering for table cells is a substring match. Tables exported as renditions carry a filePaths attribute pointing at the CSV and XLSX inside the ZIP. Bounds are reported as a bounding box in 72 dpi PDF coordinates, which is what you want for laying text back onto a page. The field-by-field breakdown is in this output reference.

import json, zipfile

def read_tables(zip_path):
    with zipfile.ZipFile(zip_path) as z:
        data = json.loads(z.read("structuredData.json"))
        cells = [e for e in data["elements"]
                 if "Table" in e.get("Path", "") and e.get("Text")]
        for e in cells:
            print(e["Path"], "->", e["Text"])
        csvs = [n for n in z.namelist() if n.endswith(".csv")]
    return cells, csvs

For most pipelines I skip walking the JSON tree by hand and load the CSV rendition straight into pandas, because Adobe already did the grid reconstruction. The JSON is there when I need bounding boxes or reading order. The CSV is there when I only want the rows.

OCR for the scans

Extract runs OCR on a scanned PDF on its own. When I only want a searchable PDF back rather than structured data, I call the OCR operation directly at /operation/ocr with an ocrLang of en-US or hi-IN. It returns a new PDF with a real text layer behind the image. I run that as a normalizing step before anything else touches the file, because a scan with a text layer behaves like a digital PDF for every downstream tool, including the free offline ones.

# same upload and poll pattern, only the operation endpoint changes
POST https://pdf-services.adobe.io/operation/ocr
{ "assetID": "<id>", "ocrLang": "hi-IN" }

Real limits and what broke

The free tier is about 500 document transactions a month, which has comfortably covered a steady flow of invoices and statements for me without spending a rupee. Each operation is one transaction, so an extract plus a separate OCR pass on the same file counts as two. My Acrobat access happens to ride on an unrelated subscription, but the PDF Services free allotment is the standard developer tier and does not depend on that.

Two things bit me. First, I tried installing the official pdfservices-sdk and hit the PEP 668 externally-managed-environment block on Python 3.12.3. Rather than fight pip, I dropped the SDK and went straight to REST with [email protected], which is leaner anyway and is the code above. Second, the extract response does not always put the download URL in the same place. Depending on the options, it lands under asset.downloadUri, or content.downloadUri, or a bare downloadUri, so I check all three rather than assume one shape. I run the whole thing wrapped behind the [email protected] server SDK so my agent can call extract and OCR as named tools, but the REST flow above is the entire engine and works fine as a plain script.

When to skip it

If your input is a clean digital PDF with a real text layer and simple tables, pdfplumber is free, offline, and good enough, and I use it there every day. Reach for Adobe Extract when the file is scanned, when the tables carry merged or rotated headers, or when you need OCR and structure in a single call. On that messy class of document, the gap over the open-source libraries is wide enough that the cloud round trip pays for itself, and at about 500 free transactions a month, a small automation pipeline never reaches the meter.

Related