Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output
Why Adobe's table-layout extraction holds up on messy real-world PDFs where pdfplumber and pypdf fall apart, with the full S2S auth, extract, and parse path I run in production.

I keep a folder of PDFs that break parsers. A scanned two-column vendor invoice, a card statement with rotated column headers, a utility bill where the amount table sits inside a bordered box. I ran all three through [email protected] first, because it is free, runs offline, and is usually my first reach. On a clean digital invoice it was fine. On the scan it returned nothing, because pdfplumber reads the existing text layer and a scan has no text layer at all. On the statement with merged header cells, extract_table() handed back ragged rows where two columns had collapsed into one and the totals row drifted into the wrong column. pypdf was worse for this, because it is built for page text and structure, not for grid reconstruction.
Then I ran the same three files through Adobe PDF Services Extract over REST, which I have wrapped as a local tool on my Linux box. The scan came back with real rows and columns, because Adobe runs OCR before extraction. The merged-header statement kept its grid. The bordered utility table came out as a clean CSV with the totals row in the right place. That gap is the entire reason this tutorial exists.
Why the table-layout extraction wins on messy input
Adobe documents the behavior in their Extract how-to, but the short version is that Extract reconstructs table structure and reading order instead of just dumping a text layer. On messy input that is a different and much harder job.
| Capability | [email protected] | pypdf | Adobe PDF Services Extract |
|---|---|---|---|
| Reads scanned image PDF | No text layer, returns nothing | Same | OCR runs first, returns text plus tables |
| Merged or rotated headers | Often collapses columns | Not built for tables | Keeps cell structure with NumRow, NumCol |
| Reading order, multi-column | Manual, sort by x-coordinate | Manual | Reconstructed into the output Path |
| Output format | Python lists | Page text | structuredData.json plus CSV and XLSX |
| Cost | Free, offline | Free, offline | Free tier, about 500 transactions a month |
| Runs offline | Yes | Yes | No, cloud REST call |
The trade is plain. pdfplumber and pypdf are free and run on your own machine, and for a clean digital PDF with a real text layer they are the right tool, and I still reach for them first. Adobe earns its place the moment the input is scanned, rotated, multi-column, or carries the kind of merged-cell tables that real statements and invoices ship with.
The S2S auth flow
Adobe retired JWT service-account auth, so any project you create now uses OAuth Server-to-Server. You provision a client ID and client secret in the Adobe Developer Console, store them in a file outside your repo with chmod 600, and exchange them for a bearer token against the IMS endpoint. The grant type is client_credentials. No browser, no redirect, no logged-in user.
curl -s -X POST 'https://ims-na1.adobelogin.com/ims/token/v3' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'grant_type=client_credentials' \
-d "client_id=$ADOBE_CLIENT_ID" \
-d "client_secret=$ADOBE_CLIENT_SECRET" \
-d 'scope=openid,AdobeID,DCAPI'
The token is good for roughly 24 hours. In production I cache it to a chmod 600 file and refresh it 300 seconds before expiry, so a long batch never dies mid-run on a token that expired between files. Every later call carries two headers: Authorization: Bearer <token> and x-api-key: <client_id>. Drop the x-api-key and you get a 401 that never says the api-key is the missing piece, which is easy to miss the first time.
import json, time, os, httpx
from pathlib import Path
IMS = "https://ims-na1.adobelogin.com/ims/token/v3"
CACHE = Path("~/.cache/adobe-pdf/token.json").expanduser()
def get_token(client_id: str, client_secret: str) -> str:
if CACHE.exists():
c = json.loads(CACHE.read_text())
if c["expires_at"] > int(time.time()):
return c["access_token"]
r = httpx.post(IMS, data={
"grant_type": "client_credentials",
"client_id": client_id,
"client_secret": client_secret,
"scope": "openid,AdobeID,DCAPI",
}, timeout=60.0)
r.raise_for_status()
body = r.json()
CACHE.parent.mkdir(parents=True, exist_ok=True)
CACHE.write_text(json.dumps({
"access_token": body["access_token"],
"expires_at": int(time.time()) + int(body["expires_in"]) - 300,
}))
os.chmod(CACHE, 0o600)
return body["access_token"]
Upload, extract, poll
Every operation follows the same three beats. Upload the file to get an asset ID, start the operation, then poll a job URL until it reports done. The API base is https://pdf-services.adobe.io. Upload is itself two steps: ask for a pre-signed URL, then PUT the bytes to it.
BASE = "https://pdf-services.adobe.io"
def upload(token, client_id, path, media_type="application/pdf"):
h = {"Authorization": f"Bearer {token}", "x-api-key": client_id}
init = httpx.post(f"{BASE}/assets",
json={"mediaType": media_type},
headers={**h, "Content-Type": "application/json"}, timeout=60.0)
init.raise_for_status()
j = init.json()
asset_id, put_uri = j["assetID"], j["uploadUri"]
with open(path, "rb") as f:
httpx.put(put_uri, content=f.read(),
headers={"Content-Type": media_type}, timeout=180.0).raise_for_status()
return asset_id
def extract(token, client_id, asset_id):
h = {"Authorization": f"Bearer {token}", "x-api-key": client_id}
op = httpx.post(f"{BASE}/operation/extractpdf",
json={"assetID": asset_id,
"elementsToExtract": ["text", "tables"],
"elementsToExtractRenditions": ["tables"]},
headers={**h, "Content-Type": "application/json"}, timeout=60.0)
if op.status_code >= 400:
raise RuntimeError(f"extract HTTP {op.status_code}: {op.text[:300]}")
job = op.headers["location"]
while True:
s = httpx.get(job, headers=h, timeout=60.0); s.raise_for_status()
b = s.json()
if b["status"] == "done":
return b
if b["status"] == "failed":
raise RuntimeError(b.get("error", b))
time.sleep(1.5)
The poll interval I run is 1.5 seconds, and the client gives up after 80 tries, which caps any single job at 120 seconds. In my testing, most extract jobs on a multi-page document finish inside the first few cycles, call it roughly 7 seconds end to end including the upload, and a single-page file is quicker. The elementsToExtract list is the lever. Pass ["text"] for plain reading order, add "tables" for cell structure, and set "elementsToExtractRenditions": ["tables"] to also receive every table as a real CSV and XLSX file. With renditions on, the result is a ZIP instead of a bare JSON, so I sniff the first four bytes for PK\x03\x04 to tell them apart rather than trusting the filename extension.
Parsing the structured output
The done response carries a download URL for the result. Inside the ZIP are structuredData.json and a tables/ folder. The JSON has an elements array, and the two fields that matter most are Path and Text. Path reads like an HTML tree, for example //Document/Table[2]/TR[3]/TD[1]/P, so filtering for table cells is a substring match. Tables exported as renditions carry a filePaths attribute pointing at the CSV and XLSX inside the ZIP. Bounds are reported as a bounding box in 72 dpi PDF coordinates, which is what you want for laying text back onto a page. The field-by-field breakdown is in this output reference.
import json, zipfile
def read_tables(zip_path):
with zipfile.ZipFile(zip_path) as z:
data = json.loads(z.read("structuredData.json"))
cells = [e for e in data["elements"]
if "Table" in e.get("Path", "") and e.get("Text")]
for e in cells:
print(e["Path"], "->", e["Text"])
csvs = [n for n in z.namelist() if n.endswith(".csv")]
return cells, csvs
For most pipelines I skip walking the JSON tree by hand and load the CSV rendition straight into pandas, because Adobe already did the grid reconstruction. The JSON is there when I need bounding boxes or reading order. The CSV is there when I only want the rows.
OCR for the scans
Extract runs OCR on a scanned PDF on its own. When I only want a searchable PDF back rather than structured data, I call the OCR operation directly at /operation/ocr with an ocrLang of en-US or hi-IN. It returns a new PDF with a real text layer behind the image. I run that as a normalizing step before anything else touches the file, because a scan with a text layer behaves like a digital PDF for every downstream tool, including the free offline ones.
# same upload and poll pattern, only the operation endpoint changes
POST https://pdf-services.adobe.io/operation/ocr
{ "assetID": "<id>", "ocrLang": "hi-IN" }
Real limits and what broke
The free tier is about 500 document transactions a month, which has comfortably covered a steady flow of invoices and statements for me without spending a rupee. Each operation is one transaction, so an extract plus a separate OCR pass on the same file counts as two. My Acrobat access happens to ride on an unrelated subscription, but the PDF Services free allotment is the standard developer tier and does not depend on that.
Two things bit me. First, I tried installing the official pdfservices-sdk and hit the PEP 668 externally-managed-environment block on Python 3.12.3. Rather than fight pip, I dropped the SDK and went straight to REST with [email protected], which is leaner anyway and is the code above. Second, the extract response does not always put the download URL in the same place. Depending on the options, it lands under asset.downloadUri, or content.downloadUri, or a bare downloadUri, so I check all three rather than assume one shape. I run the whole thing wrapped behind the [email protected] server SDK so my agent can call extract and OCR as named tools, but the REST flow above is the entire engine and works fine as a plain script.
When to skip it
If your input is a clean digital PDF with a real text layer and simple tables, pdfplumber is free, offline, and good enough, and I use it there every day. Reach for Adobe Extract when the file is scanned, when the tables carry merged or rotated headers, or when you need OCR and structure in a single call. On that messy class of document, the gap over the open-source libraries is wide enough that the cloud round trip pays for itself, and at about 500 free transactions a month, a small automation pipeline never reaches the meter.
Related
More Automation

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2
I was tired of my agent missing buttons because a window shifted a few pixels. So I pointed it at the AT-SPI2 accessibility tree instead, the same data a screen reader consumes, and had it act by element name and role. This walks through driving a GTK dialog and a native Save dialog, then reading the value back to prove the action actually landed.

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production
I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.

I Run Gemma 3 Vision On A 6GB GTX 1660 For Screenshot OCR: The Real VRAM And Latency Numbers
I host Gemma 3 4B vision on a single 6GB GTX 1660 for screenshot OCR and invoice extraction. Here are the install steps, the exact model tag, the VRAM it actually eats, and the cold versus warm latency I measured this week on my own desktop.