Extracting Massive Tables from PDFs with Reedy

Document extraction fails in predictable ways: not because the model can't read the document, but because we ask it to do too much in one shot, with no way to enforce completeness.

If you ask an LLM for "a JSON array of every row in this 50-page table," it will often return something that looks reasonable, passes a spot-check, and is quietly incomplete. Sometimes it truncates. Sometimes it smooths over missing segments with plausible noise. Either way, the downstream system inherits the error. That's fine for summaries. It's unacceptable for ETL.

Document-level structured output is misaligned with repeating entities

Most extraction pipelines start with a clean idea: define one schema for the document, run a single call, and get a single JSON object back. It's neat, it's easy to wire up, and it works well when you're extracting a handful of fields.

The trouble starts when the "document" is effectively a database rendered as PDF: provider directories, coverage grids, rate sheets, product catalogs. In those cases, completeness matters, and completeness is exactly what a document-level JSON blob is worst at delivering, because the output grows large, the content is repetitive, and the model has no built-in incentive to enumerate every item. If you need all rows, "close enough" isn't close enough.

Extract the unit that repeats

When the content repeats, your extraction strategy should repeat too.

Instead of asking the model to produce one giant list, define a schema for one entity (one row / one listing / one block) and extract entity-by-entity, then aggregate. Repeating-entity extraction isn't a prompt trick. It's an alignment choice: you match the extraction granularity to the structure of the document. In Reedy, this is Page Mode.

What Page Mode does

Page Mode isn't "split by page," and it's not naive chunking. It's segmentation plus validation-friendly structure:

Detect the repeating pattern
Table rows, list entries, repeated blocks.
Slice the document
Into small, entity-sized segments (often a handful of entities at a time).
Apply a single-entity schema
To each segment.
Aggregate the results
Into a list you can count, store, and query.

You're no longer betting that the model will stay disciplined across a massive list. You're giving it a smaller job, repeatedly, and making it possible to verify that it did the whole job.

Example: extracting 380+ rows from a hospital coverage PDF

A good stress test is a document that's "just a table," stretched across many pages: a hospital-by-county listing with plan coverage for each hospital. The table is large enough that naive document-level extraction tends to return the first chunk and then degrade.

Page Mode is a good fit because each entity is local: one hospital row contains the county, hospital name, and plan names.

Define the schema for one hospital:

from pydantic import BaseModel, Field

class Hospital(BaseModel):
    county: str = Field(description="County name")
    hospital_name: str = Field(description="Hospital name")
    plan_names: list[str] = Field(description="Plans accepted at this hospital")

Then extract per row:

from reedy import ReedyClient  # replace with your real SDK import

reedy = ReedyClient(api_key="REEDY_API_KEY")

result = reedy.extract(
    file="hospital-coverage.pdf",
    schema=Hospital,
    mode="page"
)

len(result.items), result.items[:2]

Two operational rules make this production-safe:

Always check len(result.items)
Compare it to an expected count (from the PDF index, a known row total, or a sampled tally).
Treat mismatched counts as a failed run
Not 'good enough.'

Page Mode's value isn't that it makes extraction possible; it's that it makes completeness testable.

Not just for tables: catalogs and repeated blocks

A lot of PDFs aren't strict grids, but they still have repeated structure: each product has a code, a name, a few specs, a paragraph of description. Visually, it's consistent; semantically, it's a repeating entity.

Page Mode works there too, as long as the entity boundary is discoverable and the information is mostly local to the block.

from pydantic import BaseModel, Field

class CatalogItem(BaseModel):
    section: str = Field(description="Category / section header")
    sku: str = Field(description="Product code / SKU")
    name: str = Field(description="Product name")
    specs: str = Field(description="Key specs")
    description: str = Field(description="Description text")

Same workflow: extract per entity, aggregate into a list, verify count, then ETL.

Reedy's UI workflow is designed to handle that pattern without needing an API.

Step 1: Upload the document

Open the General Prompt Agent and upload your PDF.

Once it appears under Select Documents and the status indicates it's ready, you can run extraction.

Step 2: Use a prompt that matches the entity you want

For repeating tables, the prompt should define what one row represents and what fields you want per entry.

Here's an example for a hospital-by-county coverage document:

Extract a list of hospitals organized by county, showing which BSC (Blue Shield of California)
health plans are available at each hospital. For each hospital entry, provide the county,
hospital name, and list of available plans (Trio HMO, SaveNet, Access+ HMO, BlueHPN PPO,
Tandem PPO, PPO).

Step 3: Enable "Process by Page"

For long documents, turn on Process by Page.

This is the UI equivalent of "don't do one giant generation." It keeps the extraction bounded, reduces the chance of "first chunk only" results, and makes it easier to spot where things degrade.

Step 4: Set Output Format to JSON and run

Choose Output Format: JSON, then click Submit Prompt.

You'll get a JSON array back. Depending on your prompt, Reedy may return either:

- a normalized structure (e.g., plan_names: [...]), or

- a table-like structure (e.g., one field per plan with blanks when unavailable)

Both are usable. If you're loading into a database, the normalized shape is usually easier.

What this enables

Once you can reliably extract every row, PDFs stop being dead artifacts and start being data sources:

ETL into a database
Transform extracted data into structured database formats.
Run structured queries
"show all hospitals in County X with Plan Y"
Diff versions over time
what changed this month?
Build search and filters
That match how users actually use the information

Page Mode is the bridge from "a PDF that looks like a table" to "a dataset you can trust."

Page Mode assumes entity locality. If the meaning of a row depends heavily on a header, a legend, or global context that isn't repeated near the row, you have two common options:

Include the needed context in each segmented chunk
So the entity extraction remains local.
Run a two-pass pipeline
One doc-level pass for globals, one row-level pass for items.

The point is to make completeness and correctness something your pipeline can enforce—not something you discover after shipping.

Try it today

Visit usereedy.com to get started with Reedy.

Let’s build together

We combine experience and innovation to take your project to the next level.