Structuring 10-K filings for medical sales enablement

A sales enablement startup in the medical space needed structured financial and operational data from thousands of hospital and medical company 10-K filings. Doclo built a multi-layered extraction pipeline that processes 100-page documents for under $1 each, with 95%+ accuracy on financial fields.

ExtractionAgentic pipelineFinancial data

95%+

Accuracy on objective fields

<$1

Cost per 10-K processed

50+

Fields extracted per document

5,000

10-Ks processed

Structuring 10-K filings for medical sales enablement

The problem with 10-K filings

Public companies file annual 10-K reports with the SEC. These documents contain a comprehensive picture of the business: audited financial statements, revenue breakdowns, risk factors, property holdings, legal proceedings, management discussion, and more. For anyone trying to understand a company's financial health and operational profile, the 10-K is the authoritative source.

For a sales enablement startup serving the medical industry, this data was exactly what their clients needed. Sales teams selling into hospitals and healthcare systems want to know the financial position, growth trajectory, risk profile, and operational footprint of the organizations they are targeting. That information exists in 10-K filings. The problem is getting it out.

A typical 10-K runs 80 to 150 pages. The format varies significantly between companies. Financial tables may be structured or embedded in narrative text. Risk factors are written in dense legal prose. Property and facility data can appear in tables, lists, or buried in footnotes. There is no universal template.

The startup had explored feeding 10-Ks directly into large language models, but the results were not reliable enough to present to paying customers. LLMs would return plausible-sounding data that was wrong in ways that were difficult to catch at scale: numbers pulled from the wrong fiscal year, figures attributed to the wrong line item, or risk factors summarized in ways that lost critical nuance. For a product built on trust in the underlying data, "mostly right" was not good enough.

Without a consistent, scalable way to extract this data, the startup simply could not offer it. The feature was on the roadmap but blocked on the extraction problem. Their goal was to process approximately 5,000 10-K filings across the medical and hospital sector.

What we built

Doclo built a multi-layered extraction pipeline designed around the specific challenges of 10-K filings: length, format variation, and the mix of structured financial data alongside less standardized narrative content.

Document parsing and HTML conversion

The first step converts each 10-K into a clean HTML representation. SEC filings arrive in a range of formats, from XBRL-tagged documents to plain HTML to PDF exports. The parsing layer normalizes all of these into a consistent structure that preserves tables, headings, and document hierarchy. This gives every downstream step a reliable, searchable representation of the full document.

Layer one: deterministic extraction for financial data

GAAP-standard financial fields follow predictable patterns. Revenue, net income, total assets, operating expenses, and similar line items appear in financial statements that, while formatted differently, use consistent terminology governed by accounting standards.

The first extraction layer uses regex and keyword matching to locate and extract these objective financial fields. This approach is fast, inexpensive, and highly accurate for data that follows known patterns. It handles the bulk of the structured financial data without requiring any language model inference at all.

Layer two: agentic search for semi-structured data

Not everything in a 10-K is a clean financial table. Fields like number of properties under management, bed counts, facility locations, and operational metrics may appear in narrative sections, management discussion, or supplementary tables that vary in format from company to company.

The second layer uses an agentic approach: targeted search and extraction that scans the document for specific types of information, locating relevant sections and pulling out structured values. This layer handles the fields that are too variable for regex but still have a relatively clear "right answer" somewhere in the document.

Layer three: agentic extraction for subjective fields

The final layer tackles the most unstructured content: risk profile analysis, competitive positioning, strategic priorities, and similar fields where the answer is not a single number but a synthesis of narrative content.

This layer uses more capable models to read relevant sections and produce structured outputs. To ensure accuracy, we use consensus across multiple extraction passes. If independent passes agree on a characterization, confidence is high. If they diverge, the field is flagged for review. Every extracted value includes citations back to the specific sections of the source document, so end users can verify any data point against the original filing.

Cost control through appropriate model selection

Using the most capable (and expensive) models for every field in a 100-page document would make the pipeline uneconomical at scale. By layering the approach, deterministic methods handle the high-volume structured data, mid-tier models handle semi-structured extraction, and the most capable models are reserved for the fields that genuinely require them. This keeps the total cost under $1 per 10-K filing, even for documents exceeding 100 pages with 50+ fields extracted.

Results

95%+ accuracy on objective financial fields

GAAP-standard financial data is extracted with high accuracy through deterministic methods, validated against the source document structure. No hallucinated numbers, no wrong fiscal years, no misattributed line items.

Well-cited subjective values

Risk profiles, strategic priorities, and operational characteristics are extracted with full citations to the source sections of the 10-K. Users can trace any data point back to the exact language in the original filing, which is critical for a product where customers are making business decisions based on the data.

Under $1 per 10-K processed

Processing a 100-page filing and extracting 50+ structured fields costs less than a dollar. At 5,000 filings, the total extraction cost for the startup's entire target dataset was a fraction of what a single analyst would cost for a month of manual work.

50+ fields per document

Each processed 10-K produces a structured dataset covering financial performance, operational metrics, risk factors, property and facility data, and strategic positioning. This is the data that was previously locked inside dense filings and unavailable to the startup's customers.

From blocked roadmap to shipped feature

The most significant result is not a metric. The startup had a feature that their customers wanted and that their sales team needed, but no way to build it reliably. The extraction pipeline unblocked that entirely, turning 5,000 dense regulatory filings into structured, trustworthy data that their application could surface to end users.