Industry · March 16, 2026 · 9 min read

The million-token moment for document intelligence

Every frontier model now claims a million-token context window. Only one can actually use it. Here is what that means for the future of document processing.

In March 2026, the last domino fell. Within eleven days, all three frontier model families shipped million-token context windows: GPT-5.4 on March 5, Gemini 3.1 Pro on February 19, and Claude Opus 4.6's general availability announcement on March 13. The context window arms race, which consumed most of 2024 and 2025, appears to have reached its current ceiling.

For document intelligence, this convergence is the most significant infrastructure shift since transformer-based OCR replaced template matching. But the headline number hides a more interesting story. A million tokens is not a million tokens is not a million tokens. The gap between what a model accepts and what it can actually reason over is where the real battle for document AI is being fought.

The coherence gap

Here is the uncomfortable truth the marketing materials leave out: most models claiming a million-token context window cannot reliably use it.

Chroma's context rot research studied 18 leading LLMs and found that model reliability degrades at every context length increment, not just near the limit. A model with a 1M-token window still exhibits measurable degradation at 50K tokens. Performance drops are non-uniform and often sudden rather than gradual.

The benchmark that makes this concrete is MRCR v2, an 8-needle retrieval task developed by OpenAI. It buries eight similar pieces of information across a long synthetic conversation and asks the model to retrieve a specific one by ordinal position. The needles are generated from the same distribution as the surrounding text, so the model cannot rely on stylistic differences to locate them. It has to actually read, track, and distinguish.

At 1M tokens, the scores tell a stark story:

Claude Opus 4.6: 76-78.3%
Gemini 3.1 Pro: 26.3%
GPT-5.4: Strong single-needle performance, weaker on multi-needle retrieval

At 128K tokens, the field is much closer. Both Gemini 3.1 Pro and Opus 4.6 score around 84.9%. The divergence happens as context grows, and it happens fast.

This matters enormously for document processing. A mortgage application package, a legal discovery set, or a trade compliance bundle is not a single-needle problem. You need to cross-reference the borrower's stated income on page 12 with the tax return on page 47 and the bank statement on page 93. You need to find the clause on page 5 that contradicts the amendment on page 300. These are multi-needle tasks by nature.

What a million tokens actually holds

Let's be specific about what fits in a million-token window, because the abstraction obscures the practical reality:

Roughly 1,300 pages of dense text
Up to 600 PDF pages with layout information (Opus 4.6's new limit, up from 100)
An entire mortgage underwriting package with all supporting documents
A full regulatory filing with exhibits and appendices
A complete set of trade documentation for a shipment: bill of lading, commercial invoice, packing list, certificate of origin, letter of credit, insurance certificate, and customs declarations

That last example is worth pausing on. Trade documentation has historically required splitting documents across multiple extraction calls, then stitching the results together with custom logic to resolve cross-document references. With a coherent million-token window, you can process the entire package in a single pass. The bill of lading says 450 units. The commercial invoice says 450 units. The packing list says 445. A model that can hold all three documents simultaneously catches the discrepancy without orchestration code.

Context engineering is eating RAG

The rise of million-token context windows is accelerating a shift that was already underway. Gartner has called 2026 "the year of context," and the architecture that most document AI systems rely on, Retrieval-Augmented Generation (RAG), is being absorbed into something broader: context engineering.

The old RAG pattern was simple. Chunk your documents. Embed them. Retrieve the relevant chunks. Feed them to the model. This worked when context windows were 4K to 32K tokens and you had no choice but to be selective about what the model saw.

With a million tokens, the calculus changes. For document sets that fit in context, RAG introduces unnecessary complexity and a new failure mode: retrieving the wrong chunks. If your entire document package is 200K tokens, stuffing it directly into context gives the model complete information. No chunking boundaries to split a table across two segments. No embedding failures on domain-specific terminology. No retrieval misses on the paragraph that happened to use different vocabulary than the query.

But RAG is not dead. It is evolving into what practitioners now call agentic RAG, where retrieval becomes one step in a broader reasoning loop. The agent decides what to retrieve, evaluates whether the retrieved information is sufficient, identifies gaps, pulls additional context, and adapts its strategy based on what it finds. Retrieval is no longer the architecture. It is a tool the model reaches for when it needs to.

For document intelligence specifically, this means a hybrid approach is emerging:

Full-context processing for document packages under ~500K tokens where cross-document reasoning matters
Targeted extraction for high-throughput, single-document workloads where speed and cost matter more than cross-referencing
Agentic retrieval for document corpora too large for any context window, where the system dynamically retrieves and reasons across hundreds of documents

The IDP market is splitting in two

The intelligent document processing (IDP) market, projected at $3.17 billion in 2026, is undergoing a structural split that long context windows are accelerating.

On one side: high-volume, template-driven extraction. Invoices, receipts, purchase orders. Documents with predictable layouts where the goal is speed, cost, and field-level accuracy. Traditional IDP vendors with optimized OCR pipelines still have an edge here, and adding a million-token LLM to this workflow would be like using a crane to pick up a pencil.

On the other side: complex, variable, multi-document reasoning. Contract analysis, regulatory compliance, underwriting, due diligence. Documents where the layout varies, the relationships between fields span pages, and the value comes from understanding context rather than extracting coordinates. This is where long-context LLMs are rewriting the rules.

The traditional IDP stack for complex documents looked like this: OCR, then layout analysis, then field extraction, then classification, then validation rules, then cross-document reconciliation. Each step was a separate model or rules engine. Each step introduced latency, error propagation, and maintenance burden.

A coherent million-token model collapses several of those steps into one. It reads the document as a human would, understanding layout, context, and cross-references simultaneously. The accuracy gap between traditional OCR and LLM-based extraction on complex documents is widening in the LLM's favor.

But the LLM approach has its own weaknesses. Field-level repeatability, structured output guarantees, and cost efficiency at high volume are all areas where purpose-built extraction models still outperform general-purpose LLMs. The winning architecture is not one or the other. It is knowing when to use which.

The pricing signal

Anthropic's decision to remove the long-context pricing premium is worth paying attention to. Claude Opus 4.6 processes a 900K-token request at the same per-token rate as a 9K-token request: $5 per million input tokens, $25 per million output tokens. No multiplier. No surcharge.

This is a market-shaping move. It signals that Anthropic sees long-context processing not as a premium feature but as baseline infrastructure. For document AI teams, it removes one of the practical barriers to full-context processing. A 500-page mortgage package at roughly 300K tokens costs about $1.50 to process in a single pass. That is competitive with multi-step extraction pipelines that split the same work across dozens of API calls.

The cost math gets even more interesting when you factor in engineering time. Maintaining a chunking pipeline, embedding index, retrieval logic, and stitching layer is not free. If a single API call replaces that entire stack for documents under 500K tokens, the total cost of ownership shifts dramatically, even if the per-token price is higher than a smaller model.

What this does not solve

It would be easy to read the million-token story as "just throw the whole document at the model and you're done." That framing is wrong, and teams that adopt it will repeat the same mistakes that have killed most document AI pilots before them.

Structured output reliability. An LLM that can reason across 600 pages can still hallucinate a field value. Production document pipelines need validation layers, confidence scoring, and human-in-the-loop fallbacks regardless of how capable the extraction model is.

Latency. Processing a million tokens is not instant. For workflows that need sub-second field extraction (think: real-time data entry assistance), a purpose-built model processing a single page will always beat a frontier LLM processing 600.

Auditability. Regulated industries need to trace every extracted value back to a specific location in the source document. LLMs are getting better at citation, but the audit trail for a single-model extraction is less granular than a traditional pipeline where each step produces intermediate outputs.

Volume economics. If you process 10,000 invoices per day and each one is a single page, you do not need a million-token context window. You need a fast, cheap, accurate extraction model. Long context is a capability, not a requirement.

Where this is heading

The convergence on million-token windows is not the end state. It is the beginning of a new architectural era for document intelligence.

The models that can reason coherently across their full context, not just accept tokens but actually use them, will enable document processing workflows that were previously impossible without human reviewers. Complete contract suites analyzed for internal contradictions. Entire loan files underwritten in a single pass. Regulatory filings cross-referenced against the full text of applicable regulations.

The gap between "can accept" and "can reason over" is the metric that matters now. And as of March 2026, that gap varies by a factor of three across frontier models. Choose accordingly.