Engineering · · 9 min read
Multimodal embeddings are about to change everything in document AI
Two new embedding models landed in the same week. Together, they signal a fundamental shift in how we will search, classify, and retrieve documents.

For years, document AI has relied on a sequential pipeline: scan the document, run OCR, extract text, embed the text, then search. Every step introduces errors. Every step adds latency. And every step throws away information that the original document contained, like layout, color, typography, and visual context. Two models released in the same week suggest that era is ending.
On March 10, Google released Gemini Embedding 2, its first natively multimodal embedding model. Two days later, Mixedbread announced Wholembed v3, a unified omnimodal retrieval model that outperforms Gemini on several document benchmarks. Both models embed text, images, audio, video, and PDFs into a shared vector space. Both represent a generational leap.
Why text-only embeddings were always a compromise
Traditional document search works by converting everything to text first. A scanned invoice becomes an OCR transcript. A technical diagram becomes alt text, or worse, gets ignored entirely. A contract with handwritten annotations becomes a noisy string of characters with an error rate that compounds through every downstream step.
This pipeline made sense when embedding models could only process text. But it came with structural limitations that no amount of engineering could fully solve.
Layout and spatial relationships disappear. A table in a PDF carries meaning through its rows, columns, and alignment. Flatten that to text and you lose the structure that makes the data interpretable. A signature block at the bottom of a contract is semantically different from the same text appearing in the body, but text-only embeddings cannot tell the difference.
Visual elements get discarded. Stamps, logos, checkboxes, diagrams, charts, color-coded highlights. These carry information that matters for classification, compliance checking, and fraud detection. An OCR pipeline treats them as noise.
OCR errors propagate. Field-level accuracy in production environments is significantly lower than the headline character accuracy rates suggest. Small character errors compound through post-processing, and the embedding model faithfully encodes the errors along with the content.
What Gemini Embedding 2 brings to the table
Gemini Embedding 2 is built on a shared transformer backbone that processes all modalities through a unified representation. Unlike CLIP-style architectures that pair separate encoders and align them after the fact, Gemini learns cross-modal understanding intrinsically during training.
The practical implications for document AI are significant:
Direct PDF embedding. You can pass a PDF (up to six pages) directly to the model and get back a vector that captures both the textual content and the visual layout. No OCR step. No text extraction. No information loss.
Cross-modal retrieval. A text query like "invoice with handwritten corrections" can match against embedded document images, even if no text extraction was performed. The model understands the visual concept of handwritten annotations overlaid on printed text.
Matryoshka dimensionality. The 3,072-dimensional output can be truncated to 768 or lower with minimal quality loss. This is not just a storage optimization. It means you can run fast approximate searches at low dimensions and re-rank with full-precision vectors, cutting retrieval latency without sacrificing accuracy.
Scale and cost. At $0.20 per million tokens with a 50% batch discount, the economics work for high-volume document processing. The model supports 8,192 input tokens, four times the context window of its predecessor.
On benchmarks, Gemini Embedding 2 ranks first on the MTEB multilingual leaderboard with a score of 69.9, and leads on code embeddings at 84.0. Early adopters report 70% latency reductions and 20% recall improvements over conventional multi-model pipelines.
Where Wholembed v3 takes a different path
Mixedbread's Wholembed v3 makes a fundamentally different architectural bet. Instead of producing a single vector per document, it uses ColBERT-style late interaction, generating token-level vector representations and scoring via MaxSim. For each query token, the model finds the best matching document token, then sums those scores.
This matters more than it might sound. A single embedding vector has a theoretical ceiling on how much information it can encode. Google DeepMind formalized this in the LIMIT benchmark, which stress-tests retrieval tasks that are provably hard for fixed-dimension embeddings. Previous state-of-the-art models scored below 20 on recall@100. Wholembed v3 is the first semantic model to beat BM25 on LIMIT, sidestepping the dimensionality bottleneck entirely.
Three capabilities stand out for document workloads:
Dynamic vector allocation. The model estimates the information density of each input and allocates more vectors to complex documents, fewer to simple ones. A dense financial report gets a richer representation than a single-line receipt. This is a more honest approach to the fundamental problem that documents vary enormously in complexity.
Specialized modality handling. Audio inputs go through quality preprocessing. Code inputs are parsed via AST. Documents are processed with layout-aware tokenization. Each modality gets purpose-built preprocessing before entering the shared latent space.
Production-scale late interaction. Mixedbread has deployed Wholembed v3 at over one billion indexed documents with sub-50ms P50 latency at 500+ queries per second. Late interaction has historically been dismissed as too expensive for production. These numbers suggest otherwise.
On BrowseComp-Plus, Wholembed v3 achieves 64.82% answer accuracy, ahead of Voyage (61.6%), Gemini Embedding 2 (58.6%), and Cohere Embed 4 (57.1%). On specialized-domain PDF search, multilingual PDF retrieval, and fine-grained document matching, it leads Gemini by meaningful margins.
What changes for document search and retrieval
The immediate impact is the elimination of the OCR bottleneck. When you can embed a document image directly, the question shifts from "how do we extract text accurately?" to "how do we organize a unified vector space across modalities?"
Semantic search over visual documents. Search for "table showing quarterly revenue breakdown" and retrieve the right page from a PDF, even if the table has no caption and the text never uses the word "revenue." The embedding captures the visual structure of a table alongside the numeric content.
Cross-format retrieval. A single query can match across scanned documents, born-digital PDFs, photographs of whiteboards, and transcribed audio recordings. The unified embedding space makes format irrelevant to the search experience.
Retrieval-augmented generation with full context. When you feed a multimodal embedding into a RAG pipeline, the retrieved chunks carry visual context that text-only chunks cannot. A language model receiving a retrieved document image alongside its text can reason about layout, emphasis, and spatial relationships.
What changes for classification and detection
Classification has traditionally required labeled training data for each document type. Multimodal embeddings open a different path.
Zero-shot classification. Embed your document categories as text descriptions. Embed incoming documents as images. The nearest category in the shared vector space is your classification. No training data required. No fine-tuning. When new document types appear, you add a text description and the system adapts.
Anomaly detection by visual similarity. Fraudulent documents often look subtly wrong. The font is slightly off. The logo is a low-resolution copy. The layout does not match the template. Text-only analysis misses these signals entirely. Multimodal embeddings encode visual appearance alongside content, making it possible to flag documents that are textually plausible but visually anomalous.
Compliance checking at the layout level. Regulatory documents often have specific formatting requirements: where the signature goes, how disclosures are displayed, what font sizes are used for mandatory warnings. Multimodal embeddings can encode these structural properties, enabling compliance checks that go beyond keyword matching.
The architectural shift ahead
These models do not just improve existing pipelines. They enable a fundamentally different architecture.
The traditional stack looks like this: Ingest, OCR, Extract, Embed, Index, Search. Each step is a separate service with its own failure modes, latency budget, and maintenance burden. (We covered how to build these pipelines for production reliability in an earlier article.)
The emerging stack collapses the middle: Ingest, Embed, Index, Search. The document goes directly from raw input to vector representation. Classification, extraction, and retrieval all operate on the same embedding space. Error propagation from OCR disappears because OCR disappears.
This does not mean OCR becomes irrelevant overnight. Structured data extraction still requires text. Downstream systems still need field-level values. But the role of OCR shifts from the foundation of the pipeline to a targeted extraction step that runs only when specific text values are needed, not as a prerequisite for every operation.
What to watch for
Both models are in their early days. Gemini Embedding 2 is in public preview. Wholembed v3 is available through Mixedbread's API but not as open weights. The benchmarks are promising, but production performance on your specific document types is what matters.
A few questions worth tracking:
Consistency across document types. Benchmarks aggregate performance. Your documents are specific. How well do these models handle the particular combination of languages, layouts, quality levels, and edge cases in your pipeline?
Retrieval latency at scale. Gemini's single-vector approach and Wholembed's late-interaction approach make very different tradeoffs between index size, query speed, and retrieval quality. The right choice depends on your volume and latency requirements.
Integration patterns. Both models offer API access, but the ecosystem of vector databases, orchestration frameworks, and RAG tooling is still catching up to multimodal inputs. LangChain, LlamaIndex, and major vector databases have announced Gemini Embedding 2 integrations. Wholembed v3 integrations are earlier-stage.
The direction is clear. Document AI is moving from text-first to multimodal-native. The models that landed this week are the first production-grade implementations of that shift. The organizations that start experimenting now will have a meaningful head start when these capabilities become table stakes.
Ready to solve your document challenges?
Talk to our team about how Doclo can fit into your workflow. No commitment, just a conversation.


