# documentai is solved

Even though we live "in the times of AI", quantum computers, driverless cars, and are about to visit the moon again – PDFs, scanned documents, and handwritten forms just refuse to die. They are an amusing relic from the not-so-distant past, a stronghold of the atavism of yesterday. I have re-visited the topic multiple times in my career, starting with Decision Trees, fine-tuning Encoder based models, and training LoRA adapters for LLMs.

Time for an update.

A lot of business-critical information is still locked in these kinds of documents. We have (largely forgotten about) OpenClaw, but we have not opened documents. Over three years ago, I wrote a piece on the same topic called DocumentAI: Challenges and Real World Usage. There, I traced back the approaches used to extract key-value pairs from PDFs: from "traditional" approaches with feature engineering, over to encoder- and encoder + decoder based approaches. The developments in this field are very reminiscent of the general path that Natural Language Processing (NLP) took over the last couple of years.

a very brief history

Historically, extracting data from documents meant most of the time classifying words (or characters or tokens respectively). But because most of the models were able to process positional information, they were also able to return from where in the document they got the value.

Traditional ML + feature engineering -> Encoder-only classifier (text, 2d position information, visual information) -> Encoder-Decoder -> Decoder-only LLMs -> Decoder-only (multi-modal) -> ?

To me, the most notable milestones are:

LayoutLM v1 in 2019 by Microsoft Research with their combination of BERT, Faster R-CNN, and positional embeddings (2D coordinates)
LayoutLM v2 a year later, which added cross-modality interaction during training
LAMBERT, ApplicaAI (now part of Snowflake) from 2021, which achieved nearly the same performance as LayoutLM v2 with a much simpler approach (only using positional embeddings on top of BERT)
LayoutLM v3 from 2021, swapping the Faster R-CNN with an Image Transformer
TILT, also from ApplicaAI from 2021 that supported position-image-text modalities – but now with an Encoder-Decoder approach based on the T5 model

And then?

recent developments

The decoder took over the world. All these specialised, carefully crafted niche-models are now basically extinct (except Microsoft's Azure Document Intelligence, which may still be powered by a fine-tuned LayoutLM v3). As in most areas nowadays, LLMs dominate everything. While you very recently needed to OCR a document before sending it to your (text-only) LLM, Gemini 2.0 in early 2025 could handle image input very reliably. Even supporting not very well-known things like 3D bounding box estimations (which really blew me away!).

3D bounding box example

Okay nice. But how does this help us with documents?

It returns the value location for the extraction. How? You just ask it (taken from the Google cookbook referenced above).

prompt = """
Point to the blue brush and a list of points covering the region of particles with no more than 10 items.
The answer should follow the JSON format: [{"point": , "label": }, ...].
The points are in [y, x] format normalized to 0-1000.
"""

For documents, this then looks something like this (generated with Gemini 3.1 Pro):

Gemini 3.1 Pro document bounding boxes

This may not look like much. But models like Gemini now allow us not only to classify words, but also to express document content in JSON, while also doing interpretation, translation, normalising, and much more. No OCR is needed, the model can do everything in a single step. Even if the input is only an image. Even with drawings, handwritten text, plots... And even OCR models are nowadays LLMs.

so documentai is solved?

Currently, the Gemini model family provides the best all-around performance for these kinds of use cases. Bounding boxes generated by it are nearly perfect. They also do very well at extracting text. Recently (beginning of 2026), an interesting benchmark and accompanying paper ExtractBench was released. The authors compared several state-of-the-art models on PDF-to-JSON extraction. Interestingly, even the large models like Opus 4.5 struggled with longer extractions.

verdict

Is DocumentAI solved? I would say mostly yes. Longer and more complex schemas remain a challenge. But with the new multi-modal LLMs, most of what is needed simply works. It is fascinating how LLMs have conquered all the niches that were previously held by specialized models. And I think most people don't realize how far we have gotten.