Documents are text arranged in space, and the arrangement carries meaning that the text alone does not. A heading is a heading because of where it sits and how it is set on the page. A label and its value live in a directional relationship before they live in a semantic one. A multi-column page has a reading order that the page geometry imposes on the language, not the other way around.
A common instinct in the current LLM/VLM wave pulls against that. If models got much better at reasoning over text, and then better again at reasoning over images, maybe document understanding is now mostly a matter of pointing a multimodal model at a page and asking for what you want. If that were true, much of the document-understanding stack would collapse into one model boundary.
That stack looks something like this: document → text with positions → structure → extraction → … For scanned inputs, the text-with-positions layer usually comes from OCR; for digital PDFs or web pages, a parser takes its place.
Documents carry at least two kinds of structure. One is spatial: position, adjacency, alignment, table cells, line grouping, and reading order. The other is semantic and discourse structure: section hierarchy, clause roles, and schema. This piece is about the first one.
Across rule-based, classical-ML, deep-learning, and multimodal waves, the handoff between positioned text and downstream extraction has been implemented in very different ways. It has been explicit, absorbed, re-exposed, and rebuilt. It has not gone away. The real question is where it lives.
Where the Layer Lives
In older stacks, the layer was explicit. You had OCR words or PDF spans, and then a programmable layer that reconstructed lines, grouped blocks, followed reading order, found phrases across line breaks, looked to the right of an anchor, or recovered a table from visible boundaries and aligned text. Classical layout-analysis methods such as XY-Cut, RLSA, and Voronoi segmentation all lived here. So do many of the practical PDF and post-OCR tools engineers still reach for: pdfplumber, Camelot, and similar libraries that expose words, boxes, crops, rows, columns, and region queries directly.
The deep-learning era did not remove this work. It moved it inward. Layout-aware model families such as LayoutLM, DocLLM, and LayoutLLM explicitly fused text with layout. The claim there was not “geometry no longer matters.” It was the opposite: layout was important enough to become part of the representation itself.
Then came the OCR-free and end-to-end push. Donut and Nougat made a stronger move: fold detection, recognition, order, and extraction into one model, and emit linearized markup or task output directly. This is the moment when the field came closest to actually hiding the spatial layer from the user.
But the explicit side never went away. Surya keeps OCR, layout detection, and reading order as separate components rather than fusing them into one model, and exposes structured layout as a clean handoff to downstream code. Marker, unstructured, and Docling sit alongside it as pipeline-shaped libraries.
The newest document VLMs sharpen the point. They internalize much more of the old pipeline, but many useful systems keep re-exposing a structured spatial contract downstream.
dots.ocr makes the pattern especially visible: a roughly 1.7B-parameter document VLM that handles detection, recognition, and reading order inside one model, but still emits structured layout for downstream code. Its default output is typed bounding boxes in reading order, with Markdown as a derived view and tables/formulas preserved in richer forms. In other words, it has a Donut-shaped engine and a Surya-shaped contract.
That is not unique to dots.ocr. The same downstream shape now shows up across very different kinds of systems, from explicit components and pipeline libraries to the newer wave of grounded document VLMs. Some keep the structure layer explicit. Some push it into the model. Some pull it back out again as typed elements, bounding boxes, reading order, table regions, or grounded spans. The field keeps moving the layer around, but it keeps rebuilding the same handoff.
Category Mismatch
The spatial work always lands somewhere. The sharper question is whether the substrate carrying it is actually suited to the work.
That is where document-understanding systems often get muddled. Semantic interpretation and spatial grounding tend to get treated as two difficulty levels of one task. They are not. One is about meaning. The other is about position, order, and anchored lookup. A model can be excellent at the first and still unreliable at the second if the system has not given it a stable spatial substrate to work over.
Coordinate hallucination is one especially clear symptom of that mismatch. Ask a multimodal model for the bounding box of an invoice total and it may return something that looks plausible on the page and is still wrong in the document. Nothing about the response necessarily signals failure. You only discover the error when you go back to the page itself.
The same mismatch shows up in quieter ways too. A phrase that crosses a line break has to be recovered as one phrase, not two fragments. A multi-column page can contain all the right text and still fail because the reading order is wrong. A table can preserve every token and still lose the row and column structure that made those tokens meaningful. Even something as ordinary as binding a value to its label is often a directional and regional question before it is a semantic one.
Multimodal systems are clearly getting better at layout-sensitive reasoning, and that gap will keep narrowing with better grounding, reinforcement learning on coordinate prediction, and architectural changes that fold layout in more natively. The point is narrower than “models cannot do this.” It is that when the task itself is spatial and you need reliability, grounding, or control, the question of substrate becomes unavoidable. Hallucinated coordinates are not a problem to be tuned away. They are a category mismatch.
Composition, Not Replacement
This is why the composition of modules in the system matters more than any particular implementation. The model can decide what kind of thing you are looking for and how it should map into your taxonomy. The spatial layer can find it on the page, bind it to the right anchor, and tell you where it came from. That is a cleaner division of labor than asking one inference step to do semantic interpretation, spatial lookup, schema mapping, and grounding all at once.
Once structure handling moves inside model inference, you may lose inspectability, direct grounding, and the ability to compose spatial lookup with semantic reasoning as separate operations. Sometimes that tradeoff is fine. Sometimes it is exactly the problem.
This is not an argument for older explicit geometry pipelines over newer multimodal models. Recent multimodal systems pushed deeper into the stack than ever before, absorbing more of text acquisition, ordering, and layout into the model itself.
The argument here is also not prescriptive about implementation. The choice between explicit, internalized, or re-exposed structure handling is a fit question. The narrower claim is that whatever choice you make, the output contract still has to carry spatial work somehow.
Document understanding keeps rebuilding a spatial-structure layer because documents keep being structured objects. The practical contract exposed downstream is still shaped like a spatial layer: positioned output, grounded spans, reading order, table structure, and anchored regions.
The implementation will keep moving. The contract is what survives.