Article
OCR is legacy tech
August 26, 2025
.png)

Author

Filip Rejmus
The first idea resembling something like the idea of OCR got developed in 1870 as a reading machine for the blind - the Optophone. This was the first step to solve a problem that sounds pretty simple: How do we get writing on paper inside a computer?
150 years of research, engineering breakthroughs and hundreds of IDP products later we were finally able to scan a receipt and have the fields be filled out - if it looked nice and friendly enough to the OCR model. Heureka.
Unfortunately for Tesseract, Abbyy and co. they suffer from the complication that documents are written by humans. And humans love to do things like:
- Stamp over the most critical data because it feels like the right spot
- Organize data in tables with four nested of levels of groupings
- Disagree on standards of data, thus abandoning any approach of standardization and simply sending their very own format around
- Add handwritten comments in their own language
- Create documents that need at least a college degree in their field to correctly understand.
This meant OCR models were basically just helpers for data scientists, handling cleanups, routings, and post-validations to get something only vaguely close to real automation at work.
Multimodal LLMs enter the scene
With the launch of Multimodal LLMs, most prominently Gemini-Flash-2.0, many AI fields with decades long track records of active research became obsolete. Image Classification? Solved. Answering questions over images, text and tables? Auf Wiedersehen. And OCR? Exactly.
What gives LLMs the power to declare victory in dozens of areas that were previously considered their own domain comes down to two characteristics of the Transformer architecture and its training:
- The model’s architecture allows it to have global context of everything that is in the input. I.e. it can for example understand that a table row belongs to headers printed 5 pages prior.
- The model is trained on the entire internet and can understand concepts like stamps, order forms with handwriting over them and that invoices can have as many layouts as there’s people on this planet.
With that intuition in mind it’s not surprising why they are able to easily solve problems that OCRs gave up on. Instead of just transforming pixel patterns into words they can look at the whole document at once and then even reach for their corpus of all human knowledge to make sense of it on a conceptual level. Not only that, they can even return information about an embedded Image with barely any text at all! Think of a technical drawing for example. You will be able to pull data from that where OCR would give you a transcript of the labels if you’re lucky.
Where OCRs got an ace up their sleeve - for now
Remember the part where I said that Language Models are awesome with their huge training set and global context?
Makes them a bit expensive unfortunately. A PDF with a few hundred pages and embedded images will cost you a couple Dollars, depending on the model. In document processing we also suffer from the relatively small output context windows. 64 thousand tokens may be enough to write a small codebase in Cursor but will blow up when trying to extract just a couple dozen pages of tables from docs. There’s also something reassuring about clear, explainable rules over the black-box nature of LLMs.
With all that said, I still believe that processing documents will be a solved problem in a couple years time. It feels like we’re 95% there. Models will get cheaper and more efficient with longer context windows. The focus will be on automating away the flow from document to System of Record, and AI Agents are already starting to being helpful here too.
My company cloudsquid works on this problem. If you’re interested in working with us or simply wanna chat about this post contact me through our website, filip@cloudsquid.io or on LinkedIn.