Article

OCR is legacy tech

August 26, 2025

Author

Filip Rejmus

Reviewed by

Mike McCarthy

Last Updated

November 12, 2025

The first idea resembling something like the idea of OCR got developed in 1870 as a reading machine for the blind - the Optophone. This was the first step to solve a problem that sounds pretty simple: How do we get writing on paper inside a computer?

‍

150 years of research, engineering breakthroughs and hundreds of IDP products later we were finally able to scan a receipt and have the fields be filled out - if it looked nice and friendly enough to the OCR model. Heureka.

‍

Unfortunately for Tesseract, Abbyy and co. they suffer from the complication that documents are written by humans. And humans love to do things like:

Stamp over the most critical data because it feels like the right spot
Organize data in tables with four nested of levels of groupings
Disagree on standards of data, thus abandoning any approach of standardization and simply sending their very own format around
Add handwritten comments in their own language
Create documents that need at least a college degree in their field to correctly understand.

‍

This meant OCR models were basically just helpers for data scientists, handling cleanups, routings, and post-validations to get something only vaguely close to real automation at work.

‍

Multimodal LLMs enter the scene

‍

With the launch of Multimodal LLMs, most prominently Gemini-Flash-2.0, many AI fields with decades long track records of active research became obsolete. Image Classification? Solved. Answering questions over images, text and tables? Auf Wiedersehen. And OCR? Exactly.

‍

What gives LLMs the power to declare victory in dozens of areas that were previously considered their own domain comes down to two characteristics of the Transformer architecture and its training:

‍

The model’s architecture allows it to have global context of everything that is in the input. I.e. it can for example understand that a table row belongs to headers printed 5 pages prior.
The model is trained on the entire internet and can understand concepts like stamps, order forms with handwriting over them and that invoices can have as many layouts as there’s people on this planet.

‍

With that intuition in mind it’s not surprising why they are able to easily solve problems that OCRs gave up on. Instead of just transforming pixel patterns into words they can look at the whole document at once and then even reach for their corpus of all human knowledge to make sense of it on a conceptual level. Not only that, they can even return information about an embedded Image with barely any text at all! Think of a technical drawing for example. You will be able to pull data from that where OCR would give you a transcript of the labels if you’re lucky.

‍

Where OCRs got an ace up their sleeve - for now

‍

Remember the part where I said that Language Models are awesome with their huge training set and global context?

‍

Makes them a bit expensive unfortunately. A PDF with a few hundred pages and embedded images will cost you a couple Dollars, depending on the model. In document processing we also suffer from the relatively small output context windows. 64 thousand tokens may be enough to write a small codebase in Cursor but will blow up when trying to extract just a couple dozen pages of tables from docs. There’s also something reassuring about clear, explainable rules over the black-box nature of LLMs.

‍

With all that said, I still believe that processing documents will be a solved problem in a couple years time. It feels like we’re 95% there. Models will get cheaper and more efficient with longer context windows. The focus will be on automating away the flow from document to System of Record, and AI Agents are already starting to being helpful here too.

‍

My company cloudsquid works on this problem. If you’re interested in working with us or simply wanna chat about this post contact me through our website, filip@cloudsquid.io or on LinkedIn.

‍

Get AI Agents for your Finance Ops now

Book a demo

About the Author

Filip Rejmus

Co-founder & CPO

Filip Rejmus, co-founder and Chief Product Officer at cloudsquid, is building infrastructure to help companies manage, scale, and optimize AI workflows. With a background spanning software engineering, data automation, and product strategy, he bridges the gap between AI research and building useful, friendly Products. Before founding Cloudsquid, Filip worked in engineering and data roles at Taktile, SoundHound, and Uber, and contributed to open-source projects through Google Summer of Code. He studied Computer Science at TU Berlin with additional coursework in Quantitative Finance at TU Delft and Computer Graphics at UC Santa Barbara.‍

About the Reviewer

Mike McCarthy

CEO

Mike McCarthy, co-founder and CEO of cloudsquid, is building AI-driven infrastructure to automate and simplify complex document workflows. With deep experience in go-to-market strategy and scaling SaaS companies, Mike brings a proven track record of turning early-stage products into revenue engines. Before founding Cloudsquid, he led North American sales at Ultimate, where he built the GTM team, forged strategic partnerships with Zendesk, and helped drive the company through its Series A and eventual acquisition by Zendesk. ‍