Research

An honest document AI benchmark

August 26, 2025

Author

Filip Rejmus

Reviewed by

Mike McCarthy

Last Updated

November 12, 2025

This blog post first appeared on our substack

‍

What running a benchmark taught us about benchmarking document ai

‍

We did it. The pressure from the market became too immense. We didn't want to do it, but enough was enough. We gave in, went into the lab, and created a document AI benchmark.

Yes, that's right, another in-depth technical blog showcasing an array of bar charts with cloudsquid predictably ranked as the number one vendor in the space with the "most accurate AI available ever developed."

"How is that possible?" you may ask. "How are there 1000 document AI vendors who all claim to have the most accurate AI on the planet?" "People wouldn't just go on the internet and lie, would they?"

‍

Our benchmark approach

Alright, let's drop the pretense. We ran a benchmark, and we did pretty well. Here's the thing: we're not reinventing the wheel with something brand new in AI. We offer different AI pipelines optimized for different use cases, and we wanted to objectively evaluate how they perform on a challenging open-source document dataset.

‍

Methodology

We used the Contract Understanding Atticus Dataset (CUAD), a collection of 500 legal contracts with over 13,000 annotations across 41 different legal provisions. For our evaluation:

We tested our two main document AI pipelines (Advanced and Flash)
Each document was processed through our unmodified production pipelines
We extracted the same data points defined in the CUAD ground truth
We used identical prompts across all models with zero customization for this dataset
Accuracy was measured as the percentage of correctly extracted data points compared to the ground truth

‍

Findings

Several interesting patterns emerged from our evaluation:

OCR remains critical - Even in a legal contract setting with primarily text content, adding an OCR step to the pipeline instead of relying solely on an LLM-based approach improved accuracy by approximately 10%.
Our Advanced pipeline achieved 98.5%¹ extraction accuracy across the 500 documents and thousands of data points.
Gemini 2.0 Flash, despite the substantial OCR-killing hype, still lags in standalone accuracy. It remains a solid option for simpler use cases, but the performance gap is noticeable in complex document understanding tasks.

‍

Document markdown performance is a commodity

Let's get to the heart of the matter. Cloudsquid isn't doing anything particularly novel on the model and OCR side. Our pipeline is straightforward: best-in-class OCR + best-in-class model. We ran this benchmark with zero additional optimizations tuned for this dataset.

This leads to several observations about the current state of document AI:

Marginal performance differences - Different approaches in AI pipeline setup are currently separated by tiny increments in performance. While these incremental gains might still matter in certain contexts, they're rapidly eroding as foundation models improve.
Technical moat illusion - Startups that emphasize fine-tuning, preprocessing, training their own OCR, or any of another half dozen esoteric techniques you’ve seen discussed in hacker news threads are often engaged in marketing exercises designed to make document AI sound more complicated than it is. They want to create an illusion of a technical moat so that developers won’t try to build it themselves.
Benchmark skepticism - In our experience, customers don't particularly trust accuracy studies. They generally assume everyone is using similar technology and base their purchasing decisions on other factors like price, integration capabilities, user experience, and scalability.

‍

Where the Real Innovation Lies

What's really interesting about the CUAD benchmark is the diverse and numerous data extractions required and the quality of the prompts used to test against the ground truth. It provides an efficient way to evaluate models.

Achieving nearly 99% accuracy with zero tweaks to our pipeline or prompt adjustments feels impressive. However, the more compelling area for innovation lies in building systems around the model to:

Create high-quality prompts tuned to the document
Iterate and evaluate results quickly
Adapt to different document types automatically
Tightly integrate outputs into workflows and business logic

The majority of inaccuracies in data extraction still stem from shortcomings at the prompting and evaluation level rather than issues in converting documents into machine-readable formats. This is where engineering efforts should be focused, not on marginal OCR improvements or custom model training. Creating intelligent systems that optimize the interaction between documents, models, and actual enterprise workflows and their corresponding systems.

‍

Conclusion

While we could have produced yet another "we're #1" benchmark post, we believe greater value comes from transparency about where the real challenges and opportunities in document AI truly lie. The performance ceiling of current technologies is very high, but consistent real-world implementation remains challenging and limited for LLM based vendors.

For the pragmatic and experienced evaluator, the question shouldn’t be "which vendor has the best accuracy on paper?" but rather "which solution can consistently deliver high accuracy across my specific document types with minimal configuration overhead?"

We're betting that the future belongs to systems that can intelligently adapt to document variations and integrate with business workflows and logic rather than those claiming a few extra fractions of percentage points on controlled benchmarks.

If you’re interested in how we ran this, happy to share the specific points. We can also walk you through exactly how we set up our AI pipelines in the product, it’s no secret sauce.

More exact extraction performance

OCR + LLM Pipeline

Overall Precision: 0.985, Overall Recall: 0.946
Overall F1 Score: 0.964
Overall similarity score: 0.919

Gemini Flash Pipeline

Overall Precision: 0.986, Overall Recall: 0.874
Overall F1 Score: 0.923
Overall similarity score: 0.90

‍

Get AI Agents for your Finance Ops now

Book a demo

About the Author

Filip Rejmus

Co-founder & CPO

Filip Rejmus, co-founder and Chief Product Officer at cloudsquid, is building infrastructure to help companies manage, scale, and optimize AI workflows. With a background spanning software engineering, data automation, and product strategy, he bridges the gap between AI research and building useful, friendly Products. Before founding Cloudsquid, Filip worked in engineering and data roles at Taktile, SoundHound, and Uber, and contributed to open-source projects through Google Summer of Code. He studied Computer Science at TU Berlin with additional coursework in Quantitative Finance at TU Delft and Computer Graphics at UC Santa Barbara.‍

About the Reviewer

Mike McCarthy

CEO

Mike McCarthy, co-founder and CEO of cloudsquid, is building AI-driven infrastructure to automate and simplify complex document workflows. With deep experience in go-to-market strategy and scaling SaaS companies, Mike brings a proven track record of turning early-stage products into revenue engines. Before founding Cloudsquid, he led North American sales at Ultimate, where he built the GTM team, forged strategic partnerships with Zendesk, and helped drive the company through its Series A and eventual acquisition by Zendesk. ‍