Research
An honest document AI benchmark
August 26, 2025
.png)

Author

Filip Rejmus
This blog post first appeared on our substack
What running a benchmark taught us about benchmarking document ai
We did it. The pressure from the market became too immense. We didn't want to do it, but enough was enough. We gave in, went into the lab, and created a document AI benchmark.
Yes, that's right, another in-depth technical blog showcasing an array of bar charts with cloudsquid predictably ranked as the number one vendor in the space with the "most accurate AI available ever developed."
"How is that possible?" you may ask. "How are there 1000 document AI vendors who all claim to have the most accurate AI on the planet?" "People wouldn't just go on the internet and lie, would they?"
Our benchmark approach
Alright, let's drop the pretense. We ran a benchmark, and we did pretty well. Here's the thing: we're not reinventing the wheel with something brand new in AI. We offer different AI pipelines optimized for different use cases, and we wanted to objectively evaluate how they perform on a challenging open-source document dataset.
Methodology
We used the Contract Understanding Atticus Dataset (CUAD), a collection of 500 legal contracts with over 13,000 annotations across 41 different legal provisions. For our evaluation:
- We tested our two main document AI pipelines (Advanced and Flash)
- Each document was processed through our unmodified production pipelines
- We extracted the same data points defined in the CUAD ground truth
- We used identical prompts across all models with zero customization for this dataset
- Accuracy was measured as the percentage of correctly extracted data points compared to the ground truth
Findings
Several interesting patterns emerged from our evaluation:
- OCR remains critical - Even in a legal contract setting with primarily text content, adding an OCR step to the pipeline instead of relying solely on an LLM-based approach improved accuracy by approximately 10%.
- Our Advanced pipeline achieved 98.5%1 extraction accuracy across the 500 documents and thousands of data points.
- Gemini 2.0 Flash, despite the substantial OCR-killing hype, still lags in standalone accuracy. It remains a solid option for simpler use cases, but the performance gap is noticeable in complex document understanding tasks.
Document markdown performance is a commodity
Let's get to the heart of the matter. Cloudsquid isn't doing anything particularly novel on the model and OCR side. Our pipeline is straightforward: best-in-class OCR + best-in-class model. We ran this benchmark with zero additional optimizations tuned for this dataset.
This leads to several observations about the current state of document AI:
- Marginal performance differences - Different approaches in AI pipeline setup are currently separated by tiny increments in performance. While these incremental gains might still matter in certain contexts, they're rapidly eroding as foundation models improve.
- Technical moat illusion - Startups that emphasize fine-tuning, preprocessing, training their own OCR, or any of another half dozen esoteric techniques you’ve seen discussed in hacker news threads are often engaged in marketing exercises designed to make document AI sound more complicated than it is. They want to create an illusion of a technical moat so that developers won’t try to build it themselves.
- Benchmark skepticism - In our experience, customers don't particularly trust accuracy studies. They generally assume everyone is using similar technology and base their purchasing decisions on other factors like price, integration capabilities, user experience, and scalability.
Where the Real Innovation Lies
What's really interesting about the CUAD benchmark is the diverse and numerous data extractions required and the quality of the prompts used to test against the ground truth. It provides an efficient way to evaluate models.
Achieving nearly 99% accuracy with zero tweaks to our pipeline or prompt adjustments feels impressive. However, the more compelling area for innovation lies in building systems around the model to:
- Create high-quality prompts tuned to the document
- Iterate and evaluate results quickly
- Adapt to different document types automatically
- Tightly integrate outputs into workflows and business logic
The majority of inaccuracies in data extraction still stem from shortcomings at the prompting and evaluation level rather than issues in converting documents into machine-readable formats. This is where engineering efforts should be focused, not on marginal OCR improvements or custom model training. Creating intelligent systems that optimize the interaction between documents, models, and actual enterprise workflows and their corresponding systems.
Conclusion
While we could have produced yet another "we're #1" benchmark post, we believe greater value comes from transparency about where the real challenges and opportunities in document AI truly lie. The performance ceiling of current technologies is very high, but consistent real-world implementation remains challenging and limited for LLM based vendors.
For the pragmatic and experienced evaluator, the question shouldn’t be "which vendor has the best accuracy on paper?" but rather "which solution can consistently deliver high accuracy across my specific document types with minimal configuration overhead?"
We're betting that the future belongs to systems that can intelligently adapt to document variations and integrate with business workflows and logic rather than those claiming a few extra fractions of percentage points on controlled benchmarks.
If you’re interested in how we ran this, happy to share the specific points. We can also walk you through exactly how we set up our AI pipelines in the product, it’s no secret sauce.
More exact extraction performance
OCR + LLM Pipeline
Overall Precision: 0.985, Overall Recall: 0.946
Overall F1 Score: 0.964
Overall similarity score: 0.919
Gemini Flash Pipeline
Overall Precision: 0.986, Overall Recall: 0.874
Overall F1 Score: 0.923
Overall similarity score: 0.90