Benchmarking AI Vision with Puzzles

Vision models are a core component in our agent flow. Ana uses vision to inspect every figure she generates, and to learn business context from user datasets and dashboards.

As part of a wider assessment of AI vision, we recently benchmarked 5 multimodal AI models from OpenAI, Google and Anthropic on their ability to solve puzzles provided as image files. The benchmark was constructed from twitter posts, using 10 accounts that post daily puzzles. The puzzles mainly feature:

simple algebra and linear algebra questions,
matchstick puzzles,
trigonometry and geometry questions,
spot-the-pattern puzzles, and
chess puzzles.

Benchmark Result — Model accuracy across puzzle types

The model that performed best: Google’s gemini-2.5-pro-preview which answered 73 out of 75 questions correctly on its first try.

Click here for the dataset, which is now public through Kaggle.