Benchmarking AI Vision with Puzzles
Vision models are a core component in our agent flow. Ana uses vision to inspect every figure she generates, and to learn business context from user datasets and dashboards.
As part of a wider assessment of AI vision, we recently benchmarked 5 multimodal AI models from OpenAI, Google and Anthropic on their ability to solve puzzles provided as image files. The benchmark was constructed from twitter posts, using 10 accounts that post daily puzzles. The puzzles mainly feature:
- simple algebra and linear algebra questions,
- matchstick puzzles,
- trigonometry and geometry questions,
- spot-the-pattern puzzles, and
- chess puzzles.
The model that performed best: Google’s gemini-2.5-pro-preview which answered 73 out of 75 questions correctly on its first try.
Click here for the dataset, which is now public through Kaggle.