Most AI benchmarks don’t tell us much. They ask questions that can be solved with rote memorization, or cover topics that aren’t relevant to the majority of users.
So some AI enthusiasts are turning to games as a way to test AIs’ problem-solving skills.
Paul Calcraft, a freelance AI developer, has built an app where two AI models can play a Pictionary-like game with each other. One model doodles, while the other model tries to guess what the doodle represents.
“I thought this sounded super fun and potentially interesting from a model capabilities point of view,” Calcraft told TechCrunch in an interview. “So I sat indoors on a cloudy Saturday and got it done.”
Calcraft was inspired by a similar project by British programmer Simon Willison that tasked models with rendering a vector drawing of a pelican riding a bicycle. Willison, like Calcraft, chose a challenge he believed would force models to “think” beyond the contents of their training data.
“The idea is to have a benchmark that’s un-gameable,” Calcraft said. “A benchmark that can’t be beaten by memorizing specific answers or simple patterns that have been seen before during training.”
Minecraft is in this “un-gameable” category as well, or so believes 16-year-old Adonis Singh. He’s created a tool, Mcbench, that gives a model control over a Minecraft character and tests its ability to design structures, along the lines of Microsoft’s Project Malmo.
“I believe Minecraft tests the models on resourcefulness and gives them more agency,” he told TechCrunch. “It’s not nearly as restricted and saturated as [other] benchmarks.”
Using games to benchmark AI is nothing new. The idea dates back decades: Mathematician Claude Shannon argued in 1949 that games like chess were a worthy challenge for “intelligent” software. More recently, Alphabet’s DeepMind developed a model that could play Pong and Breakout; OpenAI trained AI to compete in Dota 2 matches; and Meta designed an algorithm that could hold its own against professional Texas hold ’em players.
But what’s different now is that enthusiasts are hooking up large language models (LLMs) — models with the ability to analyze text, images and more — to games to probe how good they are at logic.
There’s an abundance of LLMs out there, from Gemini and Claude to GPT-4o, and they all have different “vibes,” so to speak. They “feel” different in one interaction to the next — a phenomenon that can be difficult to quantify.
“LLMs are known to be sensitive to particular ways questions are asked, and just generally unreliable and hard to predict,” Calcraft said.
In contrast to text-based benchmarks, games provide a visual, intuitive way to compare how a model performs and behaves, said Matthew Guzdial, an AI researcher and professor at the University of Alberta.
“We can think of every benchmark as giving us a different simplification of reality focused on particular types of problems, like reasoning or communication,” he said. “Games are just other ways you can do decision-making with AI, so folks are using them like any other approach.”
Those familiar with the history of generative AI will note how similar Pictionary is to generative adversarial networks (GANs), in which a creator model sends images to a discriminator model that then evaluates them.
Calcraft believes that Pictionary can capture an LLM’s ability to understand concepts like shapes, colors and prepositions (e.g., the meaning of “in” versus “on”). He wouldn’t go so far as to say that the game is a reliable test of reasoning, but he argued that winning requires strategy and the ability to understand clues — neither of which models find easy.
“I also really like the almost adversarial nature of the Pictionary game, similar to GANs, where you have the two different roles: one draws and the other guesses,” he said. “The best one to draw is not the most artistic, but the one that can most clearly convey the idea to the audience of other LLMs (including to the faster, much less capable models!).”
“Pictionary is a toy problem that’s not immediately practical or realistic,” Calcraft cautioned. “That said, I do think spatial understanding and multimodality are critical elements for AI advancement, so LLM Pictionary could be a small, early step on that journey.”
Singh believes that Minecraft is a useful benchmark, too, and can measure reasoning in LLMs. “From the models I’ve tested so far, the results literally perfectly align with how much I trust the model for something reasoning-related,” he said.
Others aren’t so sure.
Mike Cook, a research fellow at Queen Mary University specializing in AI, doesn’t think Minecraft is particularly special as an AI testbed.
“I think some of the fascination with Minecraft comes from people outside of the games sphere who maybe think that, because it looks like ‘the real world,’ it has a closer connection to real-world reasoning or action,” Cook told TechCrunch. “From a problem-solving perspective, it’s not so dissimilar to a video game like Fortnite, Stardew Valley or World of Warcraft. It’s just got a different dressing on top that makes it look more like an everyday set of tasks like building things or exploring.”
To Cook’s point, even the best game-playing AI systems generally don’t adapt well to new environments, and can’t easily solve problems they haven’t seen before. For example, it’s unlikely a model that excels at Minecraft will play Doom with any real skill.
“I think the good qualities Minecraft does have from an AI perspective are extremely weak reward signals and a procedural world, which means unpredictable challenges,” Cook continued. “But it’s not really that much more representative of the real world than any other video game.”
That being the case, there sure is something fascinating about watching LLMs build castles.