Artificial intelligence is taking on a new kind of challenge—Minecraft build competitions. A high school senior, Adi Singh, has launched MC-Bench, an innovative website where AI models go head-to-head to construct structures in Minecraft. Users can vote on their favorite builds before discovering which AI was responsible for them. This creative project offers a fresh way to evaluate AI performance and progress.
Why Minecraft?
Minecraft, the best-selling video game of all time, is widely recognized for its signature blocky graphics, making it an ideal medium for AI-generated designs. Even those unfamiliar with the game can easily compare two structures and decide which one looks better.
“Minecraft allows people to see AI development progress much more easily,” Singh explained. “People are used to Minecraft, used to the look and the vibe.”
By leveraging a game millions of people are already familiar with, MC-Bench provides an intuitive and engaging way to assess AI creativity and problem-solving abilities.
How MC-Bench Works
MC-Bench operates with the help of eight dedicated volunteers. Although major AI firms such as Anthropic, Google, OpenAI, and Alibaba contribute access to their models, they are not officially affiliated with the project.
Currently, the platform focuses on simple build tasks, serving as a way to track AI progress from earlier models like GPT-3 to the latest versions. However, Singh envisions expanding the competition to include more complex building challenges in the future.
“Right now, we are just doing simple builds to reflect on how far we’ve come since the GPT-3 era,” Singh said. “But we could scale to longer-form plans and goal-oriented tasks.”
Video Games as AI Testing Grounds
Evaluating AI performance can be difficult. Traditional AI benchmarks often favor models that have been trained to solve specific problems. For instance, an AI might perform in the 88th percentile on an LSAT but struggle with simple tasks, like counting the letters in “strawberry.”
Anthropic’s Claude 3.7 Sonnet is an example—it excels in software engineering assessments but struggles to play Pokémon as well as a five-year-old.
Because of this discrepancy, researchers often turn to video games as AI testing environments. Games like Pokémon Red, Street Fighter, and Pictionary have been used to assess AI reasoning and adaptability in controlled settings. MC-Bench is the latest addition to this growing trend, offering a unique way to measure AI’s problem-solving capabilities.
Why MC-Bench Matters
Technically, MC-Bench functions as a programming benchmark, since AI models generate code to construct their Minecraft builds. However, unlike traditional code-based tests, MC-Bench allows users to visually judge AI performance. It’s much easier for someone to evaluate the quality of a Minecraft snowman than to analyze lines of complex programming.
According to Singh, the leaderboard rankings on MC-Bench align closely with real-world AI performance, making it a valuable tool for researchers and AI developers.
“The current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks,” he noted. “Maybe MC-Bench could help companies see if they’re heading in the right direction.”
As AI continues to evolve, MC-Bench could become a key platform for tracking AI creativity, reasoning, and decision-making skills in a fun and accessible way. Whether it’s simple structures today or large-scale architectural projects in the future, this unique competition is already making waves in the AI and gaming communities.