
We're pretty sure that, like us, you're all pretty sick of hearing the term 'AI' everywhere you look, so forgive us for spotlighting yet another example here—but this one is relevant to all of our interests, honest.
Tech Crunch has reported that 1985's NES classic Super Mario Bros. is being used to benchmark the problem-solving performance of modern-day AI models. Hao AI Lab, a research organisation based out of the University of California San Diego, selected four AI models and tasked them with taking on Nintendo's iconic 8-bit platformer.
Anthropic’s Claude 3.7 came out on top, with its relative Claude 3.5 coming second. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o were at the back of the pack and seemed to struggle quite badly.
To be clear, the researchers tinkered with the game slightly. It was running under emulation, and Hao's GamingAgent framework—which allows AI to control the on-screen action—allowed the team to feed instructions to each AI model. The AI would then generate inputs in the form of Python code.
The interesting thing here is that Hao found that "reasoning" models like OpenAI’s o1 didn't do as well as "non-reasoning" models despite generally performing better in other non-gaming benchmarks. This is because reasoning models—as the name suggests—take a little time to pick an action, and that delay can be the difference between life and game over in a title like Super Mario Bros., as we all know too well.
Before we place too much stock in this research, Tech Crunch notes that there are some people in the industry who don't think video games are a good way to benchmark AI.