Benchmark Gaming: Microsoft's MAI-Code and the Eval Problem
Microsoft released MAI-Code-1-Flash, positioning it as a strong coding model. The HN thread immediately surfaced the key problem: it appears to be trained on the SWE Bench Pro eval set, which means its impressive scores are suspect. The comment 'So it's trained on the SWE Bench Pro evalset' landed with no pushback, just grim recognition.
This is part of a broader pattern where every major lab is chasing benchmark numbers, and the benchmarks themselves are becoming worthless as training targets rather than honest measurements. HN commenters also noted that Gemma 4 26B-A4B scored well with 20% fewer parameters, suggesting the efficiency race matters more than headline scores anyway.
The debate here is not really about MAI-Code specifically. It is about whether any published coding benchmark can be trusted at face value anymore. The answer from this crowd is: probably not.
So what?
If you are choosing AI coding tools for your team based on published benchmarks, you are flying blind. The only signal that matters now is production performance on your actual codebase. Run your own evals on representative tasks before committing to any model.