AI benchmarks are already obsolete
The Senior SWE-Bench thread dropped a benchmark that tries to evaluate AI agents at the level of a senior software engineer. Top solve rate right now is 24% with Claude Opus 4.8. But the discussion immediately went to two problems: nobody knows what a senior engineer would score on these tasks, and the benchmark tasks themselves are too subjective to be truly informative.
Kimi K2.7 Code landing in GitHub Copilot the same day adds context. Commenters there are excited that there's finally a non-OpenAI, non-Anthropic model in a mainstream IDE, but quickly ask when DeepSeek will be available and whether Copilot will allow custom model integration. The benchmark and the product launch are both symptoms of the same thing: the field is moving faster than the measurement infrastructure.
The 'Fall of the Theorem Economy' essay also surfaced today, arguing that mathematical research culture is being distorted by AI hype. Commenters called it unusually well-reasoned. Taken together, these threads suggest that evaluation, whether of AI code agents, mathematical proofs, or individual engineer output, is becoming one of the hardest unsolved problems in the industry.
So what?
If you're selling AI coding tools or agents, your benchmark claims are going to be scrutinized harder as the benchmark saturation problem becomes common knowledge. Invest in evaluation methodology that your customers understand and trust, not just top-line numbers. The companies that win will be the ones whose customers can verify the claims themselves.