AI June 8, 2026 bearish ⇧ 659 pts across 2 threads

LLM benchmark credibility is collapsing

A thread claiming DeepSeek V4 Pro beats GPT-5.5 Pro got picked apart immediately. The benchmark: 4 tasks, judged by Grok, score 38 to 33. HN commenters were unsparing. One noted GPT-5.5 Pro also struggles with structured output, randomly adding fields and changing types, which is a real production pain point, not a benchmark issue. Another shared a vulnerability scanning benchmark showing GPT-5.5 Pro underperforming there too.

The pattern is that LLM benchmarks are now so easy to game, or to accidentally design badly, that the community has largely stopped trusting them at face value. Every claim gets interrogated: who was the judge, how many tasks, what was the prompt, who ran it. This is healthy skepticism, but it also means model selection for production use is increasingly anecdotal and task-specific.

The Lathe project (using LLMs to learn domains rather than skip them) is a counterpoint: it reflects a more grounded, use-case-first view of what LLMs are actually good for, rather than chasing leaderboard positions.


So what?

If you're choosing a model for a production feature, ignore headline benchmarks and run your own evals on your actual data and tasks. Four tasks judged by a competing model tells you nothing. Build a small benchmark suite specific to your use case and rerun it every time a new model drops.

Read these