LLMs disagree with each other 67 percent of the time on facts
An HN post published research showing that five frontier LLMs, GPT-4, Claude Opus, Gemini Pro, Gemini Pro with Search, and Sonar Pro, disagree with each other on 67 percent of 1,000 real-world fact-check claims. The author notes the 95 percent confidence interval is 64-70 percent, so this is a robust finding. One commenter responded 'They get more human by the day,' which captures the darkly funny implication: models are replicating the epistemic disagreement of human experts, not resolving it.
This connects directly to a separate Reddit thread arguing that most AI product failures are not about model quality but about memory and context handling, specifically that AI forgets earlier context, repeats itself, contradicts previous answers, and loses track of workflows in production. Both threads are pointing at the same underlying problem: LLMs are not reliable truth machines, and building products that depend on them being reliably correct is risky.
The pattern for founders is that AI products need to be designed around model unreliability, not despite it. Human-in-the-loop checkpoints, confidence scoring, ensemble approaches, and explicit uncertainty disclosure are not nice-to-haves, they are the product.
So what?
If your AI product makes factual claims, you have a liability and a trust problem. Either design explicit uncertainty signals into your UX, add retrieval-augmented grounding to reduce hallucination rates, or scope your product to tasks where being wrong is low-stakes. Launching a fact-dependent AI product without addressing model disagreement is building on sand.