AI June 26, 2026 mixed ⇧ 222 pts across 1 thread

AI jailbreaking: 6,000 attempts, but the lesson is complicated

A founder ran a live experiment: 2,000 people tried to break their AI assistant over 6,000 attempts, and reported zero breaches. The post reads as a success story. The HN thread immediately got skeptical.

The pattern here: commenters pointed out that 'zero breaches' is a point estimate on a nondeterministic system. One reply noted that LLMs are vulnerable to slow 'frog boiling' attacks, where jailbreaks happen through careful multi-turn escalation rather than a single prompt, and those are much harder to log and detect. Another asked whether cheaper models could handle the same attack surface, surfacing the real-world tradeoff between cost and safety in production AI.

This sits next to the broader theme that AI security is still immature. Founders shipping AI-facing products are drawing confidence from aggregate stats that may not hold under adversarial pressure from a single determined user.

So what?

Zero breaches across 6,000 attempts is a useful signal but not a guarantee. If you're deploying a public-facing AI assistant, build replay and audit tooling so you can inspect multi-turn sessions, not just flag individual bad prompts. The single-shot jailbreak is the easy problem.

Read these

What happened after 2k people tried to hack my AI assistant

222 pts 82 comments cuchoi

← Back to today