Infrastructure May 29, 2026 mixed ⇧ 294 pts across 3 threads

LLM Inference Speed Race Heating Up on Commodity Hardware

A thread on real-time LLM inference hitting 3,000 tokens per second on standard GPUs generated genuine interest on HN, though with healthy skepticism. The main pushback: the benchmark is against a 2B model, and comparisons to Groq are unfair because Groq runs much larger models. Still, the directional signal is real. People are actively working on making fast inference available without specialized hardware.

This sits alongside the Claude Code configuration thread, which revealed a lot of undocumented knobs for tuning AI agent behavior, and the AI code review thread where teams are running orchestrated multi-model review pipelines locally. The theme is builders moving from 'use the cloud API' to 'run this thing myself, configure it precisely, and own the infrastructure.'

The Groq comparison debate is actually the interesting part. Groq's business model depends on inference speed being a moat. If commodity GPUs close the gap even on smaller models, the pricing pressure on inference providers will be significant.

So what?

If you're building on top of inference APIs, watch this space. The cost-per-token story is going to keep improving, and the right answer for your architecture in 12 months may look very different from today. If you're selling inference or AI infrastructure, the window where speed is a defensible moat is narrowing faster than the public narrative suggests.

Read these

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

62 pts 37 comments NicoConstant

Claude Code – Everything You Can Configure That the Docs Don't Tell You

189 pts 33 comments ankitg12

Orchestrating AI code review at scale

43 pts 16 comments pramodbiligiri

← Back to today