Infrastructure July 4, 2026 bullish ⇧ 636 pts across 2 threads

Running SOTA LLMs Locally Is Getting Real

A guide to running state-of-the-art LLMs locally landed on the front page and sparked a sharp hardware debate. The specific fight: 2x RTX 3090s for 48GB VRAM at around $3k versus an M5 MacBook Pro with unified memory that delivers similar capacity at the same price point with less friction. The middle ground, 96GB VRAM from something like GMKtec's EVO-X2, has no clean answer yet.

The pattern here is that local inference is no longer a hobbyist curiosity. People are seriously comparing price-per-VRAM-gigabyte across consumer GPUs, Apple Silicon, and mini-PC form factors, and the thread on 'Performance per dollar is getting faster and cheaper' reinforced it. Someone flagged that agentic coding workloads are a 'massive unlock' for underutilized compute architectures.

The counterpoint: someone questioned whether Qwen qualifies as SOTA at all, which is a fair reminder that 'local SOTA' is doing a lot of work in these conversations.

So what?

If you are building tools or products that depend on cloud inference costs, the local compute trend is coming for your margin assumptions. The hardware is getting good enough that serious developers will run models locally for privacy, latency, and cost reasons. Build your architecture to work with either endpoint.

Read these

Jamesob's guide to running SOTA LLMs locally

365 pts 167 comments livestyle

Performance per dollar is getting faster and cheaper

271 pts 100 comments latchkey

← Back to today