Local AI inference on consumer hardware is genuinely viable now
A thread on running local models surfaced real enthusiasm from builders who have moved away from AWS toward self-hosted setups. The discussion is practical: what can a 96GB Mac Studio actually run, how does it compare to Claude Opus, what are realistic speeds for embeddings, image, video, and audio workloads. The tone is not theoretical. These are people who have already made the switch and are reporting back.
The convergence of cheap Mac Studio hardware, capable open-weights models like GLM-5.2, and frustration with cloud API costs is creating a real alternative to the hosted AI stack. The person who triggered the thread made a point of calling out budget constraints, pushing back on the 'just buy a 64GB Mac' advice as if hardware cost is trivial. But the underlying signal is that the hardware is good enough now that the math can work for serious builders.
This ties directly to the GLM-5.2 thread. Better open models plus viable local hardware plus self-hosting experience from the post-AWS-budget era adds up to a genuinely different infrastructure picture than existed 18 months ago.
So what?
For founders building AI products, local inference is no longer just a privacy or latency story. It's a cost story. If your product can tolerate slightly lower throughput, self-hosted open-weights models may now be cheaper per token than any hosted API. Run the numbers before assuming you need OpenAI or Anthropic.