LLM tool builders are filling perception gaps in video and browser
Two Show HN projects landed close together and point at the same gap. Claude-real-video is a tool built because no LLM 'actually sees' a video: Claude rejects video files, ChatGPT only reads transcripts, and Gemini samples at 1fps. The builder extracted frame-by-frame detail to give any LLM genuine video understanding. The Safari MCP server for web developers wires LLMs into browser dev tools so AI can actually observe and interact with a live browser session.
Both projects exist because the flagship LLM products have obvious, annoying gaps that third-party builders are rushing to fill. The video gap is particularly stark: video is everywhere, and none of the major models handles it well natively.
Commenters on the video tool noted limits even with the workaround: LLMs still struggle to infer specific animations and motion design details, even from dense frame samples. The gap is smaller but not closed.
So what?
There's a real market right now in 'LLM middleware' that patches perception gaps in the major models. Video understanding, browser interaction, and real-time data access are all underserved. If you can build a tight, reliable solution in one of these gaps before the model providers close it natively, you have a window of maybe 12-18 months before it's commoditized.