AI Briefing — 2026-05-05
Tuesday, 5 May 2026
Covering Tue 05 May 00:00 → Wed 06 May 00:00 (24h)
OpenAI launched GPT-5.5 Instant as the new default ChatGPT model, Ollama shipped Gemma 4 MTP speculative decoding for Mac with 2x speed gains, and a practical r/LocalLLaMA post quantified exactly when local models beat cloud — directly relevant to your hybrid routing setup.
Must read
- Ollama v0.23.1: Gemma 4 MTP speculative decoding on Mac — Over 2x speed increase for Gemma 4 31B coding tasks on Apple Silicon via MTP in the MLX runner — directly upgrades your local coding model options.
- Practitioner measured 10 days of coding tasks: 72% could run locally — Concrete data on local-vs-cloud routing thresholds for coding workflows — validates and refines the hybrid routing decisions you’re making through LiteLLM.
- GPT-5.5 Instant: OpenAI’s new default model — New model tier in the API with reduced hallucinations and improved personalisation — worth evaluating against Claude via your LiteLLM gateway for cost/quality tradeoffs.
- Stop Sending IDE-Catchable AI Code Errors to Review — Directly addresses your 22,000-line PR verification problem — argues for pre-review automated gates to catch AI-generated error patterns before they hit human reviewers.
- Transformers v5.8.0 adds DeepSeek-V4 support — Official HuggingFace support for DeepSeek-V4’s new architecture means easier local experimentation and fine-tuning paths for your team.
Tools & Frameworks
Stop Sending IDE-Catchable AI Code Errors to Review
JetBrains argues that AI-generated code carries distinct error patterns (unused imports, type mismatches, dead code) that IDEs already detect. They propose automated pre-review gates to filter these before human reviewers see them.
Why this matters: Actionable pattern for your code review augmentation layer — catches the leaf-node quality issues from parallel agent workstreams before they burden reviewers.
Vercel CLI now exposes observability metrics for coding agents
New vercel metrics CLI command lets agents query performance, reliability, and security data programmatically. Explicitly designed for coding agent consumption.
Why this matters: Your stack includes Vercel — your headless agents can now pull production metrics directly when debugging or optimising deployments.
SGLang v0.5.11: Speculative Decoding V2 by default, CUDA 13
Spec V2 with overlap scheduling is now default, reducing per-step CPU cost. Also adds CUDA 13 and PyTorch 2.11 support across the stack.
Why this matters: If you’re self-hosting any inference (even for eval), SGLang’s spec decoding improvements materially reduce latency for MoE models like DeepSeek-V4.
Open Models & Local
Ollama v0.23.1: Gemma 4 MTP on MLX
Multi-token prediction speculative decoding for Gemma 4 31B is now supported on Macs via the MLX runner, delivering over 2x speed on coding tasks. Single command: ollama run gemma4:31b-coding-mtp-bf16.
Why this matters: This makes Gemma 4 31B a genuinely competitive local coding model on your Apple Silicon hardware — test against your current Qwen/DeepSeek local setup.
MTP on Strix Halo with llama.cpp PR #22673
Community testing of multi-token prediction in llama.cpp on AMD AI Max 395 with Qwen3.6-35B MTP GGUF shows significant throughput gains. The PR is not yet merged but functional.
Why this matters: MTP support landing in llama.cpp means the technique will soon be available beyond Ollama/MLX — watch for the merge if you run llama.cpp directly.
10-day local vs cloud routing experiment with real coding tasks
Developer logged 150 coding tasks over 10 days, re-ran each on local Qwen 3.6 27B and cloud. Found 72% of tasks (file reads, refactors, explanations) could run locally with equivalent quality; only complex multi-file reasoning needed frontier.
Why this matters: Empirical routing data for your LiteLLM gateway config — suggests aggressive local-first routing is viable for the majority of agent sub-tasks.
HuggingFace Transformers v5.8.0: DeepSeek-V4 architecture support
Adds full model support for DeepSeek-V4’s new hybrid attention and routing architecture. Also includes several other model additions.
Why this matters: Official transformers support accelerates the path to local quantised DeepSeek-V4 variants — relevant as the model reportedly closes the gap with frontier.
Industry & Trends
GPT-5.5 Instant launched as ChatGPT default
New model positioned between GPT-5 and GPT-5.5 on cost/capability curve. Claims reduced hallucinations and improved personalisation. Available via API.
Why this matters: Another model to benchmark in your LiteLLM routing layer — the ‘Instant’ tier suggests a latency/cost sweet spot that may compete with Claude Haiku for agent sub-tasks.
GPT-5.5 Instant System Card
Technical safety evaluation and capability details for the new model, including coding benchmarks and tool-use performance.
Why this matters: Primary source for evaluating whether GPT-5.5 Instant is worth adding to your model roster — check the coding and tool-use sections specifically.
200M tokens in 5 days: the economics of always-on local agents
User running Hermes agent with Qwen-397B locally consumed 200M tokens in 5 days on routine tasks. At cloud pricing that would be $250+/week; local hardware amortises to near-zero marginal cost.
Why this matters: Validates your overnight-agent-factory economics — at agent-scale token volumes, local inference ROI is dramatic even with modest hardware.
Auto-curated daily by Claude Opus 4.7 from Apple ML research, Ben’s Bites, Don’t Worry About the Vase (Zvi), GitHub: ggml-org/llama.cpp, GitHub: huggingface/transformers, GitHub: langchain-ai/langchain, GitHub: langchain-ai/langgraph, GitHub: ollama/ollama, GitHub: sgl-project/sglang, JetBrains AI blog, Last Week in AI, Latent Space, Lenny’s Newsletter, NVIDIA developer blog, OpenAI blog, Simon Willison, TLDR AI, The Pragmatic Engineer (Gergely Orosz), Vercel blog, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.