AI Briefing — 2026-05-06
Wednesday, 6 May 2026
Covering Tue 05 May 18:53 → Wed 06 May 18:53 (24h)
A major day for Claude Code and local inference: Anthropic held their ‘Code w/ Claude 2026’ event (live-blogged by Simon Willison), removed peak-hours throttling on Claude Code Pro/Max via a SpaceX compute deal, and shipped new plugin/dispatch features. Meanwhile, MTP (Multi-Token Prediction) is delivering 2-2.5x speedups for Qwen 3.6 27B locally, and Ollama shipped MTP support for Gemma 4 on Mac.
Must read
- Live blog: Code w/ Claude 2026 — Primary source coverage of Anthropic’s developer event — expect announcements on Claude Code features, skills, and headless/dispatch modes directly relevant to your overnight-agent-factory workflow.
- Higher usage limits for Claude and a compute deal with SpaceX — Peak-hours limit reduction removed for Claude Code Pro/Max and API rate limits raised for Opus — directly affects your team’s throughput on headless agent runs.
- Claude Code v2.1.129 — plugin-url flag, auto-update, monitors — The new
--plugin-urlflag for session plugins and the monitors/themes manifest changes expand your dispatch and customisation options for Claude Code. - 2.5x faster inference with Qwen 3.6 27B using MTP — 262k context on 48GB — MTP draft heads on quantised GGUFs make Qwen 3.6 27B a genuinely viable local agentic coding model — relevant to your hybrid local/cloud routing decisions via LiteLLM.
- Ollama v0.23.1 — Gemma 4 MTP speculative decoding on Mac — Over 2x speed increase for Gemma 4 31B coding tasks on Apple Silicon via the MLX runner — directly usable on your team’s Macs today.
Tools & Frameworks
Cursor 3.3 Changelog (May 6, 2026)
New Cursor release. Changelog details not fully available in snippet but published today.
Why this matters: You use Cursor daily — check for agent mode, Composer, or model-selection changes.
Vibe coding and agentic engineering are getting closer than I’d like
Simon Willison reflects on the convergence of vibe coding and agentic engineering in his own practice, noting the discipline gap is narrowing uncomfortably.
Why this matters: Directly maps to your published thinking on vibe coding as a management problem and the skills-framework discipline layer.
Stop Sending IDE-Catchable AI Code Errors to Review
JetBrains argues that AI-generated PRs carry new error patterns that should be caught pre-review by IDE tooling, not by human reviewers drowning in volume.
Why this matters: Relevant to your 22,000-line PR verification problem — a concrete approach to filtering AI-generated noise before it hits review.
SGLang v0.5.11 — Spec Decoding V2 default, CUDA 13
Speculative Decoding V2 with overlap scheduling is now default, reducing per-step CPU cost. Also moves to CUDA 13 and PyTorch 2.11.
Why this matters: If you self-host any inference (even for eval), the spec-decode improvements materially reduce latency for agentic multi-turn workloads.
Open Models & Local
DeepSeek V4 at 17x cheaper — measuring what actually needs cloud vs local
A developer logged 10 days of coding tasks and found ~72% could have been handled by local Qwen 3.6 27B with no quality loss, reserving cloud only for complex multi-file reasoning.
Why this matters: Concrete data for your local-plus-cloud routing decisions through LiteLLM — validates the hybrid approach you’re building.
Quality comparison between Qwen 3.6 27B quantizations (BF16 through IQ3_XXS)
Systematic test of quantisation levels on a structured reasoning task, showing Q5_K_XL as the sweet spot for 16GB VRAM setups with minimal degradation.
Why this matters: Actionable guidance for choosing the right quant if you’re running Qwen locally on team hardware.
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama
Security researchers disclose a critical unauthenticated memory leak vulnerability in Ollama that can expose sensitive data from the host.
Why this matters: If you run Ollama on any network-accessible machine (dev servers, shared infra), patch immediately — especially relevant given your MCP server architecture.
Apple drops high-memory Mac Studio configs (256GB gone)
Apple has quietly removed the 256GB Mac Studio option, leaving 96GB as the maximum. Supply constraints cited.
Why this matters: Bad news for local large-model inference on Apple Silicon — if you were planning high-RAM Mac purchases for the team, act now on remaining stock.
Industry & Trends
GPT-5.5 Instant: smarter, clearer, and more personalized
OpenAI’s new default ChatGPT model claims reduced hallucinations and improved accuracy. System card published alongside.
Why this matters: Worth evaluating via LiteLLM as a potential routing target for lower-stakes tasks — watch for API availability and pricing details.
Anthropic launches 10 finance/insurance AI agents (KYC, pitchbooks, month-end)
Anthropic ships production-ready agents for financial services via Claude Cowork, Claude Code, and Managed Agents — including KYC screening.
Why this matters: Directly in your identity/fraud/RegTech space — worth examining whether these agents or their architecture patterns are applicable to your own KYC workflows.
Hugging Face Transformers v5.8.0 — DeepSeek-V4 support
Adds DeepSeek-V4 model support with its new hybrid attention architecture, plus other model additions.
Why this matters: If you evaluate or fine-tune open models via HF, DeepSeek-V4 is now first-class — relevant to your frontier-vs-local cost analysis.
Auto-curated daily by Claude Opus 4.7 from Apple ML research, Ben’s Bites, Cursor changelog, Don’t Worry About the Vase (Zvi), GitHub: BerriAI/litellm, GitHub: anthropics/claude-code, GitHub: ggml-org/llama.cpp, GitHub: huggingface/transformers, GitHub: langchain-ai/langchain, GitHub: langchain-ai/langgraph, GitHub: ollama/ollama, GitHub: sgl-project/sglang, Hugging Face blog, JetBrains AI blog, LangChain blog, Last Week in AI, Latent Space, Lenny’s Newsletter, NVIDIA developer blog, OpenAI blog, Simon Willison, TLDR AI, The Pragmatic Engineer (Gergely Orosz), Vercel blog, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.