DeepSeek V4 Full Paper, Qwen 3.6 MTP Breakthroughs, Sonnet 4.5 Retiring
Sunday, 10 May 2026 - AI News · (last 24h)
DeepSeek V4’s full paper drops with FP4 QAT details showing 2× speedup at 99.7% recall, while local LLM community hits 80–135 tok/sec on consumer GPUs with Qwen 3.6.
Must read
- DeepSeek V4 full paper: FP4 QAT details and stability tricks — FP4 quantization-aware training yields 2× QK speedup at 99.7% recall — sets the bar for what you’ll route to via LiteLLM.
- 80 tok/sec and 128K context on 12GB VRAM with Qwen 3.6 35B via llama.cpp MTP — MTP speculative decoding on consumer hardware closes the gap with cloud for your local-plus-cloud hybrid routing decisions.
- Sonnet 4.5 is being retired — If your LiteLLM gateway or Claude Code config still references Sonnet 4.5, you need to migrate model selections now.
- The unreasonable effectiveness of HTML when using Claude Code — Simon Willison’s pattern for Claude Code output — directly relevant to your skills/spec framework thinking.
- Autoharness: Claude improved agent harness by 40.7% overnight — Meta-agent that optimises your agent harness via evals — maps directly to your overnight-agent-factory pattern.
Tools & Frameworks
Claude Code v2.1.138
Minor internal-fixes release; no user-facing features announced.
Why this matters: Track for changelog completeness; nothing to act on.
LangChain: The Agent Development Lifecycle
LangChain publishes a framework for building, evaluating, and iterating on agents through structured lifecycle stages.
Why this matters: Compare against your skills-framework approach to disciplined agent engineering.
Community thread: best CLAUDE.md files for Claude Code
Crowdsourced collection of effective CLAUDE.md configurations across languages and project types.
Why this matters: Directly feeds your skills/spec framework for Claude Code projects.
Claude Desktop now shows context usage (macOS)
New UI indicator displays remaining context window in the Claude desktop app on macOS.
Why this matters: Useful for gauging when to split conversations; minor UX win.
Open Models & Local
BeeLlama.cpp: Qwen 3.6 27B Q5 at 200K context on 3090, peak 135 tps
Fork adds DFlash and TurboQuant enabling Qwen 3.6 27B Q5 with 200K context and 2–3× speed over baseline on a single 3090.
Why this matters: If you’re evaluating local coding models on consumer GPUs, this changes the viability calculus.
HF co-founder: Qwen 3.6 27B local approaches Opus in Claude Code
Hugging Face co-founder claims Qwen 3.6 27B running offline is close to latest Opus quality for coding tasks.
Why this matters: Validates your local-plus-cloud hybrid thesis — worth benchmarking against your own evals.
NVIDIA Star Elastic: one checkpoint containing 30B, 23B, and 12B reasoning models
Single checkpoint supports zero-shot slicing into multiple model sizes (30B/23B/12B) — nested architecture, no retraining.
Why this matters: Elastic inference could simplify your model gateway routing between quality tiers.
MiniMax 2.7 at 100K context on Strix Halo
Detailed llama-server config for running MiniMax 2.7 IQ3_XXS at 100K context on AMD Strix Halo unified memory.
Why this matters: Watch-but-don’t-act unless you’re evaluating AMD unified-memory hardware for local inference.
llama.cpp b9085: flash attention MMA for MiMo-V2.5
Adds flash attention MMA/Tiles support for MiMo-V2.5’s d_kq=192 d_v=128 architecture.
Why this matters: Enables efficient local MiMo-V2.5 inference if you’re testing that model.
Industry & Trends
DeepSeek rejects Alibaba investment, prioritises independence
DeepSeek turned down Alibaba’s funding offer to maintain corporate independence from big-tech ecosystems.
Why this matters: Signals DeepSeek will remain an independent model provider — relevant for your LiteLLM routing options.
Apple removes 256GB M3 Ultra Mac Studio from store
Apple pulled the 256GB M3 Ultra Mac Studio; community worried M5 Ultra max RAM will shrink further.
Why this matters: Directly threatens the local-LLM-on-Apple-Silicon strategy if high-RAM options disappear.
Lenny’s Newsletter: when non-PMs ship directly to production via Claude Code
Discussion on Claude Code’s pricing A/B test and organisational challenges when non-engineers ship code directly.
Why this matters: Directly relevant to your ‘vibe coding as a management problem’ framing.
Auto-curated daily by Claude Opus 4.7 from Ben’s Bites, Exponential View (Azeem Azhar), GitHub: anthropics/claude-code, GitHub: ggml-org/llama.cpp, Hugging Face blog, LangChain blog, Lenny’s Newsletter, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.