DeepSeek V4 Full Paper, Qwen 3.6 MTP Breakthroughs, Sonnet 4.5 Retiring

DeepSeek V4’s full paper drops with FP4 QAT details showing 2× speedup at 99.7% recall, while local LLM community hits 80–135 tok/sec on consumer GPUs with Qwen 3.6.

Must read

DeepSeek V4 full paper: FP4 QAT details and stability tricks — FP4 quantization-aware training yields 2× QK speedup at 99.7% recall — sets the bar for what you’ll route to via LiteLLM.
80 tok/sec and 128K context on 12GB VRAM with Qwen 3.6 35B via llama.cpp MTP — MTP speculative decoding on consumer hardware closes the gap with cloud for your local-plus-cloud hybrid routing decisions.
Sonnet 4.5 is being retired — If your LiteLLM gateway or Claude Code config still references Sonnet 4.5, you need to migrate model selections now.
The unreasonable effectiveness of HTML when using Claude Code — Simon Willison’s pattern for Claude Code output — directly relevant to your skills/spec framework thinking.
Autoharness: Claude improved agent harness by 40.7% overnight — Meta-agent that optimises your agent harness via evals — maps directly to your overnight-agent-factory pattern.

Tools & Frameworks

Claude Code v2.1.138

Minor internal-fixes release; no user-facing features announced.

Why this matters: Track for changelog completeness; nothing to act on.

LangChain: The Agent Development Lifecycle

LangChain publishes a framework for building, evaluating, and iterating on agents through structured lifecycle stages.

Why this matters: Compare against your skills-framework approach to disciplined agent engineering.

Community thread: best CLAUDE.md files for Claude Code

Crowdsourced collection of effective CLAUDE.md configurations across languages and project types.

Why this matters: Directly feeds your skills/spec framework for Claude Code projects.

Claude Desktop now shows context usage (macOS)

New UI indicator displays remaining context window in the Claude desktop app on macOS.

Why this matters: Useful for gauging when to split conversations; minor UX win.

Open Models & Local

BeeLlama.cpp: Qwen 3.6 27B Q5 at 200K context on 3090, peak 135 tps

Fork adds DFlash and TurboQuant enabling Qwen 3.6 27B Q5 with 200K context and 2–3× speed over baseline on a single 3090.

Why this matters: If you’re evaluating local coding models on consumer GPUs, this changes the viability calculus.

HF co-founder: Qwen 3.6 27B local approaches Opus in Claude Code

Hugging Face co-founder claims Qwen 3.6 27B running offline is close to latest Opus quality for coding tasks.

Why this matters: Validates your local-plus-cloud hybrid thesis — worth benchmarking against your own evals.

NVIDIA Star Elastic: one checkpoint containing 30B, 23B, and 12B reasoning models

Single checkpoint supports zero-shot slicing into multiple model sizes (30B/23B/12B) — nested architecture, no retraining.

Why this matters: Elastic inference could simplify your model gateway routing between quality tiers.

MiniMax 2.7 at 100K context on Strix Halo

Detailed llama-server config for running MiniMax 2.7 IQ3_XXS at 100K context on AMD Strix Halo unified memory.

Why this matters: Watch-but-don’t-act unless you’re evaluating AMD unified-memory hardware for local inference.

llama.cpp b9085: flash attention MMA for MiMo-V2.5

Adds flash attention MMA/Tiles support for MiMo-V2.5’s d_kq=192 d_v=128 architecture.

Why this matters: Enables efficient local MiMo-V2.5 inference if you’re testing that model.

Industry & Trends

DeepSeek rejects Alibaba investment, prioritises independence

DeepSeek turned down Alibaba’s funding offer to maintain corporate independence from big-tech ecosystems.

Why this matters: Signals DeepSeek will remain an independent model provider — relevant for your LiteLLM routing options.

Apple removes 256GB M3 Ultra Mac Studio from store

Apple pulled the 256GB M3 Ultra Mac Studio; community worried M5 Ultra max RAM will shrink further.

Why this matters: Directly threatens the local-LLM-on-Apple-Silicon strategy if high-RAM options disappear.

Discussion on Claude Code’s pricing A/B test and organisational challenges when non-engineers ship code directly.

Why this matters: Directly relevant to your ‘vibe coding as a management problem’ framing.

Auto-curated daily by Claude Opus 4.7 from Ben’s Bites, Exponential View (Azeem Azhar), GitHub: anthropics/claude-code, GitHub: ggml-org/llama.cpp, Hugging Face blog, LangChain blog, Lenny’s Newsletter, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.