SGLang DeepSeek V4, llama.cpp MTP Support, LiteLLM v1.85

SGLang v0.5.12 ships full DeepSeek V4 inference with expert parallelism and disaggregated prefill-decode, while llama.cpp lands native MTP speculative decoding.

Must read

SGLang v0.5.12: DeepSeek V4 day-0 support — Full DeepSeek V4 inference path with TP/EP/CP, prefill-decode disaggregation, and HiSparse KV offloading — relevant if you route via LiteLLM to self-hosted endpoints.
llama.cpp b9180: Native MTP speculative decoding — Gemma 4’s multi-token prediction drafting now works in llama.cpp — directly speeds up local Apple Silicon inference for your coding workflows.
Raschka: KV Sharing, mHC, and Compressed Attention in recent LLMs — Technical deep-dive on how Gemma 4 and DeepSeek V4 cut long-context costs — useful context for routing decisions in your model gateway.
LiteLLM v1.85.0 — You run LiteLLM as your model gateway; this release adds cosign-verified Docker images — check changelog for new model support.
Claude Code persistent memory across 200 sessions — community experiment — Directly relevant to your overnight-agent-factory pattern: extracted signals, cross-session reflection, emergent frameworks.

Tools & Frameworks

SGLang v0.5.12

Day-0 DeepSeek V4 support with TP/EP/CP parallelism, prefill-decode disaggregation, HiSparse CPU offload, and FlashMLA kernels across Nvidia B300/H200/H100 and AMD MI35X.

Why this matters: Self-hosted DeepSeek V4 behind your LiteLLM gateway is now viable at scale.

LiteLLM v1.85.0

New release with cosign-verified Docker images for supply-chain security; check for DeepSeek V4 model routing additions.

Why this matters: Direct dependency in your stack — verify and upgrade.

Cline CLI v3.0.5

Plugin-provided tools and slash commands now hydrate properly in CLI settings; fixes disappearing tools on toggle.

Why this matters: Competitor signal for Claude Code CLI patterns.

Generator-Evaluator harness replicated with Kiro CLI

Community implementation of Anthropic’s multi-agent GAN-style harness for long-running code generation — 12 adversarial iterations to ship a site.

Why this matters: Pattern maps to your headless overnight agents with eval loops.

Open Models & Local

llama.cpp b9180: MTP speculative decoding

Native multi-token prediction support lands — enables Gemma 4 MTP draft models for faster speculative decoding on Apple Silicon.

Why this matters: Directly accelerates local coding inference on your Mac.

Open Artifacts #21: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1

Nathan Lambert’s assessment of the open-model wave — covers architecture choices, benchmark positioning, and CAISI’s V4 evaluation methodology.

Why this matters: Useful landscape view for choosing which models to route locally vs cloud.

Technical breakdown of how new architectures (mHC attention, KV sharing) reduce long-context memory costs by 2-4× vs standard GQA.

Why this matters: Informs quantisation and context-window trade-offs for local runs.

Industry & Trends

Claude Code persistent memory — 200-session experiment

Developer built cross-session signal extraction and periodic self-reflection for Claude Code; after 200 sessions it developed autonomous correction frameworks.

Why this matters: Validates your skills-framework approach to persistent agent memory.

Claude API elevated error rates (2026-05-16)

Multi-model elevated error rates reported ~18:08 UTC on 16 May; check status.claude.com for resolution.

Why this matters: If your overnight agents hit this window, check for failed runs.

Sources unavailable today: r/LocalLLaMA top

Auto-curated daily by Claude Opus 4.7 from Exponential View (Azeem Azhar), GitHub: BerriAI/litellm, GitHub: cline/cline, GitHub: ggml-org/llama.cpp, GitHub: sgl-project/sglang, Interconnects (Nathan Lambert), Lenny’s Newsletter, OpenAI blog, Sebastian Raschka, Simon Willison, r/ClaudeAI top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.