SGLang DeepSeek V4, llama.cpp MTP Support, LiteLLM v1.85
Sonntag, 17. Mai 2026 - AI News · (letzte 24h)
SGLang v0.5.12 ships full DeepSeek V4 inference with expert parallelism and disaggregated prefill-decode, while llama.cpp lands native MTP speculative decoding.
Must read
- SGLang v0.5.12: DeepSeek V4 day-0 support — Full DeepSeek V4 inference path with TP/EP/CP, prefill-decode disaggregation, and HiSparse KV offloading — relevant if you route via LiteLLM to self-hosted endpoints.
- llama.cpp b9180: Native MTP speculative decoding — Gemma 4’s multi-token prediction drafting now works in llama.cpp — directly speeds up local Apple Silicon inference for your coding workflows.
- Raschka: KV Sharing, mHC, and Compressed Attention in recent LLMs — Technical deep-dive on how Gemma 4 and DeepSeek V4 cut long-context costs — useful context for routing decisions in your model gateway.
- LiteLLM v1.85.0 — You run LiteLLM as your model gateway; this release adds cosign-verified Docker images — check changelog for new model support.
- Claude Code persistent memory across 200 sessions — community experiment — Directly relevant to your overnight-agent-factory pattern: extracted signals, cross-session reflection, emergent frameworks.
Tools & Frameworks
SGLang v0.5.12
Day-0 DeepSeek V4 support with TP/EP/CP parallelism, prefill-decode disaggregation, HiSparse CPU offload, and FlashMLA kernels across Nvidia B300/H200/H100 and AMD MI35X.
Why this matters: Self-hosted DeepSeek V4 behind your LiteLLM gateway is now viable at scale.
LiteLLM v1.85.0
New release with cosign-verified Docker images for supply-chain security; check for DeepSeek V4 model routing additions.
Why this matters: Direct dependency in your stack — verify and upgrade.
Cline CLI v3.0.5
Plugin-provided tools and slash commands now hydrate properly in CLI settings; fixes disappearing tools on toggle.
Why this matters: Competitor signal for Claude Code CLI patterns.
Generator-Evaluator harness replicated with Kiro CLI
Community implementation of Anthropic’s multi-agent GAN-style harness for long-running code generation — 12 adversarial iterations to ship a site.
Why this matters: Pattern maps to your headless overnight agents with eval loops.
Open Models & Local
llama.cpp b9180: MTP speculative decoding
Native multi-token prediction support lands — enables Gemma 4 MTP draft models for faster speculative decoding on Apple Silicon.
Why this matters: Directly accelerates local coding inference on your Mac.
Open Artifacts #21: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1
Nathan Lambert’s assessment of the open-model wave — covers architecture choices, benchmark positioning, and CAISI’s V4 evaluation methodology.
Why this matters: Useful landscape view for choosing which models to route locally vs cloud.
KV Sharing and Compressed Attention in Gemma 4 / DeepSeek V4
Technical breakdown of how new architectures (mHC attention, KV sharing) reduce long-context memory costs by 2-4× vs standard GQA.
Why this matters: Informs quantisation and context-window trade-offs for local runs.
Industry & Trends
Claude Code persistent memory — 200-session experiment
Developer built cross-session signal extraction and periodic self-reflection for Claude Code; after 200 sessions it developed autonomous correction frameworks.
Why this matters: Validates your skills-framework approach to persistent agent memory.
Claude API elevated error rates (2026-05-16)
Multi-model elevated error rates reported ~18:08 UTC on 16 May; check status.claude.com for resolution.
Why this matters: If your overnight agents hit this window, check for failed runs.
Sources unavailable today: r/LocalLLaMA top
Auto-curated daily by Claude Opus 4.7 from Exponential View (Azeem Azhar), GitHub: BerriAI/litellm, GitHub: cline/cline, GitHub: ggml-org/llama.cpp, GitHub: sgl-project/sglang, Interconnects (Nathan Lambert), Lenny’s Newsletter, OpenAI blog, Sebastian Raschka, Simon Willison, r/ClaudeAI top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.