Skip to content

← AI Tracker

AI Briefing

SGLang DeepSeek V4, llama.cpp MTP Support, LiteLLM v1.85

dimanche 17 mai 2026 - AI News · (24 dernières heures)

SGLang v0.5.12 ships full DeepSeek V4 inference with expert parallelism and disaggregated prefill-decode, while llama.cpp lands native MTP speculative decoding.

Must read

Tools & Frameworks

SGLang v0.5.12

Day-0 DeepSeek V4 support with TP/EP/CP parallelism, prefill-decode disaggregation, HiSparse CPU offload, and FlashMLA kernels across Nvidia B300/H200/H100 and AMD MI35X.

Why this matters: Self-hosted DeepSeek V4 behind your LiteLLM gateway is now viable at scale.

LiteLLM v1.85.0

New release with cosign-verified Docker images for supply-chain security; check for DeepSeek V4 model routing additions.

Why this matters: Direct dependency in your stack — verify and upgrade.

Cline CLI v3.0.5

Plugin-provided tools and slash commands now hydrate properly in CLI settings; fixes disappearing tools on toggle.

Why this matters: Competitor signal for Claude Code CLI patterns.

Generator-Evaluator harness replicated with Kiro CLI

Community implementation of Anthropic’s multi-agent GAN-style harness for long-running code generation — 12 adversarial iterations to ship a site.

Why this matters: Pattern maps to your headless overnight agents with eval loops.

Open Models & Local

llama.cpp b9180: MTP speculative decoding

Native multi-token prediction support lands — enables Gemma 4 MTP draft models for faster speculative decoding on Apple Silicon.

Why this matters: Directly accelerates local coding inference on your Mac.

Open Artifacts #21: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1

Nathan Lambert’s assessment of the open-model wave — covers architecture choices, benchmark positioning, and CAISI’s V4 evaluation methodology.

Why this matters: Useful landscape view for choosing which models to route locally vs cloud.

KV Sharing and Compressed Attention in Gemma 4 / DeepSeek V4

Technical breakdown of how new architectures (mHC attention, KV sharing) reduce long-context memory costs by 2-4× vs standard GQA.

Why this matters: Informs quantisation and context-window trade-offs for local runs.

Claude Code persistent memory — 200-session experiment

Developer built cross-session signal extraction and periodic self-reflection for Claude Code; after 200 sessions it developed autonomous correction frameworks.

Why this matters: Validates your skills-framework approach to persistent agent memory.

Claude API elevated error rates (2026-05-16)

Multi-model elevated error rates reported ~18:08 UTC on 16 May; check status.claude.com for resolution.

Why this matters: If your overnight agents hit this window, check for failed runs.


Sources unavailable today: r/LocalLLaMA top

Auto-curated daily by Claude Opus 4.7 from Exponential View (Azeem Azhar), GitHub: BerriAI/litellm, GitHub: cline/cline, GitHub: ggml-org/llama.cpp, GitHub: sgl-project/sglang, Interconnects (Nathan Lambert), Lenny’s Newsletter, OpenAI blog, Sebastian Raschka, Simon Willison, r/ClaudeAI top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.