llama.cpp MTP Speedup, Claude Context Tools, DeepSeek V4 1M Context
Montag, 18. Mai 2026 - AI News · (letzte 24h)
llama.cpp ships MTP prompt-decode optimisation while community benchmarks DeepSeek V4’s 1M context window across real codebases.
Must read
- llama.cpp: avoid copying logits during prompt decode in MTP — Direct speed improvement for MTP inference — relevant if you’re running Qwen3.6-MTP locally via llama.cpp on Apple Silicon.
- Anthropic’s 4 context tools: when each one wins — Practical guide to /clear, /compact, and two newer context tools — directly applicable to your long Claude Code sessions.
- DeepSeek V4’s 1M context: real-world codebase stress test — Tested on 45k–520k token codebases; recall degrades past 150k. Useful routing data for your LiteLLM gateway decisions.
- Anthropic Mythos Preview: first public macOS M5 kernel exploit in 5 days — Demonstrates frontier-model capability for security research — and a signal on what agentic coding can achieve in constrained timeframes.
Tools & Frameworks
4-month Claude Pro vs ChatGPT Plus comparison by task type
Opus 4.7 wins on writing and code reasoning; GPT-5 wins on search and data analysis; both roughly equal on generation.
Why this matters: Useful for model-routing decisions in your gateway.
Opus spawns 3 subagents for frontend PageSpeed fixes across 41 files
User wrote a playbook ADR, pointed Opus at 9 pages; it self-organised subagents and hit near-perfect Lighthouse scores in 15 minutes.
Why this matters: Pattern for your overnight-agent-factory: ADR + subagent delegation.
Structured workflows with small local models as agent loops
Home-rolled agent loop with local models reaches self-editing capability; author documents progressive-disclosure patterns.
Why this matters: Aligns with your skills-framework thinking for local models.
Cline CLI v3.0.6–3.0.7: ChatGPT OAuth fix, GPT-5.x model support
Adds GPT-5.2, 5.4, 5.4-mini to Cline CLI model list; removes startup OAuth round-trip.
Why this matters: Minor but shows Cline keeping pace with new OpenAI models.
Open Models & Local
llama.cpp MTP benchmarks on Qwen3.6 with RTX 5090
Qwen3.6-27B Q5_K_M with MTP on: ~1.8× generation speedup over MTP off at 128k context on RTX 5090.
Why this matters: MTP is the key local-inference acceleration; validates Qwen3.6-MTP GGUFs.
Abliterlitics: 85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B
Open-source forensics toolkit benchmarks safety removal techniques with weight-level analysis on Qwen3.6-27B variants.
Why this matters: If you run uncensored local models, this quantifies the quality trade-offs.
Qwopus3.5-9B-Coder-GGUF: agentic coding fine-tune
9B dense model optimised for tool calling and agentic coding; runs at Q8 on 16 GB RAM devices including Mac mini.
Why this matters: Potential local coding model for your Apple Silicon fleet at minimal VRAM cost.
Dual GPU llama.cpp: quantized KV cache with tensor parallelism
Fork fixes llama.cpp’s tensor-split limitation — enables quantized KV caches across multiple GPUs for larger contexts.
Why this matters: Watch-but-don’t-act unless you run multi-GPU Linux boxes.
M5 vs DGX Spark vs Strix Halo vs RTX 6000: standardised local LLM benchmarks
RTX 6000 at ~1,800 GB/s bandwidth dominates; M5 at ~600 GB/s; DGX Spark at ~256 GB/s. Full repo published.
Why this matters: Hard numbers for your next hardware purchase decision for local inference.
Industry & Trends
Sonnet 4.5 discontinuation pushed to May 18
Anthropic extended Sonnet 4.5 EOL by 3 days (was May 15, now May 18) — likely imminent removal.
Why this matters: Check any pipelines still pinned to Sonnet 4.5 via your LiteLLM gateway.
swyx keynote: AI Engineer Singapore — The Agentic Nation
Closing keynote at first AI Engineer Singapore event; swyx shares opinionated takes on national agentic strategy.
Why this matters: Context for AI Engineer World’s Fair — same community, same speaker circuit.
Auto-curated daily by Claude Opus 4.7 from GitHub: cline/cline, GitHub: ggml-org/llama.cpp, Lenny’s Newsletter, Simon Willison, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top, swyx.io. Source list and editorial profile maintained by Daniel.