Skip to content

← AI Tracker

AI Briefing

llama.cpp MTP Speedup, Claude Context Tools, DeepSeek V4 1M Context

lundi 18 mai 2026 - AI News · (24 dernières heures)

llama.cpp ships MTP prompt-decode optimisation while community benchmarks DeepSeek V4’s 1M context window across real codebases.

Must read

Tools & Frameworks

4-month Claude Pro vs ChatGPT Plus comparison by task type

Opus 4.7 wins on writing and code reasoning; GPT-5 wins on search and data analysis; both roughly equal on generation.

Why this matters: Useful for model-routing decisions in your gateway.

Opus spawns 3 subagents for frontend PageSpeed fixes across 41 files

User wrote a playbook ADR, pointed Opus at 9 pages; it self-organised subagents and hit near-perfect Lighthouse scores in 15 minutes.

Why this matters: Pattern for your overnight-agent-factory: ADR + subagent delegation.

Structured workflows with small local models as agent loops

Home-rolled agent loop with local models reaches self-editing capability; author documents progressive-disclosure patterns.

Why this matters: Aligns with your skills-framework thinking for local models.

Cline CLI v3.0.6–3.0.7: ChatGPT OAuth fix, GPT-5.x model support

Adds GPT-5.2, 5.4, 5.4-mini to Cline CLI model list; removes startup OAuth round-trip.

Why this matters: Minor but shows Cline keeping pace with new OpenAI models.

Open Models & Local

llama.cpp MTP benchmarks on Qwen3.6 with RTX 5090

Qwen3.6-27B Q5_K_M with MTP on: ~1.8× generation speedup over MTP off at 128k context on RTX 5090.

Why this matters: MTP is the key local-inference acceleration; validates Qwen3.6-MTP GGUFs.

Abliterlitics: 85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B

Open-source forensics toolkit benchmarks safety removal techniques with weight-level analysis on Qwen3.6-27B variants.

Why this matters: If you run uncensored local models, this quantifies the quality trade-offs.

Qwopus3.5-9B-Coder-GGUF: agentic coding fine-tune

9B dense model optimised for tool calling and agentic coding; runs at Q8 on 16 GB RAM devices including Mac mini.

Why this matters: Potential local coding model for your Apple Silicon fleet at minimal VRAM cost.

Dual GPU llama.cpp: quantized KV cache with tensor parallelism

Fork fixes llama.cpp’s tensor-split limitation — enables quantized KV caches across multiple GPUs for larger contexts.

Why this matters: Watch-but-don’t-act unless you run multi-GPU Linux boxes.

M5 vs DGX Spark vs Strix Halo vs RTX 6000: standardised local LLM benchmarks

RTX 6000 at ~1,800 GB/s bandwidth dominates; M5 at ~600 GB/s; DGX Spark at ~256 GB/s. Full repo published.

Why this matters: Hard numbers for your next hardware purchase decision for local inference.

Sonnet 4.5 discontinuation pushed to May 18

Anthropic extended Sonnet 4.5 EOL by 3 days (was May 15, now May 18) — likely imminent removal.

Why this matters: Check any pipelines still pinned to Sonnet 4.5 via your LiteLLM gateway.

swyx keynote: AI Engineer Singapore — The Agentic Nation

Closing keynote at first AI Engineer Singapore event; swyx shares opinionated takes on national agentic strategy.

Why this matters: Context for AI Engineer World’s Fair — same community, same speaker circuit.


Auto-curated daily by Claude Opus 4.7 from GitHub: cline/cline, GitHub: ggml-org/llama.cpp, Lenny’s Newsletter, Simon Willison, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top, swyx.io. Source list and editorial profile maintained by Daniel.