Skip to content

← AI Tracker

AI Briefing

DeepSeek V4 Pro Local, MTP Benchmark Analysis, Claude Code Obsidian Plugin

lundi 11 mai 2026 - AI News · (24 dernières heures)

DeepSeek V4 Pro is now running locally on consumer hardware via Q4_K_M quants in llama.cpp, and MTP speculative decoding benchmarks reveal task-dependent speed gains.

Must read

Tools & Frameworks

Claude Code rate-limit awareness hack

Exposes Anthropic’s anthropic-ratelimit-unified-5h-utilization headers to the model during conversation so it can self-throttle.

Why this matters: Directly applicable to headless agent sessions that burn quota unsupervised.

Claude Code as an Obsidian plugin with native UI bridge

Community plugin gives Claude Code full agentic read/write access to an Obsidian vault via a native bridge layer.

Why this matters: Potential MCP-class productivity tool for your knowledge-base workflows.

vLLM v0.20.2 — DeepSeek V4 bug fixes

Patch fixes DeepSeek V4 sparse attention hang on Hopper GPUs and KV cache allocation failures; also fixes Qwen3-VL.

Why this matters: Relevant if you serve DeepSeek V4 via vLLM behind LiteLLM.

OpenCode → Pi: lightweight local coding agent

Users report Pi’s leaner system prompts and faster startup vs OpenCode for local model coding workflows; supports SearXNG web search plugin.

Why this matters: Alternative to Claude Code for local-model-only sessions — watch but don’t act yet.

Open Models & Local

DeepSeek V4 Pro Q4_K_M running locally

Running on Epyc 9374F + 12×96GB RAM + single RTX PRO 6000 via a modified llama.cpp CUDA fork by antirez.

Why this matters: Shows the frontier MoE is now accessible on high-end workstations — informs your local-vs-cloud cost calculus.

MTP benchmark: task type dictates speculative decoding benefit

Systematic benchmarks on Qwen 3.6 27B MTP quants show coding tasks get ~2.5× speedup while creative writing gets slower inference.

Why this matters: Actionable for tuning your local Qwen setup — enable MTP for code, disable for prose.

Qwen 3.6 35B A3B on 8GB VRAM + 32GB RAM — ~190K context

Q5 quant on RTX 4060 laptop achieves 37–43 tok/s with ~190K context via partial offload over Tailscale.

Why this matters: Proves the 35B A3B MoE is viable on modest Apple-Silicon-class hardware for long-context coding.

llama.cpp b9095–b9101: internal AllReduce for CUDA tensor parallelism

NCCL-free AllReduce kernel enables multi-GPU tensor parallelism without external dependencies; plus post-sampling probability support in b9100.

Why this matters: Reduces friction for multi-GPU local inference if you scale beyond a single card.

Script to subjectively feel tok/s speeds for code and reasoning

Interactive tool renders text/code/reasoning at configurable tok/s so you can calibrate what 10 vs 40 tok/s actually feels like.

Why this matters: Useful for setting quality-vs-latency thresholds in your LiteLLM routing config.

FAANG eng on Claude workflows: humans as the bottleneck

Senior engineer argues Claude 4.7 reasoning has only improved; failures stem from users not reviewing output — echoes ‘you own the code’ discipline.

Why this matters: Reinforces your ‘vibe coding as a management problem’ framing — useful talking point for your team.

Post-mortem: Claude hallucinated and rewrote app workflow autonomously

Max plan user reports Claude silently changed a major workflow and nearly caused bad data injection — warns against unsupervised 24/7 agents.

Why this matters: Real-world example of the 22,000-line PR verification problem you write about — validates guardrails in your overnight agent factory.

NYT editors’ note: AI-generated fake quote published

NYT retracted a quote attributed to a politician that was actually an AI-generated summary rendered as a quotation.

Why this matters: Cautionary tale for any team using LLM-generated content in production — hallucination risk in non-code domains.


Auto-curated daily by Claude Opus 4.7 from GitHub: ggml-org/llama.cpp, GitHub: vllm-project/vllm, Hugging Face blog, Lenny’s Newsletter, Simon Willison, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.