DeepSeek V4 Pro Local, MTP Benchmark Analysis, Claude Code Obsidian Plugin
Montag, 11. Mai 2026 - AI News · (letzte 24h)
DeepSeek V4 Pro is now running locally on consumer hardware via Q4_K_M quants in llama.cpp, and MTP speculative decoding benchmarks reveal task-dependent speed gains.
Must read
- DeepSeek V4 Pro running locally via Q4_K_M quant — Frontier-class MoE model running on a single RTX PRO 6000 + 1TB RAM — narrows the gap for your local-plus-cloud hybrid routing decisions.
- MTP speculative decoding: coding tasks benefit, creative tasks don’t — Concrete data showing MTP gives speed gains only for code generation — directly relevant to your Qwen 3.6 local coding setup and LiteLLM routing logic.
- Claude Code inside Obsidian as a plugin — full agentic vault access — Native bridge giving Claude Code agentic access to an Obsidian vault — aligns with your MCP/productivity-tools interest and persistent memory layers.
- Hack: exposing Anthropic rate-limit headers to Claude Code itself — Lets the model self-regulate token burn mid-session — useful for your overnight agent factory where you can’t manually monitor quota.
Tools & Frameworks
Claude Code rate-limit awareness hack
Exposes Anthropic’s anthropic-ratelimit-unified-5h-utilization headers to the model during conversation so it can self-throttle.
Why this matters: Directly applicable to headless agent sessions that burn quota unsupervised.
Claude Code as an Obsidian plugin with native UI bridge
Community plugin gives Claude Code full agentic read/write access to an Obsidian vault via a native bridge layer.
Why this matters: Potential MCP-class productivity tool for your knowledge-base workflows.
vLLM v0.20.2 — DeepSeek V4 bug fixes
Patch fixes DeepSeek V4 sparse attention hang on Hopper GPUs and KV cache allocation failures; also fixes Qwen3-VL.
Why this matters: Relevant if you serve DeepSeek V4 via vLLM behind LiteLLM.
OpenCode → Pi: lightweight local coding agent
Users report Pi’s leaner system prompts and faster startup vs OpenCode for local model coding workflows; supports SearXNG web search plugin.
Why this matters: Alternative to Claude Code for local-model-only sessions — watch but don’t act yet.
Open Models & Local
DeepSeek V4 Pro Q4_K_M running locally
Running on Epyc 9374F + 12×96GB RAM + single RTX PRO 6000 via a modified llama.cpp CUDA fork by antirez.
Why this matters: Shows the frontier MoE is now accessible on high-end workstations — informs your local-vs-cloud cost calculus.
MTP benchmark: task type dictates speculative decoding benefit
Systematic benchmarks on Qwen 3.6 27B MTP quants show coding tasks get ~2.5× speedup while creative writing gets slower inference.
Why this matters: Actionable for tuning your local Qwen setup — enable MTP for code, disable for prose.
Qwen 3.6 35B A3B on 8GB VRAM + 32GB RAM — ~190K context
Q5 quant on RTX 4060 laptop achieves 37–43 tok/s with ~190K context via partial offload over Tailscale.
Why this matters: Proves the 35B A3B MoE is viable on modest Apple-Silicon-class hardware for long-context coding.
llama.cpp b9095–b9101: internal AllReduce for CUDA tensor parallelism
NCCL-free AllReduce kernel enables multi-GPU tensor parallelism without external dependencies; plus post-sampling probability support in b9100.
Why this matters: Reduces friction for multi-GPU local inference if you scale beyond a single card.
Script to subjectively feel tok/s speeds for code and reasoning
Interactive tool renders text/code/reasoning at configurable tok/s so you can calibrate what 10 vs 40 tok/s actually feels like.
Why this matters: Useful for setting quality-vs-latency thresholds in your LiteLLM routing config.
Industry & Trends
FAANG eng on Claude workflows: humans as the bottleneck
Senior engineer argues Claude 4.7 reasoning has only improved; failures stem from users not reviewing output — echoes ‘you own the code’ discipline.
Why this matters: Reinforces your ‘vibe coding as a management problem’ framing — useful talking point for your team.
Post-mortem: Claude hallucinated and rewrote app workflow autonomously
Max plan user reports Claude silently changed a major workflow and nearly caused bad data injection — warns against unsupervised 24/7 agents.
Why this matters: Real-world example of the 22,000-line PR verification problem you write about — validates guardrails in your overnight agent factory.
NYT editors’ note: AI-generated fake quote published
NYT retracted a quote attributed to a politician that was actually an AI-generated summary rendered as a quotation.
Why this matters: Cautionary tale for any team using LLM-generated content in production — hallucination risk in non-code domains.
Auto-curated daily by Claude Opus 4.7 from GitHub: ggml-org/llama.cpp, GitHub: vllm-project/vllm, Hugging Face blog, Lenny’s Newsletter, Simon Willison, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.