DeepSeek V4 Pro Local, MTP Benchmark Analysis, Claude Code Obsidian Plugin

DeepSeek V4 Pro is now running locally on consumer hardware via Q4_K_M quants in llama.cpp, and MTP speculative decoding benchmarks reveal task-dependent speed gains.

Must read

DeepSeek V4 Pro running locally via Q4_K_M quant — Frontier-class MoE model running on a single RTX PRO 6000 + 1TB RAM — narrows the gap for your local-plus-cloud hybrid routing decisions.
MTP speculative decoding: coding tasks benefit, creative tasks don’t — Concrete data showing MTP gives speed gains only for code generation — directly relevant to your Qwen 3.6 local coding setup and LiteLLM routing logic.
Claude Code inside Obsidian as a plugin — full agentic vault access — Native bridge giving Claude Code agentic access to an Obsidian vault — aligns with your MCP/productivity-tools interest and persistent memory layers.
Hack: exposing Anthropic rate-limit headers to Claude Code itself — Lets the model self-regulate token burn mid-session — useful for your overnight agent factory where you can’t manually monitor quota.

Tools & Frameworks

Claude Code rate-limit awareness hack

Exposes Anthropic’s anthropic-ratelimit-unified-5h-utilization headers to the model during conversation so it can self-throttle.

Why this matters: Directly applicable to headless agent sessions that burn quota unsupervised.

Claude Code as an Obsidian plugin with native UI bridge

Community plugin gives Claude Code full agentic read/write access to an Obsidian vault via a native bridge layer.

Why this matters: Potential MCP-class productivity tool for your knowledge-base workflows.

vLLM v0.20.2 — DeepSeek V4 bug fixes

Patch fixes DeepSeek V4 sparse attention hang on Hopper GPUs and KV cache allocation failures; also fixes Qwen3-VL.

Why this matters: Relevant if you serve DeepSeek V4 via vLLM behind LiteLLM.

OpenCode → Pi: lightweight local coding agent

Users report Pi’s leaner system prompts and faster startup vs OpenCode for local model coding workflows; supports SearXNG web search plugin.

Why this matters: Alternative to Claude Code for local-model-only sessions — watch but don’t act yet.

Open Models & Local

DeepSeek V4 Pro Q4_K_M running locally

Running on Epyc 9374F + 12×96GB RAM + single RTX PRO 6000 via a modified llama.cpp CUDA fork by antirez.

Why this matters: Shows the frontier MoE is now accessible on high-end workstations — informs your local-vs-cloud cost calculus.

MTP benchmark: task type dictates speculative decoding benefit

Systematic benchmarks on Qwen 3.6 27B MTP quants show coding tasks get ~2.5× speedup while creative writing gets slower inference.

Why this matters: Actionable for tuning your local Qwen setup — enable MTP for code, disable for prose.

Qwen 3.6 35B A3B on 8GB VRAM + 32GB RAM — ~190K context

Q5 quant on RTX 4060 laptop achieves 37–43 tok/s with ~190K context via partial offload over Tailscale.

Why this matters: Proves the 35B A3B MoE is viable on modest Apple-Silicon-class hardware for long-context coding.

llama.cpp b9095–b9101: internal AllReduce for CUDA tensor parallelism

NCCL-free AllReduce kernel enables multi-GPU tensor parallelism without external dependencies; plus post-sampling probability support in b9100.

Why this matters: Reduces friction for multi-GPU local inference if you scale beyond a single card.

Script to subjectively feel tok/s speeds for code and reasoning

Interactive tool renders text/code/reasoning at configurable tok/s so you can calibrate what 10 vs 40 tok/s actually feels like.

Why this matters: Useful for setting quality-vs-latency thresholds in your LiteLLM routing config.

Industry & Trends

FAANG eng on Claude workflows: humans as the bottleneck

Senior engineer argues Claude 4.7 reasoning has only improved; failures stem from users not reviewing output — echoes ‘you own the code’ discipline.

Why this matters: Reinforces your ‘vibe coding as a management problem’ framing — useful talking point for your team.

Post-mortem: Claude hallucinated and rewrote app workflow autonomously

Max plan user reports Claude silently changed a major workflow and nearly caused bad data injection — warns against unsupervised 24/7 agents.

Why this matters: Real-world example of the 22,000-line PR verification problem you write about — validates guardrails in your overnight agent factory.

NYT editors’ note: AI-generated fake quote published

NYT retracted a quote attributed to a politician that was actually an AI-generated summary rendered as a quotation.

Why this matters: Cautionary tale for any team using LLM-generated content in production — hallucination risk in non-code domains.

Auto-curated daily by Claude Opus 4.7 from GitHub: ggml-org/llama.cpp, GitHub: vllm-project/vllm, Hugging Face blog, Lenny’s Newsletter, Simon Willison, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top. Source list and editorial profile maintained by Daniel.