Grok Build 0.1, Mellum2 12B Open-Source, Claude Code v2.1.160

xAI launched Grok Build 0.1 for agentic coding at $1/M input tokens, while JetBrains open-sourced Mellum2 12B for production AI routing.

Must read

Grok Build 0.1 on API — New agentic coding model at $1/$2 per M tokens, 100+ tok/s, integrates with Cursor — a viable routing option for your LiteLLM gateway.
Mellum2: 12B MoE Model for AI Workflows (Apache 2.0) — 12B model designed for routing, sub-agents, and Q&A — small enough for local Apple Silicon and purpose-built for the orchestration layer you run.
Verifying Agentic Development at Scale (Cognition/Devin) — Cognition now runs more async than interactive Devin sessions; their 10-20 parallel agent pattern mirrors your overnight-agent-factory setup.
Claude Code v2.1.160 — Security hardening: prompts before writing to shell startup files and build-tool configs; grep-then-edit no longer needs a separate Read step.
MiniMax M3: Open-Weight Frontier Coding Model (1M context) — Open-weights model with 1M token context, desktop computer-use, and frontier coding scores — potential local/hybrid candidate worth evaluating.

Tools & Frameworks

pi-dynamic-workflows: Fan-out Subagent Orchestration

Pi extension that lets an assistant write JS to dispatch isolated subagents for audits, refactors, and research, then synthesise results.

Why this matters: Directly applicable to your headless overnight-agent-factory pattern.

ECC: Multi-Harness Agent Workflows with Skills & Memory

Comprehensive system for agent workflows featuring skills, instincts, memory optimisation, and security scanning.

Why this matters: Aligns with your skills-framework thinking for disciplined agentic dev.

LangGraph 1.2.3: v3 Streaming & Named Subagents

Adds v3 streaming to RemoteGraph, tool-dispatched subagent naming via lc_agent_name, and SDK interleave projections.

Why this matters: If you evaluate LangGraph for orchestration, named subagents improve observability.

LiteLLM v1.87.0

New release of your model gateway with cosign-verified Docker images; incremental update.

Why this matters: You run LiteLLM — keep images current.

Open Models & Local

Mellum2 on Hugging Face

12B MoE model optimised for latency and throughput in routing and sub-agent tasks, Apache 2.0, designed for private deployment.

Why this matters: Small enough for Apple Silicon inference via llama.cpp or MLX.

llama.cpp b9459: f16 Metal GLU Kernels

Templated GLU kernels now load/store in native f16, saving memory bandwidth on Apple Silicon while keeping float ALU compute.

Why this matters: Direct throughput improvement for local model inference on your Mac.

llama.cpp b9455: Quantised KV Cache with Tensor Parallelism

Adds quantised KV cache support to tensor-parallel inference, reducing VRAM pressure for large-context local runs.

Why this matters: Enables longer context windows on constrained local hardware.

Qwen 3.7 Plus on Vercel AI Gateway (Free Until 4 Jun)

Qwen 3.7 Plus unifies vision and language for GUI/CLI agent tasks; free for paid AI Gateway users until 4 June.

Why this matters: Zero-cost window to benchmark a visual-agent model against your stack.

Industry & Trends

OpenAI Models & Codex Now GA on AWS

OpenAI frontier models and Codex available through AWS with native IAM, procurement, and environment controls.

Why this matters: Your infra is AWS — you can now access Codex without a separate billing relationship.

Claude Opus 4.8 System Card Analysis

244-page system card released six weeks after Opus 4.7; incremental capability gains, still behind Mythos; detailed safety/capability breakdown.

Why this matters: Useful context for deciding when to upgrade your Claude Code model pin.

NVIDIA Computex 2026: N1X Laptop Chip Preview

N1X features 20 ARM cores and RTX 5070-equivalent GPU with improved VRAM allocation for AI; Vera Rubin datacenter platform also expected.

Why this matters: Better VRAM on laptops directly affects what you can run locally — watch for specs.

AI Agent Bottleneck Is Permissions, Not Performance

Enterprise agents stall on permissioning; Workday uses its system of record as governance layer, integrating with Gemini for regulated sectors.

Why this matters: Relevant to your RegTech context — governance-as-infrastructure is the pattern.

Open vs Closed Models: Different Exponentials

Nathan Lambert argues open and closed models follow diverging scaling curves, with marginal intelligence gains mattering differently by use case.

Why this matters: Frames your local-plus-cloud routing decisions with a structural argument.

Sources unavailable today: r/ChatGPTCoding top, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top

Auto-curated daily by Claude Opus 4.7 from Don’t Worry About the Vase (Zvi), Exponential View (Azeem Azhar), GitHub: BerriAI/litellm, GitHub: anthropics/claude-code, GitHub: cline/cline, GitHub: ggml-org/llama.cpp, GitHub: langchain-ai/langgraph, Hugging Face blog, Import AI (Jack Clark), Interconnects (Nathan Lambert), JetBrains AI blog, LangChain blog, Latent Space, Lenny’s Newsletter, NVIDIA developer blog, OpenAI blog, SaaStr (Jason Lemkin), Simon Willison, TLDR AI, The Algorithmic Bridge (Alberto Romero), Tomasz Tunguz, Understanding AI (Timothy B. Lee), Vercel blog, smol.ai news. Source list and editorial profile maintained by Daniel.