Skip to content

← AI Tracker

AI Briefing

Codex in Chrome, Codex /goal Persistence, GitHub Token Efficiency

Samstag, 9. Mai 2026 - AI News · (letzte 24h)

OpenAI shipped Codex as a browser-native agent running directly in Chrome tabs on macOS and Windows.

Must read

Tools & Frameworks

Claude Code v2.1.136

Adds hard_deny auto-mode rules, fixes MCP servers disappearing after /clear in VS Code and JetBrains, and adds OTEL survey flag for enterprises.

Why this matters: The MCP-disappearing-after-/clear fix likely hit your team directly.

AlphaEvolve scaling impact across fields

Gemini-powered coding agent now explains physics and designs algorithms; post details real-world deployments beyond maths.

Why this matters: Shows where autonomous code-generation agents are heading beyond dev tooling.

Deploy any HuggingFace model via Goose + Together

One-prompt deployment of any HF model to production GPU containers using Goose agent and Together’s Dedicated Container Inference.

Why this matters: Useful for fast model eval without provisioning your own infra.

Grammar-Constrained Bash Generation for Small LMs

NVIDIA shows grammar-constrained decoding dramatically improves bash correctness in small models used as agent tool-callers.

Why this matters: Relevant if you route shell tasks to local small models in your hybrid workflow.

Open Models & Local

ds4.c — native DeepSeek V4 Flash inference engine

Antirez’s single-file Metal-only inference engine for DeepSeek V4 Flash; alpha but intentionally narrow and end-to-end.

Why this matters: Metal-only = Apple Silicon native; watch for when it stabilises.

Gemma 4 26B DFlash speculative decoding

z-lab’s DFlash draft model pushes Gemma 4 26B to 600 tok/s on a single RTX 5090 via stateful parallel block diffusion drafting.

Why this matters: DFlash’s stateful drafting may port to Apple Silicon vLLM paths — worth tracking.

Qwen 3.6 35B-A3B usable on 12 GB VRAM

Benchmarks show Qwen3.6-35B-A3B MTP IQ4_XS runs well on RTX 3060 12 GB with 16–32k context; practical MoE offloading tips included.

Why this matters: Validates MoE quants as viable local coding models on modest hardware.

Qwen3.6-27B: 80+ t/s at 262K context on RTX 4090

MTP + TurboQuant 4.25 bpv KV cache yields 80–87 t/s with 73% draft acceptance on a single 4090 at 262K context.

Why this matters: Shows what’s possible for long-context local inference — relevant to your hybrid routing decisions.

llama.cpp b9080 — Gemma4 NVFP4 support

Adds support for Gemma4_26B_A4B NVFP4 checkpoint conversion to GGUF format.

Why this matters: Unlocks running Gemma 4 26B in NVFP4 quant locally via llama.cpp.

Serving DeepSeek V4: million-token context as inference problem

Together AI details compressed KV layouts, prefix caching, and kernel work needed to serve DeepSeek V4 at 1M tokens on HGX B200.

Why this matters: Explains the systems constraints you’d hit routing long-context tasks to V4 via your LiteLLM gateway.

OpenAI Realtime Audio Models released

Three new API models: GPT-Realtime-2 (conversational), GPT-Realtime-Translate (live multilingual), GPT-Realtime-Whisper (streaming transcription).

Why this matters: New API primitives — watch but don’t act unless voice enters your product roadmap.

Meta prepares Hatch AI Agent

Consumer-grade agent with image/video gen, shopping, and learning integrated into Instagram/Facebook; internal tests expected June.

Why this matters: Signals big-tech consumer agent competition — context for your RegTech positioning.

DeepSeek seeking $7.35B funding, V4.1 next month

DeepSeek reportedly raising RMB 50B (~$7.35B) — largest single AI round ever — with V4.1 update planned for June.

Why this matters: V4.1 could shift your local/cloud routing calculus; funding signals sustained open-weight investment.

Anthropic growing 10x/year while others lay off

Anthropic reportedly growing revenue 10x year-over-year while competitors cut 10%+ of staff.

Why this matters: Your primary tooling vendor is thriving — good signal for Claude Code investment continuity.

Perplexity Personal Computer for all Mac users

Perplexity’s Mac desktop agent now accesses local files, apps, connectors, and web for all users — not just Pro.

Why this matters: Another desktop agent entrant; interesting as a research/browsing complement to your coding agents.


Auto-curated daily by Claude Opus 4.7 from Apple ML research, Don’t Worry About the Vase (Zvi), GitHub: anthropics/claude-code, GitHub: ggml-org/llama.cpp, GitHub: langchain-ai/langchain, Hugging Face blog, Latent Space, NVIDIA developer blog, OpenAI blog, Simon Willison, TLDR AI, Together AI blog, Vercel blog, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top, smol.ai news. Source list and editorial profile maintained by Daniel.