Codex in Chrome, Codex /goal Persistence, GitHub Token Efficiency
Saturday, 9 May 2026 - AI News · (last 24h)
OpenAI shipped Codex as a browser-native agent running directly in Chrome tabs on macOS and Windows.
Must read
- Codex now works directly in Chrome on macOS and Windows — Browser-native coding agent running in parallel across tabs — a direct competitor to your Claude Code headless workflows.
- The Six-Hour Codex Run That Survived a Five-Hour Pause — Codex /goal persists state across sleep/restart — mirrors your overnight-agent-factory pattern but with built-in resume.
- Improving token efficiency in GitHub Agentic Workflows — Concrete optimisations for agentic CI costs — directly applicable to your GitHub Actions + LiteLLM gateway setup.
- Running Codex safely at OpenAI — Sandboxing, approvals, and agent-native telemetry patterns — useful reference for your in-house MCP server security model.
- Using Claude Code: The Unreasonable Effectiveness of HTML — Anthropic’s Claude Code team advocates HTML over Markdown as output format — immediately testable in your Claude Code workflows.
Tools & Frameworks
Claude Code v2.1.136
Adds hard_deny auto-mode rules, fixes MCP servers disappearing after /clear in VS Code and JetBrains, and adds OTEL survey flag for enterprises.
Why this matters: The MCP-disappearing-after-/clear fix likely hit your team directly.
AlphaEvolve scaling impact across fields
Gemini-powered coding agent now explains physics and designs algorithms; post details real-world deployments beyond maths.
Why this matters: Shows where autonomous code-generation agents are heading beyond dev tooling.
Deploy any HuggingFace model via Goose + Together
One-prompt deployment of any HF model to production GPU containers using Goose agent and Together’s Dedicated Container Inference.
Why this matters: Useful for fast model eval without provisioning your own infra.
Grammar-Constrained Bash Generation for Small LMs
NVIDIA shows grammar-constrained decoding dramatically improves bash correctness in small models used as agent tool-callers.
Why this matters: Relevant if you route shell tasks to local small models in your hybrid workflow.
Open Models & Local
ds4.c — native DeepSeek V4 Flash inference engine
Antirez’s single-file Metal-only inference engine for DeepSeek V4 Flash; alpha but intentionally narrow and end-to-end.
Why this matters: Metal-only = Apple Silicon native; watch for when it stabilises.
Gemma 4 26B DFlash speculative decoding
z-lab’s DFlash draft model pushes Gemma 4 26B to 600 tok/s on a single RTX 5090 via stateful parallel block diffusion drafting.
Why this matters: DFlash’s stateful drafting may port to Apple Silicon vLLM paths — worth tracking.
Qwen 3.6 35B-A3B usable on 12 GB VRAM
Benchmarks show Qwen3.6-35B-A3B MTP IQ4_XS runs well on RTX 3060 12 GB with 16–32k context; practical MoE offloading tips included.
Why this matters: Validates MoE quants as viable local coding models on modest hardware.
Qwen3.6-27B: 80+ t/s at 262K context on RTX 4090
MTP + TurboQuant 4.25 bpv KV cache yields 80–87 t/s with 73% draft acceptance on a single 4090 at 262K context.
Why this matters: Shows what’s possible for long-context local inference — relevant to your hybrid routing decisions.
llama.cpp b9080 — Gemma4 NVFP4 support
Adds support for Gemma4_26B_A4B NVFP4 checkpoint conversion to GGUF format.
Why this matters: Unlocks running Gemma 4 26B in NVFP4 quant locally via llama.cpp.
Serving DeepSeek V4: million-token context as inference problem
Together AI details compressed KV layouts, prefix caching, and kernel work needed to serve DeepSeek V4 at 1M tokens on HGX B200.
Why this matters: Explains the systems constraints you’d hit routing long-context tasks to V4 via your LiteLLM gateway.
Industry & Trends
OpenAI Realtime Audio Models released
Three new API models: GPT-Realtime-2 (conversational), GPT-Realtime-Translate (live multilingual), GPT-Realtime-Whisper (streaming transcription).
Why this matters: New API primitives — watch but don’t act unless voice enters your product roadmap.
Meta prepares Hatch AI Agent
Consumer-grade agent with image/video gen, shopping, and learning integrated into Instagram/Facebook; internal tests expected June.
Why this matters: Signals big-tech consumer agent competition — context for your RegTech positioning.
DeepSeek seeking $7.35B funding, V4.1 next month
DeepSeek reportedly raising RMB 50B (~$7.35B) — largest single AI round ever — with V4.1 update planned for June.
Why this matters: V4.1 could shift your local/cloud routing calculus; funding signals sustained open-weight investment.
Anthropic growing 10x/year while others lay off
Anthropic reportedly growing revenue 10x year-over-year while competitors cut 10%+ of staff.
Why this matters: Your primary tooling vendor is thriving — good signal for Claude Code investment continuity.
Perplexity Personal Computer for all Mac users
Perplexity’s Mac desktop agent now accesses local files, apps, connectors, and web for all users — not just Pro.
Why this matters: Another desktop agent entrant; interesting as a research/browsing complement to your coding agents.
Auto-curated daily by Claude Opus 4.7 from Apple ML research, Don’t Worry About the Vase (Zvi), GitHub: anthropics/claude-code, GitHub: ggml-org/llama.cpp, GitHub: langchain-ai/langchain, Hugging Face blog, Latent Space, NVIDIA developer blog, OpenAI blog, Simon Willison, TLDR AI, Together AI blog, Vercel blog, r/ClaudeAI top, r/LocalLLaMA top, r/MachineLearning top, smol.ai news. Source list and editorial profile maintained by Daniel.