Skip to content

← AI Tracker

Digest AI Mensuel

GPT-5.5 and Codex Superapp, Claude Code w/ SpaceX Compute, DeepSeek V4 Million-Token Context

vendredi 8 mai 2026 - Briefing AI Mensuel · mai 2026

April was the month the coding-agent war went full-stack. OpenAI shipped GPT-5.5 (2× the price of 5.4, but fewer output tokens per task) alongside a radically expanded Codex desktop app — now a general-purpose agent surface with computer use, plugins, skills, memory, and Symphony orchestration turning issue trackers into agent control planes. Anthropic countered with Opus 4.7 (SWE-bench Verified 87.6%, new tokenizer, xhigh reasoning tier), then doubled Claude Code rate limits overnight via a SpaceX/Colossus compute deal, shipped Managed Agents with dreaming and multiagent orchestration, and launched Bugcrawl for repo-wide vulnerability scanning. Meanwhile DeepSeek V4 dropped as a 1.6T MoE with 1M-token context and 75% price cuts, and Moonshot’s Kimi K2.6 (1T MoE, 32B active, 256K context) arrived as the new open-weight coding leader. For builders, the practical upshot: frontier agentic performance is now available at three price tiers, harness engineering matters as much as model choice, and overnight agent factories just got meaningfully cheaper to run.

Launches & releases this month

Models

  • GPT-5.5 — OpenAI’s new flagship: $5/$30 per M tokens, improved agentic reasoning and tool use, 2× GPT-5.4 pricing. (OpenAI blog)
  • GPT-5.5 Instant — New default ChatGPT/API model with reduced hallucinations, stronger factuality, and personalisation controls. (OpenAI blog)
  • Claude Opus 4.7 — Anthropic’s strongest Opus: SWE-bench Verified 87.6%, new tokenizer, xhigh reasoning tier, 3.75MP image input. (smol.ai news)
  • DeepSeek V4 Pro & Flash — 1.6T MoE, 49B active params, 1M-token context, hybrid attention; 75% price cut within days of launch. (Hugging Face blog)
  • Moonshot Kimi K2.6 — 1T-param open-weight MoE with 32B active, 384 experts, 256K context, INT4 quant; SWE-Bench Pro 58.6%. (smol.ai news)
  • Mistral Medium 3.5 + Vibe Agents — 128B dense model powering remote async coding agents from CLI or Le Chat; runs on 4 GPUs. (TLDR AI)
  • Granite 4.1 (IBM) — Apache 2.0 family in 3B/8B/30B; strong coding benchmarks, runs locally on Apple Silicon via MLX. (Hugging Face blog)
  • Qwen3.6-27B — Dense Apache 2.0 coding model with thinking modes; outperforms Qwen3.5-397B on SWE-bench and Terminal-Bench. (smol.ai news)
  • xAI Grok 4.3 — Improved cost-per-intelligence over Grok 4.20; 1M context, strong instruction following and tool calling. (TLDR AI)
  • Subquadratic 12M Context — New model with 12M-token context window outperforms GPT-5.5 on retrieval; 50M planned next. (TLDR AI)

Features & Tools

  • Claude Managed Agents — Dreaming, outcomes-based self-correction, multiagent orchestration, and filesystem-based persistent memory. (TLDR AI)
  • Codex CLI /goal Loop — Codex CLI 0.128.0 adds /goal — agent loops until goal met or token budget exhausted (Ralph loop pattern). (Simon Willison)
  • Responses API WebSockets — WebSocket mode with connection-scoped caching cuts Codex agent-loop latency up to 40%. (OpenAI blog)
  • Cursor 3.3 + Harness Blog — Continuous agent harness updates; vision-driven dev, A/B testing, dynamic context adaptation across models. (TLDR AI)
  • JetBrains Skill Manager — Install trusted skills once, reuse across agents and projects; new skill repository for discovery. (JetBrains AI blog)
  • Gemini API File Search (Multimodal RAG) — Multimodal support, custom metadata filtering, and page-level citations for verifiable RAG pipelines. (TLDR AI)

Products

  • Codex Desktop Superapp — Codex macOS/Windows adds computer use, browsing, image gen, memory, plugins — 42% faster agent loops. (OpenAI blog)

Deals & Partnerships

  • Anthropic SpaceX Compute Deal — Access to Colossus (220K+ GPUs); Claude Code rate limits doubled for Pro/Max/Team/Enterprise immediately. (TLDR AI)
  • OpenAI on AWS (Bedrock) — GPT models, Codex, and Managed Agents now available on AWS Bedrock; ends Azure exclusivity. (OpenAI blog)

Other Releases

  • Symphony Orchestration Spec — Open-source spec turning issue trackers into always-on agent control planes; up to 5× PR throughput. (OpenAI blog)
  • OpenAI Agents SDK Update — Native sandbox execution, model-native harness, open-source; supports long-running durable agents across files/tools. (OpenAI blog)
  • OpenAI Privacy Filter — Open-weight 1.5B PII detection/redaction model; runs locally, context-aware, state-of-the-art accuracy. (OpenAI blog)
  • Gemma 4 MTP Drafters — Multi-token prediction drafters deliver up to 3× speculative decoding speedup with zero quality loss. (TLDR AI)
  • LangChain Open SWE — Open-source internal coding agent framework built on Deep Agents and LangGraph with model-specific profiles. (LangChain blog)
  • Vercel deepsec — Open-source security harness powered by coding agents; runs locally with Claude or Codex subscriptions. (Vercel blog)

Stories of the month

Harness engineering eclipses model selection

The month’s clearest signal: which harness wraps the model now matters as much as which model you pick. Cursor published data showing a harness change alone moved them from ‘Top 30 to Top 5’ on Terminal-Bench. OpenAI shipped WebSocket-mode caching (40% faster loops) and Symphony orchestration. LangChain released model-specific profiles yielding 10–20 point jumps on τ²-bench. Anthropic’s Managed Agents added dreaming and self-correction. For teams running overnight agent factories, investing in harness tuning — tool schemas, memory rituals, retry logic — now delivers more ROI than waiting for the next model drop.

Open-weight models close the coding gap

DeepSeek V4 (1M context, 75% price cut), Kimi K2.6 (open 1T MoE), Qwen3.6-27B (outperforms its own 397B sibling), and Granite 4.1 (Apache 2.0, MLX-friendly) all landed within weeks. LangChain’s evals confirmed open models now match closed frontier on core agent tasks — file ops, tool use, instruction following — at a fraction of cost. For teams routing via LiteLLM, the practical implication is that local-plus-cloud hybrid workflows are no longer a compromise; they’re a cost-optimisation strategy with minimal quality loss on well-scoped tasks.

Skills and memory become first-class agent infrastructure

The discipline layer above vibe coding got real infrastructure this month. JetBrains shipped a Skill Manager and repository for cross-agent reuse. OpenAI’s Codex added plugins/skills with automations and config imports from other agents. Anthropic’s Managed Agents gained filesystem-based persistent memory that compounds across sessions. An in-depth post on SKILL.md internals explained how runtime execution shapes what you write at the surface. For CTOs managing agent-augmented teams, skills are becoming the unit of verifiable, shareable capability — the progressive disclosure layer that makes overnight agent work auditable.

Compute economics reshape the agent stack

Anthropic’s SpaceX deal, OpenAI’s AWS/Bedrock expansion, Cursor’s rumoured xAI acquisition ($60B), and DeepSeek’s 75% price cut all point to the same thing: compute access is the new moat. B200 spot prices surged 114% in six weeks. Pragmatic Engineer reported AI token spending ‘out of control’ across 15 companies. Cursor’s negative gross margins revealed the inverted SaaS economics of agent-heavy products. For engineering leaders, the implication is clear: routing intelligence (LiteLLM, Batch API for fleets, KV cache locality) is now a cost-centre discipline, not a nice-to-have.

Eval and observability hit production scale

As agent output volume grows, verification becomes the bottleneck. Hugging Face published data showing eval costs now rival training costs. LangSmith shipped OpenTelemetry support, test-run comparisons, and regression testing. Vercel open-sourced deepsec for agent-powered security scanning. JetBrains demonstrated that IDE-native search tools made agents faster and cheaper. Harvey released an open Legal Agent Benchmark. The pattern: teams are building eval into the agent loop itself rather than treating it as a post-hoc step — exactly the discipline needed to manage leaf-node risk in 22,000-line PRs.

What I’m watching into next month

nexu-io/open-design

33.2k★ · TypeScript · agent-skills ai-agents ai-design byok claude 🎨 Local-first, open-source alternative to Anthropic’s Claude Design. ⚡ 19 Skills · ✨ 71 brand-grade Design Systems 🖼 Generate web · desktop · mobile prototypes · slides · images · videos · HyperFrames 📦 Sandboxed preview · HTML/PDF/PPTX/MP4 export 🤖 Runs on Claude Code / Codex / Cursor / Gemini / OpenCode / Qwen / Copilot / Hermes / Kimi CLI.

EvoLinkAI/awesome-gpt-image-2-API-and-Prompts

13.3k★ · Python · api awesome awesome-list chatgpt generative- GPT-Image-2 API and Prompts

alchaincyf/huashu-design

12.6k★ · HTML Huashu Design · HTML-native design skill for Claude Code · Claude Code 里 HTML 原生的设计 skill · 高保真原型 / 幻灯片 / 动画 + 20 设计哲学 + 5 维评审 + MP4 导出 · Agent-agnostic

kyegomez/OpenMythos

12.3k★ · Python · ai anthropic attention claude claude-ai A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.

google-labs-code/design.md

12.1k★ · TypeScript A format specification for describing a visual identity to coding agents. DESIGN.md gives agents a persistent, structured understanding of a design system.

browser-use/browser-harness

11.5k★ · Python Browser Harness | Self-healing harness that enables LLMs to complete any task.

h4ckf0r0day/obscura

11.1k★ · Rust The headless browser for AI agents and web scraping

browser-use/video-use

6.9k★ · Python Edit videos with coding agents

nashsu/llm_wiki

6.4k★ · TypeScript LLM Wiki is a cross-platform desktop application that turns your documents into an organized, interlinked knowledge base — automatically. Instead of traditional RAG (retrieve-and-answer from scratch every time), the LLM incrementally builds and maintains a persistent wiki from your sources。

Robbyant/lingbot-map

5.9k★ · Python A feed-forward 3D foundation model for reconstructing scenes from streaming data

Read this month

Vibe coding and agentic engineering are getting closer than I’d like

Simon Willison articulates the uncomfortable convergence you’ve been writing about — the boundary between casual AI-assisted coding and serious agentic engineering is dissolving in practice, raising exactly the verification and governance questions your leaf-nodes/overnight-factory framing addresses. Essential context heading into AI Engineer World’s Fair.

Quote of the month

People who come from the world of agentic coding have a certain digital smell that is not obvious to them but is obvious to everyone else.

Andrew Kelley (Zig project lead) · link


Auto-curated monthly by Claude Opus 4.7 from AI Tidbits (Sahar Mor), Apple ML research, Ben’s Bites, Cursor changelog, Don’t Worry About the Vase (Zvi), Eric Jang, Eugene Yan, Every — Chain of Thought (Dan Shipper), Exponential View (Azeem Azhar), Google DeepMind blog, Hacker News (AI), Hugging Face blog, Import AI (Jack Clark), Interconnects (Nathan Lambert), JetBrains AI blog, LangChain blog, Last Week in AI, Latent Space, Lenny’s Newsletter, NVIDIA developer blog, One Useful Thing (Ethan Mollick), OpenAI blog, Sebastian Raschka, Simon Willison, Sourcegraph blog, TLDR AI, The Algorithmic Bridge (Alberto Romero), The Pragmatic Engineer (Gergely Orosz), Together AI blog, Understanding AI (Timothy B. Lee), Vercel blog, smol.ai news, swyx.io. Source list and editorial profile maintained by Daniel.