GPT-5.5 and Codex Superapp, Claude Code w/ SpaceX Compute, DeepSeek V4 Million-Token Context

April was the month the coding-agent war went full-stack. OpenAI shipped GPT-5.5 (2× the price of 5.4, but fewer output tokens per task) alongside a radically expanded Codex desktop app — now a general-purpose agent surface with computer use, plugins, skills, memory, and Symphony orchestration turning issue trackers into agent control planes. Anthropic countered with Opus 4.7 (SWE-bench Verified 87.6%, new tokenizer, xhigh reasoning tier), then doubled Claude Code rate limits overnight via a SpaceX/Colossus compute deal, shipped Managed Agents with dreaming and multiagent orchestration, and launched Bugcrawl for repo-wide vulnerability scanning. Meanwhile DeepSeek V4 dropped as a 1.6T MoE with 1M-token context and 75% price cuts, and Moonshot’s Kimi K2.6 (1T MoE, 32B active, 256K context) arrived as the new open-weight coding leader. For builders, the practical upshot: frontier agentic performance is now available at three price tiers, harness engineering matters as much as model choice, and overnight agent factories just got meaningfully cheaper to run.

Launches & releases this month

Models

GPT-5.5 — OpenAI’s new flagship: $5/$30 per M tokens, improved agentic reasoning and tool use, 2× GPT-5.4 pricing. (OpenAI blog)
GPT-5.5 Instant — New default ChatGPT/API model with reduced hallucinations, stronger factuality, and personalisation controls. (OpenAI blog)
Claude Opus 4.7 — Anthropic’s strongest Opus: SWE-bench Verified 87.6%, new tokenizer, xhigh reasoning tier, 3.75MP image input. (smol.ai news)
DeepSeek V4 Pro & Flash — 1.6T MoE, 49B active params, 1M-token context, hybrid attention; 75% price cut within days of launch. (Hugging Face blog)
Moonshot Kimi K2.6 — 1T-param open-weight MoE with 32B active, 384 experts, 256K context, INT4 quant; SWE-Bench Pro 58.6%. (smol.ai news)
Mistral Medium 3.5 + Vibe Agents — 128B dense model powering remote async coding agents from CLI or Le Chat; runs on 4 GPUs. (TLDR AI)
Granite 4.1 (IBM) — Apache 2.0 family in 3B/8B/30B; strong coding benchmarks, runs locally on Apple Silicon via MLX. (Hugging Face blog)
Qwen3.6-27B — Dense Apache 2.0 coding model with thinking modes; outperforms Qwen3.5-397B on SWE-bench and Terminal-Bench. (smol.ai news)
xAI Grok 4.3 — Improved cost-per-intelligence over Grok 4.20; 1M context, strong instruction following and tool calling. (TLDR AI)
Subquadratic 12M Context — New model with 12M-token context window outperforms GPT-5.5 on retrieval; 50M planned next. (TLDR AI)

Features & Tools

Claude Managed Agents — Dreaming, outcomes-based self-correction, multiagent orchestration, and filesystem-based persistent memory. (TLDR AI)
Codex CLI /goal Loop — Codex CLI 0.128.0 adds /goal — agent loops until goal met or token budget exhausted (Ralph loop pattern). (Simon Willison)
Responses API WebSockets — WebSocket mode with connection-scoped caching cuts Codex agent-loop latency up to 40%. (OpenAI blog)
Cursor 3.3 + Harness Blog — Continuous agent harness updates; vision-driven dev, A/B testing, dynamic context adaptation across models. (TLDR AI)
JetBrains Skill Manager — Install trusted skills once, reuse across agents and projects; new skill repository for discovery. (JetBrains AI blog)
Gemini API File Search (Multimodal RAG) — Multimodal support, custom metadata filtering, and page-level citations for verifiable RAG pipelines. (TLDR AI)

Products

Codex Desktop Superapp — Codex macOS/Windows adds computer use, browsing, image gen, memory, plugins — 42% faster agent loops. (OpenAI blog)

Deals & Partnerships

Anthropic SpaceX Compute Deal — Access to Colossus (220K+ GPUs); Claude Code rate limits doubled for Pro/Max/Team/Enterprise immediately. (TLDR AI)
OpenAI on AWS (Bedrock) — GPT models, Codex, and Managed Agents now available on AWS Bedrock; ends Azure exclusivity. (OpenAI blog)

Other Releases

Symphony Orchestration Spec — Open-source spec turning issue trackers into always-on agent control planes; up to 5× PR throughput. (OpenAI blog)
OpenAI Agents SDK Update — Native sandbox execution, model-native harness, open-source; supports long-running durable agents across files/tools. (OpenAI blog)
OpenAI Privacy Filter — Open-weight 1.5B PII detection/redaction model; runs locally, context-aware, state-of-the-art accuracy. (OpenAI blog)
Gemma 4 MTP Drafters — Multi-token prediction drafters deliver up to 3× speculative decoding speedup with zero quality loss. (TLDR AI)
LangChain Open SWE — Open-source internal coding agent framework built on Deep Agents and LangGraph with model-specific profiles. (LangChain blog)
Vercel deepsec — Open-source security harness powered by coding agents; runs locally with Claude or Codex subscriptions. (Vercel blog)

Stories of the month

Harness engineering eclipses model selection

The month’s clearest signal: which harness wraps the model now matters as much as which model you pick. Cursor published data showing a harness change alone moved them from ‘Top 30 to Top 5’ on Terminal-Bench. OpenAI shipped WebSocket-mode caching (40% faster loops) and Symphony orchestration. LangChain released model-specific profiles yielding 10–20 point jumps on τ²-bench. Anthropic’s Managed Agents added dreaming and self-correction. For teams running overnight agent factories, investing in harness tuning — tool schemas, memory rituals, retry logic — now delivers more ROI than waiting for the next model drop.

Cursor: Continually improving our agent harness — Vision-driven dev and A/B testing across models; harness change alone moved ranking dramatically. (TLDR AI)
Speeding up agentic workflows with WebSockets — Connection-scoped caching in Responses API cuts Codex agent-loop latency 40%. (OpenAI blog)
Tuning Deep Agents to Work Well with Different Models — Model-specific profiles in Deep Agents yield 10–20 point benchmark jumps. (LangChain blog)
Symphony: open-source orchestration spec — Turns issue trackers into agent control planes; up to 5× PR throughput. (OpenAI blog)
Model-Harness-Fit analysis — Labs post-train models against specific harnesses; tool schemas baked into weights. (TLDR AI)

Open-weight models close the coding gap

DeepSeek V4 (1M context, 75% price cut), Kimi K2.6 (open 1T MoE), Qwen3.6-27B (outperforms its own 397B sibling), and Granite 4.1 (Apache 2.0, MLX-friendly) all landed within weeks. LangChain’s evals confirmed open models now match closed frontier on core agent tasks — file ops, tool use, instruction following — at a fraction of cost. For teams routing via LiteLLM, the practical implication is that local-plus-cloud hybrid workflows are no longer a compromise; they’re a cost-optimisation strategy with minimal quality loss on well-scoped tasks.

DeepSeek-V4: a million-token context that agents can actually use — 1.6T MoE with 1M context; hybrid attention and compressed KV for major memory savings. (Hugging Face blog)
Kimi K2.6 on AI Gateway — Open-weight 1T MoE; 32B active params, 256K context, INT4 quant, SWE-Bench Pro 58.6%. (Vercel blog)
Open Models have crossed a threshold — GLM-5 and MiniMax M2.7 match closed frontier on core agent tasks at lower cost. (LangChain blog)
Granite 4.1 LLMs: How They’re Built — Apache 2.0, 3B/8B/30B sizes; runs on Apple Silicon via MLX. (Hugging Face blog)
Qwen3.6-27B release — Dense 27B coding model outperforms 397B sibling on SWE-bench; Apache 2.0. (smol.ai news)

Skills and memory become first-class agent infrastructure

The discipline layer above vibe coding got real infrastructure this month. JetBrains shipped a Skill Manager and repository for cross-agent reuse. OpenAI’s Codex added plugins/skills with automations and config imports from other agents. Anthropic’s Managed Agents gained filesystem-based persistent memory that compounds across sessions. An in-depth post on SKILL.md internals explained how runtime execution shapes what you write at the surface. For CTOs managing agent-augmented teams, skills are becoming the unit of verifiable, shareable capability — the progressive disclosure layer that makes overnight agent work auditable.

JetBrains Skill Manager and Skill Repository — Install trusted skills once; reuse across agents and projects with discovery layer. (JetBrains AI blog)
Codex Plugins and Skills — Connect tools, access data, follow repeatable workflows to automate tasks. (OpenAI blog)
Anthropic launches Memory in Claude Agents — Filesystem-based memory layer; agents accumulate knowledge across sessions. (TLDR AI)
What you’re actually writing when you write a SKILL.md — Understanding the runtime changes everything you do at the surface. (TLDR AI)
Evaluating Skills — Best practices for defining tasks, measuring performance, and iterating with LangSmith. (LangChain blog)

Compute economics reshape the agent stack

Anthropic’s SpaceX deal, OpenAI’s AWS/Bedrock expansion, Cursor’s rumoured xAI acquisition ($60B), and DeepSeek’s 75% price cut all point to the same thing: compute access is the new moat. B200 spot prices surged 114% in six weeks. Pragmatic Engineer reported AI token spending ‘out of control’ across 15 companies. Cursor’s negative gross margins revealed the inverted SaaS economics of agent-heavy products. For engineering leaders, the implication is clear: routing intelligence (LiteLLM, Batch API for fleets, KV cache locality) is now a cost-centre discipline, not a nice-to-have.

Anthropic SpaceX compute deal — 220K+ GPUs via Colossus; Claude Code limits doubled immediately. (TLDR AI)
GPU Spot Prices Surge 114% — B200 rental hit $4.95/hr driven by GPT-5.5 and DeepSeek V4 demand. (TLDR AI)
GPT-5.5 Price Increase: What It Actually Costs — 2× list price mitigated to 49–92% actual increase due to fewer output tokens. (TLDR AI)
Cursor’s $60B Escape Hatch — Negative 23% gross margins; best customers are most expensive to serve. (TLDR AI)
Batch API: terrible for one agent, great for a fleet — 50% discount viable when pooling requests across parallel agents. (TLDR AI)

Eval and observability hit production scale

As agent output volume grows, verification becomes the bottleneck. Hugging Face published data showing eval costs now rival training costs. LangSmith shipped OpenTelemetry support, test-run comparisons, and regression testing. Vercel open-sourced deepsec for agent-powered security scanning. JetBrains demonstrated that IDE-native search tools made agents faster and cheaper. Harvey released an open Legal Agent Benchmark. The pattern: teams are building eval into the agent loop itself rather than treating it as a post-hoc step — exactly the discipline needed to manage leaf-node risk in 22,000-line PRs.

AI evals are becoming the new compute bottleneck — Eval costs now rival or exceed training costs; some runs cost tens of thousands. (TLDR AI)
Agent Observability: Monitor and Evaluate in Production — Tracing, evaluation, and improvement of AI agents at scale with LangSmith. (LangChain blog)
Vercel deepsec: security harness for codebases — Open-source agent-powered vulnerability scanner; runs locally with existing subscriptions. (Vercel blog)
IDE-Native Search Tools made agents faster and cheaper — Prebundled tooling reduced latency, cost, and budget overruns across models. (JetBrains AI blog)
Harvey’s Legal Agent Benchmark — Open-source benchmark for assessing AI agents on legal tasks. (TLDR AI)

What I’m watching into next month

Anthropic prior restraint / Mythos access — White House ordered Anthropic not to expand Mythos access; a full prior-restraint regime for frontier models is under serious discussion.
- The AI Ad-Hoc Prior Restraint Era Begins (Don’t Worry About the Vase (Zvi))
- Why Anthropic believes its latest model is too dangerous to release (Understanding AI (Timothy B. Lee))
Cursor/xAI acquisition and IDE economics — If the $60B deal closes, it reshapes which models power the dominant coding IDE and what happens to Anthropic’s biggest distribution channel.
- Cursor’s $60 Billion Escape Hatch (TLDR AI)
- Cursor’s war chest, xAI’s redemption (TLDR AI)
EU AI Act developer obligations (Aug 2026) — LangChain published compliance mapping; deadline is 2 Aug 2026 — teams deploying agents in regulated industries need to start now.
- How LangSmith helps meet EU AI Act requirements (LangChain blog)
Vibe coding converging with agentic engineering — Simon Willison flagged that the boundary between casual vibe coding and serious agentic engineering is dissolving — implications for team governance and verification.
- Vibe coding and agentic engineering are getting closer than I’d like (Simon Willison)
- Tokenmaxxing as a weird new trend (The Pragmatic Engineer (Gergely Orosz))

nexu-io/open-design

33.2k★ · TypeScript · agent-skills ai-agents ai-design byok claude 🎨 Local-first, open-source alternative to Anthropic’s Claude Design. ⚡ 19 Skills · ✨ 71 brand-grade Design Systems 🖼 Generate web · desktop · mobile prototypes · slides · images · videos · HyperFrames 📦 Sandboxed preview · HTML/PDF/PPTX/MP4 export 🤖 Runs on Claude Code / Codex / Cursor / Gemini / OpenCode / Qwen / Copilot / Hermes / Kimi CLI.

EvoLinkAI/awesome-gpt-image-2-API-and-Prompts

13.3k★ · Python · api awesome awesome-list chatgpt generative- GPT-Image-2 API and Prompts

alchaincyf/huashu-design

12.6k★ · HTML Huashu Design · HTML-native design skill for Claude Code · Claude Code 里 HTML 原生的设计 skill · 高保真原型 / 幻灯片 / 动画 + 20 设计哲学 + 5 维评审 + MP4 导出 · Agent-agnostic

kyegomez/OpenMythos

12.3k★ · Python · ai anthropic attention claude claude-ai A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.

google-labs-code/design.md

12.1k★ · TypeScript A format specification for describing a visual identity to coding agents. DESIGN.md gives agents a persistent, structured understanding of a design system.

browser-use/browser-harness

11.5k★ · Python Browser Harness | Self-healing harness that enables LLMs to complete any task.

h4ckf0r0day/obscura

11.1k★ · Rust The headless browser for AI agents and web scraping

browser-use/video-use

6.9k★ · Python Edit videos with coding agents

nashsu/llm_wiki

6.4k★ · TypeScript LLM Wiki is a cross-platform desktop application that turns your documents into an organized, interlinked knowledge base — automatically. Instead of traditional RAG (retrieve-and-answer from scratch every time), the LLM incrementally builds and maintains a persistent wiki from your sources。

Robbyant/lingbot-map

5.9k★ · Python A feed-forward 3D foundation model for reconstructing scenes from streaming data

Read this month

Vibe coding and agentic engineering are getting closer than I’d like

Simon Willison articulates the uncomfortable convergence you’ve been writing about — the boundary between casual AI-assisted coding and serious agentic engineering is dissolving in practice, raising exactly the verification and governance questions your leaf-nodes/overnight-factory framing addresses. Essential context heading into AI Engineer World’s Fair.

Quote of the month

People who come from the world of agentic coding have a certain digital smell that is not obvious to them but is obvious to everyone else.

— Andrew Kelley (Zig project lead) · link

Auto-curated monthly by Claude Opus 4.7 from AI Tidbits (Sahar Mor), Apple ML research, Ben’s Bites, Cursor changelog, Don’t Worry About the Vase (Zvi), Eric Jang, Eugene Yan, Every — Chain of Thought (Dan Shipper), Exponential View (Azeem Azhar), Google DeepMind blog, Hacker News (AI), Hugging Face blog, Import AI (Jack Clark), Interconnects (Nathan Lambert), JetBrains AI blog, LangChain blog, Last Week in AI, Latent Space, Lenny’s Newsletter, NVIDIA developer blog, One Useful Thing (Ethan Mollick), OpenAI blog, Sebastian Raschka, Simon Willison, Sourcegraph blog, TLDR AI, The Algorithmic Bridge (Alberto Romero), The Pragmatic Engineer (Gergely Orosz), Together AI blog, Understanding AI (Timothy B. Lee), Vercel blog, smol.ai news, swyx.io. Source list and editorial profile maintained by Daniel.