Running Gemma 4 Locally: Frontier-Class, Zero Cost
Google's Gemma 4 26B runs locally on Apple Silicon, scoring within single digits of GPT-5.2 and Claude Opus 4.5 on reasoning benchmarks — reshaping the local-plus-cloud AI coding workflow.
Two days ago, Google DeepMind released Gemma 4, and I’ve been running it locally on my MacBook ever since. The short version: an open-weight model under Apache 2.0, running entirely offline on consumer hardware, is now competitive with the commercial models I pay monthly subscriptions for. That’s a sentence I didn’t expect to write in 2026.
What Gemma 4 actually is
Gemma 4 is a family of four models — E2B, E4B, 26B A4B, and 31B Dense — built from the same research stack as Google’s proprietary Gemini 3. The one I’m running is the 26B A4B, a Mixture-of-Experts model with 25.2 billion total parameters but only 3.8 billion active during inference. In practice, that means it runs at roughly the speed of a 4B model while delivering intelligence in the 27B—31B class.
The architecture uses 128 small experts, activating eight per token plus one shared always-on expert. This isn’t just a benchmark curiosity — it directly translates to lower memory pressure and faster token generation on unified memory hardware like Apple Silicon.
On my MacBook Pro M5 Max with 128 GB unified memory, the Q8_0 quantized version (around 27 GB) loads comfortably and leaves plenty of headroom for running an IDE, browser, and other tools simultaneously. Even the unquantized 31B Dense would fit, though the MoE variant is the smarter choice for interactive coding where latency matters.
The benchmarks tell a clear story
Here’s where things get interesting. I compared Gemma 4’s reported scores against the two commercial models I use daily — Claude Opus 4.5 and GPT-5.2 — plus Gemma 3 to show the generational leap:
| Benchmark | Gemma 4 26B A4B | Gemma 4 31B | Gemma 3 27B | Claude Opus 4.5 | GPT-5.2 |
|---|---|---|---|---|---|
| MMLU Pro | 82.6% | 85.2% | 67.6% | 89.5% | 75.4% |
| GPQA Diamond | 82.3% | 84.3% | 42.4% | 87.0% | 92.4% |
| AIME 2025/26 | 88.3% | 89.2% | 20.8% | ~87% | 100% |
| LiveCodeBench v6 | 77.1% | 80.0% | 29.1% | — | — |
| Codeforces ELO | 1718 | 2150 | 110 | — | — |
| SWE-bench Verified | — | — | — | 80.9% | 55.6% |
| MMMU Pro (Vision) | 73.8% | 76.9% | 49.7% | — | — |
| BigBench Extra Hard | 64.8% | 74.4% | 19.3% | — | — |
| Tau2 Agentic (avg) | 68.2% | 76.9% | 16.2% | — | — |
A few things jump out. Gemma 4 26B A4B scores 82.3% on GPQA Diamond — graduate-level science reasoning — compared to 87% for Claude Opus 4.5 and 92.4% for GPT-5.2. That’s not parity, but it’s close enough that the gap is measured in percentage points rather than capability tiers. On AIME math competition problems, it hits 88.3% versus GPT-5.2’s perfect 100%. On coding benchmarks like LiveCodeBench and Codeforces, Gemma 4 scores are strong enough that direct comparisons with the proprietary models aren’t even available — they simply don’t report on the same benchmarks.
The jump from Gemma 3 is staggering. AIME went from 20.8% to 88.3%. LiveCodeBench tripled. GPQA nearly doubled. Codeforces ELO went from 110 (barely functional) to 1718 (expert level). The thinking mode — where the model reasons step-by-step before responding — is the primary driver.
Important caveats: benchmark versions differ across providers (AIME 2025 vs 2026), not all models report on the same benchmarks, and self-reported scores should always be taken with some skepticism.
My coding workflow: Local Gemma 4 + Claude Code + Codex
The real value of a strong local model isn’t replacing cloud-based AI — it’s creating a hybrid workflow where you use the right model for the right task.
Here’s how I’ve been working:
Gemma 4 via Ollama handles the high-frequency, low-stakes work. Quick code completions, boilerplate generation, refactoring suggestions, explaining unfamiliar code, writing tests for well-defined functions, generating documentation. This is the stuff that happens dozens of times per hour. Running it locally means zero latency to an API, no token costs, no rate limits, and my code never leaves my machine. For a CPTO at an identity verification company, that last point matters.
Claude Code handles the complex, multi-step engineering tasks. Architecture decisions, debugging subtle issues across multiple files, working through git worktrees in parallel, building features that require understanding the full codebase context. Claude Code’s agentic capabilities — running commands, editing files, managing workflows — are still ahead of what a local model can do reliably. The 200K context window and the quality of long-form reasoning justify the subscription for this tier of work.
OpenAI’s Codex fills the gap for rapid prototyping and throwaway scripts. When I need a quick utility, a data transformation pipeline, or a one-off automation script, Codex is fast and good enough.
The practical setup is straightforward. Ollama runs Gemma 4 locally, exposed as an API endpoint. Claude Code connects to Anthropic’s API for the heavy lifting. The mental model is simple: if I’d feel comfortable delegating the task to a competent junior developer, Gemma 4 handles it. If it requires senior-level judgment, I route to Claude Code.
With Gemma 4’s native function-calling support and 256K context window, the local tier has gotten meaningfully more capable. I can pass entire files or even small repositories into context. The model handles structured JSON output reliably, which is essential for any tooling integration.
Why Apache 2.0 matters
Gemma 3 shipped with the “Gemma Open” license — usable, but with Google-specific terms and restrictions. Gemma 4 switches to Apache 2.0. No usage restrictions, no monthly active user limits, no acceptable use policies beyond standard Apache terms.
For anyone building products or internal tools on top of these models, this is the real headline. You can fine-tune it, embed it, ship it commercially, and distribute derivatives without legal overhead. The licensing playing field between Gemma 4, Qwen, and Mistral models is now level. Meta’s Llama 4 community license, with its 700M MAU limit, is more restrictive by comparison.
The bottom line
A model that runs on a MacBook, costs nothing per token, keeps all data local, ships under Apache 2.0, and scores within single-digit percentage points of GPT-5.2 and Claude Opus 4.5 on most reasoning benchmarks — that’s a genuine inflection point for how we build with AI.
I’m not abandoning Claude Code or Codex. The commercial models are still better at the hardest tasks, and the difference matters when you’re working on production systems. But the floor has risen dramatically. The routine 80% of AI-assisted coding work can now happen entirely on-device, privately, for free.
For CTOs evaluating their AI tooling stack: the hybrid local-plus-cloud approach isn’t a compromise anymore. It’s the architecture that makes the most sense — economically, practically, and from a data governance perspective.
If you want the practical follow-up — how to actually wire Gemma 4 into a Claude Code session via LM Studio so the two models work together in the same loop — see Wiring Gemma 4 Into Claude Code.
Sources: Google Gemma 4 model card (April 2, 2026), Anthropic Claude Opus 4.5 documentation, OpenAI GPT-5.2 release notes, Vellum AI benchmarks, Artificial Analysis, Hugging Face.