Wiring Gemma 4 Into Claude Code - A Practical Local-Plus-Cloud Setup

In a previous article I wrote about why Gemma 4 is suddenly worth taking seriously as a standalone local model. This is the practical follow-up: how to wire it into your Claude Code workflow so that the two models work together - Gemma 4 handling the cheap, local, privacy-sensitive grunt work, Claude handling the hard reasoning and multi-step agentic tasks.

The whole setup takes about five minutes. You get an OpenAI-compatible endpoint on localhost:1234, and any inline ! curl or ! python3 block inside a Claude Code session can talk to it. No new SDK, no cloud dependency, no token budget for the local hops.

Why Wire Them Together at All

It’s tempting to treat local LLMs and frontier cloud LLMs as competitors. They aren’t - they’re complementary, and the most interesting workflows put them in the same loop:

Pre-filtering - let Gemma 4 read 200 documents locally and surface the 5 that matter, then hand those to Claude for the deep reasoning step
Privacy-sensitive data - keep customer PII, medical data, or proprietary code on-device for the sensitive parts; only send the abstracted question to Claude
Batch jobs overnight - summarise, classify, or transform thousands of files locally without burning a single API token
First-pass drafts - generate rough outlines, code skeletons, or test stubs locally, then ask Claude to refine them
Cost ceiling - for tasks where “good enough” really is good enough, route them away from your Claude budget entirely

The pattern is simple: Gemma for breadth and privacy, Claude for depth and complexity. Wiring them together means you can do both inside a single Claude Code session without leaving the terminal.

Prerequisites

LM Studio installed (brew install --cask lm-studio or download from lmstudio.ai)
A Gemma 4 variant downloaded in LM Studio. I run the 26B A4B (MoE) on an M2 Ultra; the 4B dense variant works fine on a base M-series MacBook
Claude Code running on macOS, authenticated, and inside any project where you’re comfortable running shell commands
A few minutes of patience the first time you load the model into VRAM

Step 1 - Start the Local Server in LM Studio

Open LM Studio
Load your Gemma 4 model from the model picker (top center)
Switch to the Local Server tab in the left sidebar
Click Start Server

LM Studio binds to http://localhost:1234 by default and exposes an OpenAI-compatible REST API. That last part is the magic - anything that can talk to the OpenAI SDK can talk to Gemma 4 with a single base URL change. No custom client, no special wrapper, no glue code.

You’ll see a green indicator and a log line confirming the model is loaded. Leave LM Studio running in the background.

Step 2 - Verify the Connection From Claude Code

Inside a Claude Code session, you can run shell commands inline by prefixing them with !. Use that to confirm the server is reachable:

! curl -s http://localhost:1234/v1/models | jq

You should see a JSON response listing the loaded model. Note the exact model name - you’ll need it for the API calls below. LM Studio uses identifiers like google/gemma-4-26b-a4b rather than a friendly gemma-4, and a typo here is the most common reason Step 3 fails.

Then test a completion:

! curl -s http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Explain RAG in one sentence."}],
    "temperature": 0.5
  }' | jq -r '.choices[0].message.content'

If you get a coherent sentence back, you’re done with the plumbing. The rest is workflow.

Step 3 - Call Gemma 4 From a Python Script

For anything beyond a one-shot curl, drop a small Python helper into your project. Ask Claude Code to write it for you, or paste this in directly:

# scripts/gemma.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",  # any non-empty string works
)

def ask_gemma(prompt: str, system: str = "You are a concise technical assistant.") -> str:
    response = client.chat.completions.create(
        model="google/gemma-4-26b-a4b",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
        max_tokens=2048,
    )
    return response.choices[0].message.content


if __name__ == "__main__":
    import sys
    print(ask_gemma(sys.stdin.read() if not sys.stdin.isatty() else " ".join(sys.argv[1:])))

Now you can pipe anything to it from inside Claude Code:

! cat README.md | python3 scripts/gemma.py "Summarise this in 5 bullets"

The first time you run it, Claude Code will offer to install the openai package if it’s missing - accept it.

Step 4 - Use Gemma 4 as a Subagent Inside a Claude Session

This is where the workflow gets interesting. Instead of you orchestrating Gemma calls manually, let Claude dispatch them as subtasks.

A prompt pattern that works well:

Use the local Gemma 4 API at localhost:1234 (OpenAI-compatible, model google/gemma-4-26b-a4b) to summarise every .md file under docs/ into a single bullet per file. Save the result as docs-index.md. Don’t read the files yourself - delegate the per-file summarisation to Gemma to keep your context clean.

Claude will generate the loop, hit the local endpoint for each file, write the index, and never burn its own context window on the raw document contents. You just spent zero cloud tokens on the bulk read pass.

The key phrase is “don’t read the files yourself - delegate to Gemma.” Without it, Claude will helpfully read everything itself, which defeats the point.

Other delegation patterns I use:

“Run Gemma over every commit message in the last 30 days and categorise them as feature / fix / refactor / chore”
“Use Gemma to extract the function signatures from these 40 Python files and write them to api-surface.txt”
“Translate these German UI strings to French via Gemma and review the result yourself before writing the file”

That last one is the prettiest version of the pattern: Gemma drafts, Claude reviews. You get bulk throughput from the local model and quality control from the cloud one, in a single agentic loop.

Step 5 - Pipe Output Directly Into Your Notes or Repo

A one-liner that sends a Gemma response straight into your Obsidian inbox:

! python3 scripts/gemma.py "Brief intro to MoE routing" \
  > ~/Library/Mobile\ Documents/iCloud~md~obsidian/Documents/The\ Vault/Inbox/moe-intro.md

Or generate a draft article skeleton in your repo and let Claude refine it:

! python3 scripts/gemma.py "Outline a 6-section article about local LLMs vs. cloud LLMs in 2026" \
  > drafts/local-vs-cloud-outline.md

Then say to Claude: “Read drafts/local-vs-cloud-outline.md and turn it into a finished article matching the voice of src/content/tech/gemma4-local-coding.md.” You’ve just done the rough pass for free and the polish pass at full quality.

Alternative: Ollama Instead of LM Studio

LM Studio is the friendliest on-ramp, but if you’d rather skip the GUI and run everything from the terminal, Ollama gets you to the same place in fewer steps. Same OpenAI-compatible endpoint, same Claude Code delegation patterns, different daemon. The only thing that changes in all the examples above is the base URL and the model name - everything else carries over.

Step 1 - Install and Pull Gemma 4

! brew install --cask ollama-app     # or download from ollama.com
! ollama pull gemma4:26b             # substitute :4b, :12b, :27b to taste

Ollama runs as a background daemon on http://localhost:11434 the moment the app launches - no “Start Server” button to click. The Gemma tags mirror the official library page at ollama.com/library/gemma4; use whatever your hardware can hold.

Step 2 - Verify the Connection

! curl -s http://localhost:11434/v1/models | jq
! curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Explain RAG in one sentence."}]
  }' | jq -r '.choices[0].message.content'

Model names are exactly what ollama list prints - gemma4:26b, not gemma-4-26b. A single character off and you’ll get model not found.

Step 3 - Python Helper (Same Script, Two Lines Changed)

# scripts/gemma.py - Ollama variant
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",  # required but ignored
)

def ask_gemma(prompt: str, system: str = "You are a concise technical assistant.") -> str:
    response = client.chat.completions.create(
        model="gemma4:26b",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
        max_tokens=2048,
    )
    return response.choices[0].message.content

That’s it. Every delegation pattern from Step 4 - “use the local Gemma API at localhost:11434 (OpenAI-compatible, model gemma4:26b) to summarise every .md file…” - works without any other change.

Two Gotchas Worth Knowing

Context window defaults are small. Ollama auto-sizes based on VRAM (often 4K–32K), which is well below Gemma 4’s real 128K ceiling. The OpenAI-compatible endpoint has no num_ctx field, so bump it globally with OLLAMA_CONTEXT_LENGTH=65536 ollama serve, or bake it into a Modelfile:
```
FROM gemma4:26b
PARAMETER num_ctx 65536
```
Then ollama create gemma4-long -f Modelfile and use gemma4-long as the model name. Worth doing once if you care about multi-document summarisation.
Keep the model resident. Ollama unloads after 5 minutes of idle time, which means a 10–15 second reload every time you come back from lunch. Fix it globally with OLLAMA_KEEP_ALIVE=24h in your shell profile, or use ollama ps to confirm the model is actually loaded (and 100% GPU) before a long batch run.

Apart from those, Ollama is genuinely fire-and-forget on Apple Silicon - Metal acceleration is automatic, no toggles, no offload sliders. If you prefer a single ollama pull over clicking through LM Studio’s model browser, this is the path.

LM Studio Settings That Actually Matter

Setting	Recommended	Why
Context length	8192+ (Gemma 4 supports up to 128K)	Enough headroom for multi-document summarisation
GPU offload	Max layers (Metal on Mac)	Without this, inference is painfully slow
Temperature	0.3 for extraction, 0.7 for drafting	Lower for “facts,” higher for “ideas”
Server port	1234 (default)	Matches every example you’ll find online
Keep model in memory	On	Avoids the 5–15 second reload between calls

The single biggest performance win on Apple Silicon is GPU offload set to max. If your Gemma 4 feels sluggish, that’s almost always why.

Troubleshooting

Connection refused on port 1234 - LM Studio’s server isn’t running. Check the Local Server tab and confirm the green indicator. If the port is taken, change it in LM Studio settings and pass the new URL to your scripts.

Model not found error - The model name in your API call doesn’t match what LM Studio is serving. Run curl -s http://localhost:1234/v1/models | jq and copy the exact id field. LM Studio identifiers contain slashes and version suffixes that are easy to mistype.

Slow inference - Three usual culprits: GPU offload not maxed, another GPU-heavy app eating VRAM (Final Cut, Blender, Stable Diffusion), or you loaded a quantization that’s too aggressive for your hardware. Try a 4-bit or 5-bit MLX build for the best speed/quality tradeoff on Mac.

openai package not installed - Run ! pip install openai from inside Claude Code. The OpenAI SDK is the official client for any OpenAI-compatible endpoint, including LM Studio, Ollama, vLLM, and Together.

Garbage output / repetition loops - Almost always a temperature problem. Drop to 0.3 for extraction tasks, and make sure max_tokens is set so the model doesn’t run away.

What This Unlocks

Once you have this wired up, your mental model of “what Claude Code can do” expands. Suddenly the question isn’t should I run this task at all? but which model should run it? Most days, my split looks roughly like:

Gemma 4 (local) - bulk reads, summarisation, classification, extraction, format conversion, first drafts, anything privacy-sensitive
Claude (cloud) - architectural decisions, multi-step refactors, code review, anything where being wrong is expensive
Both, in the same loop - Gemma drafts → Claude reviews → you ship

This is the hybrid AI workflow I’ve been writing about for the past year, but in its most concrete form: two models, one terminal, no friction. The frontier-vs-local debate is over. The interesting question now is how you orchestrate them.

Key Takeaways

LM Studio exposes an OpenAI-compatible API - no custom client, no glue code
Claude Code can call it with ! curl or ! python3 inline blocks, no plugin needed
Delegate bulk work to Gemma, keep Claude’s context clean for hard reasoning
Gemma 4 on Apple Silicon (Metal, max GPU offload) is fast enough for interactive use at 4B–26B
The cost of running this hybrid setup is roughly zero - every Gemma call is free
The most powerful pattern is Gemma drafts, Claude reviews in a single agentic loop