Skip to content
Back to Tech
GenAI · 9 min read

Wiring Gemma 4 Into Claude Code — A Practical Local-Plus-Cloud Setup

A step-by-step guide to running Gemma 4 locally via LM Studio and calling it from inside a Claude Code session — for offline reasoning, batch processing, privacy-sensitive data, and zero-cost first-pass drafts that hand off to Claude when you need the heavy hitter.

In a previous article I wrote about why Gemma 4 is suddenly worth taking seriously as a standalone local model. This is the practical follow-up: how to wire it into your daily Claude Code workflow so that the two models work together — Gemma 4 handling the cheap, local, privacy-sensitive grunt work, Claude handling the hard reasoning and multi-step agentic tasks.

The whole setup takes about five minutes. You get an OpenAI-compatible endpoint on localhost:1234, and any inline ! curl or ! python3 block inside a Claude Code session can talk to it. No new SDK, no cloud dependency, no token budget for the local hops.

Why Wire Them Together at All

It’s tempting to treat local LLMs and frontier cloud LLMs as competitors. They aren’t — they’re complementary, and the most interesting workflows put them in the same loop:

  • Pre-filtering — let Gemma 4 read 200 documents locally and surface the 5 that matter, then hand those to Claude for the deep reasoning step
  • Privacy-sensitive data — keep customer PII, medical data, or proprietary code on-device for the sensitive parts; only send the abstracted question to Claude
  • Batch jobs overnight — summarise, classify, or transform thousands of files locally without burning a single API token
  • First-pass drafts — generate rough outlines, code skeletons, or test stubs locally, then ask Claude to refine them
  • Cost ceiling — for tasks where “good enough” really is good enough, route them away from your Claude budget entirely

The pattern is simple: Gemma for breadth and privacy, Claude for depth and complexity. Wiring them together means you can do both inside a single Claude Code session without leaving the terminal.

Prerequisites

  • LM Studio installed (brew install --cask lm-studio or download from lmstudio.ai)
  • A Gemma 4 variant downloaded in LM Studio. I run the 26B A4B (MoE) on an M2 Ultra; the 4B dense variant works fine on a base M-series MacBook
  • Claude Code running on macOS, authenticated, and inside any project where you’re comfortable running shell commands
  • A few minutes of patience the first time you load the model into VRAM

Step 1 — Start the Local Server in LM Studio

  1. Open LM Studio
  2. Load your Gemma 4 model from the model picker (top center)
  3. Switch to the Local Server tab in the left sidebar
  4. Click Start Server

LM Studio binds to http://localhost:1234 by default and exposes an OpenAI-compatible REST API. That last part is the magic — anything that can talk to the OpenAI SDK can talk to Gemma 4 with a single base URL change. No custom client, no special wrapper, no glue code.

You’ll see a green indicator and a log line confirming the model is loaded. Leave LM Studio running in the background.

Step 2 — Verify the Connection From Claude Code

Inside a Claude Code session, you can run shell commands inline by prefixing them with !. Use that to confirm the server is reachable:

! curl -s http://localhost:1234/v1/models | jq

You should see a JSON response listing the loaded model. Note the exact model name — you’ll need it for the API calls below. LM Studio uses identifiers like google/gemma-4-26b-a4b rather than a friendly gemma-4, and a typo here is the most common reason Step 3 fails.

Then test a completion:

! curl -s http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Explain RAG in one sentence."}],
    "temperature": 0.5
  }' | jq -r '.choices[0].message.content'

If you get a coherent sentence back, you’re done with the plumbing. The rest is workflow.

Step 3 — Call Gemma 4 From a Python Script

For anything beyond a one-shot curl, drop a small Python helper into your project. Ask Claude Code to write it for you, or paste this in directly:

# scripts/gemma.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",  # any non-empty string works
)

def ask_gemma(prompt: str, system: str = "You are a concise technical assistant.") -> str:
    response = client.chat.completions.create(
        model="google/gemma-4-26b-a4b",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
        max_tokens=2048,
    )
    return response.choices[0].message.content


if __name__ == "__main__":
    import sys
    print(ask_gemma(sys.stdin.read() if not sys.stdin.isatty() else " ".join(sys.argv[1:])))

Now you can pipe anything to it from inside Claude Code:

! cat README.md | python3 scripts/gemma.py "Summarise this in 5 bullets"

The first time you run it, Claude Code will offer to install the openai package if it’s missing — accept it.

Step 4 — Use Gemma 4 as a Subagent Inside a Claude Session

This is where the workflow gets interesting. Instead of you orchestrating Gemma calls manually, let Claude dispatch them as subtasks.

A prompt pattern that works well:

Use the local Gemma 4 API at localhost:1234 (OpenAI-compatible, model google/gemma-4-26b-a4b) to summarise every .md file under docs/ into a single bullet per file. Save the result as docs-index.md. Don’t read the files yourself — delegate the per-file summarisation to Gemma to keep your context clean.

Claude will generate the loop, hit the local endpoint for each file, write the index, and never burn its own context window on the raw document contents. You just spent zero cloud tokens on the bulk read pass.

The key phrase is “don’t read the files yourself — delegate to Gemma.” Without it, Claude will helpfully read everything itself, which defeats the point.

Other delegation patterns I use:

  • “Run Gemma over every commit message in the last 30 days and categorise them as feature / fix / refactor / chore”
  • “Use Gemma to extract the function signatures from these 40 Python files and write them to api-surface.txt
  • “Translate these German UI strings to French via Gemma and review the result yourself before writing the file”

That last one is the prettiest version of the pattern: Gemma drafts, Claude reviews. You get bulk throughput from the local model and quality control from the cloud one, in a single agentic loop.

Step 5 — Pipe Output Directly Into Your Notes or Repo

A one-liner that sends a Gemma response straight into your Obsidian inbox:

! python3 scripts/gemma.py "Brief intro to MoE routing" \
  > ~/Library/Mobile\ Documents/iCloud~md~obsidian/Documents/The\ Vault/Inbox/moe-intro.md

Or generate a draft article skeleton in your repo and let Claude refine it:

! python3 scripts/gemma.py "Outline a 6-section article about local LLMs vs. cloud LLMs in 2026" \
  > drafts/local-vs-cloud-outline.md

Then say to Claude: “Read drafts/local-vs-cloud-outline.md and turn it into a finished article matching the voice of src/content/tech/gemma4-local-coding.md.” You’ve just done the rough pass for free and the polish pass at full quality.

LM Studio Settings That Actually Matter

SettingRecommendedWhy
Context length8192+ (Gemma 4 supports up to 128K)Enough headroom for multi-document summarisation
GPU offloadMax layers (Metal on Mac)Without this, inference is painfully slow
Temperature0.3 for extraction, 0.7 for draftingLower for “facts,” higher for “ideas”
Server port1234 (default)Matches every example you’ll find online
Keep model in memoryOnAvoids the 5–15 second reload between calls

The single biggest performance win on Apple Silicon is GPU offload set to max. If your Gemma 4 feels sluggish, that’s almost always why.

Troubleshooting

Connection refused on port 1234 — LM Studio’s server isn’t running. Check the Local Server tab and confirm the green indicator. If the port is taken, change it in LM Studio settings and pass the new URL to your scripts.

Model not found error — The model name in your API call doesn’t match what LM Studio is serving. Run curl -s http://localhost:1234/v1/models | jq and copy the exact id field. LM Studio identifiers contain slashes and version suffixes that are easy to mistype.

Slow inference — Three usual culprits: GPU offload not maxed, another GPU-heavy app eating VRAM (Final Cut, Blender, Stable Diffusion), or you loaded a quantization that’s too aggressive for your hardware. Try a 4-bit or 5-bit MLX build for the best speed/quality tradeoff on Mac.

openai package not installed — Run ! pip install openai from inside Claude Code. The OpenAI SDK is the official client for any OpenAI-compatible endpoint, including LM Studio, Ollama, vLLM, and Together.

Garbage output / repetition loops — Almost always a temperature problem. Drop to 0.3 for extraction tasks, and make sure max_tokens is set so the model doesn’t run away.

What This Unlocks

Once you have this wired up, your mental model of “what Claude Code can do” expands. Suddenly the question isn’t should I run this task at all? but which model should run it? Most days, my split looks roughly like:

  • Gemma 4 (local) — bulk reads, summarisation, classification, extraction, format conversion, first drafts, anything privacy-sensitive
  • Claude (cloud) — architectural decisions, multi-step refactors, code review, anything where being wrong is expensive
  • Both, in the same loop — Gemma drafts → Claude reviews → you ship

This is the hybrid AI workflow I’ve been writing about for the past year, but in its most concrete form: two models, one terminal, no friction. The frontier-vs-local debate is over. The interesting question now is how you orchestrate them.

Key Takeaways

  • LM Studio exposes an OpenAI-compatible API — no custom client, no glue code
  • Claude Code can call it with ! curl or ! python3 inline blocks, no plugin needed
  • Delegate bulk work to Gemma, keep Claude’s context clean for hard reasoning
  • Gemma 4 on Apple Silicon (Metal, max GPU offload) is fast enough for interactive use at 4B–26B
  • The cost of running this hybrid setup is roughly zero — every Gemma call is free
  • The most powerful pattern is Gemma drafts, Claude reviews in a single agentic loop
claude-code gemma4 lm-studio local-llm hybrid-ai openai-compatible apple-silicon tutorial