Skip to content
Back to Tech
GenAI · 9 min read

Running Local LLMs — Qwen3-Coder, DeepSeek, MLX, and Apple Silicon

A practical guide to running powerful language models locally on Apple Silicon — model selection, MLX vs Ollama, quantization trade-offs, and when local beats cloud.

Running Local LLMs on Apple Silicon

The ability to run capable language models locally has reached an inflection point. With Apple Silicon’s unified memory architecture and frameworks like MLX and Ollama, models that would have required a data center five years ago now run on a laptop. Here’s my setup, what I’ve learned, and practical guidance for getting started.

Why Local?

There are three compelling reasons to run models locally rather than exclusively through cloud APIs:

Privacy: Process sensitive code, proprietary documents, and customer data without sending anything off your machine. This isn’t theoretical — if you work with regulated data, client code, or internal business logic, local inference removes the data processing question entirely.

Latency and Availability: No network dependency, no rate limits, instant responses even on a plane. Local models respond in milliseconds for autocomplete and seconds for longer generation. There’s no queue, no API outage, and no “you’ve exceeded your rate limit” at 2 AM.

Cost: After the hardware investment, inference is essentially free. If you’re running hundreds of queries per day for autocomplete, code explanation, and quick tasks, the cloud API costs add up. Local inference amortizes to nearly zero over time.

The trade-off is clear: local models are smaller and less capable than frontier cloud models. You’re trading raw intelligence for privacy, speed, and cost. The art is knowing which queries need frontier capability and which are well-served by a local 14B or 32B parameter model.

My Hardware Setup

Mac Studio M2 Ultra (192GB unified memory): My primary local inference machine. The massive unified memory is the key — it allows running 70B+ parameter models at full quality, or multiple smaller models simultaneously. The M2 Ultra’s memory bandwidth (~800 GB/s) directly determines token generation speed for LLMs.

MacBook Pro M4 Pro (48GB): My portable setup. Runs 14B models comfortably, 32B models in Q4 quantization. Good enough for code autocomplete and quick queries on the go.

Why Apple Silicon: The unified memory architecture means GPU and CPU share the same memory pool. A 70B Q4 model needs ~40GB of memory — on a discrete GPU setup, you’d need multiple NVIDIA cards. On a Mac Studio with 192GB unified memory, it just works. The memory bandwidth is lower than dedicated GPUs, but the total capacity and ease of use are unmatched for local inference.

Software Stack

MLX (Apple’s ML Framework)

MLX is Apple’s machine learning framework, designed specifically for Apple Silicon. It’s become the fastest way to run models locally on a Mac.

# Install MLX and the LM module
pip install mlx mlx-lm

# Run a model directly
mlx_lm.generate --model mlx-community/Qwen3-Coder-8B-4bit \
  --prompt "Write a Python function that validates an email address"

# Start a local server (OpenAI-compatible API)
mlx_lm.server --model mlx-community/Qwen3-Coder-8B-4bit --port 8080

Key advantages: Native Apple Silicon optimization, lazy evaluation (only computes what’s needed), the mlx-community on Hugging Face provides pre-converted model weights so you don’t need to convert yourself.

Ollama

Ollama is the easiest way to get started with local models. Period.

# Install
brew install ollama

# Run a model (downloads automatically on first use)
ollama run qwen3-coder
ollama run deepseek-r1:14b
ollama run llama4-scout

# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "Hello"}]}'

Key advantages: Dead simple setup, automatic model management, built-in OpenAI-compatible API, works with most AI tools (Cursor, Continue.dev, Cline) out of the box.

llama.cpp

The foundational inference engine that made the local LLM revolution possible. Most other tools (including Ollama) build on top of it. Direct usage gives you the most control:

# Build from source for maximum performance
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Run a GGUF model
./llama-cli -m ~/models/qwen3-coder-14b-q5_k_m.gguf \
  -p "Explain the three-tier AI architecture pattern" \
  -n 512 --temp 0.7

When to use llama.cpp directly: When you need maximum control over inference parameters, are benchmarking, or are building custom inference pipelines.

Model Selection Guide

For Code Tasks

ModelParametersMemory (Q4)SpeedQualityNotes
Qwen3-Coder 8B8B~5GBFastGoodMy default for autocomplete. Fast enough for inline suggestions.
Qwen3-Coder 14B14B~9GBGoodVery GoodSweet spot for code generation on 48GB machines.
Qwen3-Coder 32B32B~20GBModerateExcellentStrong code understanding, runs well on 48GB+
DeepSeek-Coder V3 16B16B~10GBGoodVery GoodExcellent for complex code reasoning
Codestral 22B22B~14GBGoodVery GoodMistral’s code model, strong multilingual support

For General Tasks

ModelParametersMemory (Q4)SpeedQualityNotes
Llama 4 Scout17B active (MoE)~12GBGoodVery GoodMeta’s latest, strong general capability
Qwen3 14B14B~9GBGoodVery GoodExcellent multilingual (including German)
DeepSeek R1 14B14B~9GBModerateExcellentReasoning model — thinks before answering. Slower but much more capable for complex tasks.
Gemma 3 12B12B~8GBFastGoodGoogle’s open model, good for general tasks
Mistral Small 24B24B~15GBModerateVery GoodStrong European language support

For Reasoning

ModelParametersMemory (Q4)Notes
DeepSeek R1 14B14B~9GBBest reasoning model at this size. Chain-of-thought is built in.
DeepSeek R1 32B32B~20GBSignificantly stronger reasoning. Worth the memory if you have it.
QwQ 32B32B~20GBAlibaba’s reasoning model. Competitive with R1 on many benchmarks.

Understanding Quantization

Quantization reduces model precision to fit in less memory. The trade-off is quality vs. memory usage:

QuantizationQuality LossMemory ReductionWhen to Use
Q8Negligible~50% of FP16When you have plenty of memory. Closest to original quality.
Q6_KMinimal~40% of FP16Good balance for large-memory machines.
Q5_K_MSmall~35% of FP16My default. Best quality-to-memory ratio.
Q4_K_MNoticeable on complex tasks~25% of FP16When memory is tight. Still very usable.
Q3_KSignificant~20% of FP16Only when you must fit a larger model.

Rule of thumb: Use the highest quantization level that fits in your available memory with ~5GB headroom for the OS and other applications.

Integration with Development Tools

Cursor

Cursor supports local models via its OpenAI-compatible API settings:

  1. Start Ollama: ollama serve
  2. In Cursor Settings > Models > OpenAI API Base: http://localhost:11434/v1
  3. Add model name: qwen3-coder
  4. Use for autocomplete; keep cloud models for complex agent tasks

Continue.dev

Continue.dev has first-class Ollama support:

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen3-Coder (Local)",
      "provider": "ollama",
      "model": "qwen3-coder"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen3-Coder 8B",
    "provider": "ollama",
    "model": "qwen3-coder:8b"
  }
}

Cline

Cline supports Ollama models as a backend. Configure in the extension settings with the Ollama API endpoint. Works well for smaller tasks; for complex multi-file edits, cloud models (Claude) still outperform.

Performance Tuning

Memory Management

# Check available memory
sysctl hw.memsize

# Monitor during inference
# Activity Monitor > Memory tab > Memory Pressure (should stay green)

# For Ollama: control how many models stay loaded
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4

Context Length vs Speed

Longer context windows use more memory and slow down inference. For local models:

  • 4K context: Fast, good for autocomplete and short queries
  • 8K context: Standard, good balance for most tasks
  • 16K-32K context: Slow on local hardware, use only when you need to process longer documents
  • 128K+ context: Impractical locally except on very high-memory machines. Use cloud models.

Batch Processing

For tasks like processing multiple files or generating documentation across a codebase, batch requests to the local API rather than sending them one by one:

import mlx_lm

# Load model once, run many inferences
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-Coder-14B-4bit")

for file_content in files:
    result = mlx_lm.generate(
        model, tokenizer,
        prompt=f"Add docstrings to this code:\n{file_content}",
        max_tokens=2048
    )

When Local Beats Cloud

Use CaseLocalCloudWinner
Code autocompleteFast, free, privateSlight latency, per-token costLocal
Quick code explanationsInstant, offlineBetter qualityLocal
Complex multi-file refactoringLimited by model sizeFrontier models excelCloud
Processing sensitive codeNo data leaves machineData sent to providerLocal
Offline development (travel, plane)Always availableRequires internetLocal
High-volume batch processingFree after hardware$$$$ at scaleLocal
Novel architecture decisionsWeaker reasoningFrontier reasoningCloud
Long-context analysis (>32K tokens)Very slow, memory hungryHandles nativelyCloud

The Hybrid Approach

The best setup uses both. My routing:

  • Local (Qwen3-Coder via Ollama/MLX): Autocomplete, quick questions, code explanations, simple refactoring, documentation generation, offline work, sensitive code
  • Cloud (Claude Opus via Claude Code): Complex feature implementation, architecture decisions, multi-file refactors, anything requiring deep reasoning or long context
  • Cloud (Claude Sonnet via Cursor): Medium-complexity tasks, code review, UI component implementation

The cost savings are real. I estimate local inference handles 60-70% of my daily AI queries by volume, and those would cost $50-100/month in API fees. The Mac Studio paid for itself within the first year just in saved API costs — and that’s before counting the privacy and availability benefits.

Getting Started

If you’re new to local LLMs, here’s the fastest path:

  1. Install Ollama: brew install ollama
  2. Pull a model: ollama pull qwen3-coder (or ollama pull deepseek-r1:14b for reasoning)
  3. Try it: ollama run qwen3-coder
  4. Integrate: Point your IDE (Cursor, Continue.dev) at http://localhost:11434/v1
  5. Experiment: Try different models and sizes for your specific use cases

Total time to first local inference: about 5 minutes plus model download time. There’s never been a lower barrier to running powerful AI locally.

local-llm qwen3 deepseek mlx ollama apple-silicon privacy developer-tools