Running Local LLMs - Qwen3-Coder, DeepSeek, MLX, and Apple Silicon

The ability to run capable language models locally has reached an inflection point. With Apple Silicon’s unified memory architecture and frameworks like MLX and Ollama, models that would have required a data center five years ago now run on a laptop. Here’s my setup, what I’ve learned, and practical guidance for getting started.

Why Local?

There are three compelling reasons to run models locally rather than exclusively through cloud APIs:

Privacy: Process sensitive code, proprietary documents, and customer data without sending anything off your machine. This isn’t theoretical — if you work with regulated data, client code, or internal business logic, local inference removes the data processing question entirely.

Latency and Availability: No network dependency, no rate limits, instant responses even on a plane. Local models respond in milliseconds for autocomplete and seconds for longer generation. There’s no queue, no API outage, and no “you’ve exceeded your rate limit” at 2 AM.

Cost: After the hardware investment, inference is essentially free. If you’re running hundreds of queries per day for autocomplete, code explanation, and quick tasks, the cloud API costs add up. Local inference amortizes to nearly zero over time.

The trade-off is clear: local models are smaller and less capable than frontier cloud models. You’re trading raw intelligence for privacy, speed, and cost. The art is knowing which queries need frontier capability and which are well-served by a local 14B or 32B parameter model.

My Hardware Setup

MacBook Pro M4 Pro (48GB unified memory): My local inference machine. The 48GB unified memory comfortably runs 14B models at full quality and 32B models in Q4 quantization. The M4 Pro’s improved memory bandwidth over previous generations makes token generation noticeably faster for local LLMs.

Why Apple Silicon: The unified memory architecture means GPU and CPU share the same memory pool. On a discrete GPU setup, you’d need expensive NVIDIA cards to match even 48GB of unified memory. On Apple Silicon, it just works. The memory bandwidth is lower than dedicated GPUs, but the ease of use and the fact that you can run serious models on a laptop are unmatched for local inference.

Software Stack

MLX (Apple’s ML Framework)

MLX is Apple’s machine learning framework, designed specifically for Apple Silicon. It’s become the fastest way to run models locally on a Mac.

# Install MLX and the LM module
pip install mlx mlx-lm

# Run a model directly
mlx_lm.generate --model mlx-community/Qwen3-Coder-8B-4bit \
  --prompt "Write a Python function that validates an email address"

# Start a local server (OpenAI-compatible API)
mlx_lm.server --model mlx-community/Qwen3-Coder-8B-4bit --port 8080

Key advantages: Native Apple Silicon optimization, lazy evaluation (only computes what’s needed), the mlx-community on Hugging Face provides pre-converted model weights so you don’t need to convert yourself.

Ollama

Ollama is the easiest way to get started with local models. Period.

# Install
brew install ollama

# Run a model (downloads automatically on first use)
ollama run qwen3-coder
ollama run deepseek-r1:14b
ollama run llama4-scout

# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "Hello"}]}'

Key advantages: Dead simple setup, automatic model management, built-in OpenAI-compatible API, works with most AI tools (Cursor, Continue.dev, Cline) out of the box.

llama.cpp

The foundational inference engine that made the local LLM revolution possible. Most other tools (including Ollama) build on top of it. Direct usage gives you the most control:

# Build from source for maximum performance
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Run a GGUF model
./llama-cli -m ~/models/qwen3-coder-14b-q5_k_m.gguf \
  -p "Explain the three-tier AI architecture pattern" \
  -n 512 --temp 0.7

When to use llama.cpp directly: When you need maximum control over inference parameters, are benchmarking, or are building custom inference pipelines.

Model Selection Guide

For Code Tasks

Model	Parameters	Memory (Q4)	Speed	Quality	Notes
Qwen3-Coder 8B	8B	~5GB	Fast	Good	My default for autocomplete. Fast enough for inline suggestions.
Qwen3-Coder 14B	14B	~9GB	Good	Very Good	Sweet spot for code generation on 48GB machines.
Qwen3-Coder 32B	32B	~20GB	Moderate	Excellent	Strong code understanding, runs well on 48GB+
DeepSeek-Coder V3 16B	16B	~10GB	Good	Very Good	Excellent for complex code reasoning
Codestral 22B	22B	~14GB	Good	Very Good	Mistral’s code model, strong multilingual support

For General Tasks

Model	Parameters	Memory (Q4)	Speed	Quality	Notes
Llama 4 Scout	17B active (MoE)	~12GB	Good	Very Good	Meta’s latest, strong general capability
Qwen3 14B	14B	~9GB	Good	Very Good	Excellent multilingual (including German)
DeepSeek R1 14B	14B	~9GB	Moderate	Excellent	Reasoning model — thinks before answering. Slower but much more capable for complex tasks.
Gemma 3 12B	12B	~8GB	Fast	Good	Google’s open model, good for general tasks
Mistral Small 24B	24B	~15GB	Moderate	Very Good	Strong European language support

For Reasoning

Model	Parameters	Memory (Q4)	Notes
DeepSeek R1 14B	14B	~9GB	Best reasoning model at this size. Chain-of-thought is built in.
DeepSeek R1 32B	32B	~20GB	Significantly stronger reasoning. Worth the memory if you have it.
QwQ 32B	32B	~20GB	Alibaba’s reasoning model. Competitive with R1 on many benchmarks.

Understanding Quantization

Quantization reduces model precision to fit in less memory. The trade-off is quality vs. memory usage:

Quantization	Quality Loss	Memory Reduction	When to Use
Q8	Negligible	~50% of FP16	When you have plenty of memory. Closest to original quality.
Q6_K	Minimal	~40% of FP16	Good balance for large-memory machines.
Q5_K_M	Small	~35% of FP16	My default. Best quality-to-memory ratio.
Q4_K_M	Noticeable on complex tasks	~25% of FP16	When memory is tight. Still very usable.
Q3_K	Significant	~20% of FP16	Only when you must fit a larger model.

Rule of thumb: Use the highest quantization level that fits in your available memory with ~5GB headroom for the OS and other applications.

Integration with Development Tools

Cursor

Cursor supports local models via its OpenAI-compatible API settings:

Start Ollama: ollama serve
In Cursor Settings > Models > OpenAI API Base: http://localhost:11434/v1
Add model name: qwen3-coder
Use for autocomplete; keep cloud models for complex agent tasks

Continue.dev

Continue.dev has first-class Ollama support:

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen3-Coder (Local)",
      "provider": "ollama",
      "model": "qwen3-coder"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen3-Coder 8B",
    "provider": "ollama",
    "model": "qwen3-coder:8b"
  }
}

Cline

Cline supports Ollama models as a backend. Configure in the extension settings with the Ollama API endpoint. Works well for smaller tasks; for complex multi-file edits, cloud models (Claude) still outperform.

Performance Tuning

Memory Management

# Check available memory
sysctl hw.memsize

# Monitor during inference
# Activity Monitor > Memory tab > Memory Pressure (should stay green)

# For Ollama: control how many models stay loaded
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4

Context Length vs Speed

Longer context windows use more memory and slow down inference. For local models:

4K context: Fast, good for autocomplete and short queries
8K context: Standard, good balance for most tasks
16K-32K context: Slow on local hardware, use only when you need to process longer documents
128K+ context: Impractical locally except on very high-memory machines. Use cloud models.

Batch Processing

For tasks like processing multiple files or generating documentation across a codebase, batch requests to the local API rather than sending them one by one:

import mlx_lm

# Load model once, run many inferences
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-Coder-14B-4bit")

for file_content in files:
    result = mlx_lm.generate(
        model, tokenizer,
        prompt=f"Add docstrings to this code:\n{file_content}",
        max_tokens=2048
    )

When Local Beats Cloud

Use Case	Local	Cloud	Winner
Code autocomplete	Fast, free, private	Slight latency, per-token cost	Local
Quick code explanations	Instant, offline	Better quality	Local
Complex multi-file refactoring	Limited by model size	Frontier models excel	Cloud
Processing sensitive code	No data leaves machine	Data sent to provider	Local
Offline development (travel, plane)	Always available	Requires internet	Local
High-volume batch processing	Free after hardware	$$$$ at scale	Local
Novel architecture decisions	Weaker reasoning	Frontier reasoning	Cloud
Long-context analysis (>32K tokens)	Very slow, memory hungry	Handles natively	Cloud

The Hybrid Approach

The best setup uses both. My routing:

Local (Qwen3-Coder via Ollama/MLX): Autocomplete, quick questions, code explanations, simple refactoring, documentation generation, offline work, sensitive code
Cloud (Claude Opus via Claude Code): Complex feature implementation, architecture decisions, multi-file refactors, anything requiring deep reasoning or long context
Cloud (Claude Sonnet via Cursor): Medium-complexity tasks, code review, UI component implementation

The cost savings are real. I estimate local inference handles 60-70% of my AI queries by volume, and those would cost $50-100/month in API fees. The savings add up quickly — and that’s before counting the privacy and availability benefits.

Getting Started

If you’re new to local LLMs, here’s the fastest path:

Install Ollama: brew install ollama
Pull a model: ollama pull qwen3-coder (or ollama pull deepseek-r1:14b for reasoning)
Try it: ollama run qwen3-coder
Integrate: Point your IDE (Cursor, Continue.dev) at http://localhost:11434/v1
Experiment: Try different models and sizes for your specific use cases

Total time to first local inference: about 5 minutes plus model download time. There’s never been a lower barrier to running powerful AI locally.