Running Local LLMs — Qwen3-Coder, DeepSeek, MLX, and Apple Silicon
A practical guide to running powerful language models locally on Apple Silicon — model selection, MLX vs Ollama, quantization trade-offs, and when local beats cloud.
Running Local LLMs on Apple Silicon
The ability to run capable language models locally has reached an inflection point. With Apple Silicon’s unified memory architecture and frameworks like MLX and Ollama, models that would have required a data center five years ago now run on a laptop. Here’s my setup, what I’ve learned, and practical guidance for getting started.
Why Local?
There are three compelling reasons to run models locally rather than exclusively through cloud APIs:
Privacy: Process sensitive code, proprietary documents, and customer data without sending anything off your machine. This isn’t theoretical — if you work with regulated data, client code, or internal business logic, local inference removes the data processing question entirely.
Latency and Availability: No network dependency, no rate limits, instant responses even on a plane. Local models respond in milliseconds for autocomplete and seconds for longer generation. There’s no queue, no API outage, and no “you’ve exceeded your rate limit” at 2 AM.
Cost: After the hardware investment, inference is essentially free. If you’re running hundreds of queries per day for autocomplete, code explanation, and quick tasks, the cloud API costs add up. Local inference amortizes to nearly zero over time.
The trade-off is clear: local models are smaller and less capable than frontier cloud models. You’re trading raw intelligence for privacy, speed, and cost. The art is knowing which queries need frontier capability and which are well-served by a local 14B or 32B parameter model.
My Hardware Setup
Mac Studio M2 Ultra (192GB unified memory): My primary local inference machine. The massive unified memory is the key — it allows running 70B+ parameter models at full quality, or multiple smaller models simultaneously. The M2 Ultra’s memory bandwidth (~800 GB/s) directly determines token generation speed for LLMs.
MacBook Pro M4 Pro (48GB): My portable setup. Runs 14B models comfortably, 32B models in Q4 quantization. Good enough for code autocomplete and quick queries on the go.
Why Apple Silicon: The unified memory architecture means GPU and CPU share the same memory pool. A 70B Q4 model needs ~40GB of memory — on a discrete GPU setup, you’d need multiple NVIDIA cards. On a Mac Studio with 192GB unified memory, it just works. The memory bandwidth is lower than dedicated GPUs, but the total capacity and ease of use are unmatched for local inference.
Software Stack
MLX (Apple’s ML Framework)
MLX is Apple’s machine learning framework, designed specifically for Apple Silicon. It’s become the fastest way to run models locally on a Mac.
# Install MLX and the LM module
pip install mlx mlx-lm
# Run a model directly
mlx_lm.generate --model mlx-community/Qwen3-Coder-8B-4bit \
--prompt "Write a Python function that validates an email address"
# Start a local server (OpenAI-compatible API)
mlx_lm.server --model mlx-community/Qwen3-Coder-8B-4bit --port 8080
Key advantages: Native Apple Silicon optimization, lazy evaluation (only computes what’s needed), the mlx-community on Hugging Face provides pre-converted model weights so you don’t need to convert yourself.
Ollama
Ollama is the easiest way to get started with local models. Period.
# Install
brew install ollama
# Run a model (downloads automatically on first use)
ollama run qwen3-coder
ollama run deepseek-r1:14b
ollama run llama4-scout
# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-d '{"model": "qwen3-coder", "messages": [{"role": "user", "content": "Hello"}]}'
Key advantages: Dead simple setup, automatic model management, built-in OpenAI-compatible API, works with most AI tools (Cursor, Continue.dev, Cline) out of the box.
llama.cpp
The foundational inference engine that made the local LLM revolution possible. Most other tools (including Ollama) build on top of it. Direct usage gives you the most control:
# Build from source for maximum performance
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Run a GGUF model
./llama-cli -m ~/models/qwen3-coder-14b-q5_k_m.gguf \
-p "Explain the three-tier AI architecture pattern" \
-n 512 --temp 0.7
When to use llama.cpp directly: When you need maximum control over inference parameters, are benchmarking, or are building custom inference pipelines.
Model Selection Guide
For Code Tasks
| Model | Parameters | Memory (Q4) | Speed | Quality | Notes |
|---|---|---|---|---|---|
| Qwen3-Coder 8B | 8B | ~5GB | Fast | Good | My default for autocomplete. Fast enough for inline suggestions. |
| Qwen3-Coder 14B | 14B | ~9GB | Good | Very Good | Sweet spot for code generation on 48GB machines. |
| Qwen3-Coder 32B | 32B | ~20GB | Moderate | Excellent | Strong code understanding, runs well on 48GB+ |
| DeepSeek-Coder V3 16B | 16B | ~10GB | Good | Very Good | Excellent for complex code reasoning |
| Codestral 22B | 22B | ~14GB | Good | Very Good | Mistral’s code model, strong multilingual support |
For General Tasks
| Model | Parameters | Memory (Q4) | Speed | Quality | Notes |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B active (MoE) | ~12GB | Good | Very Good | Meta’s latest, strong general capability |
| Qwen3 14B | 14B | ~9GB | Good | Very Good | Excellent multilingual (including German) |
| DeepSeek R1 14B | 14B | ~9GB | Moderate | Excellent | Reasoning model — thinks before answering. Slower but much more capable for complex tasks. |
| Gemma 3 12B | 12B | ~8GB | Fast | Good | Google’s open model, good for general tasks |
| Mistral Small 24B | 24B | ~15GB | Moderate | Very Good | Strong European language support |
For Reasoning
| Model | Parameters | Memory (Q4) | Notes |
|---|---|---|---|
| DeepSeek R1 14B | 14B | ~9GB | Best reasoning model at this size. Chain-of-thought is built in. |
| DeepSeek R1 32B | 32B | ~20GB | Significantly stronger reasoning. Worth the memory if you have it. |
| QwQ 32B | 32B | ~20GB | Alibaba’s reasoning model. Competitive with R1 on many benchmarks. |
Understanding Quantization
Quantization reduces model precision to fit in less memory. The trade-off is quality vs. memory usage:
| Quantization | Quality Loss | Memory Reduction | When to Use |
|---|---|---|---|
| Q8 | Negligible | ~50% of FP16 | When you have plenty of memory. Closest to original quality. |
| Q6_K | Minimal | ~40% of FP16 | Good balance for large-memory machines. |
| Q5_K_M | Small | ~35% of FP16 | My default. Best quality-to-memory ratio. |
| Q4_K_M | Noticeable on complex tasks | ~25% of FP16 | When memory is tight. Still very usable. |
| Q3_K | Significant | ~20% of FP16 | Only when you must fit a larger model. |
Rule of thumb: Use the highest quantization level that fits in your available memory with ~5GB headroom for the OS and other applications.
Integration with Development Tools
Cursor
Cursor supports local models via its OpenAI-compatible API settings:
- Start Ollama:
ollama serve - In Cursor Settings > Models > OpenAI API Base:
http://localhost:11434/v1 - Add model name:
qwen3-coder - Use for autocomplete; keep cloud models for complex agent tasks
Continue.dev
Continue.dev has first-class Ollama support:
// ~/.continue/config.json
{
"models": [
{
"title": "Qwen3-Coder (Local)",
"provider": "ollama",
"model": "qwen3-coder"
}
],
"tabAutocompleteModel": {
"title": "Qwen3-Coder 8B",
"provider": "ollama",
"model": "qwen3-coder:8b"
}
}
Cline
Cline supports Ollama models as a backend. Configure in the extension settings with the Ollama API endpoint. Works well for smaller tasks; for complex multi-file edits, cloud models (Claude) still outperform.
Performance Tuning
Memory Management
# Check available memory
sysctl hw.memsize
# Monitor during inference
# Activity Monitor > Memory tab > Memory Pressure (should stay green)
# For Ollama: control how many models stay loaded
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4
Context Length vs Speed
Longer context windows use more memory and slow down inference. For local models:
- 4K context: Fast, good for autocomplete and short queries
- 8K context: Standard, good balance for most tasks
- 16K-32K context: Slow on local hardware, use only when you need to process longer documents
- 128K+ context: Impractical locally except on very high-memory machines. Use cloud models.
Batch Processing
For tasks like processing multiple files or generating documentation across a codebase, batch requests to the local API rather than sending them one by one:
import mlx_lm
# Load model once, run many inferences
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-Coder-14B-4bit")
for file_content in files:
result = mlx_lm.generate(
model, tokenizer,
prompt=f"Add docstrings to this code:\n{file_content}",
max_tokens=2048
)
When Local Beats Cloud
| Use Case | Local | Cloud | Winner |
|---|---|---|---|
| Code autocomplete | Fast, free, private | Slight latency, per-token cost | Local |
| Quick code explanations | Instant, offline | Better quality | Local |
| Complex multi-file refactoring | Limited by model size | Frontier models excel | Cloud |
| Processing sensitive code | No data leaves machine | Data sent to provider | Local |
| Offline development (travel, plane) | Always available | Requires internet | Local |
| High-volume batch processing | Free after hardware | $$$$ at scale | Local |
| Novel architecture decisions | Weaker reasoning | Frontier reasoning | Cloud |
| Long-context analysis (>32K tokens) | Very slow, memory hungry | Handles natively | Cloud |
The Hybrid Approach
The best setup uses both. My routing:
- Local (Qwen3-Coder via Ollama/MLX): Autocomplete, quick questions, code explanations, simple refactoring, documentation generation, offline work, sensitive code
- Cloud (Claude Opus via Claude Code): Complex feature implementation, architecture decisions, multi-file refactors, anything requiring deep reasoning or long context
- Cloud (Claude Sonnet via Cursor): Medium-complexity tasks, code review, UI component implementation
The cost savings are real. I estimate local inference handles 60-70% of my daily AI queries by volume, and those would cost $50-100/month in API fees. The Mac Studio paid for itself within the first year just in saved API costs — and that’s before counting the privacy and availability benefits.
Getting Started
If you’re new to local LLMs, here’s the fastest path:
- Install Ollama:
brew install ollama - Pull a model:
ollama pull qwen3-coder(orollama pull deepseek-r1:14bfor reasoning) - Try it:
ollama run qwen3-coder - Integrate: Point your IDE (Cursor, Continue.dev) at
http://localhost:11434/v1 - Experiment: Try different models and sizes for your specific use cases
Total time to first local inference: about 5 minutes plus model download time. There’s never been a lower barrier to running powerful AI locally.