Running Claude Code Locally with LM Studio on Apple Silicon
Most guides for running Anthropic’s Claude Code against a local model point you at Ollama, tell you to set a couple of env vars, and consider it done. Granted, it looks like it works - you can chat with it, but the agentic loop is silently broken. The model can talk to you, but it can’t actually do anything. No file reads, no real tool calls, no multi-step task execution. Just a polite chatbot wearing Claude Code’s UI as a costume (and making my MacBook a mini space heater.)
This post walks through a setup I eventually came to that actually works on Apple Silicon: LM Studio + the Unsloth GGUF version of the Qwen3-Coder-30B-A3B model, running entirely local on a 14” M5 Pro MacBook Pro with 64GB of unified memory. Full agentic loop, no API costs, no rate limits, no data leaving the machine.
Why bother running it locally
One of the key advantages of running local large language models (LLMs) is privacy, which can also be a key component in DFIR work. When dealing with sensitive Client info, or even malware, cloud models introduce risk and restrictions. I wanted to test for myself what these local models could do, which initially led me to beta testing Claude Code local on my MacBook Pro in the first place. If you also have capable hardware (and hate the direction of Anthropic’s pricing and plans) it can be worthwhile to explore local models for lighter agentic work.
While this post doesn’t cover it, one of the other benefits to local models can be the ability to run “abliterated” versions, which are models that have had their refusal & safety behavior weakened or removed after training. These can be very useful for malware decoding and analysis where normal cloud-based models, like OpenAI’s ChatGPT and Google’s Gemini, will refuse. These would be run independently, not via the Claude Code process outlined below.
Why LM Studio, not Ollama
This is the part I learned after about an hour of wondering why Ollama was spitting back gibberish to me after thinking on the question “what files are in this directory?” for 5-10 minutes.
Claude Code is built around Anthropic’s Messages API, which uses structured tool_use and tool_result blocks for every agentic action. Essentially every Bash command, file read, and edit. The model’s response isn’t just text, it’s a sequence of typed content blocks that the CLI parses and dispatches.
Ollama serves an OpenAI-compatible endpoint and translates Anthropic-shaped requests on the fly. That translation layer doesn’t preserve the tool-call blocks cleanly. The model emits something that looks like a tool call, the adapter mangles it, Claude Code can’t parse it, and the agentic loop breaks. You get a model that says “I’ll check that file for you” and then nothing happens. Super fun.
LM Studio 0.4.1 added a native Anthropic Messages API at /v1/messages. Claude Code talks to it the same way it talks to Anthropic’s hosted API, and tool calls round-trip correctly. No adapter or translation needed.
Prerequisites
- macOS (this walkthrough was done on macOS 26.4)
- LM Studio 0.4.1 or later — earlier versions don’t expose the native Anthropic endpoint
- An Anthropic account — Pro, Max, Team, Enterprise, or Console. Free tier doesn’t include Claude Code access. You only need to authenticate once, then redirect everything local via env vars.
- Terminal access (Terminal, iTerm2, whatever you use)
- Apple Silicon with enough unified memory for the model you want. I’m on a 14” M5 Pro with 64GB. To maximize the value of this post, I have added in model recommendations below that scale by RAM tier.
Step-by-step setup
Step 1: Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
Verify the install:
claude doctor
This checks installation health and surfaces config issues. On the first run it’ll authenticate against Anthropic and that’s expected. The redirect to local happens via env vars in Step 5.
Step 2: Pick and download a model
The sad reality is what you can run depends on how much unified memory you have. Rough guide:
| RAM | Recommended model | Quant | Size |
|---|---|---|---|
| 24GB | Qwen3.5-35B-A3B | Q4_K_M | ~22GB |
| 64GB (my setup) | Qwen3-Coder-30B-A3B | UD Q4_K_XL | ~17.67GB |
| 64GB (more intensive) | Qwen3.5-27B dense | Q8_0 | ~30GB |
| 128GB+ | Qwen3-Coder-Next 80B | Q4_K_M | ~48GB |
I went with Qwen3-Coder-30B-A3B for the agentic Claude Code use case. A few reasons for this:
- Purpose-built for agentic coding, tool calling, and multi-file reasoning
- Mixture of Experts (MoE) architecture - 30B total params but only 3B active per token, so prefill is fast
- Native 256K context support
- No “thinking” mode - less overhead per turn, which matters when you’re firing off tool calls in a loop
In LM Studio’s model search, look for Qwen3-Coder-30B-A3B and pick:
- Author:
unsloth - Repo:
Qwen3-Coder-30B-A3B-Instruct-GGUF - Quant:
UD Q4_K_XL(~17.67GB)
UD refers to Unsloth’s “dynamic” quantization, which uses layer-aware compression to retain more model quality than standard Q4 while staying around the same file size. On a 64GB MacBook such as mine, that should leave roughly 46GB available for macOS, KV cache, and other apps running alongside the model.
Step 3: Configure the model in LM Studio
Once downloaded, open the model’s settings panel. There are two tabs that matter, Load and Inference (plus the prompt template). A lot of these settings were discovered by me through a combo of trial/error and research, validated by some Claude questions.
Load tab:
| Setting | Value | Notes |
|---|---|---|
| Context Length | 32768 | 32K is the sweet spot. Push to 65536 if you keep hitting limits. |
| GPU Offload | Max / -1 | Full Metal offload so model fits in unified memory. |
| Evaluation Batch Size | 1024 | Default is 512. Doubling this noticeably speeds up prefill, which is relevant for Claude Code’s large system prompt. |
| Unified KV Cache | On | Default, leave it. |
| Offload KV Cache to GPU | On | Default, leave it. |
| Keep Model in Memory | On | Avoids cold-load delays between sessions. |
| Flash Attention | On | Reduces memory pressure at long contexts. |
| K/V Cache Quantization | Off | Experimental, leave off for stability. |
| Try mmap() | On | Default |
| Number of Experts | 8 | Correct for this model, don’t change. |
Inference tab:
| Setting | Value | Notes |
|---|---|---|
| Temperature | 0.7 | Qwen’s official recommendation for the Coder series. |
| Top K | 20 | Down from default 40 - keeps tool calls tight. |
| Top P | 0.80 | Down from default 0.95 - also Qwen’s recommendation. |
| Repeat Penalty | 1.05 | Down from default 1.1 - discourages repetition in long sessions. |
| Min P | Off / 0 | Disable. Can interfere with tool-call format. |
| Reasoning Section Parsing | Off | This model has no <think> blocks. |
| Structured Output | Off | Claude Code handles its own structure, enabling this breaks tool calls. |
Prompt Template tab:
The default Jinja template included with the GGUF version of this model uses an unsupported safe filter, which causes an error on the first prompt. This error specifically was a major headache for me to identify, but luckily was an easy fix.
[ERROR] Error rendering prompt with jinja template:
"Unknown StringValue filter: safe"
Fix it manually:
- Switch from Template (Jinja) to Manual
- Pick ChatML from the dropdown
- Confirm the start/end tags populate as
<|im_start|>/<|im_end|>for system, user, and assistant - Confirm stop strings include
<|im_start|>and<|im_end|> - Eject and reload the model
Step 4: Start the LM Studio server
Either flip the server toggle on in LM Studio’s Developer tab, or start it from the terminal:
lms server start
Default port is 1234. Verify the model is loaded and reachable:
curl http://localhost:1234/v1/models
Expected output:
{
"data": [
{
"id": "qwen3-coder-30b-a3b-instruct",
"object": "model",
"owned_by": "organization_owner"
}
]
}
Write down the exact id value - it should match the name of the model you chose and you’ll need it character-for-character for the env var in the next step.
While you’re in the Developer tab, two server settings worth tweaking:
- Just-in-Time Model Loading:
Off— keeps the model resident in memory between Claude Code prompts instead of re-loading on each request. - Require Authentication:
Off— a dummy token works fine locally, so no need for the overhead.
Step 5: Set environment variables
Open your shell rc file:
nano ~/.zshrc
Add these three lines at the bottom (using the model you chose):
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
export ANTHROPIC_MODEL=qwen3-coder-30b-a3b-instruct
The AUTH_TOKEN value doesn’t matter (LM Studio doesn’t validate it locally) but Claude Code refuses to start without one set. The MODEL value must match the id from Step 4 exactly.
Save (Control+O, Enter, Control+X), then apply to the current shell:
source ~/.zshrc
Verify:
echo $ANTHROPIC_BASE_URL
# http://localhost:1234
Step 6: Launch Claude Code
cd /your/project
claude
The very first thing to check: the bottom-left of the Claude Code UI. If it shows the LM Studio model id (e.g. qwen3-coder-30b-a3b-instruct), you’re routed local. If it still shows something like Sonnet 4.6 · API Usage Billing, the env vars didn’t take in this terminal session - back to Step 5 you go.
I have seen that it’s advisable to set effort to low for routine tasks - local models can’t match hosted Sonnet at high effort, and low is the sweet spot for prefill speed:
/effort low
Smoke-test the agentic loop with something concrete:
what files are in this directory?
Watch LM Studio’s developer logs, which are visible under the Developer tab. You should see prefill, generation, and a tool call go out. Claude Code should also come back with actual filenames, not a description of what it would do if it could read files. If the model just narrates what it’s about to do without anything happening, the tool-call format is broken and you’ll need to re-check the GGUF (Step 2) and the prompt template (Step 3). The speed at which Claude Code responds will also be heavily dependent on your hardware and the model you chose.
Finally, you can generate a CLAUDE.md for your project if you’re already in the folder you wish to code within:
/init
Claude Code reads this file on every session start, which lets the model skip a chunk of the cold-start exploration it did initially.
Performance expectations
For my setup (M5 Pro, 64GB, Qwen3-Coder-30B-A3B UD Q4_K_XL) this is what I found as the consensus online, which helps me ensure everything is working as it should:
| Metric | Value |
|---|---|
| Model size in memory | ~17.67GB |
| macOS overhead | ~8–10GB |
| Total memory pressure | ~26–28GB (comfortable on 64GB) |
| GPU utilization during inference | ~100% |
| GPU power draw | ~33W |
| GPU temperature under load | ~91°C (safe — M5 Pro throttles around 105°C) |
| Prefill speed | ~100 tok/s |
| First response time (cold) | 20–30 seconds (after /effort low) |
| Subsequent responses | Faster - KV cache holds the session context |
Why first responses feel slow: Claude Code sends a 10–40K token system prompt at the start of every session. All of that has to be prefilled before your first answer comes back. Subsequent prompts in the same session reuse the KV cache and respond noticeably faster, which is why /init and re-using the same session both pay off.
The unified memory architecture is doing a lot of heavy lifting here, which is also why Apple’s Mac Mini and Mac Studio products have been flying off shelves lately. GPU and CPU share the same pool, so there’s no transfer bottleneck between discrete VRAM and system RAM the way there would be on a desktop with a dedicated card, such as a gaming PC.
Troubleshooting (pain I felt)
If it helps, I listed out below some of the specific problems I faced going through the initial setup and iterations and what I found out to fix them.
Issue 1: Claude Code still shows Sonnet 4.6 after setting env vars
Symptom: Bottom-left of the UI still says Sonnet 4.6 · API Usage Billing.
Cause: Env vars not live in the current terminal, or ANTHROPIC_MODEL wasn’t set.
Fix: echo $ANTHROPIC_BASE_URL to confirm it’s set; source ~/.zshrc if not. Confirm LM Studio is up with curl http://localhost:1234/v1/models. Make sure ANTHROPIC_MODEL matches the id returned by that curl, character-for-character. Relaunch claude from the same terminal.
Issue 2: First response takes 5+ minutes
Symptom: Claude Code hangs for several minutes on the first prompt of a session.
Causes & Fixes:
- Multiple models loaded in LM Studio - combined weight pushed past available RAM into swap. Eject everything except the model you’re using.
- High effort mode — run
/effort low. - Cold prefill of the 10–40K-token system prompt — this is normal, especially on the first prompt. Subsequent prompts are faster and
/initcan help reduce it further. - Default batch size of 512 — bump to
1024in LM Studio Load settings.
Issue 3: Jinja template error on first prompt
Symptom: LM Studio dev logs show Unknown StringValue filter: safe.
Cause: The Qwen3-Coder GGUF ships a Jinja template that uses a filter LM Studio’s template engine doesn’t support.
Fix: Switch the prompt template from Jinja to Manual —> ChatML, confirm the im_start/im_end tags and stop strings, eject and reload the model. Full steps in Step 3.
Issue 4: Model describes actions but doesn’t actually execute them
Symptom: The model says “I’ll read that file for you” and then…nothing. No file actually opened, no tool call in the LM Studio logs. Just confusion.
Cause: Either you’re behind Ollama’s translation layer, or the GGUF you’re using predates the Unsloth tool-calling fixes.
Fix: Use LM Studio (not Ollama) for the native Anthropic endpoint, and use the unsloth GGUF specifically (not lmstudio-community or mradermacher). This is the whole reason the post exists.
Issue 5: ANTHROPIC_MODEL value doesn’t take effect
Symptom: Claude Code routes to the wrong model or errors out at startup.
Fix: Copy the id straight from the curl /v1/models response. The display name in LM Studio’s UI is sometimes formatted differently (version suffixes, capitalization) and the env var has to match the API id exactly.
TL;DR
Essentially, it appears most people point Claude Code at Ollama, sets a couple of env vars, and call it done. It looks like it works, but the agentic loop is silently broken. LM Studio’s native Anthropic API + the Unsloth Qwen3-Coder GGUF are what separated my ultimate working setup from a demo.
I have been playing around with this since getting it setup and it has been incredibly useful for local coding tasks with an agentic boost. While it will never be as powerful as a full cloud model, not every task needs it to be (which can also save me usage and a few $ in API credits.)
My next step, independent of Claude Code, is going to be exploring static malware analysis with “abliterated” models, such as deobfuscated complicated Base64 commands to determine their functionality. Additionally, I am hoping these models will allow me to dive deeper into ethical research related to different attack methodologies via malware generation.
Enjoy those tokens.
ZB