May 1, 2026

Running Claude Code Locally with LM Studio on Apple Silicon

Most guides for running Anthropic’s Claude Code against a local model point you at Ollama, tell you to set a couple of env vars, and consider it done. Granted, it looks like it works - you can chat with it, but the agentic loop is silently broken. The model can talk to you, but it can’t actually do anything. No file reads, no real tool calls, no multi-step task execution. Just a polite chatbot wearing Claude Code’s UI as a costume (and making my MacBook a mini space heater.)

This post walks through a setup I eventually came to that actually works on Apple Silicon: LM Studio + the Unsloth GGUF version of the Qwen3-Coder-30B-A3B model, running entirely local on a 14” M5 Pro MacBook Pro with 64GB of unified memory. Full agentic loop, no API costs, no rate limits, no data leaving the machine.

Why bother running it locally

One of the key advantages of running local large language models (LLMs) is privacy, which can also be a key component in DFIR work. When dealing with sensitive Client info, or even malware, cloud models introduce risk and restrictions. I wanted to test for myself what these local models could do, which initially led me to beta testing Claude Code local on my MacBook Pro in the first place. If you also have capable hardware (and hate the direction of Anthropic’s pricing and plans) it can be worthwhile to explore local models for lighter agentic work.

While this post doesn’t cover it, one of the other benefits to local models can be the ability to run “abliterated” versions, which are models that have had their refusal & safety behavior weakened or removed after training. These can be very useful for malware decoding and analysis where normal cloud-based models, like OpenAI’s ChatGPT and Google’s Gemini, will refuse. These would be run independently, not via the Claude Code process outlined below.

Why LM Studio, not Ollama

This is the part I learned after about an hour of wondering why Ollama was spitting back gibberish to me after thinking on the question “what files are in this directory?” for 5-10 minutes.

Claude Code is built around Anthropic’s Messages API, which uses structured tool_use and tool_result blocks for every agentic action. Essentially every Bash command, file read, and edit. The model’s response isn’t just text, it’s a sequence of typed content blocks that the CLI parses and dispatches.

Ollama serves an OpenAI-compatible endpoint and translates Anthropic-shaped requests on the fly. That translation layer doesn’t preserve the tool-call blocks cleanly. The model emits something that looks like a tool call, the adapter mangles it, Claude Code can’t parse it, and the agentic loop breaks. You get a model that says “I’ll check that file for you” and then nothing happens. Super fun.

LM Studio 0.4.1 added a native Anthropic Messages API at /v1/messages. Claude Code talks to it the same way it talks to Anthropic’s hosted API, and tool calls round-trip correctly. No adapter or translation needed.

Prerequisites

macOS (this walkthrough was done on macOS 26.4)
LM Studio 0.4.1 or later — earlier versions don’t expose the native Anthropic endpoint
An Anthropic account — Pro, Max, Team, Enterprise, or Console. Free tier doesn’t include Claude Code access. You only need to authenticate once, then redirect everything local via env vars.
Terminal access (Terminal, iTerm2, whatever you use)
Apple Silicon with enough unified memory for the model you want. I’m on a 14” M5 Pro with 64GB. To maximize the value of this post, I have added in model recommendations below that scale by RAM tier.

Step-by-step setup

Step 1: Install Claude Code

curl -fsSL https://claude.ai/install.sh | bash

Verify the install:

claude doctor

This checks installation health and surfaces config issues. On the first run it’ll authenticate against Anthropic and that’s expected. The redirect to local happens via env vars in Step 5.

Step 2: Pick and download a model

The sad reality is what you can run depends on how much unified memory you have. Rough guide:

RAM	Recommended model	Quant	Size
24GB	Qwen3.5-35B-A3B	Q4_K_M	~22GB
64GB (my setup)	Qwen3-Coder-30B-A3B	UD Q4_K_XL	~17.67GB
64GB (more intensive)	Qwen3.5-27B dense	Q8_0	~30GB
128GB+	Qwen3-Coder-Next 80B	Q4_K_M	~48GB

I went with Qwen3-Coder-30B-A3B for the agentic Claude Code use case. A few reasons for this:

Purpose-built for agentic coding, tool calling, and multi-file reasoning
Mixture of Experts (MoE) architecture - 30B total params but only 3B active per token, so prefill is fast
Native 256K context support
No “thinking” mode - less overhead per turn, which matters when you’re firing off tool calls in a loop

In LM Studio’s model search, look for Qwen3-Coder-30B-A3B and pick:

Author: unsloth
Repo: Qwen3-Coder-30B-A3B-Instruct-GGUF
Quant: UD Q4_K_XL (~17.67GB)

UD refers to Unsloth’s “dynamic” quantization, which uses layer-aware compression to retain more model quality than standard Q4 while staying around the same file size. On a 64GB MacBook such as mine, that should leave roughly 46GB available for macOS, KV cache, and other apps running alongside the model.

Don't grab the wrong upload. The lmstudio-community upload of this model is months older and predates the tool-calling fixes. If you do, your tool calls will silently fail. Stick with the unsloth author. Also avoid fine-tuned variants (Huihui abliterated, etc.) for this use case as Claude Code expects the base instruct format.

Step 3: Configure the model in LM Studio

Once downloaded, open the model’s settings panel. There are two tabs that matter, Load and Inference (plus the prompt template). A lot of these settings were discovered by me through a combo of trial/error and research, validated by some Claude questions.

Load tab:

Setting	Value	Notes
Context Length	`32768`	32K is the sweet spot. Push to `65536` if you keep hitting limits.
GPU Offload	`Max / -1`	Full Metal offload so model fits in unified memory.
Evaluation Batch Size	`1024`	Default is `512`. Doubling this noticeably speeds up prefill, which is relevant for Claude Code’s large system prompt.
Unified KV Cache	`On`	Default, leave it.
Offload KV Cache to GPU	`On`	Default, leave it.
Keep Model in Memory	`On`	Avoids cold-load delays between sessions.
Flash Attention	`On`	Reduces memory pressure at long contexts.
K/V Cache Quantization	`Off`	Experimental, leave off for stability.
Try mmap()	`On`	Default
Number of Experts	`8`	Correct for this model, don’t change.

Inference tab:

Setting	Value	Notes
Temperature	`0.7`	Qwen’s official recommendation for the Coder series.
Top K	`20`	Down from default `40` - keeps tool calls tight.
Top P	`0.80`	Down from default `0.95` - also Qwen’s recommendation.
Repeat Penalty	`1.05`	Down from default `1.1` - discourages repetition in long sessions.
Min P	`Off / 0`	Disable. Can interfere with tool-call format.
Reasoning Section Parsing	`Off`	This model has no `<think>` blocks.
Structured Output	`Off`	Claude Code handles its own structure, enabling this breaks tool calls.

Prompt Template tab:

The default Jinja template included with the GGUF version of this model uses an unsupported safe filter, which causes an error on the first prompt. This error specifically was a major headache for me to identify, but luckily was an easy fix.

[ERROR] Error rendering prompt with jinja template:
"Unknown StringValue filter: safe"

Fix it manually:

Switch from Template (Jinja) to Manual
Pick ChatML from the dropdown
Confirm the start/end tags populate as <|im_start|> / <|im_end|> for system, user, and assistant
Confirm stop strings include <|im_start|> and <|im_end|>
Eject and reload the model

Step 4: Start the LM Studio server

Either flip the server toggle on in LM Studio’s Developer tab, or start it from the terminal:

lms server start

Default port is 1234. Verify the model is loaded and reachable:

curl http://localhost:1234/v1/models

Expected output:

{
  "data": [
    {
      "id": "qwen3-coder-30b-a3b-instruct",
      "object": "model",
      "owned_by": "organization_owner"
    }
  ]
}

Write down the exact id value - it should match the name of the model you chose and you’ll need it character-for-character for the env var in the next step.

While you’re in the Developer tab, two server settings worth tweaking:

Just-in-Time Model Loading: Off — keeps the model resident in memory between Claude Code prompts instead of re-loading on each request.
Require Authentication: Off — a dummy token works fine locally, so no need for the overhead.

Step 5: Set environment variables

Open your shell rc file:

nano ~/.zshrc

Add these three lines at the bottom (using the model you chose):

export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio
export ANTHROPIC_MODEL=qwen3-coder-30b-a3b-instruct

The AUTH_TOKEN value doesn’t matter (LM Studio doesn’t validate it locally) but Claude Code refuses to start without one set. The MODEL value must match the id from Step 4 exactly.

Save (Control+O, Enter, Control+X), then apply to the current shell:

source ~/.zshrc

Verify:

echo $ANTHROPIC_BASE_URL
# http://localhost:1234

If you forget to source, the new env vars only take effect in new terminal windows. Existing windows still point at Anthropic's hosted API, and Claude Code will quietly bill you instead of routing local. Always re-verify with echo before launching claude in a session you care about. You can also see what model is being used when Claude Code loads.

Step 6: Launch Claude Code

cd /your/project
claude

The very first thing to check: the bottom-left of the Claude Code UI. If it shows the LM Studio model id (e.g. qwen3-coder-30b-a3b-instruct), you’re routed local. If it still shows something like Sonnet 4.6 · API Usage Billing, the env vars didn’t take in this terminal session - back to Step 5 you go.

I have seen that it’s advisable to set effort to low for routine tasks - local models can’t match hosted Sonnet at high effort, and low is the sweet spot for prefill speed:

/effort low

Smoke-test the agentic loop with something concrete:

what files are in this directory?

Watch LM Studio’s developer logs, which are visible under the Developer tab. You should see prefill, generation, and a tool call go out. Claude Code should also come back with actual filenames, not a description of what it would do if it could read files. If the model just narrates what it’s about to do without anything happening, the tool-call format is broken and you’ll need to re-check the GGUF (Step 2) and the prompt template (Step 3). The speed at which Claude Code responds will also be heavily dependent on your hardware and the model you chose.

Finally, you can generate a CLAUDE.md for your project if you’re already in the folder you wish to code within:

/init

Claude Code reads this file on every session start, which lets the model skip a chunk of the cold-start exploration it did initially.

Performance expectations

For my setup (M5 Pro, 64GB, Qwen3-Coder-30B-A3B UD Q4_K_XL) this is what I found as the consensus online, which helps me ensure everything is working as it should:

Metric	Value
Model size in memory	~17.67GB
macOS overhead	~8–10GB
Total memory pressure	~26–28GB (comfortable on 64GB)
GPU utilization during inference	~100%
GPU power draw	~33W
GPU temperature under load	~91°C (safe — M5 Pro throttles around 105°C)
Prefill speed	~100 tok/s
First response time (cold)	20–30 seconds (after `/effort low`)
Subsequent responses	Faster - KV cache holds the session context

Why first responses feel slow: Claude Code sends a 10–40K token system prompt at the start of every session. All of that has to be prefilled before your first answer comes back. Subsequent prompts in the same session reuse the KV cache and respond noticeably faster, which is why /init and re-using the same session both pay off.

The unified memory architecture is doing a lot of heavy lifting here, which is also why Apple’s Mac Mini and Mac Studio products have been flying off shelves lately. GPU and CPU share the same pool, so there’s no transfer bottleneck between discrete VRAM and system RAM the way there would be on a desktop with a dedicated card, such as a gaming PC.

Troubleshooting (pain I felt)

If it helps, I listed out below some of the specific problems I faced going through the initial setup and iterations and what I found out to fix them.

Issue 1: Claude Code still shows Sonnet 4.6 after setting env vars

Symptom: Bottom-left of the UI still says Sonnet 4.6 · API Usage Billing.

Cause: Env vars not live in the current terminal, or ANTHROPIC_MODEL wasn’t set.

Fix: echo $ANTHROPIC_BASE_URL to confirm it’s set; source ~/.zshrc if not. Confirm LM Studio is up with curl http://localhost:1234/v1/models. Make sure ANTHROPIC_MODEL matches the id returned by that curl, character-for-character. Relaunch claude from the same terminal.

Issue 2: First response takes 5+ minutes

Symptom: Claude Code hangs for several minutes on the first prompt of a session.

Causes & Fixes:

Multiple models loaded in LM Studio - combined weight pushed past available RAM into swap. Eject everything except the model you’re using.
High effort mode — run /effort low.
Cold prefill of the 10–40K-token system prompt — this is normal, especially on the first prompt. Subsequent prompts are faster and /init can help reduce it further.
Default batch size of 512 — bump to 1024 in LM Studio Load settings.

Issue 3: Jinja template error on first prompt

Symptom: LM Studio dev logs show Unknown StringValue filter: safe.

Cause: The Qwen3-Coder GGUF ships a Jinja template that uses a filter LM Studio’s template engine doesn’t support.

Fix: Switch the prompt template from Jinja to Manual —> ChatML, confirm the im_start/im_end tags and stop strings, eject and reload the model. Full steps in Step 3.

Issue 4: Model describes actions but doesn’t actually execute them

Symptom: The model says “I’ll read that file for you” and then…nothing. No file actually opened, no tool call in the LM Studio logs. Just confusion.

Cause: Either you’re behind Ollama’s translation layer, or the GGUF you’re using predates the Unsloth tool-calling fixes.

Fix: Use LM Studio (not Ollama) for the native Anthropic endpoint, and use the unsloth GGUF specifically (not lmstudio-community or mradermacher). This is the whole reason the post exists.

Issue 5: `ANTHROPIC_MODEL` value doesn’t take effect

Symptom: Claude Code routes to the wrong model or errors out at startup.

Fix: Copy the id straight from the curl /v1/models response. The display name in LM Studio’s UI is sometimes formatted differently (version suffixes, capitalization) and the env var has to match the API id exactly.

TL;DR

Essentially, it appears most people point Claude Code at Ollama, sets a couple of env vars, and call it done. It looks like it works, but the agentic loop is silently broken. LM Studio’s native Anthropic API + the Unsloth Qwen3-Coder GGUF are what separated my ultimate working setup from a demo.

I have been playing around with this since getting it setup and it has been incredibly useful for local coding tasks with an agentic boost. While it will never be as powerful as a full cloud model, not every task needs it to be (which can also save me usage and a few $ in API credits.)

My next step, independent of Claude Code, is going to be exploring static malware analysis with “abliterated” models, such as deobfuscated complicated Base64 commands to determine their functionality. Additionally, I am hoping these models will allow me to dive deeper into ethical research related to different attack methodologies via malware generation.

Enjoy those tokens.

Running Claude Code Locally with LM Studio on Apple Silicon

Why bother running it locally

Why LM Studio, not Ollama

Prerequisites

Step-by-step setup

Step 1: Install Claude Code

Step 2: Pick and download a model

Step 3: Configure the model in LM Studio

Step 4: Start the LM Studio server

Step 5: Set environment variables

Step 6: Launch Claude Code

Performance expectations

Troubleshooting (pain I felt)

Issue 1: Claude Code still shows Sonnet 4.6 after setting env vars

Issue 2: First response takes 5+ minutes

Issue 3: Jinja template error on first prompt

Issue 4: Model describes actions but doesn’t actually execute them

Issue 5: ANTHROPIC_MODEL value doesn’t take effect

TL;DR

Issue 5: `ANTHROPIC_MODEL` value doesn’t take effect