How to Set Up a Local LLM Inference Server with llama.cpp or vLLM
Running a local LLM inference server on your own GPU gives you a private, always-on API with no token costs and no rate limits. This guide covers the full setup: picking an inference engine (llama.cpp or vLLM), choosing the right model for your hardware, exposing a secure public endpoint via Cloudflare Tunnel and connecting Claude Code to your self-hosted model.
The setup is the same whether you’re on a consumer RTX card or a rack-mounted DGX with H100s. The hardware tier changes what models you can run and how fast - the architecture stays the same.
What You’ll End Up With
- A local LLM server running on your GPU
- A public HTTPS endpoint via Cloudflare Tunnel (no port forwarding, no static IP required)
- Bearer token auth so only you can use it
- Claude Code configured to route requests to your local model
How Much GPU VRAM Do You Need to Run a Local LLM?
The same stack runs across a wide range of hardware. What changes is which models fit and how fast they run.
| Hardware | Unified/VRAM | Notes |
|---|---|---|
| Apple M4 | 16-32 GB unified | ~60-80 tok/s on 8B, 153 GB/s memory bandwidth |
| Apple M4 Pro | up to 64 GB unified | ~80-100 tok/s on 8B, 273 GB/s bandwidth |
| Apple M4 Max | up to 128 GB unified | Fits 70B models, 546 GB/s bandwidth |
| Apple M5 | up to 32 GB unified | ~19% faster than M4 in token generation, 153.6 GB/s |
| Apple M5 Pro | up to 64 GB unified | 307 GB/s bandwidth, Fusion Architecture |
| Apple M5 Max | up to 128 GB unified | 614 GB/s bandwidth, fastest consumer Apple Silicon for inference |
| GTX 1080 | 8 GB VRAM | No tensor cores, Pascal architecture - usable but slow |
| RTX 4070 | 12 GB VRAM | Ada Lovelace, 4th gen tensor cores - solid single-user server |
| RTX 4090 | 24 GB VRAM | Fits 34B+ models quantized, ~120 tok/s on 8B |
| RTX 5090 (Blackwell) | 32 GB VRAM | Latest consumer flagship, FP4 support |
| A100 / H100 | 40-80 GB VRAM | Data center cards - run 70B models at full precision |
| NVIDIA DGX | 320-640 GB VRAM | Multi-GPU systems, large model inference at scale |
Apple Silicon deserves a specific callout. The M-series chips use unified memory shared between CPU and GPU - a MacBook Pro M5 Max with 128 GB can fit a 70B model that wouldn’t come close to fitting on most discrete GPUs. The M5 Pro and M5 Max use Apple’s new Fusion Architecture (two 3nm dies) which pushes memory bandwidth to 307-614 GB/s - bandwidth being the primary bottleneck for LLM token generation.
The M5 generation is roughly 19-27% faster than M4 for token generation, and Apple’s own benchmarks show 4x faster time-to-first-token on Qwen3-14B. If you’re on Apple Silicon, the Metal backend in llama.cpp is well-optimized and included automatically in the Homebrew install.
For NVIDIA, the RTX 4070 hits the sweet spot for a personal inference server - 12 GB fits 8B-14B models comfortably, tensor cores are fast, and it runs on a standard desktop PSU. If you’re on a DGX or high-end data center card, vLLM with tensor parallelism is the right serving layer rather than llama.cpp.
This guide uses the RTX 4070 and GTX 1080 as concrete examples, but the steps apply regardless of what you’re running on.
What models fit
RTX 4070 (12 GB VRAM)
| Model | Quantization | VRAM Used | Speed |
|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~5 GB | ~65 tok/s |
| Llama 3.1 8B | Q5_K_M | ~6 GB | ~55 tok/s |
| Mistral 7B | Q4_K_M | ~5 GB | ~58 tok/s |
| Qwen2.5 14B | Q4_K_M | ~9 GB | ~35 tok/s |
GTX 1080 (8 GB VRAM)
| Model | Quantization | VRAM Used | Speed |
|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~5 GB | ~20 tok/s |
| Qwen2.5 7B | Q4_K_M | ~5 GB | ~18 tok/s |
| Gemma 3 4B | Q4_K_M | ~3 GB | ~28 tok/s |
The 1080 can run useful models - 20 tokens/second is readable in real time - but you’re limited to 7B-8B models at 4-bit quantization. Anything larger won’t fit without CPU offloading, which tanks throughput.
llama.cpp vs vLLM: Which Inference Engine Should You Use?
Use llama.cpp if:
- Single user (just you)
- Consumer GPU, especially the GTX 1080
- You want minimal setup and low memory overhead
- You need quantized models (GGUF format)
Use vLLM if:
- Multiple concurrent users or requests
- RTX 4070 or better (needs CUDA compute 7.0+, which the GTX 1080 just barely meets but benefits little from vLLM’s architecture)
- You want OpenAI-compatible API out of the box with continuous batching
For the GTX 1080, use llama.cpp. For the RTX 4070 with single-user use, either works - llama.cpp is simpler. For production or multi-user, vLLM on the 4070.
Option A: llama.cpp Server
Install
The easiest path is via a package manager - these track releases automatically and handle build dependencies for you.
macOS / Linux (Homebrew)
brew install llama.cpp
Windows
winget install llama.cpp
macOS (MacPorts)
sudo port install llama.cpp
Linux (Nix)
nix profile install nixpkgs#llama-cpp
On Apple Silicon, the Homebrew build includes Metal support by default - no extra flags needed. On Linux/Windows with NVIDIA, if your package manager build doesn’t include CUDA support, build from source:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Full build docs: llama.cpp install guide
If you hit CUDA errors, verify your toolkit matches your driver:
nvcc --version
nvidia-smi
Download a model
# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir ./models
Start the server
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 \
--port 8080 \
-ngl 99 \
--ctx-size 4096
-ngl 99 offloads all layers to GPU. Reduce this number if you run out of VRAM - each layer you move to CPU costs throughput. With the RTX 4070 and an 8B Q4 model, 99 fits comfortably.
Test it:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Say hello"}]
}'
Option B: vLLM Server (RTX 4070)
pip install vllm
Start the server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 127.0.0.1 \
--port 8080 \
--dtype float16 \
--max-model-len 4096
vLLM downloads models from Hugging Face automatically. Set your token first:
export HF_TOKEN=your_token_here
Expose via Cloudflare Tunnel
No static IP, no router config, no open ports. Cloudflare Tunnel creates an outbound-only connection from your machine to Cloudflare’s edge.
Install cloudflared
# macOS
brew install cloudflared
# Linux
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
-o /usr/local/bin/cloudflared
chmod +x /usr/local/bin/cloudflared
Authenticate and create a tunnel
cloudflared tunnel login
cloudflared tunnel create llm-server
Configure the tunnel
Create ~/.cloudflared/config.yml:
tunnel: llm-server
credentials-file: /home/youruser/.cloudflared/<tunnel-id>.json
ingress:
- hostname: llm.yourdomain.com
service: http://127.0.0.1:8080
- service: http_status:404
If you don’t have a custom domain, Cloudflare will generate one for you (*.trycloudflare.com) - just run:
cloudflared tunnel --url http://127.0.0.1:8080
Point your domain
In your Cloudflare DNS dashboard, add a CNAME:
- Name:
llm - Target:
<tunnel-id>.cfargotunnel.com
Start the tunnel
cloudflared tunnel run llm-server
Your server is now reachable at https://llm.yourdomain.com.
Add Bearer Token Auth
llama.cpp’s built-in --api-key flag is the simplest approach:
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 \
--port 8080 \
-ngl 99 \
--api-key your-secret-token-here
Any request without Authorization: Bearer your-secret-token-here gets a 401. Test it:
curl https://llm.yourdomain.com/v1/chat/completions \
-H "Authorization: Bearer your-secret-token-here" \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'
For vLLM, use the same flag:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 127.0.0.1 \
--port 8080 \
--api-key your-secret-token-here
Point Claude Code at Your Server
Claude Code supports custom OpenAI-compatible endpoints. Add this to your Claude Code settings (~/.claude/settings.json):
{
"customApiEndpoint": "https://llm.yourdomain.com",
"customApiKey": "your-secret-token-here",
"customModel": "local"
}
Or set it via environment variables before launching Claude Code:
export ANTHROPIC_BASE_URL=https://llm.yourdomain.com
export ANTHROPIC_API_KEY=your-secret-token-here
Claude Code will route requests to your local server instead of Anthropic’s API.
When to Use Your Local Server vs Claude Code’s Default
Your local 8B model is fast, private and free to run. Claude is better at complex reasoning, large context windows and multi-step tasks. The practical split:
| Task | Local Model | Claude |
|---|---|---|
| Quick drafts, summarization | Yes | Overkill |
| Code autocomplete, small edits | Yes | Overkill |
| Architecture decisions | No | Yes |
| Debugging complex issues | No | Yes |
| Sensitive data you can’t send to the cloud | Yes | No |
| Burning through tokens on repetitive tasks | Yes | Expensive |
Claude Code remains the better choice for anything complex. The local server earns its place for fast, private, repetitive work where you don’t want to hit the API.
Run It as a Service
To keep the server running after logout, create a systemd service (Linux):
# /etc/systemd/system/llm-server.service
[Unit]
Description=llama.cpp LLM Server
After=network.target
[Service]
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
-m /home/youruser/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 -ngl 99 \
--api-key your-secret-token-here
Restart=always
User=youruser
[Install]
WantedBy=multi-user.target
sudo systemctl enable llm-server
sudo systemctl start llm-server
Do the same for cloudflared:
sudo cloudflared service install
Frequently Asked Questions
Does llama.cpp work on Apple Silicon / Mac? Yes, and Apple Silicon is one of the best platforms for local LLM inference. llama.cpp’s Metal backend runs natively on M-series chips. The unified memory architecture is the key advantage - an M5 Max with 128 GB can run a 70B model that won’t fit on a 24 GB discrete GPU. The M5 Pro and M5 Max push memory bandwidth to 307-614 GB/s, which directly translates to faster token generation. The Homebrew install includes Metal support automatically - no extra configuration needed. Expect 60-100+ tok/s on 8B models depending on your chip.
Can I run a local LLM without a GPU? Yes, llama.cpp supports CPU-only inference. It’s much slower - expect 3-8 tokens/second on a modern CPU vs 20-65+ on a GPU - but functional for testing or low-frequency use.
What is the minimum VRAM to run a local LLM? 8 GB VRAM is the practical minimum for useful models. With 8 GB you can run 7B-8B models at Q4 quantization. Less than 8 GB forces you to smaller models or heavy CPU offloading.
Is llama.cpp or vLLM better for a single user on a consumer GPU? llama.cpp. It’s simpler to set up, supports GGUF quantized models, has lower overhead and performs well for single-user workloads. vLLM’s advantages (continuous batching, tensor parallelism) only matter when serving multiple concurrent users.
Can I use my local LLM server with Claude Code?
Yes. Claude Code supports custom OpenAI-compatible endpoints. Set ANTHROPIC_BASE_URL to your server URL and ANTHROPIC_API_KEY to your bearer token before launching.
How do I expose my local LLM server to the internet safely?
Cloudflare Tunnel is the recommended approach - it creates an outbound-only encrypted connection with no open ports or static IP required. Pair it with the --api-key flag in llama.cpp or vLLM to require bearer token authentication on every request.
What models work best on an RTX 4070? Llama 3.1 8B at Q4_K_M or Q5_K_M quantization gives the best balance of speed (~55-65 tok/s) and quality. Qwen2.5 14B fits at Q4_K_M (~9 GB VRAM) if you want a larger model at the cost of some throughput.
Up Next
Multi-GPU setups introduce a real architectural choice: run one large model split across both GPUs using tensor parallelism, or run two separate models and route between them. The right answer depends on your VRAM, the models you’re running and your use case. That’s the next guide.
Sources: GPU Ranking for Local LLMs - Puget Systems LLM Consumer GPU Benchmarks - LocalScore Llama 3.1 8B Benchmarks - Red Hat: vLLM or llama.cpp