All Guides How to Set Up a Local LLM Inference Server with llama.cpp or vLLM
#LLM #llama.cpp #vLLM #Cloudflare #Claude Code #Self-hosted #AI

How to Set Up a Local LLM Inference Server with llama.cpp or vLLM

Cavan Page ·

Running a local LLM inference server on your own GPU gives you a private, always-on API with no token costs and no rate limits. This guide covers the full setup: picking an inference engine (llama.cpp or vLLM), choosing the right model for your hardware, exposing a secure public endpoint via Cloudflare Tunnel and connecting Claude Code to your self-hosted model.

The setup is the same whether you’re on a consumer RTX card or a rack-mounted DGX with H100s. The hardware tier changes what models you can run and how fast - the architecture stays the same.

What You’ll End Up With

  • A local LLM server running on your GPU
  • A public HTTPS endpoint via Cloudflare Tunnel (no port forwarding, no static IP required)
  • Bearer token auth so only you can use it
  • Claude Code configured to route requests to your local model

How Much GPU VRAM Do You Need to Run a Local LLM?

The same stack runs across a wide range of hardware. What changes is which models fit and how fast they run.

HardwareUnified/VRAMNotes
Apple M416-32 GB unified~60-80 tok/s on 8B, 153 GB/s memory bandwidth
Apple M4 Proup to 64 GB unified~80-100 tok/s on 8B, 273 GB/s bandwidth
Apple M4 Maxup to 128 GB unifiedFits 70B models, 546 GB/s bandwidth
Apple M5up to 32 GB unified~19% faster than M4 in token generation, 153.6 GB/s
Apple M5 Proup to 64 GB unified307 GB/s bandwidth, Fusion Architecture
Apple M5 Maxup to 128 GB unified614 GB/s bandwidth, fastest consumer Apple Silicon for inference
GTX 10808 GB VRAMNo tensor cores, Pascal architecture - usable but slow
RTX 407012 GB VRAMAda Lovelace, 4th gen tensor cores - solid single-user server
RTX 409024 GB VRAMFits 34B+ models quantized, ~120 tok/s on 8B
RTX 5090 (Blackwell)32 GB VRAMLatest consumer flagship, FP4 support
A100 / H10040-80 GB VRAMData center cards - run 70B models at full precision
NVIDIA DGX320-640 GB VRAMMulti-GPU systems, large model inference at scale

Apple Silicon deserves a specific callout. The M-series chips use unified memory shared between CPU and GPU - a MacBook Pro M5 Max with 128 GB can fit a 70B model that wouldn’t come close to fitting on most discrete GPUs. The M5 Pro and M5 Max use Apple’s new Fusion Architecture (two 3nm dies) which pushes memory bandwidth to 307-614 GB/s - bandwidth being the primary bottleneck for LLM token generation.

The M5 generation is roughly 19-27% faster than M4 for token generation, and Apple’s own benchmarks show 4x faster time-to-first-token on Qwen3-14B. If you’re on Apple Silicon, the Metal backend in llama.cpp is well-optimized and included automatically in the Homebrew install.

For NVIDIA, the RTX 4070 hits the sweet spot for a personal inference server - 12 GB fits 8B-14B models comfortably, tensor cores are fast, and it runs on a standard desktop PSU. If you’re on a DGX or high-end data center card, vLLM with tensor parallelism is the right serving layer rather than llama.cpp.

This guide uses the RTX 4070 and GTX 1080 as concrete examples, but the steps apply regardless of what you’re running on.

What models fit

RTX 4070 (12 GB VRAM)

ModelQuantizationVRAM UsedSpeed
Llama 3.1 8BQ4_K_M~5 GB~65 tok/s
Llama 3.1 8BQ5_K_M~6 GB~55 tok/s
Mistral 7BQ4_K_M~5 GB~58 tok/s
Qwen2.5 14BQ4_K_M~9 GB~35 tok/s

GTX 1080 (8 GB VRAM)

ModelQuantizationVRAM UsedSpeed
Llama 3.1 8BQ4_K_M~5 GB~20 tok/s
Qwen2.5 7BQ4_K_M~5 GB~18 tok/s
Gemma 3 4BQ4_K_M~3 GB~28 tok/s

The 1080 can run useful models - 20 tokens/second is readable in real time - but you’re limited to 7B-8B models at 4-bit quantization. Anything larger won’t fit without CPU offloading, which tanks throughput.


llama.cpp vs vLLM: Which Inference Engine Should You Use?

Use llama.cpp if:

  • Single user (just you)
  • Consumer GPU, especially the GTX 1080
  • You want minimal setup and low memory overhead
  • You need quantized models (GGUF format)

Use vLLM if:

  • Multiple concurrent users or requests
  • RTX 4070 or better (needs CUDA compute 7.0+, which the GTX 1080 just barely meets but benefits little from vLLM’s architecture)
  • You want OpenAI-compatible API out of the box with continuous batching

For the GTX 1080, use llama.cpp. For the RTX 4070 with single-user use, either works - llama.cpp is simpler. For production or multi-user, vLLM on the 4070.


Option A: llama.cpp Server

Install

The easiest path is via a package manager - these track releases automatically and handle build dependencies for you.

macOS / Linux (Homebrew)

brew install llama.cpp

Windows

winget install llama.cpp

macOS (MacPorts)

sudo port install llama.cpp

Linux (Nix)

nix profile install nixpkgs#llama-cpp

On Apple Silicon, the Homebrew build includes Metal support by default - no extra flags needed. On Linux/Windows with NVIDIA, if your package manager build doesn’t include CUDA support, build from source:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Full build docs: llama.cpp install guide

If you hit CUDA errors, verify your toolkit matches your driver:

nvcc --version
nvidia-smi

Download a model

# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

Start the server

./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  --ctx-size 4096

-ngl 99 offloads all layers to GPU. Reduce this number if you run out of VRAM - each layer you move to CPU costs throughput. With the RTX 4070 and an 8B Q4 model, 99 fits comfortably.

Test it:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

Option B: vLLM Server (RTX 4070)

pip install vllm

Start the server:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8080 \
  --dtype float16 \
  --max-model-len 4096

vLLM downloads models from Hugging Face automatically. Set your token first:

export HF_TOKEN=your_token_here

Expose via Cloudflare Tunnel

No static IP, no router config, no open ports. Cloudflare Tunnel creates an outbound-only connection from your machine to Cloudflare’s edge.

Install cloudflared

# macOS
brew install cloudflared

# Linux
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
  -o /usr/local/bin/cloudflared
chmod +x /usr/local/bin/cloudflared

Authenticate and create a tunnel

cloudflared tunnel login
cloudflared tunnel create llm-server

Configure the tunnel

Create ~/.cloudflared/config.yml:

tunnel: llm-server
credentials-file: /home/youruser/.cloudflared/<tunnel-id>.json

ingress:
  - hostname: llm.yourdomain.com
    service: http://127.0.0.1:8080
  - service: http_status:404

If you don’t have a custom domain, Cloudflare will generate one for you (*.trycloudflare.com) - just run:

cloudflared tunnel --url http://127.0.0.1:8080

Point your domain

In your Cloudflare DNS dashboard, add a CNAME:

  • Name: llm
  • Target: <tunnel-id>.cfargotunnel.com

Start the tunnel

cloudflared tunnel run llm-server

Your server is now reachable at https://llm.yourdomain.com.


Add Bearer Token Auth

llama.cpp’s built-in --api-key flag is the simplest approach:

./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  --api-key your-secret-token-here

Any request without Authorization: Bearer your-secret-token-here gets a 401. Test it:

curl https://llm.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer your-secret-token-here" \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'

For vLLM, use the same flag:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8080 \
  --api-key your-secret-token-here

Point Claude Code at Your Server

Claude Code supports custom OpenAI-compatible endpoints. Add this to your Claude Code settings (~/.claude/settings.json):

{
  "customApiEndpoint": "https://llm.yourdomain.com",
  "customApiKey": "your-secret-token-here",
  "customModel": "local"
}

Or set it via environment variables before launching Claude Code:

export ANTHROPIC_BASE_URL=https://llm.yourdomain.com
export ANTHROPIC_API_KEY=your-secret-token-here

Claude Code will route requests to your local server instead of Anthropic’s API.


When to Use Your Local Server vs Claude Code’s Default

Your local 8B model is fast, private and free to run. Claude is better at complex reasoning, large context windows and multi-step tasks. The practical split:

TaskLocal ModelClaude
Quick drafts, summarizationYesOverkill
Code autocomplete, small editsYesOverkill
Architecture decisionsNoYes
Debugging complex issuesNoYes
Sensitive data you can’t send to the cloudYesNo
Burning through tokens on repetitive tasksYesExpensive

Claude Code remains the better choice for anything complex. The local server earns its place for fast, private, repetitive work where you don’t want to hit the API.


Run It as a Service

To keep the server running after logout, create a systemd service (Linux):

# /etc/systemd/system/llm-server.service
[Unit]
Description=llama.cpp LLM Server
After=network.target

[Service]
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
  -m /home/youruser/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 -ngl 99 \
  --api-key your-secret-token-here
Restart=always
User=youruser

[Install]
WantedBy=multi-user.target
sudo systemctl enable llm-server
sudo systemctl start llm-server

Do the same for cloudflared:

sudo cloudflared service install

Frequently Asked Questions

Does llama.cpp work on Apple Silicon / Mac? Yes, and Apple Silicon is one of the best platforms for local LLM inference. llama.cpp’s Metal backend runs natively on M-series chips. The unified memory architecture is the key advantage - an M5 Max with 128 GB can run a 70B model that won’t fit on a 24 GB discrete GPU. The M5 Pro and M5 Max push memory bandwidth to 307-614 GB/s, which directly translates to faster token generation. The Homebrew install includes Metal support automatically - no extra configuration needed. Expect 60-100+ tok/s on 8B models depending on your chip.

Can I run a local LLM without a GPU? Yes, llama.cpp supports CPU-only inference. It’s much slower - expect 3-8 tokens/second on a modern CPU vs 20-65+ on a GPU - but functional for testing or low-frequency use.

What is the minimum VRAM to run a local LLM? 8 GB VRAM is the practical minimum for useful models. With 8 GB you can run 7B-8B models at Q4 quantization. Less than 8 GB forces you to smaller models or heavy CPU offloading.

Is llama.cpp or vLLM better for a single user on a consumer GPU? llama.cpp. It’s simpler to set up, supports GGUF quantized models, has lower overhead and performs well for single-user workloads. vLLM’s advantages (continuous batching, tensor parallelism) only matter when serving multiple concurrent users.

Can I use my local LLM server with Claude Code? Yes. Claude Code supports custom OpenAI-compatible endpoints. Set ANTHROPIC_BASE_URL to your server URL and ANTHROPIC_API_KEY to your bearer token before launching.

How do I expose my local LLM server to the internet safely? Cloudflare Tunnel is the recommended approach - it creates an outbound-only encrypted connection with no open ports or static IP required. Pair it with the --api-key flag in llama.cpp or vLLM to require bearer token authentication on every request.

What models work best on an RTX 4070? Llama 3.1 8B at Q4_K_M or Q5_K_M quantization gives the best balance of speed (~55-65 tok/s) and quality. Qwen2.5 14B fits at Q4_K_M (~9 GB VRAM) if you want a larger model at the cost of some throughput.


Up Next

Multi-GPU setups introduce a real architectural choice: run one large model split across both GPUs using tensor parallelism, or run two separate models and route between them. The right answer depends on your VRAM, the models you’re running and your use case. That’s the next guide.


Sources: GPU Ranking for Local LLMs - Puget Systems LLM Consumer GPU Benchmarks - LocalScore Llama 3.1 8B Benchmarks - Red Hat: vLLM or llama.cpp