// concrete recipe

Build a homelab AI cluster in 30 minutes

Four machines. One 70B model that fits on none of them alone. Start with two boxes, add machines whenever. The dusty 2070 in the closet and the no-GPU server in the basement both pull their weight — and you point your chat app at one URL.

Jump to the steps → How pooling works → See the numbers →

// the setup

Your junk drawer of compute, unified

Most "run a 70B at home" guides assume two matching 3090s. Real homelabs aren't like that. You have a gaming rig, an old PC you almost sold, a laptop, maybe a headless server with no GPU at all — different vendors, different generations, different OSes. Tightwad pools all of it into a single OpenAI-compatible endpoint, then uses speculative decoding so the pooled model is actually usable instead of painfully slow.

This recipe runs a real four-machine cluster: three boxes pool 52 GB of mixed VRAM (NVIDIA + Apple Metal) to host Llama 3.3 70B, and a fourth laptop runs an Llama 3.1 8B draft model plus the Tightwad proxy. Under greedy decoding the output is mathematically identical to running the 70B alone — you just see it faster. No Docker Compose with 300 env vars, no Kubernetes. Python and one config file.

// 4 machines · 1 endpoint

Homelab setup, step by step

Three machines pool a 70B. A fourth drafts and proxies. Start with two, add the rest anytime — the cluster grows.

⚡ Draft Brain

💻

MacBook Air M4

Llama 3.1 8B · Apple Silicon

Tightwad proxy :8088

Drafts 33 tokens/round
locally, sends token IDs

propose →

WiFi

← verify

🖥️ GPU Pool — Target Model

Llama 3.3 70B · 52GB VRAM distributed

🖥️

Desktop

RTX 4070 Ti Super + RTX 3060

28 GB VRAM

🖥️

Old Gaming PC

RTX 2070

8 GB VRAM

💻

MacBook Air M2

Apple Metal

16 GB unified

3 machines · 52 GB total · rpc-server :50052

1.86× speedup

4.1 tok/s was 2.2 tok/s

= output identical to 70B alone

$0 runs fully local

On Machines A, B, C: start the llama.cpp RPC workers

Pool Workers

bash (on each pool machine)

# Machine A — Desktop (4070 Ti Super + 3060, 28GB, CUDA)
$ rpc-server --host 0.0.0.0 --port 50052
# Machine B — Old Gaming PC (RTX 2070, 8GB, CUDA)
$ rpc-server --host 0.0.0.0 --port 50052
# Machine C — MacBook Air M2 (Metal). Restrict to the GPU only:
$ ./rpc-server --host 0.0.0.0 --port 50052 --device MTL0

Grab prebuilt rpc-server binaries from the llama.cpp releases, or build from source. All workers and the coordinator must be the same llama.cpp build — version mismatches fail silently. On macOS, --device MTL0 stops llama.cpp from exposing the CPU as a second device and breaking the tensor split. Open port 50052 in your firewall.

On Machine D: start the draft model

Machine D

bash

# Machine D — MacBook Air M4 (runs the draft + proxy)
$ ollama run llama3.1:8b
# Confirm it's loaded:
$ ollama ps
✓ llama3.1:8b  running

The draft and target must be the same model family (here, Llama 3.x → Llama 3.3). Cross-family drafting collapses acceptance and ends up slower than no speculation. Tightwad checks the families at startup and in tightwad doctor.

Install Tightwad (on whichever machine runs the proxy)

Either

bash

$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad

Edit configs/cluster.yaml

Either

configs/cluster.yaml — combined mode (pool + speculation)

proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes from rolling acceptance
  draft:
    url: http://localhost:11434   # Machine D (M4, local draft)
    model_name: llama3.1:8b
    backend: ollama

coordinator:
  host: 0.0.0.0
  port: 8090
  model: Llama-3.3-70B-Q4_K_M.gguf
  gpus:                          # local GPUs on the coordinator
    - { name: "RTX 4070 Ti Super", vram_gb: 16 }
    - { name: "RTX 3060",          vram_gb: 12 }

workers:
  - host: 192.168.1.20         # Machine B (RTX 2070)
    gpus: [ { name: "RTX 2070", vram_gb: 8, rpc_port: 50052 } ]
  - host: 192.168.1.30         # Machine C (M2 Metal)
    gpus: [ { name: "Apple M2 Metal", vram_gb: 11, rpc_port: 50052 } ]

Find your IPs with ip addr (Linux), ipconfig (Windows), or ipconfig getifaddr en0 (macOS). For Apple Silicon use recommendedMaxWorkingSetSize (printed by rpc-server at startup), not total unified memory. The coordinator needs enough system RAM for the whole GGUF — a 70B Q4_K_M (~40GB) wants ~44GB RAM. Add more workers anytime; the cluster grows. Or skip the hand-editing and run tightwad init to auto-discover LAN servers.

Start the proxy

Either

bash

$ tightwad proxy start
✓ Draft model healthy  (llama3.1:8b @ localhost:11434) — Machine D
✓ Pool: 3 workers online (52GB VRAM total) — A + B + C
✓ Target: Llama-3.3-70B distributed across pool
✓ Proxy listening on http://localhost:8088

Run tightwad doctor first if anything looks off — it checks config, binaries, network, build versions, and model families.

Test it

Either

bash

$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}'

# Check acceptance rate + throughput
$ tightwad proxy status
→ Acceptance rate: N% | Rounds: N | Tokens/round: N

Point your chat app at it

Done ✓

In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:

http://192.168.1.10:11434 → http://192.168.1.10:8088

That's it. Four machines, one endpoint. Same app, same model name, same output quality. Machines A, B, and C pool a 70B that fits on no single machine; Machine D drafts and proxies. Under greedy decoding you see 4.1 tok/s instead of 2.2 — for output that's identical to running the 70B alone.

What to expect with this setup

Measured on this exact cluster — Llama 3.1 8B drafting for Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi), greedy decoding (temp 0):

Metric	Result
⚡ Output equivalence (greedy)	=
🚀 Speedup	1.86×
💬 Tokens per round	33
⏱️ Speed (pool only)	2.2 tok/s
⏱️ Speed (pool + speculation)	4.1 tok/s

519 tokens in 127s vs 512 tokens in 231s. Your numbers will vary with hardware, network, and model pairing. See the full methodology on the benchmarks page.

Want the why behind the pooling, or the raw benchmark logs? Start here:

How GPU pooling works → See the benchmarks → GitHub →

Get started: pip install tightwad

// keep exploring

Your junk drawer of compute, unified

Homelab setup, step by step

What to expect with this setup

Where to next