// concrete recipe

Build a homelab AI cluster in 30 minutes

Four machines. One 70B model that fits on none of them alone. Start with two boxes, add machines whenever. The dusty 2070 in the closet and the no-GPU server in the basement both pull their weight โ€” and you point your chat app at one URL.

Your junk drawer of compute, unified

Most "run a 70B at home" guides assume two matching 3090s. Real homelabs aren't like that. You have a gaming rig, an old PC you almost sold, a laptop, maybe a headless server with no GPU at all โ€” different vendors, different generations, different OSes. Tightwad pools all of it into a single OpenAI-compatible endpoint, then uses speculative decoding so the pooled model is actually usable instead of painfully slow.

This recipe runs a real four-machine cluster: three boxes pool 52 GB of mixed VRAM (NVIDIA + Apple Metal) to host Llama 3.3 70B, and a fourth laptop runs an Llama 3.1 8B draft model plus the Tightwad proxy. Under greedy decoding the output is mathematically identical to running the 70B alone โ€” you just see it faster. No Docker Compose with 300 env vars, no Kubernetes. Python and one config file.

Homelab setup, step by step

Three machines pool a 70B. A fourth drafts and proxies. Start with two, add the rest anytime โ€” the cluster grows.

โšก Draft Brain
๐Ÿ’ป
MacBook Air M4
Llama 3.1 8B · Apple Silicon
Tightwad proxy :8088
Drafts 33 tokens/round
locally, sends token IDs
propose โ†’
WiFi
โ† verify
๐Ÿ–ฅ๏ธ GPU Pool โ€” Target Model
Llama 3.3 70B ยท 52GB VRAM distributed
๐Ÿ–ฅ๏ธ
Desktop
RTX 4070 Ti Super + RTX 3060
28 GB VRAM
๐Ÿ–ฅ๏ธ
Old Gaming PC
RTX 2070
8 GB VRAM
๐Ÿ’ป
MacBook Air M2
Apple Metal
16 GB unified
3 machines ยท 52 GB total ยท rpc-server :50052
1.86ร— speedup
4.1 tok/s was 2.2 tok/s
= output identical to 70B alone
$0 runs fully local
1
On Machines A, B, C: start the llama.cpp RPC workers
Pool Workers
bash (on each pool machine)
# Machine A โ€” Desktop (4070 Ti Super + 3060, 28GB, CUDA)
$ rpc-server --host 0.0.0.0 --port 50052
# Machine B โ€” Old Gaming PC (RTX 2070, 8GB, CUDA)
$ rpc-server --host 0.0.0.0 --port 50052
# Machine C โ€” MacBook Air M2 (Metal). Restrict to the GPU only:
$ ./rpc-server --host 0.0.0.0 --port 50052 --device MTL0

Grab prebuilt rpc-server binaries from the llama.cpp releases, or build from source. All workers and the coordinator must be the same llama.cpp build โ€” version mismatches fail silently. On macOS, --device MTL0 stops llama.cpp from exposing the CPU as a second device and breaking the tensor split. Open port 50052 in your firewall.

2
On Machine D: start the draft model
Machine D
bash
# Machine D โ€” MacBook Air M4 (runs the draft + proxy)
$ ollama run llama3.1:8b
# Confirm it's loaded:
$ ollama ps
 llama3.1:8b  running

The draft and target must be the same model family (here, Llama 3.x → Llama 3.3). Cross-family drafting collapses acceptance and ends up slower than no speculation. Tightwad checks the families at startup and in tightwad doctor.

3
Install Tightwad (on whichever machine runs the proxy)
Either
bash
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad
4
Edit configs/cluster.yaml
Either
configs/cluster.yaml โ€” combined mode (pool + speculation)
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes from rolling acceptance
  draft:
    url: http://localhost:11434   # Machine D (M4, local draft)
    model_name: llama3.1:8b
    backend: ollama

coordinator:
  host: 0.0.0.0
  port: 8090
  model: Llama-3.3-70B-Q4_K_M.gguf
  gpus:                          # local GPUs on the coordinator
    - { name: "RTX 4070 Ti Super", vram_gb: 16 }
    - { name: "RTX 3060",          vram_gb: 12 }

workers:
  - host: 192.168.1.20         # Machine B (RTX 2070)
    gpus: [ { name: "RTX 2070", vram_gb: 8, rpc_port: 50052 } ]
  - host: 192.168.1.30         # Machine C (M2 Metal)
    gpus: [ { name: "Apple M2 Metal", vram_gb: 11, rpc_port: 50052 } ]

Find your IPs with ip addr (Linux), ipconfig (Windows), or ipconfig getifaddr en0 (macOS). For Apple Silicon use recommendedMaxWorkingSetSize (printed by rpc-server at startup), not total unified memory. The coordinator needs enough system RAM for the whole GGUF โ€” a 70B Q4_K_M (~40GB) wants ~44GB RAM. Add more workers anytime; the cluster grows. Or skip the hand-editing and run tightwad init to auto-discover LAN servers.

5
Start the proxy
Either
bash
$ tightwad proxy start
 Draft model healthy  (llama3.1:8b @ localhost:11434) โ€” Machine D
 Pool: 3 workers online (52GB VRAM total) โ€” A + B + C
 Target: Llama-3.3-70B distributed across pool
 Proxy listening on http://localhost:8088

Run tightwad doctor first if anything looks off โ€” it checks config, binaries, network, build versions, and model families.

6
Test it
Either
bash
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}'

# Check acceptance rate + throughput
$ tightwad proxy status
 Acceptance rate: N% | Rounds: N | Tokens/round: N
7
Point your chat app at it
Done โœ“

In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:

http://192.168.1.10:11434 โ†’ http://192.168.1.10:8088

That's it. Four machines, one endpoint. Same app, same model name, same output quality. Machines A, B, and C pool a 70B that fits on no single machine; Machine D drafts and proxies. Under greedy decoding you see 4.1 tok/s instead of 2.2 โ€” for output that's identical to running the 70B alone.

What to expect with this setup

Measured on this exact cluster โ€” Llama 3.1 8B drafting for Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi), greedy decoding (temp 0):

MetricResult
⚡ Output equivalence (greedy)
=
🚀 Speedup
1.86×
💬 Tokens per round
33
⏱️ Speed (pool only)
2.2 tok/s
⏱️ Speed (pool + speculation)
4.1 tok/s

519 tokens in 127s vs 512 tokens in 231s. Your numbers will vary with hardware, network, and model pairing. See the full methodology on the benchmarks page.

Want the why behind the pooling, or the raw benchmark logs? Start here:

Get started: pip install tightwad