Build a homelab AI cluster in 30 minutes
Four machines. One 70B model that fits on none of them alone. Start with two boxes, add machines whenever. The dusty 2070 in the closet and the no-GPU server in the basement both pull their weight โ and you point your chat app at one URL.
Your junk drawer of compute, unified
Most "run a 70B at home" guides assume two matching 3090s. Real homelabs aren't like that. You have a gaming rig, an old PC you almost sold, a laptop, maybe a headless server with no GPU at all โ different vendors, different generations, different OSes. Tightwad pools all of it into a single OpenAI-compatible endpoint, then uses speculative decoding so the pooled model is actually usable instead of painfully slow.
This recipe runs a real four-machine cluster: three boxes pool 52 GB of mixed VRAM (NVIDIA + Apple Metal) to host Llama 3.3 70B, and a fourth laptop runs an Llama 3.1 8B draft model plus the Tightwad proxy. Under greedy decoding the output is mathematically identical to running the 70B alone โ you just see it faster. No Docker Compose with 300 env vars, no Kubernetes. Python and one config file.
Homelab setup, step by step
Three machines pool a 70B. A fourth drafts and proxies. Start with two, add the rest anytime โ the cluster grows.
locally, sends token IDs
# Machine A โ Desktop (4070 Ti Super + 3060, 28GB, CUDA) $ rpc-server --host 0.0.0.0 --port 50052 # Machine B โ Old Gaming PC (RTX 2070, 8GB, CUDA) $ rpc-server --host 0.0.0.0 --port 50052 # Machine C โ MacBook Air M2 (Metal). Restrict to the GPU only: $ ./rpc-server --host 0.0.0.0 --port 50052 --device MTL0
Grab prebuilt rpc-server binaries from the llama.cpp releases, or build from source. All workers and the coordinator must be the same llama.cpp build โ version mismatches fail silently. On macOS, --device MTL0 stops llama.cpp from exposing the CPU as a second device and breaking the tensor split. Open port 50052 in your firewall.
# Machine D โ MacBook Air M4 (runs the draft + proxy) $ ollama run llama3.1:8b # Confirm it's loaded: $ ollama ps ✓ llama3.1:8b running
The draft and target must be the same model family (here, Llama 3.x → Llama 3.3). Cross-family drafting collapses acceptance and ends up slower than no speculation. Tightwad checks the families at startup and in tightwad doctor.
$ python3 -m venv .venv && source .venv/bin/activate $ pip install tightwad
configs/cluster.yamlproxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes from rolling acceptance draft: url: http://localhost:11434 # Machine D (M4, local draft) model_name: llama3.1:8b backend: ollama coordinator: host: 0.0.0.0 port: 8090 model: Llama-3.3-70B-Q4_K_M.gguf gpus: # local GPUs on the coordinator - { name: "RTX 4070 Ti Super", vram_gb: 16 } - { name: "RTX 3060", vram_gb: 12 } workers: - host: 192.168.1.20 # Machine B (RTX 2070) gpus: [ { name: "RTX 2070", vram_gb: 8, rpc_port: 50052 } ] - host: 192.168.1.30 # Machine C (M2 Metal) gpus: [ { name: "Apple M2 Metal", vram_gb: 11, rpc_port: 50052 } ]
Find your IPs with ip addr (Linux), ipconfig (Windows), or ipconfig getifaddr en0 (macOS). For Apple Silicon use recommendedMaxWorkingSetSize (printed by rpc-server at startup), not total unified memory. The coordinator needs enough system RAM for the whole GGUF โ a 70B Q4_K_M (~40GB) wants ~44GB RAM. Add more workers anytime; the cluster grows. Or skip the hand-editing and run tightwad init to auto-discover LAN servers.
$ tightwad proxy start ✓ Draft model healthy (llama3.1:8b @ localhost:11434) โ Machine D ✓ Pool: 3 workers online (52GB VRAM total) โ A + B + C ✓ Target: Llama-3.3-70B distributed across pool ✓ Proxy listening on http://localhost:8088
Run tightwad doctor first if anything looks off โ it checks config, binaries, network, build versions, and model families.
$ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}' # Check acceptance rate + throughput $ tightwad proxy status → Acceptance rate: N% | Rounds: N | Tokens/round: N
In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:
http://192.168.1.10:11434
โ
http://192.168.1.10:8088
That's it. Four machines, one endpoint. Same app, same model name, same output quality. Machines A, B, and C pool a 70B that fits on no single machine; Machine D drafts and proxies. Under greedy decoding you see 4.1 tok/s instead of 2.2 โ for output that's identical to running the 70B alone.
What to expect with this setup
Measured on this exact cluster โ Llama 3.1 8B drafting for Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi), greedy decoding (temp 0):
| Metric | Result |
|---|---|
| ⚡ Output equivalence (greedy) | |
| 🚀 Speedup | |
| 💬 Tokens per round | |
| ⏱️ Speed (pool only) | |
| ⏱️ Speed (pool + speculation) |
519 tokens in 127s vs 512 tokens in 231s. Your numbers will vary with hardware, network, and model pairing. See the full methodology on the benchmarks page.
Want the why behind the pooling, or the raw benchmark logs? Start here:
Get started: pip install tightwad