Run a 70B that fits on no single machine — across all of them.
Tightwad pools your junk drawer of compute — CUDA, ROCm, Metal, even a CPU-only box — into one OpenAI-compatible endpoint. A model too big for any one card gets distributed across the whole pool. Then speculative decoding rides on top to make it fast: 1.86× measured on Llama 3.3 70B across 4 consumer GPUs over WiFi.
Your junk drawer of compute, unified
It's not two matching GPUs. It's the 4070 in your main rig, the 2070 in the box you almost sold, an AMD card you bought on sale, and that old Xeon with no GPU at all — pooled into one API. The model doesn't have to fit on any single machine. It just has to fit across all of them.
Mix anything. Get one endpoint.
How pooling works
Built on llama.cpp RPC. Each machine that contributes a GPU runs an rpc-server. A coordinator loads the full model and distributes layers and tensors across every device — local and remote, NVIDIA and AMD and Apple — behind one endpoint.
RPC Cluster — pool GPUs into one model
Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another, Metal on a Mac — Tightwad doesn't care. The coordinator distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint. Use this when a model doesn't fit on any single machine.
[OpenAI Client]
|
v
+-------------------+
| Tightwad | <-- One endpoint to rule them all
| Coordinator :8090|
+--------+----------+
| mmaps full GGUF, distributes layers/tensors
+----+----+----------+
v v v
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker |
| NVIDIA | | AMD | | Metal |
| 4070Ti | | 7900XTX| | M2 Mac |
| 16 GB | | 24 GB | | 11 GB |
+--------+ +--------+ +--------+
70B model: covered ✓
- ✓ Mix NVIDIA + AMD + Apple GPUs freely on the same model
- ✓ Run 70B+ models on consumer hardware that fits nowhere else
- ✓ Hot-swap models without restarting workers
- ✓ One OpenAI-compatible API — point any client at it, change nothing
Coordinator RAM caveat: the coordinator machine needs enough system RAM for the full model file — not just its GPU share. llama.cpp mmaps the entire GGUF before distributing tensors to workers, so a 70B Q4_K_M (~40 GB) wants ~44 GB of RAM on the coordinator. The workers only need VRAM for their slice.
One honest catch: RPC tensor-parallelism ships 100–300 MB of tensor data per inference step over the network. Run autoregressively over WiFi and a 70B pool crawls at ~2–3 tok/s. Pooling gets the model loadable. The next step gets it fast.
Then speculate on top
A pooled model is slow because every token is one full network round-trip. Speculative decoding fixes that: a tiny draft model proposes 32 tokens at once, and the pool verifies all 32 in a single batch — one round-trip for 32 tokens instead of 32 round-trips. The RPC overhead amortizes, and the model that fit nowhere becomes usable.
Combined Mode — Speculation Over a Pool
When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.
[Junk Hardware — GTX 770 2GB, laptop CPU, P400]
| runs 1.7B–8B draft, fast & local
| sends token IDs (bytes, not megabytes)
v
[Tightwad Proxy :8088]
| sends draft to the pool for BATCH verify
v
[RPC GPU Pool — 4 GPUs, 52GB total, WiFi]
| verifies 32 tokens in ONE forward pass
v
4.1 tok/s instead of 2.2 tok/s — 70B fits nowhere else
- ✓ 1.86× measured wall-clock speedup on Llama 3.3 70B (4 GPUs over WiFi)*
- ✓ Output mathematically identical to the target alone under greedy decoding (Leviathan guarantee)
- ✓ Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
- ✓ Pool CUDA + ROCm + Metal GPUs, speculate on top
* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52 GB VRAM over WiFi, greedy decoding (temp=0). 519 tokens in 127s (4.1 tok/s) vs 512 tokens in 231s (2.2 tok/s) direct, 33 tokens/round. Under greedy decoding the output is mathematically identical to running the 70B alone. Your results will vary with hardware and network.
The measured result, plainly
Pool direct
70B across 4 GPUs over WiFi, autoregressive: 2.2 tok/s. Loadable, but painful — every token is a full RPC round-trip.
Draft
Llama 3.1 8B on a Mac's Metal GPU proposes 32 tokens locally. Fast, cheap, no network.
Batch verify
The pool checks all 33 drafted tokens in one forward pass — one round-trip instead of 33.
Pool + spec
4.1 tok/s — 1.86× faster, identical output. The 70B is now actually usable.
One hard rule: the draft and target must be the same model family. Llama 3.1 8B → Llama 3.3 70B works because they share an architecture. A cross-family pairing collapses acceptance and makes things slower, not faster. Tightwad auto-detects families at proxy startup and in tightwad doctor, warning loudly on a mismatch.
Pick your archetype
Tightwad was built by and for people who think a closet full of idle GPUs is a crime against compute.
The Homelab Hoarder
You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.
- Pool all your random GPUs into one endpoint
- Run 70B models across consumer hardware
- Zero wasted VRAM, zero cloud spend
The Budget Builder
You want 70B-class quality on a consumer GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust, then speculation makes it fast.
- Llama 3.3 70B across 4× consumer GPUs over WiFi
- No enterprise hardware required — 1.86× measured
- Benchmark built in to tune your setup
The Mixed Vendor Maverick
You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad runs CUDA and ROCm on the same model together. Finally.
- CUDA + ROCm + Metal on the same model
- llama.cpp RPC handles the hard parts
- Coordinator distributes layers intelligently
The E-Waste Reviver
That GTX 770 from 2013 with 2GB of VRAM? It can't hold a big model — but it can run a 1–2B draft model for one. The old Xeon with no GPU? CPU drafting. No node left behind.
- 2GB GPUs and CPU-only boxes draft for the pool
- Turn e-waste into productive infrastructure
- Every machine contributes — same model family is all it takes
Example configurations
However your hardware is set up, there's a local config for it. Draft on a GPU, draft on a CPU — the pool verifies. Everything runs on your own machines.
Homelab / Small Teams
Draft on any GPU you have, verify on a pool of bigger ones. Mix any generations, any vendors. RTX 4070 + RTX 2070 + RX 7900 XTX + an M2 Mac — all in one cluster.
Model Too Big for One Machine
A 70B fits on no single card you own — so pool four of them, then draft on a tiny model to overcome RPC's per-token latency. This is Combined Mode, the validated headline.
Zero GPU Required to Participate
Run a tiny 1–2B draft model on any CPU. Verify on a pooled GPU target. Your CPU-only server, your laptop, your NAS — all can contribute to the cluster.
| Config | Draft | Target | Use case | Measured |
|---|---|---|---|---|
| GPU → Pool | Any GPU — old, new, NVIDIA, AMD | Pool of bigger GPUs | Homelab, mixed hardware | 1.27x (8B→32B) |
| Draft → RPC Pool | Same-family small model | 70B across 4 GPUs | Model too big for one machine | 1.86x (8B→70B) |
| CPU → Pool | Any CPU, no GPU needed | GPU pool | Zero-GPU participants | Best at draft=32 |
The 1.27× cross-machine figure (Qwen3-8B → Qwen3-32B over llama-server) was measured under prompt-append verification; the per-task acceptance breakdown that accompanied it was measured under the legacy text-match verifier and is being re-validated under the v0.5.1+ per-position verifier. The 1.86× pool result is greedy wall-clock and is the figure to trust.
Pool your hardware in one command
Install Tightwad on the machine that runs the proxy, point it at your workers, and you've got one endpoint over the whole pool.
# Install pip install tightwad # Auto-discover your LAN and generate a cluster config tightwad init # Verify config, binaries, network, versions, model families tightwad doctor # Start the coordinator over your pooled GPUs tightwad start # Then point any OpenAI client at http://localhost:8088/v1