// pool mismatched GPUs into one endpoint

Run a 70B that fits on no single machine — across all of them.

Tightwad pools your junk drawer of compute — CUDA, ROCm, Metal, even a CPU-only box — into one OpenAI-compatible endpoint. A model too big for any one card gets distributed across the whole pool. Then speculative decoding rides on top to make it fast: 1.86× measured on Llama 3.3 70B across 4 consumer GPUs over WiFi.

Your junk drawer of compute, unified

It's not two matching GPUs. It's the 4070 in your main rig, the 2070 in the box you almost sold, an AMD card you bought on sale, and that old Xeon with no GPU at all — pooled into one API. The model doesn't have to fit on any single machine. It just has to fit across all of them.

Mix anything. Get one endpoint.

YOUR HARDWARE (any mix works)
RTX 4070 Ti Super (16GB)
RTX 3060 (12GB)
RTX 2070 (8GB)
GTX 770 (2GB — why not)
RX 7900 XTX (24GB, AMD!)
Old Xeon (CPU only)
Laptop (M2, CPU draft)
CUDA ✓ ROCm ✓ Metal ✓ CPU ✓ Mixed ✓
TIGHTWAD
One endpoint
Pool layers. Speculate on top.
localhost:8088
OpenAI-compatible API
Without Tightwad: the big model doesn't fit anywhere, so you don't run it at all  •  With Tightwad: the pool holds the model, speculation makes it usable  •  Cost: $0 — runs fully local

How pooling works

Built on llama.cpp RPC. Each machine that contributes a GPU runs an rpc-server. A coordinator loads the full model and distributes layers and tensors across every device — local and remote, NVIDIA and AMD and Apple — behind one endpoint.

01

RPC Cluster — pool GPUs into one model

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another, Metal on a Mac — Tightwad doesn't care. The coordinator distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint. Use this when a model doesn't fit on any single machine.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8090|
+--------+----------+
         |  mmaps full GGUF, distributes layers/tensors
    +----+----+----------+
    v         v          v
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker |
| NVIDIA | |  AMD   | | Metal  |
| 4070Ti | | 7900XTX| | M2 Mac |
| 16 GB  | | 24 GB  | | 11 GB  |
+--------+ +--------+ +--------+
  70B model: covered ✓
  • Mix NVIDIA + AMD + Apple GPUs freely on the same model
  • Run 70B+ models on consumer hardware that fits nowhere else
  • Hot-swap models without restarting workers
  • One OpenAI-compatible API — point any client at it, change nothing

Coordinator RAM caveat: the coordinator machine needs enough system RAM for the full model file — not just its GPU share. llama.cpp mmaps the entire GGUF before distributing tensors to workers, so a 70B Q4_K_M (~40 GB) wants ~44 GB of RAM on the coordinator. The workers only need VRAM for their slice.

One honest catch: RPC tensor-parallelism ships 100–300 MB of tensor data per inference step over the network. Run autoregressively over WiFi and a 70B pool crawls at ~2–3 tok/s. Pooling gets the model loadable. The next step gets it fast.

Then speculate on top

A pooled model is slow because every token is one full network round-trip. Speculative decoding fixes that: a tiny draft model proposes 32 tokens at once, and the pool verifies all 32 in a single batch — one round-trip for 32 tokens instead of 32 round-trips. The RPC overhead amortizes, and the model that fit nowhere becomes usable.

👑 THE KILLER FEATURE
02

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware — GTX 770 2GB, laptop CPU, P400]
        | runs 1.7B–8B draft, fast & local
        | sends token IDs (bytes, not megabytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to the pool for BATCH verify
        v
  [RPC GPU Pool — 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s — 70B fits nowhere else
  • 1.86× measured wall-clock speedup on Llama 3.3 70B (4 GPUs over WiFi)*
  • Output mathematically identical to the target alone under greedy decoding (Leviathan guarantee)
  • Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
  • Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52 GB VRAM over WiFi, greedy decoding (temp=0). 519 tokens in 127s (4.1 tok/s) vs 512 tokens in 231s (2.2 tok/s) direct, 33 tokens/round. Under greedy decoding the output is mathematically identical to running the 70B alone. Your results will vary with hardware and network.

The measured result, plainly

🐌

Pool direct

70B across 4 GPUs over WiFi, autoregressive: 2.2 tok/s. Loadable, but painful — every token is a full RPC round-trip.

🚀

Draft

Llama 3.1 8B on a Mac's Metal GPU proposes 32 tokens locally. Fast, cheap, no network.

🔍

Batch verify

The pool checks all 33 drafted tokens in one forward pass — one round-trip instead of 33.

Pool + spec

4.1 tok/s1.86× faster, identical output. The 70B is now actually usable.

One hard rule: the draft and target must be the same model family. Llama 3.1 8B → Llama 3.3 70B works because they share an architecture. A cross-family pairing collapses acceptance and makes things slower, not faster. Tightwad auto-detects families at proxy startup and in tightwad doctor, warning loudly on a mismatch.

Pick your archetype

Tightwad was built by and for people who think a closet full of idle GPUs is a crime against compute.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode
  • Pool all your random GPUs into one endpoint
  • Run 70B models across consumer hardware
  • Zero wasted VRAM, zero cloud spend

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad runs CUDA and ROCm on the same model together. Finally.

RPC Cluster Mode
  • CUDA + ROCm + Metal on the same model
  • llama.cpp RPC handles the hard parts
  • Coordinator distributes layers intelligently
💀

The E-Waste Reviver

That GTX 770 from 2013 with 2GB of VRAM? It can't hold a big model — but it can run a 1–2B draft model for one. The old Xeon with no GPU? CPU drafting. No node left behind.

Combined Mode
  • 2GB GPUs and CPU-only boxes draft for the pool
  • Turn e-waste into productive infrastructure
  • Every machine contributes — same model family is all it takes

Example configurations

However your hardware is set up, there's a local config for it. Draft on a GPU, draft on a CPU — the pool verifies. Everything runs on your own machines.

CPU POOL

Zero GPU Required to Participate

Run a tiny 1–2B draft model on any CPU. Verify on a pooled GPU target. Your CPU-only server, your laptop, your NAS — all can contribute to the cluster.

💻
Any machine
CPU only · Qwen3-1.7B draft · even a laptop
🖥️
GPU Pool
Any GPUs · any big target model
Best at max_draft_tokens=32 (HTTP overhead negates gains at 8) · zero GPU to participate · $0, fully local
Config Draft Target Use case Measured
GPU → Pool Any GPU — old, new, NVIDIA, AMD Pool of bigger GPUs Homelab, mixed hardware 1.27x (8B→32B)
Draft → RPC Pool Same-family small model 70B across 4 GPUs Model too big for one machine 1.86x (8B→70B)
CPU → Pool Any CPU, no GPU needed GPU pool Zero-GPU participants Best at draft=32

The 1.27× cross-machine figure (Qwen3-8B → Qwen3-32B over llama-server) was measured under prompt-append verification; the per-task acceptance breakdown that accompanied it was measured under the legacy text-match verifier and is being re-validated under the v0.5.1+ per-position verifier. The 1.86× pool result is greedy wall-clock and is the figure to trust.

Pool your hardware in one command

Install Tightwad on the machine that runs the proxy, point it at your workers, and you've got one endpoint over the whole pool.

# Install
pip install tightwad

# Auto-discover your LAN and generate a cluster config
tightwad init

# Verify config, binaries, network, versions, model families
tightwad doctor

# Start the coordinator over your pooled GPUs
tightwad start

# Then point any OpenAI client at http://localhost:8088/v1