Run Large Models Across Multiple GPUs

// the whole idea

Your junk drawer of compute, unified

It's not two matching GPUs. It's the 4070 in your main rig, the 2070 in the box you almost sold, an AMD card you bought on sale, and that old Xeon with no GPU at all — pooled into one API. The model doesn't have to fit on any single machine. It just has to fit across all of them.

Mix anything. Get one endpoint.

YOUR HARDWARE (any mix works)

RTX 4070 Ti Super (16GB)

RTX 3060 (12GB)

RTX 2070 (8GB)

GTX 770 (2GB — why not)

RX 7900 XTX (24GB, AMD!)

Old Xeon (CPU only)

Laptop (M2, CPU draft)

CUDA ✓ ROCm ✓ Metal ✓ CPU ✓ Mixed ✓

➜

TIGHTWAD

One endpoint

Pool layers. Speculate on top.

localhost:8088

✓ ✓ ✓ ✓ ✓ ✓ ✓

OpenAI-compatible API

Without Tightwad: the big model doesn't fit anywhere, so you don't run it at all • With Tightwad: the pool holds the model, speculation makes it usable • Cost: $0 — runs fully local

// step one

How pooling works

Built on llama.cpp RPC. Each machine that contributes a GPU runs an rpc-server. A coordinator loads the full model and distributes layers and tensors across every device — local and remote, NVIDIA and AMD and Apple — behind one endpoint.

01

RPC Cluster — pool GPUs into one model

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another, Metal on a Mac — Tightwad doesn't care. The coordinator distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint. Use this when a model doesn't fit on any single machine.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8090|
+--------+----------+
         |  mmaps full GGUF, distributes layers/tensors
    +----+----+----------+
    v         v          v
+--------+ +--------+ +--------+
| Worker | | Worker | | Worker |
| NVIDIA | |  AMD   | | Metal  |
| 4070Ti | | 7900XTX| | M2 Mac |
| 16 GB  | | 24 GB  | | 11 GB  |
+--------+ +--------+ +--------+
  70B model: covered ✓

✓ Mix NVIDIA + AMD + Apple GPUs freely on the same model
✓ Run 70B+ models on consumer hardware that fits nowhere else
✓ Hot-swap models without restarting workers
✓ One OpenAI-compatible API — point any client at it, change nothing

Coordinator RAM caveat: the coordinator machine needs enough system RAM for the full model file — not just its GPU share. llama.cpp mmaps the entire GGUF before distributing tensors to workers, so a 70B Q4_K_M (~40 GB) wants ~44 GB of RAM on the coordinator. The workers only need VRAM for their slice.

One honest catch: RPC tensor-parallelism ships 100–300 MB of tensor data per inference step over the network. Run autoregressively over WiFi and a 70B pool crawls at ~2–3 tok/s. Pooling gets the model loadable. The next step gets it fast.

// step two — the killer feature

Then speculate on top

A pooled model is slow because every token is one full network round-trip. Speculative decoding fixes that: a tiny draft model proposes 32 tokens at once, and the pool verifies all 32 in a single batch — one round-trip for 32 tokens instead of 32 round-trips. The RPC overhead amortizes, and the model that fit nowhere becomes usable.

👑 THE KILLER FEATURE

02

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware — GTX 770 2GB, laptop CPU, P400]
        | runs 1.7B–8B draft, fast & local
        | sends token IDs (bytes, not megabytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to the pool for BATCH verify
        v
  [RPC GPU Pool — 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s — 70B fits nowhere else

✓ 1.86× measured wall-clock speedup on Llama 3.3 70B (4 GPUs over WiFi)*
✓ Output mathematically identical to the target alone under greedy decoding (Leviathan guarantee)
✓ Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
✓ Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52 GB VRAM over WiFi, greedy decoding (temp=0). 519 tokens in 127s (4.1 tok/s) vs 512 tokens in 231s (2.2 tok/s) direct, 33 tokens/round. Under greedy decoding the output is mathematically identical to running the 70B alone. Your results will vary with hardware and network.

The measured result, plainly

🐌

Pool direct

70B across 4 GPUs over WiFi, autoregressive: 2.2 tok/s. Loadable, but painful — every token is a full RPC round-trip.

→

🚀

Draft

Llama 3.1 8B on a Mac's Metal GPU proposes 32 tokens locally. Fast, cheap, no network.

→

🔍

Batch verify

The pool checks all 33 drafted tokens in one forward pass — one round-trip instead of 33.

→

⚡

Pool + spec

4.1 tok/s — 1.86× faster, identical output. The 70B is now actually usable.

One hard rule: the draft and target must be the same model family. Llama 3.1 8B → Llama 3.3 70B works because they share an architecture. A cross-family pairing collapses acceptance and makes things slower, not faster. Tightwad auto-detects families at proxy startup and in tightwad doctor, warning loudly on a mismatch.

// who pools

Pick your archetype

Tightwad was built by and for people who think a closet full of idle GPUs is a crime against compute.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode

Pool all your random GPUs into one endpoint
Run 70B models across consumer hardware
Zero wasted VRAM, zero cloud spend

💰 BIGGEST WIN

🏗️

The Budget Builder

You want 70B-class quality on a consumer GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust, then speculation makes it fast.

Combined Mode

Llama 3.3 70B across 4× consumer GPUs over WiFi
No enterprise hardware required — 1.86× measured
Benchmark built in to tune your setup

⚡

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad runs CUDA and ROCm on the same model together. Finally.

RPC Cluster Mode

CUDA + ROCm + Metal on the same model
llama.cpp RPC handles the hard parts
Coordinator distributes layers intelligently

💀

The E-Waste Reviver

That GTX 770 from 2013 with 2GB of VRAM? It can't hold a big model — but it can run a 1–2B draft model for one. The old Xeon with no GPU? CPU drafting. No node left behind.

Combined Mode

2GB GPUs and CPU-only boxes draft for the pool
Turn e-waste into productive infrastructure
Every machine contributes — same model family is all it takes

// ways to wire it up

Example configurations

However your hardware is set up, there's a local config for it. Draft on a GPU, draft on a CPU — the pool verifies. Everything runs on your own machines.

MOST COMMON

GPU → POOL

Homelab / Small Teams

Draft on any GPU you have, verify on a pool of bigger ones. Mix any generations, any vendors. RTX 4070 + RTX 2070 + RX 7900 XTX + an M2 Mac — all in one cluster.

🖥️

Draft Machine

RTX 2070 · GTX 770 · M2 Metal · any GPU

→

🖥️

Target Pool

4070 Ti + 3060 + 7900 XTX · one big model

1.27x measured cross-machine speedup (8B→32B) · any mix of hardware · $0, fully local

👑 KILLER

DRAFT → RPC POOL

Model Too Big for One Machine

A 70B fits on no single card you own — so pool four of them, then draft on a tiny model to overcome RPC's per-token latency. This is Combined Mode, the validated headline.

💻

Drafter

Llama 3.1 8B · M2 Metal · same family as target

→

🖥️

RPC Pool (70B)

4070 Ti + 3060 + 2070 + M2 = 52GB over WiFi

2.2 → 4.1 tok/s · 1.86x measured · output identical (greedy) · $0, fully local

CPU → POOL

Zero GPU Required to Participate

Run a tiny 1–2B draft model on any CPU. Verify on a pooled GPU target. Your CPU-only server, your laptop, your NAS — all can contribute to the cluster.

💻

Any machine

CPU only · Qwen3-1.7B draft · even a laptop

→

🖥️

GPU Pool

Any GPUs · any big target model

Best at max_draft_tokens=32 (HTTP overhead negates gains at 8) · zero GPU to participate · $0, fully local

Config	Draft	Target	Use case	Measured
GPU → Pool	Any GPU — old, new, NVIDIA, AMD	Pool of bigger GPUs	Homelab, mixed hardware	1.27x (8B→32B)
Draft → RPC Pool	Same-family small model	70B across 4 GPUs	Model too big for one machine	1.86x (8B→70B)
CPU → Pool	Any CPU, no GPU needed	GPU pool	Zero-GPU participants	Best at draft=32

The 1.27× cross-machine figure (Qwen3-8B → Qwen3-32B over llama-server) was measured under prompt-append verification; the per-task acceptance breakdown that accompanied it was measured under the legacy text-match verifier and is being re-validated under the v0.5.1+ per-position verifier. The 1.86× pool result is greedy wall-clock and is the figure to trust.

// get pooling

Pool your hardware in one command

Install Tightwad on the machine that runs the proxy, point it at your workers, and you've got one endpoint over the whole pool.

# Install
pip install tightwad

# Auto-discover your LAN and generate a cluster config
tightwad init

# Verify config, binaries, network, versions, model families
tightwad doctor

# Start the coordinator over your pooled GPUs
tightwad start

# Then point any OpenAI client at http://localhost:8088/v1

Full homelab walkthrough → View on GitHub

// keep exploring

Run a 70B that fits on no single machine — across all of them.