v0.5.4

Your GPUs are
|

Tightwad pools your mismatched CUDA + ROCm + Metal cards — even that dusty GTX 770 — into one OpenAI-compatible endpoint, so a model that fits on no single machine runs across all of them.
Then speculative decoding makes the pool fast. 1.86× measured on 70B over WiFi. Same output quality. $0 cloud bill.*

70B runs across 4 GPUs that each fit ~13GB
1.86× measured speedup (70B pooled)*
= output identical to target alone (greedy)
$0 cloud bill — runs fully local

* 1.86× wall-clock measured on Llama 3.1 8B → Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). Under greedy decoding output is mathematically identical to running the target alone — speculative decoding is a pure speed optimization, not a quality tradeoff. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.

terminal — your junk drawer, unified
$ tightwad start
 Draft:  Llama-3.1-8B  @ localhost:8081   (M2 Metal — drafts 32 tokens/round)
 Pool:   4 GPUs / 52GB VRAM over WiFi    (4070 Ti + 3060 + 2070 + M2 Metal)
 Target: Llama-3.3-70B across pool       (too big for any single machine)
 Proxy listening on http://localhost:8088
 1.86× speedup | 4.1 tok/s (was 2.2) | greedy: output = target alone
or just pip install
$ pip install tightwad

 Successfully installed tightwad-0.5.4

$ tightwad proxy start
 Proxy listening on http://localhost:8088
 Ready. Point your app at localhost:8088.
📦 PyPI · Python 3.10+

Two moves. That's the whole product.

Pool your GPUs so a big model runs at all — then change one URL so it runs fast. Dead simple, both of them.

1 Pool it — so the model runs at all
WON'T FIT
🧠
Llama 3.3 70B
~40GB at Q4
💥
RTX 3060 · 12GB
CUDA out of memory
✗ Too big for any single card you own.
POOLED
🎮
4070 Ti · 16GB
NVIDIA
🎮
3060 · 12GB
NVIDIA
🎮
2070 · 8GB
NVIDIA
🍎
M2 · 16GB
Metal
🐷
Tightwad :8088
52GB pooled · one endpoint
✓ A model that fit on nothing now runs on everything.
2 Speed it up — change one URL
BEFORE
💬
Open WebUI
your chat app
🐢
Ollama :11434
Llama 3.3 70B — slow
Base URL: http://192.168.1.10:11434
⏳ Every token generated one at a time. Waiting.
AFTER
💬
Open WebUI
same app, no changes
🐷
Tightwad :8088
invisible proxy
70B across the pool
same output quality, 1.86× faster
Base URL: http://192.168.1.10:8088 ← only change
✓ Equivalent output quality. Just faster.
🔗

One URL change

Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.

🫥

The small model is invisible

You never configure it, select it, or see it. It's like autocomplete on your phone — it suggests tokens, the big model accepts or corrects. You only see the final output.

🔬

Output quality is preserved

In the default speculative-decoding mode with greedy decoding (temperature=0), output is mathematically identical to running the large model alone — the big model validates every token (the Leviathan / Chen guarantee). With sampling, output is statistically equivalent.

🧩

Nothing sits idle

The 4070 in your main rig, the 2070 in the box you almost sold, an AMD card you bought on sale, that old Xeon with no GPU — all of it contributes to one endpoint.

That's it. Pool your hardware. Change one URL. Run bigger models, faster.

Set It Up in 30 Minutes →

How a small model makes a big one fast

Speculative decoding is what Google and DeepMind already use to accelerate frontier models. Tightwad puts it on your pooled hardware.

🚀

Draft

A small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap — runs on any junk GPU or a CPU.

🔍

Verify

The big model (across your pool) evaluates all 32 tokens in a single forward pass. Batch is basically free.

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

📡

Stream

Accepted tokens stream to your app instantly. Output quality is equivalent to the target model alone.

Pick your mode. Stack them.

Six inference modes — pool, speculate, race, cluster, gate, distribute. Run one or run all six. Full details on each →

👑 THE KILLER FEATURE
01

Combined Mode — Speculation Over a Pool

When a model doesn't fit on one machine, pool the GPUs and speculate on top. Batch verification amortizes the RPC overhead — 32 tokens per round instead of 1. 1.86× measured on Llama 3.3 70B across 4 GPUs over WiFi.

  • Run models that fit nowhere else
  • 2.2 → 4.1 tok/s on the 70B pool
How pooling works →
02

Speculative Decoding Proxy

A fast draft model proposes tokens; a large target verifies them in batch. Output equivalent to the target alone, up to 2× faster. Ships token IDs (bytes), not tensor data. Drop-in OpenAI/SSE compatible.

  • One URL change for your app
  • Live dashboard at /dashboard
Mode details →
03

Multi-Drafter Consensus approximate mode

Race several cheap drafters in parallel. When they agree, the expensive target verification is skipped entirely — more drafters, more tokens bypass the bottleneck. Three modes: strict, majority, any_disagree.

  • Skip the GPU when drafters agree
  • Opt-in; consensus tokens may differ from the target
Mode details →
04

RPC Cluster

Pool CUDA + ROCm + Metal GPUs across machines into one OpenAI-compatible endpoint using llama.cpp RPC. The coordinator distributes model layers across local and remote GPUs. Hot-swap models without restarting workers.

  • Mix NVIDIA + AMD + Apple freely
  • 70B+ on consumer hardware
Pool your GPUs →
05

Quality Gate — CPU Fleet Drafts, GPU Reviews

Full-response level, not token-level. A fleet of cheap machines generates complete responses with small models; one powerful GPU reviews each — approving, correcting, or rejecting. ~60–80% pass unchanged, so the GPU only handles the hard ones.

  • Put idle CPUs to work
  • tightwad gate start
Mode details →
06

Swarm Transfer — P2P Model Distribution

Pulling a 40GB+ GGUF to every worker wastes hours and bandwidth. Tightwad splits models into 64 MB SHA256-verified pieces and lets workers pull from any peer that has them — rarest-first selection, resume on interrupt, delta updates, zero central server.

  • Multi-source parallel download
  • Delta updates for new quants
Mode details →

The headline number, and how we got it

1.86× wall-clock speedup

Llama 3.1 8B → Llama 3.3 70B, across a 4-GPU RPC pool (52GB VRAM, over WiFi), greedy decoding (temperature=0). Throughput went from 2.2 → 4.1 tok/s — and the 70B fits on no single machine in the pool. Under greedy decoding the output is mathematically identical to running the target alone.

Other pairings, acceptance rates, and the cloud-API caveats live on the benchmarks page — including which numbers are being re-validated under the v0.5.1+ per-position verifier.

See full benchmarks →

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

1

Install

bash
$ pip install tightwad
# or from source:
$ git clone https://github.com/youngharold/tightwad.git
$ cd tightwad && pip install .
2

Configure your hardware

configs/cluster.yaml
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto          # auto-tunes based on acceptance rate
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama
3

Start it & test

bash
$ tightwad proxy start
 Draft model healthy
 Target model healthy
 Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check live stats
$ tightwad proxy status
 Rounds: 34 | Draft tokens/round: 32 | Mode: greedy (exact)
1

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)
# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0
2

Configure cluster topology

configs/cluster.yaml
coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  llama-3.3-70b:
    path: /models/Llama-3.3-70B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true
3

Start the cluster

bash
$ tightwad start
 Coordinator started
 Worker @ 192.168.1.100:50052 online
 Model llama-3.3-70b loaded across 52 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

Want the full walkthrough on real mismatched hardware?

Homelab cluster in 30 minutes →

Why not vLLM, Ollama, or plain llama.cpp?

Each is great at something. Tightwad's lane is the awkward middle: mismatched, network-separated, consumer hardware.

vLLM / TGI

Built for homogeneous datacenter GPUs and raw throughput. They won't pool a 3060 + an AMD card + a laptop over WiFi. Use them when you have matching cards in one box.

Ollama

Great single-machine runner. But it doesn't pool GPUs across machines or speculate across a network. Tightwad sits in front of Ollama and uses it as a backend.

llama.cpp RPC

The pooling primitive Tightwad builds on. Tightwad adds auto-discovery, config, speculative decoding on top, MoE placement, and a single OpenAI-compatible endpoint.

Honest, side-by-side — including when not to use Tightwad.

Full comparison →