Your GPUs are
|
Tightwad pools your mismatched CUDA + ROCm + Metal cards — even that dusty GTX 770 — into one OpenAI-compatible endpoint, so a model that fits on no single machine runs across all of them.
Then speculative decoding makes the pool fast. 1.86× measured on 70B over WiFi. Same output quality. $0 cloud bill.*
* 1.86× wall-clock measured on Llama 3.1 8B → Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). Under greedy decoding output is mathematically identical to running the target alone — speculative decoding is a pure speed optimization, not a quality tradeoff. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.
$ tightwad start ✓ Draft: Llama-3.1-8B @ localhost:8081 (M2 Metal — drafts 32 tokens/round) ✓ Pool: 4 GPUs / 52GB VRAM over WiFi (4070 Ti + 3060 + 2070 + M2 Metal) ✓ Target: Llama-3.3-70B across pool (too big for any single machine) ✓ Proxy listening on http://localhost:8088 → 1.86× speedup | 4.1 tok/s (was 2.2) | greedy: output = target alone
$ pip install tightwad ✓ Successfully installed tightwad-0.5.4 $ tightwad proxy start ✓ Proxy listening on http://localhost:8088 → Ready. Point your app at localhost:8088.
Two moves. That's the whole product.
Pool your GPUs so a big model runs at all — then change one URL so it runs fast. Dead simple, both of them.
One URL change
Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.
The small model is invisible
You never configure it, select it, or see it. It's like autocomplete on your phone — it suggests tokens, the big model accepts or corrects. You only see the final output.
Output quality is preserved
In the default speculative-decoding mode with greedy decoding (temperature=0), output is mathematically identical to running the large model alone — the big model validates every token (the Leviathan / Chen guarantee). With sampling, output is statistically equivalent.
Nothing sits idle
The 4070 in your main rig, the 2070 in the box you almost sold, an AMD card you bought on sale, that old Xeon with no GPU — all of it contributes to one endpoint.
That's it. Pool your hardware. Change one URL. Run bigger models, faster.
Set It Up in 30 Minutes →How a small model makes a big one fast
Speculative decoding is what Google and DeepMind already use to accelerate frontier models. Tightwad puts it on your pooled hardware.
Draft
A small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap — runs on any junk GPU or a CPU.
Verify
The big model (across your pool) evaluates all 32 tokens in a single forward pass. Batch is basically free.
Accept
Keep every token both models agree on. Take the big model's token at the first disagreement.
Stream
Accepted tokens stream to your app instantly. Output quality is equivalent to the target model alone.
Pick your mode. Stack them.
Six inference modes — pool, speculate, race, cluster, gate, distribute. Run one or run all six. Full details on each →
Combined Mode — Speculation Over a Pool
When a model doesn't fit on one machine, pool the GPUs and speculate on top. Batch verification amortizes the RPC overhead — 32 tokens per round instead of 1. 1.86× measured on Llama 3.3 70B across 4 GPUs over WiFi.
- ✓ Run models that fit nowhere else
- ✓ 2.2 → 4.1 tok/s on the 70B pool
Speculative Decoding Proxy
A fast draft model proposes tokens; a large target verifies them in batch. Output equivalent to the target alone, up to 2× faster. Ships token IDs (bytes), not tensor data. Drop-in OpenAI/SSE compatible.
- ✓ One URL change for your app
- ✓ Live dashboard at
/dashboard
Multi-Drafter Consensus approximate mode
Race several cheap drafters in parallel. When they agree, the expensive target verification is skipped entirely — more drafters, more tokens bypass the bottleneck. Three modes: strict, majority, any_disagree.
- ✓ Skip the GPU when drafters agree
- ✓ Opt-in; consensus tokens may differ from the target
RPC Cluster
Pool CUDA + ROCm + Metal GPUs across machines into one OpenAI-compatible endpoint using llama.cpp RPC. The coordinator distributes model layers across local and remote GPUs. Hot-swap models without restarting workers.
- ✓ Mix NVIDIA + AMD + Apple freely
- ✓ 70B+ on consumer hardware
Quality Gate — CPU Fleet Drafts, GPU Reviews
Full-response level, not token-level. A fleet of cheap machines generates complete responses with small models; one powerful GPU reviews each — approving, correcting, or rejecting. ~60–80% pass unchanged, so the GPU only handles the hard ones.
- ✓ Put idle CPUs to work
- ✓
tightwad gate start
Swarm Transfer — P2P Model Distribution
Pulling a 40GB+ GGUF to every worker wastes hours and bandwidth. Tightwad splits models into 64 MB SHA256-verified pieces and lets workers pull from any peer that has them — rarest-first selection, resume on interrupt, delta updates, zero central server.
- ✓ Multi-source parallel download
- ✓ Delta updates for new quants
The headline number, and how we got it
Llama 3.1 8B → Llama 3.3 70B, across a 4-GPU RPC pool (52GB VRAM, over WiFi), greedy decoding (temperature=0). Throughput went from 2.2 → 4.1 tok/s — and the 70B fits on no single machine in the pool. Under greedy decoding the output is mathematically identical to running the target alone.
Other pairings, acceptance rates, and the cloud-API caveats live on the benchmarks page — including which numbers are being re-validated under the v0.5.1+ per-position verifier.
See full benchmarks →This isn’t magic — it’s what Google and DeepMind already use
Speculative decoding powers production inference at the biggest players. Tightwad just puts it on your hardware without the data center.
Fast Inference from Transformers via Speculative Decoding
The foundational paper. Introduces the draft-verify loop and proves output equivalence under greedy decoding.
arxiv.org/abs/2211.17192 →Accelerating Large Language Model Decoding with Speculative Sampling
Independent parallel formulation, extends the technique to stochastic sampling with the rejection-sampling trick.
arxiv.org/abs/2302.01318 →Looking Back at Speculative Decoding
Plain-English retrospective from the original authors covering production deployment, what held up, and what didn’t.
research.google →Tightwad is independent open-source software (MIT) with no affiliation, endorsement, or commercial relationship with Google, Google DeepMind, or the listed authors. Citations are nominative fair use of public academic publications.
Quick Start
No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.
Install
$ pip install tightwad # or from source: $ git clone https://github.com/youngharold/tightwad.git $ cd tightwad && pip install .
Configure your hardware
proxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes based on acceptance rate fallback_on_draft_failure: true draft: url: http://192.168.1.50:11434 # Your cheap GPU (Ollama) model_name: qwen3:8b backend: ollama target: url: http://192.168.1.100:11434 # Your big GPU (Ollama) model_name: qwen3:32b backend: ollama
Start it & test
$ tightwad proxy start ✓ Draft model healthy ✓ Target model healthy ✓ Proxy listening on http://localhost:8088 # Test it (drop-in for any OpenAI SDK call) $ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}' # Check live stats $ tightwad proxy status → Rounds: 34 | Draft tokens/round: 32 | Mode: greedy (exact)
Build RPC workers (CUDA — Windows/Linux)
# Or use scripts/install-worker.sh $ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON $ cmake --build build --config Release $ build/bin/rpc-server -p 50052 # GPU 0
Configure cluster topology
coordinator: host: 0.0.0.0 port: 8080 backend: hip # or cuda gpus: - name: "7900 XTX #0" vram_gb: 24 workers: - host: 192.168.1.100 # NVIDIA box gpus: - name: "RTX 4070 Ti Super" vram_gb: 16 rpc_port: 50052 models: llama-3.3-70b: path: /models/Llama-3.3-70B-Q4_K_M.gguf ctx_size: 8192 flash_attn: true default: true
Start the cluster
$ tightwad start ✓ Coordinator started ✓ Worker @ 192.168.1.100:50052 online ✓ Model llama-3.3-70b loaded across 52 GB VRAM # Hot-swap to a different model anytime $ tightwad swap deepseek-r1-70b # Run the benchmark $ tightwad benchmark
Want the full walkthrough on real mismatched hardware?
Homelab cluster in 30 minutes →Why not vLLM, Ollama, or plain llama.cpp?
Each is great at something. Tightwad's lane is the awkward middle: mismatched, network-separated, consumer hardware.
Honest, side-by-side — including when not to use Tightwad.
Full comparison →