Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 1.86× measured on 70B.* Zero cloud bill (fully local setup).
* 1.86× wall-clock measured on Llama 3.1 8B β Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). Under greedy decoding output is mathematically identical to running the target alone β speculative decoding is a pure speed optimization, not a quality tradeoff. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.
$ pip install tightwad ✓ Collecting tightwad ✓ Installing collected packages: tightwad ✓ Successfully installed tightwad-0.5.2 $ tightwad proxy start ✓ Proxy listening on http://localhost:8088 → Ready. Point your app at localhost:8088.
Most people don't get it at first. So here it is, dead simple. One change. That's it.
Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.
You never configure it, select it, or see it. It's like autocomplete on your phone β it suggests tokens, the big model accepts or corrects. You only see the final output.
In the default speculative-decoding mode with greedy decoding (temperature=0), output is mathematically identical to running the large model alone β the big model validates every token (the Leviathan / Chen guarantee). With sampling, output is statistically equivalent. Multi-Drafter Consensus is a separate opt-in mode that trades that exactness for speed by skipping the target when drafters unanimously agree β it's labeled as approximate consensus, not the same guarantee.
With same-family models and greedy decoding (Llama 3.1 8B β Llama 3.3 70B), most draft tokens match the target's argmax and ship straight through. The target validates every position β output under greedy decoding is mathematically identical to running the target alone. Acceptance benchmarks are being re-measured under v0.5.2's per-position verification.
That's it. Change one URL. Get up to 2-3x faster responses. Same quality.
Set It Up in 20 Minutes βPick your poison. Stack them. Run all six. Tightwad doesn't judge — it just saves you cash.
When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.
[Junk Hardware β P400 2GB, GTX 770, laptop CPU]
| runs 1.7B draft, ~30 tok/s
| sends token IDs (bytes)
v
[Tightwad Proxy :8088]
| sends draft to pool for BATCH verify
v
[RPC GPU Pool β 4 GPUs, 52GB total, WiFi]
| verifies 32 tokens in ONE forward pass
v
4.1 tok/s instead of 2.2 tok/s β 70B fits nowhere else
* Measured: Llama 3.1 8B draft β Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions.
Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.
[Your App / OpenAI SDK]
|
v
+--------------------------+
| Tightwad Proxy :8088 |
| |
| 1. Draft 32 tokens -----+--> Qwen3-8B
| (~100 tok/s, cheap) | RTX 2070 (the dusty one)
| |
| 2. Verify batch --------+--> Llama 3.3 70B
| (one forward pass) | 4070Ti / Cloud API
| |
| 3. Accept/reject <------+
| 4. Stream to client |
+--------------------------+
Output quality = equivalent to 70B alone ✓
Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely β that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce (typically 3–6% divergence). Three sub-modes: strict, majority, any_disagree β pick your acceptable risk. Off by default.
[Tightwad Proxy :8088]
|
| races all drafters in parallel
|
+----+----+----+
v v v v
[M2] [CPU] [2070] [P400]
8B 8B 8B 1.7B
| | | |
+----+----+----+-+
|
v
Consensus? ββyesββ> Stream tokens (GPU never touched)
|
no
|
v
[Target 70B GPU] ββ> Verify only disagreed tokens
strict, majority, any_disagreeGot GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.
[OpenAI Client]
|
v
+-------------------+
| Tightwad | <-- One endpoint to rule them all
| Coordinator :8080|
+--------+----------+
| distributes layers
+----+----+
v v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | | AMD |
| 4070Ti | | 7900XTX|
| 16 GB | | 24 GB |
+--------+ +--------+
70B model: covered ✓
Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — the GPU only sweats the hard 20–40%.
[Client Request]
|
v
[Tightwad Gate :8088]
|
| fan-out to CPU fleet
+----+----+----+
v v v v
[CPU] [CPU] [CPU] [CPU]
each generates full response with 8B model
+----+----+----+
|
v
[GPU Target β 70B]
Reviews each response:
✓ 60-80% approved (pass through)
✎ 15-25% corrected (light edit)
✗ 5-10% rejected (regenerate)
tightwad gate start — one command to runModels are huge. Downloading 70B from HuggingFace takes hours. Pull from every machine that already has it. Chunked transfer with piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads the model from all your existing machines in parallel.
[New Machine Joins Cluster]
|
| "I need Llama-3.3-70B-Q4_K_M.gguf"
v
+---------------------------+
| Tightwad Swarm Discovery |
| |
| Piece 1 <--- Machine A | (4070 Ti — has full model)
| Piece 2 <--- Machine B | (RTX 2070 — has full model)
| Piece 3 <--- Machine C | (M2 Metal — has pieces 1-6)
| Piece 4 <--- Machine A | (parallel, rarest-first)
| ... |
+---------------------------+
|
v
SHA256-verified • ready to serve in minutes, not hours
Small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap.
Big model evaluates all 32 tokens in a single forward pass. Batch is basically free.
Keep every token both models agree on. Take the big model's token at the first disagreement.
Accepted tokens stream to your app instantly. Repeat until done. Output quality is equivalent to the target model alone.
Speculative decoding powers production inference at the biggest players. Tightwad just puts it on your hardware without the data center.
The foundational paper. Introduces the draft-verify loop and proves output equivalence under greedy decoding.
arxiv.org/abs/2211.17192 →Independent parallel formulation, extends the technique to stochastic sampling with the rejection-sampling trick.
arxiv.org/abs/2302.01318 →Plain-English retrospective from the original authors covering production deployment, what held up, and what didn’t.
research.google →Tightwad is independent open-source software (MIT) with no affiliation, endorsement, or commercial relationship with Google, Google DeepMind, or the listed authors. Citations are nominative fair use of public academic publications.
Mixture-of-Experts models are the new normal — MiniMax M2.5 (229B), GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE. Everyone’s shipping them. Nobody supports them properly on home hardware. Until now.
Most MoE GGUFs ship fused: one giant tensor per layer covering every expert. llama.cpp can’t per-expert-split a fused tensor — so every MoE optimization tool on the planet silently degrades to whole-layer placement. Tightwad is the only one that rewrites fused weights into indexed form so you can actually pin experts.
BEFORE (fused) AFTER (indexed)
blk.0.ffn_gate_exps.weight --> blk.0.ffn_gate.0.weight
blk.0.ffn_gate.1.weight
blk.0.ffn_gate.2.weight
...
blk.0.ffn_gate.127.weight
Same bytes. Same quantization. One pass of disk I/O.
tightwad moe defuse fused.gguf indexed.ggufOnce your model is indexed, Tightwad bin-packs entire experts onto individual GPUs proportional to VRAM. No more shipping half an expert across the network every forward pass. Emits llama.cpp --override-tensor flags automatically — you just set moe_placement: balanced and start.
models:
gpt-oss-120b:
path: /models/gpt-oss-120b-indexed.gguf
moe_placement: balanced
Tightwad emits at startup:
--override-tensor "^blk\.0\.ffn_(gate|up|down)\.(0|1|...|67)\.weight$=CUDA0"
--override-tensor "^blk\.0\.ffn_(gate|up|down)\.(68|...|101)\.weight$=RPC[...]"
... (one per layer per device)
-ot regex per (layer, device) — narrowest firsttightwad moe plan —emit-ot to preview before launchdefuse)MoE routing isn’t uniform. A handful of experts fire 10× more than the tail. Capture a profile from your real traffic, and Tightwad pins the hot ones to your fastest GPU. Cold experts land on slow nodes. Same total VRAM, dramatically different throughput.
$ tightwad moe profile --follow-coord --duration 300
$ tightwad moe summary ~/.tightwad/profile.json
Top 20 hot experts
Layer Expert Hits
βββββ ββββββ ββββ
12 7 4,231 ◀ 10× more than average
12 88 3,109
0 19 2,874
...
$ # update moe_placement: profile-guided & restart
Pair expert-aware placement with speculative decoding and you get sparse weights + sparse compute. The draft model (small, same-family) predicts 32 tokens. The pooled MoE target verifies in one batch. Only the activated experts fire. Only the disagreed tokens get regenerated. It’s the whole stack of optimizations working together.
[Your App]
|
v
[Tightwad Proxy :8088]
|
+--> Draft: Qwen3-1.7B (local, ~100 tok/s)
|
+--> Target: MiniMax M2.5 (229B MoE, LM Studio)
|
| expert-aware placement keeps hot experts
| on Mac Studio M3 Ultra unified memory
|
v
Batch verify 32 tokens → activate 8 of 256 experts
tightwad moe bench streams live TTFT + acceptance tableReal hardware. Real numbers. No cherry-picking. Logprobs-based batch verification is live β these acceptance rates translate directly to wall-clock speedup.
| Mode | Tokens | Time | Speed |
|---|---|---|---|
| RPC pool direct (autoregressive) | 512 | 231s | 2.2 tok/s |
| RPC pool + speculation | 519 | 127s | 4.1 tok/s |
| ⚡ Speedup | Greedy-equivalent output · 33 tokens/round | 1.86× | |
The 70B model doesn't fit on any single machine. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal) over WiFi. Without speculation: painfully slow. With speculation: usable.
| Mode | Speed | Notes |
|---|---|---|
| Desktop local only (4070+3060, 32B) | 17.0 tok/s | Best case β fits on one machine |
| 4-GPU RPC pool (autoregressive) | 3.0 tok/s | Each token = full RPC round-trip |
| RPC pool + speculation | 5.4 tok/s | 32 tokens verified per batch (greedy decoding, output equivalent to target alone) |
| ⚡ Pool speedup | 1.8× over pool-only (3.0 β 5.4 tok/s) | |
RPC pooling alone is slow over WiFi (one network round-trip per token). Speculation amortizes that β 32 tokens per round-trip instead of 1. Don't pool when the model fits locally (17 tok/s local vs 5.4 tok/s pooled).
| Prompt Type | Acceptance Rate | Rounds | Verdict |
|---|---|---|---|
| Reasoning | 32 | Math is deterministic. Love it. | |
| Code | 34 | Syntax is law. Both models agree. | |
| Factual | 18 | Strong agreement on facts. | |
| List | 40 | Phrasing varies. Still worthwhile. | |
| Creative | 6 | Many valid outputs. Expected. | |
| ⚡ Average | 26 | 64% of tokens = free. |
| Prompt Type | Acceptance Rate | Notes |
|---|---|---|
| Reasoning | Highest β deterministic math | |
| ⚡ Average (normalized) | Key result: 4 in 5 tokens local. |
With whitespace normalization, a consumer GPU running an 8B model drafts 4 out of every 5 tokens for a 397B model. That means up to 80% fewer output tokens billed to the cloud API for the same quality output. The bigger the gap between draft and target quality, the more you save.
| Prompt | Baseline | Speculative | Speedup |
|---|---|---|---|
| Capital of France | 1.17s | 0.90s | 1.30x |
| Thermodynamics | 12.73s | 9.09s | 1.40x |
| Prime checker | 12.76s | 10.15s | 1.28x |
| Average speed | 13.24s | 10.95s | 1.21x |
| TCP vs UDP | 5.58s | 4.88s | 1.14x |
| Total | 45.43s | 35.96s | 1.27x |
Set max_draft_tokens: auto and Tightwad finds the sweet spot for you. Or pin it at 32 for manual control.
Slide to see your monthly cloud inference waste. Then stop doing that.
* Estimated savings assume the selected acceptance rate is sustained. Default uses 60% acceptance rate typical of local GPU setups. Rates vary by model pair and prompt type (58-64% local, up to 80% same-family API). Savings do not account for local electricity, hardware costs, or maintenance. Your results will vary.
Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.
You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.
You're still paying OpenAI/Anthropic for some tasks. Fine. But why let them do the easy parts? Draft locally, verify via API. 58% fewer API calls. Same answers.
You want 70B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.
You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.
Tightwad works however your hardware is set up. Consumer GPU, no GPU, cloud API β there's a config for you.
Draft on any GPU you have, verify on a bigger one. Mix any generations, any vendors. RTX 4070 + GTX 770 + RX 7900 XTX β all in one cluster.
Even a GTX 1060 can draft for GPT-4. Any GPU you have β old, cheap, low VRAM β reduces your API bill. Up to 80% fewer output tokens billed to the cloud API.
Run a tiny draft model on any CPU. Verify on a remote GPU server. Your CPU-only server, your laptop, your NAS β all can contribute to the cluster.
Data centers often run at 10β30% average utilization (industry estimates). Idle CPUs, stranded servers, that old Xeon doing nothing β put them to work drafting tokens. No GPU required, ever.
Legacy GPU revival: that GTX 770 from 2013 can run Qwen3-1.7B as a drafter for a 70B target β turning e-waste into productive infrastructure.
| Config | Draft | Target | Use Case | Acceptance |
|---|---|---|---|---|
| GPU β GPU | Any GPU β old, new, NVIDIA, AMD | Any bigger GPU | Homelab, mixed hardware | ~64% |
| GPU β API | Any GPU (even GTX 1060) | Cloud API | Slash API bills | ~80% |
| CPU β GPU | Any CPU, no GPU needed | GPU server | Zero-GPU participants | ~68% |
| CPU β API | Literally any computer | Cloud API | Data centers, enterprise | ~68% |
Four machines. One 70B model. Start with two, add machines anytime. The cluster grows.
# Machine A β Desktop (4070 Ti + 3060, 28GB) $ rpc-server -p 50052 # Machine B β Old Gaming PC (RTX 2070, 8GB) $ rpc-server -p 50052 # Machine C β MacBook Air M2 (Metal, 16GB) $ rpc-server -p 50052
# Machine D β MacBook Air M4 (runs draft + proxy) $ ollama run llama3.1:8b # Confirm: $ ollama ps ✓ llama3.1:8b running
$ python3 -m venv .venv && source .venv/bin/activate $ pip install tightwad
configs/cluster.yamlproxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes based on acceptance rate mode: combined # speculation over pooled GPUs draft: url: http://localhost:11434 # Machine D (M4, local draft) model_name: llama3.1:8b backend: ollama coordinator: host: 0.0.0.0 port: 8080 model: Llama-3.3-70B-Q4_K_M.gguf workers: - host: 192.168.1.10 # Machine A (4070 Ti + 3060) rpc_port: 50052 - host: 192.168.1.20 # Machine B (RTX 2070) rpc_port: 50052 - host: 192.168.1.30 # Machine C (M2 Metal) rpc_port: 50052
Find your IPs: ip addr on Linux, ipconfig on Windows, ifconfig on macOS. Add more workers anytime β the cluster grows.
$ tightwad proxy start ✓ Draft model healthy (llama3.1:8b @ localhost:11434) β Machine D ✓ Pool: 3 workers online (52GB VRAM total) β A + B + C ✓ Target: Llama-3.3-70B distributed across pool ✓ Proxy listening on http://localhost:8088
$ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}' # Check acceptance rate $ tightwad proxy status → Acceptance rate: ~58% | Rounds: N | Tokens saved: N
In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:
http://192.168.1.10:11434
β
http://192.168.1.10:8088
That's it. Four machines, one endpoint. Same app. Same model name. Same output quality. Machines A, B, and C pool a 70B model that fits on no single machine. Machine D drafts and proxies. You just see 4.1 tok/s instead of 2.2.
Same-family models (Llama 3.1 8B β Llama 3.3 70B) with greedy decoding:
| Metric | Result |
|---|---|
| ⚡ Output equivalence (greedy) | |
| 🚀 Speedup | |
| 💬 Tokens per round | |
| ⏱️ Speed (pool only) | |
| ⏱️ Speed (pool + speculation) |
No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.
$ pip install tightwad # or from source: $ git clone https://github.com/youngharold/tightwad.git $ cd tightwad && pip install .
proxy: host: 0.0.0.0 port: 8088 max_draft_tokens: auto # auto-tunes based on acceptance rate fallback_on_draft_failure: true draft: url: http://192.168.1.50:11434 # Your cheap GPU (Ollama) model_name: qwen3:8b backend: ollama target: url: http://192.168.1.100:11434 # Your big GPU (Ollama) model_name: qwen3:32b backend: ollama
$ tightwad proxy start ✓ Draft model healthy ✓ Target model healthy ✓ Proxy listening on http://localhost:8088 # Test it (drop-in for any OpenAI SDK call) $ curl http://localhost:8088/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}' # Check acceptance rate stats $ tightwad proxy status → Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
# Or use scripts/install-worker.sh $ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON $ cmake --build build --config Release $ build/bin/rpc-server -p 50052 # GPU 0
coordinator: host: 0.0.0.0 port: 8080 backend: hip # or cuda gpus: - name: "7900 XTX #0" vram_gb: 24 workers: - host: 192.168.1.100 # NVIDIA box gpus: - name: "RTX 4070 Ti Super" vram_gb: 16 rpc_port: 50052 models: llama-3.3-70b: path: /models/Llama-3.3-70B-Q4_K_M.gguf ctx_size: 8192 flash_attn: true default: true
$ tightwad start ✓ Coordinator started ✓ Worker @ 192.168.1.100:50052 online ✓ Model llama-3.3-70b loaded across 52 GB VRAM # Hot-swap to a different model anytime $ tightwad swap deepseek-r1-70b # Run the benchmark $ tightwad benchmark
Built for terminal people who hate bloat as much as they hate cloud bills.
Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.
tightwad swap model-name β swap the model while workers keep running. Zero downtime.
Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.
tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.
One file describes your entire hardware topology. Version control it. Share it. Ship it.
tightwad bench β proxy vs direct target comparison. See your exact speedup, tok/s, and per-prompt breakdown.
Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.
Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.
NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.
Auto-detects model architecture families. Warns you before a mismatched draft/target pair wastes hours at 1.6% acceptance.
Detects Llama 3, Mistral, Gemma, Phi, and more. No more hardcoded Qwen3 template breaking other model families.
max_draft_tokens: auto β adjusts at runtime based on acceptance rate. Zero-config optimization.
Multiple drafters vote on tokens. When they agree, skip the target entirely. Three modes: strict, majority, any-disagree.
Cross-platform cluster management without SSH. REST API on every node for version checks, GPU info, and remote control.
Version enforcement, MoE VRAM warnings, SSRF protection, bearer token auth. Production-ready out of the box.
tightwad moe defuse rewrites fused expert tensors to indexed form so per-expert placement actually works. No quantization change, identical weights.
moe_placement: balanced pins whole experts to individual GPUs via llama.cpp --override-tensor flags. No more half-experts on the wire.
Capture per-expert hit counts, pin the hot ones to your fastest card. tightwad moe profile + moe_placement: profile-guided.
tightwad moe bench — streams live TTFT, rolling acceptance, speedup. Works with LM Studio, vLLM, any OpenAI-compatible target.
Fair question. Here's the honest answer. The other tools are good. Tightwad is for a different problem.
Comparison accurate as of March 2026. These tools evolve quickly β check their docs for the latest capabilities.
Excellent production inference engine. CUDA-only. Built for ML teams.
The reason most people have local models. One model, one machine, beautifully simple.
The low-level primitive Tightwad is built on. Powerful. Requires a lot of scripting.
Production inference for the HuggingFace ecosystem. Great if you're already there.