Six ways to run Tightwad
One install, six inference strategies for a junk drawer of mismatched compute. Pool GPUs that fit nothing on their own. Draft cheap and verify smart. Race drafters for consensus. Gate full responses. Sling 40 GB models peer-to-peer. Stack them, or run just the one you need — they all hang off the same OpenAI-compatible endpoint.
Pick your poison. Stack them. Run all six.
Speculative decoding, GPU pooling, consensus drafting, quality gating, and P2P model transfer — every mode is a `pip install tightwad` away. Tightwad doesn't judge what hardware you bring; it just puts it to work behind one endpoint.
Combined Mode — Speculation Over a Pool
When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.
[Junk Hardware β P400 2GB, GTX 770, laptop CPU]
| runs 1.7B draft, ~30 tok/s
| sends token IDs (bytes)
v
[Tightwad Proxy :8088]
| sends draft to pool for BATCH verify
v
[RPC GPU Pool β 4 GPUs, 52GB total, WiFi]
| verifies 32 tokens in ONE forward pass
v
4.1 tok/s instead of 2.2 tok/s β 70B fits nowhere else
- ✓ 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
- ✓ Output mathematically identical to target alone under greedy decoding (Leviathan guarantee)
- ✓ Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
- ✓ Pool CUDA + ROCm + Metal GPUs, speculate on top
* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions. Full deep dive →
Speculative Decoding Proxy
Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.
[Your App / OpenAI SDK]
|
v
+--------------------------+
| Tightwad Proxy :8088 |
| |
| 1. Draft 32 tokens -----+--> Qwen3-8B
| (~100 tok/s, cheap) | RTX 2070 (the dusty one)
| |
| 2. Verify batch --------+--> Qwen3-32B
| (one forward pass) | 4070Ti (same machine or LAN)
| |
| 3. Accept/reject <------+
| 4. Stream to client |
+--------------------------+
Output quality = equivalent to 32B alone ✓
- ✓ Output quality equivalent to target model alone*
- ✓ Best on local / low-latency targets — the win is wall-clock, not the bill
- ✓ Supports Ollama + llama.cpp backends
- ✓ SSE streaming, full OpenAI compatibility
* Identical to the target under greedy decoding; statistically equivalent under sampling. Speculation shines when both models are local or very low-latency — over a remote cloud API, per-round network latency makes it slower than baseline.
Multi-Drafter Consensus approximate mode
Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely — that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce. Three sub-modes: strict, majority, any_disagree — pick your acceptable risk. Off by default.
[Tightwad Proxy :8088]
|
| races all drafters in parallel
|
+----+----+----+
v v v v
[M2] [CPU] [2070] [P400]
8B 8B 8B 1.7B
| | | |
+----+----+----+-+
|
v
Consensus? ββyesββ> Stream tokens (GPU never touched)
|
no
|
v
[Target 70B GPU] ββ> Verify only disagreed tokens
- ✓ Race unlimited drafters — CPUs, old GPUs, laptops, anything
- ✓ Unanimous tokens skip the target GPU entirely
- ✓ Three modes:
strict,majority,any_disagree - ✓ Tree-based speculation for branching draft paths
- ✓ Prometheus metrics for consensus accept/fallback rates
RPC Cluster Mode
Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.
[OpenAI Client]
|
v
+-------------------+
| Tightwad | <-- One endpoint to rule them all
| Coordinator :8080|
+--------+----------+
| distributes layers
+----+----+
v v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | | AMD |
| 4070Ti | | 7900XTX|
| 16 GB | | 24 GB |
+--------+ +--------+
70B model: covered ✓
- ✓ Mix NVIDIA + AMD GPUs freely
- ✓ Run 70B+ models on consumer hardware
- ✓ Hot-swap models without restarting workers
- ✓ Built-in benchmarking CLI
Pooling pays off only when the model fits nowhere on its own — pair it with speculation (Mode 01) to claw back the per-token network cost. See the deep dive →
Quality Gate — CPU Fleet Drafts, GPU Reviews
Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — so the GPU only sweats the hard remainder.
[Client Request]
|
v
[Tightwad Gate :8088]
|
| fan-out to CPU fleet
+----+----+----+
v v v v
[CPU] [CPU] [CPU] [CPU]
each generates full response with 8B model
+----+----+----+
|
v
[GPU Target β 70B]
Reviews each response:
✓ approve (pass through)
✎ correct (light edit)
✗ reject (regenerate)
60-80% pass unchanged
- ✓ GPU only processes the hard responses the fleet got wrong
- ✓ Any CPU or cheap GPU can be a drafter
- ✓ Full-response verification, not token-by-token
- ✓ Automatic approve/correct/reject pipeline
- ✓
tightwad gate start— one command to run
Swarm Transfer — P2P Model Distribution
Models are huge. Pulling a 40 GB GGUF from HuggingFace to every worker takes hours and wastes bandwidth. Pull from every machine that already has it. Chunked 64 MB transfer with SHA256 piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads from all your existing machines in parallel.
[New Machine Joins Cluster]
|
| "I need Llama-3.3-70B-Q4_K_M.gguf"
v
+---------------------------+
| Tightwad Swarm Discovery |
| |
| Piece 1 <--- Machine A | (4070 Ti — has full model)
| Piece 2 <--- Machine B | (RTX 2070 — has full model)
| Piece 3 <--- Machine C | (M2 Metal — has pieces 1-6)
| Piece 4 <--- Machine A | (parallel, rarest-first)
| ... |
+---------------------------+
|
v
SHA256-verified • ready to serve in minutes, not hours
- ✓ Multi-source parallel download — pull from every peer simultaneously
- ✓ SHA256 piece verification — every 64 MB chunk validated before use
- ✓ Rarest-first selection — keeps the model available across the cluster
- ✓ Delta updates — new quantization? Only transfer the changed pieces
- ✓ Zero central server — resume on interrupt, no single bottleneck
Stack them, or pick the one that fits
The honest short version: if a model fits on one box, just speculate (Mode 02). If it doesn't, pool and speculate (Mode 01). Everything else is a knob for a specific junk-drawer shape.
- ✓ Model fits on one machine → Speculative Decoding Proxy (02). Cheap local draft, exact output.
- ✓ Model fits nowhere → Combined Mode (01). Pool the GPUs, then speculate to make the pool usable.
- ✓ A pile of idle CPUs / old GPUs → Multi-Drafter Consensus (03, approximate) or Quality Gate (05, full-response).
- ✓ One big GPU rig, just need the endpoint → RPC Cluster (04).
- ✓ Getting a 40 GB model onto every worker → Swarm Transfer (06). Skip the per-worker HuggingFace pull.