Benchmarks: 1.86× on a 70B model over WiFi
Real hardware. Real numbers. No cherry-picking. The headline: a 70B model that fits on no single machine goes from painfully slow to usable across four mismatched consumer GPUs — pooled over WiFi, made fast by speculative decoding. Everything below is measured on a homelab junk drawer, with the methodology and honest caveats laid out so you can reproduce it.
What the pool actually does
Under greedy decoding (temperature 0), speculative output is mathematically identical to running the big model alone — you only ever see the target's tokens. The draft model just makes them arrive faster.
| Mode | Tokens | Time | Speed |
|---|---|---|---|
| RPC pool direct (autoregressive) | 512 | 231s | 2.2 tok/s |
| RPC pool + speculation | 519 | 127s | 4.1 tok/s |
| ⚡ Speedup | Greedy-equivalent output · 33 tokens/round | 1.86× | |
The 70B model doesn't fit on any single machine here. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM) over WiFi. Without speculation: painfully slow. With speculation: usable. This is the fully validated wall-clock result.
| Mode | Speed | Notes |
|---|---|---|
| Desktop local only (4070+3060, 32B) | 17.0 tok/s | Best case — fits on one machine |
| 4-GPU RPC pool (autoregressive) | 3.0 tok/s | Each token = full RPC round-trip |
| RPC pool + speculation | 5.4 tok/s | 32 tokens verified per batch (greedy, output equivalent to target alone) |
| ⚡ Pool speedup | 1.8× over pool-only (3.0 → 5.4 tok/s) | |
RPC pooling alone is slow over WiFi — one network round-trip per token. Speculation amortizes that: 32 tokens per round-trip instead of 1. The honest caveat: don't pool when the model fits locally (17 tok/s local beats 5.4 tok/s pooled). Pooling is for models too big for one box.
| Prompt | Baseline | Speculative | Speedup |
|---|---|---|---|
| Capital of France | 1.17s | 0.90s | 1.30x |
| Thermodynamics | 12.73s | 9.09s | 1.40x |
| Prime checker | 12.76s | 10.15s | 1.28x |
| Average speed | 13.24s | 10.95s | 1.21x |
| TCP vs UDP | 5.58s | 4.88s | 1.14x |
| Total | 45.43s | 35.96s | 1.27x |
Cross-machine, both models local. 1.27× overall at max_draft_tokens: 32 (50 rounds, 31.7 tokens/round). Tuning matters: 8 drafts went slower (0.63×) from too many HTTP round-trips; 64 added draft latency. Set max_draft_tokens: auto and Tightwad finds the sweet spot for you.
Legacy text-match verifier — being re-validated
These per-task acceptance rates were measured under the legacy text-match verifier that earlier versions used for acceptance comparison. They are being re-validated under the v0.5.1+ per-position verifier. Treat them as a directional picture of which task types agree, not as a wall-clock speedup figure — the validated wall-clock numbers live on the other tabs.
| Prompt Type | Acceptance Rate | Rounds | Why |
|---|---|---|---|
| Reasoning | 32 | Math is deterministic. Both models agree. | |
| Code | 34 | Structured syntax overlaps. | |
| Factual | 18 | Strong agreement on facts. | |
| List | 40 | Phrasing varies. | |
| Creative | 6 | Many valid outputs. Expected. | |
| ⚡ Average | 26 | Legacy text-match average. |
Cloud-API speculation is a wall-clock loss
Pointing the target at a cloud API is not a speedup and not a cost win. Per-round network latency (~3–8s per API call) makes speculation slower than baseline in every case we measured, and verification still bills the full drafted batch. Speculative decoding shines only when both models are local or very low-latency. These acceptance rates are published as a warning, not a recommendation.
| Draft | Target | Size Gap | Acceptance |
|---|---|---|---|
| Llama 3.1 8B | Llama 3.1 405B | 50× | 18.9% |
| Qwen3 1.7B | Qwen3.5 397B | 233× | 10.8% |
| Llama 3.1 8B | Llama 3.1 70B | 9× | 9.9% |
| Qwen3 1.7B | Qwen3 235B | 138× | 6.6% |
| Qwen3 8B | Llama 3.3 70B | cross-family | ~3% |
The much-quoted "tiny model drafts a 397B model" pairing (Qwen3-1.7B → Qwen3.5-397B, a 233× size gap) lands at 10.8% acceptance — a high model-size ratio yields low acceptance, and over the network it's still a wall-clock loss. The only honest cost claim for Tightwad is $0: it runs fully local.
Methodology
Run it yourself. The benchmark scripts live in the benchmarks/ directory of the repo; results save as JSON.
| Machine | Hardware | Role |
|---|---|---|
| Desktop | RTX 4070 Ti Super (16GB) + RTX 3060 (12GB), NVIDIA / CUDA | Coordinator + largest pool share |
| Old gaming PC | RTX 2070 (8GB), NVIDIA / CUDA | RPC worker |
| Laptop | Apple M2, Metal | RPC worker + draft model |
| Pool total | CUDA + Metal mixed · ~52GB VRAM · standard home WiFi | |
No two GPUs match. No data-center interconnect — just WiFi. The whole point is that mismatched consumer hardware pools into one OpenAI-compatible endpoint, and speculative decoding makes the pooled target usable. Coordinator note: the machine loading the model needs enough system RAM for the full GGUF (~44GB for a 70B Q4_K_M), not just its GPU share.
The threshold rule
Speculation helps when the target is slow. Below ~8–10 tok/s baseline (a 70B pool, a dense model over RPC), the draft→verify amortization is a clear win. Above ~15 tok/s — when the model already fits locally and runs fast — speculation adds overhead without benefit. Run direct mode there.