// the money shot

Benchmarks: 1.86× on a 70B model over WiFi

Real hardware. Real numbers. No cherry-picking. The headline: a 70B model that fits on no single machine goes from painfully slow to usable across four mismatched consumer GPUs — pooled over WiFi, made fast by speculative decoding. Everything below is measured on a homelab junk drawer, with the methodology and honest caveats laid out so you can reproduce it.

What the pool actually does

Under greedy decoding (temperature 0), speculative output is mathematically identical to running the big model alone — you only ever see the target's tokens. The draft model just makes them arrive faster.

Llama 3.3 70B · 4-GPU RPC pool (52GB VRAM over WiFi) · Llama 3.1 8B draft on M2 Metal · greedy (temp=0)
Mode Tokens Time Speed
RPC pool direct (autoregressive) 512 231s 2.2 tok/s
RPC pool + speculation 519 127s 4.1 tok/s
⚡ Speedup Greedy-equivalent output · 33 tokens/round 1.86×

The 70B model doesn't fit on any single machine here. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM) over WiFi. Without speculation: painfully slow. With speculation: usable. This is the fully validated wall-clock result.

👑

The killer result

A 70B model across 4 consumer GPUs over WiFi — 2.2 → 4.1 tok/s. No single machine could run this model. Speculation is what makes the pool usable.

Greedy-equivalent output

Same-family drafting (Llama 3.1 8B → Llama 3.3 70B) under per-position verification: with greedy decoding the output is mathematically identical to running the 70B alone (the Leviathan greedy guarantee).

⚠️

Family matters

Llama 3.2 3B → Llama 3.3 70B got only 1.6% acceptance despite sharing a tokenizer. Architecture match is critical — Llama 3.1 8B is the correct drafter.

Qwen3-32B · 4-GPU RPC pool · Qwen3-1.7B draft on M4 CPU · greedy (temp=0)
Mode Speed Notes
Desktop local only (4070+3060, 32B) 17.0 tok/s Best case — fits on one machine
4-GPU RPC pool (autoregressive) 3.0 tok/s Each token = full RPC round-trip
RPC pool + speculation 5.4 tok/s 32 tokens verified per batch (greedy, output equivalent to target alone)
⚡ Pool speedup 1.8× over pool-only (3.0 → 5.4 tok/s)

RPC pooling alone is slow over WiFi — one network round-trip per token. Speculation amortizes that: 32 tokens per round-trip instead of 1. The honest caveat: don't pool when the model fits locally (17 tok/s local beats 5.4 tok/s pooled). Pooling is for models too big for one box.

When to use combined mode

Only when the model doesn't fit on one machine. If it fits locally (17 tok/s), don't pool — just speculate with a remote drafter.

💡

Why it works

Pool autoregressive: 1 token per network round-trip = slow. Pool + speculation: 32 tokens per round-trip = 1.8× faster. The draft model amortizes network overhead.

Wall-clock · Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti Super + RTX 3060) · llama-server · max_draft_tokens=32 · greedy (temp=0)
Prompt Baseline Speculative Speedup
Capital of France 1.17s 0.90s 1.30x
Thermodynamics 12.73s 9.09s 1.40x
Prime checker 12.76s 10.15s 1.28x
Average speed 13.24s 10.95s 1.21x
TCP vs UDP 5.58s 4.88s 1.14x
Total 45.43s 35.96s 1.27x

Cross-machine, both models local. 1.27× overall at max_draft_tokens: 32 (50 rounds, 31.7 tokens/round). Tuning matters: 8 drafts went slower (0.63×) from too many HTTP round-trips; 64 added draft latency. Set max_draft_tokens: auto and Tightwad finds the sweet spot for you.

Real wall-clock time

1.27× measured end-to-end. Not theoretical — actual seconds off the clock per response, both models local on the LAN.

🎛️

Tune for your setup

Cross-machine HTTP overhead is the enemy. Set max_draft_tokens: auto to let Tightwad optimize round trips, or pin at 32 for manual control.

⚠️

Legacy text-match verifier — being re-validated

These per-task acceptance rates were measured under the legacy text-match verifier that earlier versions used for acceptance comparison. They are being re-validated under the v0.5.1+ per-position verifier. Treat them as a directional picture of which task types agree, not as a wall-clock speedup figure — the validated wall-clock numbers live on the other tabs.

Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti Super) · 130 prompts · text-match (legacy)
Prompt Type Acceptance Rate Rounds Why
🧮 Reasoning
89%
32 Math is deterministic. Both models agree.
💻 Code
76%
34 Structured syntax overlaps.
📚 Factual
73%
18 Strong agreement on facts.
📋 List
42%
40 Phrasing varies.
🎨 Creative
39%
6 Many valid outputs. Expected.
⚡ Average
63.8%
26 Legacy text-match average.
⚠️

Why the caveat

The legacy fast path accepted draft tokens loosely, which made acceptance numbers tautological. v0.5.1+ does real per-position argmax comparison. These per-task rates are being re-measured under the corrected verifier.

💡

What it tells you

The shape is the takeaway: deterministic tasks (math, code, facts) draft well; open-ended tasks (lists, creative) diverge. That pattern holds — the exact percentages are under review.

🛑

Cloud-API speculation is a wall-clock loss

Pointing the target at a cloud API is not a speedup and not a cost win. Per-round network latency (~3–8s per API call) makes speculation slower than baseline in every case we measured, and verification still bills the full drafted batch. Speculative decoding shines only when both models are local or very low-latency. These acceptance rates are published as a warning, not a recommendation.

Cloud API acceptance · OpenRouter targets · same-family unless noted
Draft Target Size Gap Acceptance
Llama 3.1 8B Llama 3.1 405B 50× 18.9%
Qwen3 1.7B Qwen3.5 397B 233× 10.8%
Llama 3.1 8B Llama 3.1 70B 9.9%
Qwen3 1.7B Qwen3 235B 138× 6.6%
Qwen3 8B Llama 3.3 70B cross-family ~3%

The much-quoted "tiny model drafts a 397B model" pairing (Qwen3-1.7B → Qwen3.5-397B, a 233× size gap) lands at 10.8% acceptance — a high model-size ratio yields low acceptance, and over the network it's still a wall-clock loss. The only honest cost claim for Tightwad is $0: it runs fully local.

🌐

Latency kills it

Even at decent acceptance, ~3–8s per round-trip to a cloud target erases any gain. Keep both models local.

🔍

Same-family or bust

Cross-family drafting drops to ~3% regardless of size — different training data, different phrasing. Match the architecture.

Methodology

Run it yourself. The benchmark scripts live in the benchmarks/ directory of the repo; results save as JSON.

The pool — 4 mismatched GPUs, ~52GB VRAM, over WiFi
Machine Hardware Role
Desktop RTX 4070 Ti Super (16GB) + RTX 3060 (12GB), NVIDIA / CUDA Coordinator + largest pool share
Old gaming PC RTX 2070 (8GB), NVIDIA / CUDA RPC worker
Laptop Apple M2, Metal RPC worker + draft model
Pool total CUDA + Metal mixed · ~52GB VRAM · standard home WiFi

No two GPUs match. No data-center interconnect — just WiFi. The whole point is that mismatched consumer hardware pools into one OpenAI-compatible endpoint, and speculative decoding makes the pooled target usable. Coordinator note: the machine loading the model needs enough system RAM for the full GGUF (~44GB for a 70B Q4_K_M), not just its GPU share.

📐

Greedy decoding (temp=0)

All benchmarks run at temperature 0 for reproducibility. Under greedy decoding, speculative output is mathematically identical to running the target alone — speculation never changes what you read, only how fast it arrives.

🎯

Acceptance vs wall-clock

Acceptance rate = fraction of drafted tokens the target keeps. Wall-clock speedup = actual seconds saved end-to-end. High acceptance only converts to wall-clock when the round-trip is cheap — which is why local pools win and cloud APIs don't.

🔁

Re-validation in progress

v0.5.1 replaced a loose same-family fast path with real per-position argmax verification. Legacy text-match acceptance tables are being re-measured under it; the wall-clock 1.86× on 70B-pooled is unaffected and remains real.

💡

The threshold rule

Speculation helps when the target is slow. Below ~8–10 tok/s baseline (a 70B pool, a dense model over RPC), the draft→verify amortization is a clear win. Above ~15 tok/s — when the model already fits locally and runs fast — speculation adds overhead without benefit. Run direct mode there.