// the money shot

Benchmarks: 1.86× on a 70B model over WiFi

Real hardware. Real numbers. No cherry-picking. The headline: a 70B model that fits on no single machine goes from painfully slow to usable across four mismatched consumer GPUs — pooled over WiFi, made fast by speculative decoding. Everything below is measured on a homelab junk drawer, with the methodology and honest caveats laid out so you can reproduce it.

See the numbers Methodology Reproduce on GitHub

// measured results

What the pool actually does

Under greedy decoding (temperature 0), speculative output is mathematically identical to running the big model alone — you only ever see the target's tokens. The draft model just makes them arrive faster.

Llama 3.3 70B · 4-GPU RPC pool (52GB VRAM over WiFi) · Llama 3.1 8B draft on M2 Metal · greedy (temp=0)

Mode	Tokens	Time	Speed
RPC pool direct (autoregressive)	512	231s	2.2 tok/s
RPC pool + speculation	519	127s	4.1 tok/s
⚡ Speedup	Greedy-equivalent output · 33 tokens/round		1.86×

The 70B model doesn't fit on any single machine here. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM) over WiFi. Without speculation: painfully slow. With speculation: usable. This is the fully validated wall-clock result.

👑

The killer result

A 70B model across 4 consumer GPUs over WiFi — 2.2 → 4.1 tok/s. No single machine could run this model. Speculation is what makes the pool usable.

✅

Greedy-equivalent output

Same-family drafting (Llama 3.1 8B → Llama 3.3 70B) under per-position verification: with greedy decoding the output is mathematically identical to running the 70B alone (the Leviathan greedy guarantee).

⚠️

Family matters

Llama 3.2 3B → Llama 3.3 70B got only 1.6% acceptance despite sharing a tokenizer. Architecture match is critical — Llama 3.1 8B is the correct drafter.

Qwen3-32B · 4-GPU RPC pool · Qwen3-1.7B draft on M4 CPU · greedy (temp=0)

Mode	Speed	Notes
Desktop local only (4070+3060, 32B)	17.0 tok/s	Best case — fits on one machine
4-GPU RPC pool (autoregressive)	3.0 tok/s	Each token = full RPC round-trip
RPC pool + speculation	5.4 tok/s	32 tokens verified per batch (greedy, output equivalent to target alone)
⚡ Pool speedup	1.8× over pool-only (3.0 → 5.4 tok/s)

RPC pooling alone is slow over WiFi — one network round-trip per token. Speculation amortizes that: 32 tokens per round-trip instead of 1. The honest caveat: don't pool when the model fits locally (17 tok/s local beats 5.4 tok/s pooled). Pooling is for models too big for one box.

Wall-clock · Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti Super + RTX 3060) · llama-server · max_draft_tokens=32 · greedy (temp=0)

Prompt	Baseline	Speculative	Speedup
Capital of France	1.17s	0.90s	1.30x
Thermodynamics	12.73s	9.09s	1.40x
Prime checker	12.76s	10.15s	1.28x
Average speed	13.24s	10.95s	1.21x
TCP vs UDP	5.58s	4.88s	1.14x
Total	45.43s	35.96s	1.27x

Cross-machine, both models local. 1.27× overall at max_draft_tokens: 32 (50 rounds, 31.7 tokens/round). Tuning matters: 8 drafts went slower (0.63×) from too many HTTP round-trips; 64 added draft latency. Set max_draft_tokens: auto and Tightwad finds the sweet spot for you.

⚠️

Legacy text-match verifier — being re-validated

These per-task acceptance rates were measured under the legacy text-match verifier that earlier versions used for acceptance comparison. They are being re-validated under the v0.5.1+ per-position verifier. Treat them as a directional picture of which task types agree, not as a wall-clock speedup figure — the validated wall-clock numbers live on the other tabs.

Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti Super) · 130 prompts · text-match (legacy)

Prompt Type	Acceptance Rate	Rounds	Why
🧮 Reasoning	89%	32	Math is deterministic. Both models agree.
💻 Code	76%	34	Structured syntax overlaps.
📚 Factual	73%	18	Strong agreement on facts.
📋 List	42%	40	Phrasing varies.
🎨 Creative	39%	6	Many valid outputs. Expected.
⚡ Average	63.8%	26	Legacy text-match average.

⚠️

Why the caveat

The legacy fast path accepted draft tokens loosely, which made acceptance numbers tautological. v0.5.1+ does real per-position argmax comparison. These per-task rates are being re-measured under the corrected verifier.

💡

What it tells you

The shape is the takeaway: deterministic tasks (math, code, facts) draft well; open-ended tasks (lists, creative) diverge. That pattern holds — the exact percentages are under review.

🛑

Cloud-API speculation is a wall-clock loss

Pointing the target at a cloud API is not a speedup and not a cost win. Per-round network latency (~3–8s per API call) makes speculation slower than baseline in every case we measured, and verification still bills the full drafted batch. Speculative decoding shines only when both models are local or very low-latency. These acceptance rates are published as a warning, not a recommendation.

Cloud API acceptance · OpenRouter targets · same-family unless noted

Draft	Target	Size Gap	Acceptance
Llama 3.1 8B	Llama 3.1 405B	50×	18.9%
Qwen3 1.7B	Qwen3.5 397B	233×	10.8%
Llama 3.1 8B	Llama 3.1 70B	9×	9.9%
Qwen3 1.7B	Qwen3 235B	138×	6.6%
Qwen3 8B	Llama 3.3 70B	cross-family	~3%

The much-quoted "tiny model drafts a 397B model" pairing (Qwen3-1.7B → Qwen3.5-397B, a 233× size gap) lands at 10.8% acceptance — a high model-size ratio yields low acceptance, and over the network it's still a wall-clock loss. The only honest cost claim for Tightwad is $0: it runs fully local.

// how we measured

Methodology

Run it yourself. The benchmark scripts live in the benchmarks/ directory of the repo; results save as JSON.

The pool — 4 mismatched GPUs, ~52GB VRAM, over WiFi

Machine	Hardware	Role
Desktop	RTX 4070 Ti Super (16GB) + RTX 3060 (12GB), NVIDIA / CUDA	Coordinator + largest pool share
Old gaming PC	RTX 2070 (8GB), NVIDIA / CUDA	RPC worker
Laptop	Apple M2, Metal	RPC worker + draft model
Pool total	CUDA + Metal mixed · ~52GB VRAM · standard home WiFi

No two GPUs match. No data-center interconnect — just WiFi. The whole point is that mismatched consumer hardware pools into one OpenAI-compatible endpoint, and speculative decoding makes the pooled target usable. Coordinator note: the machine loading the model needs enough system RAM for the full GGUF (~44GB for a 70B Q4_K_M), not just its GPU share.

📐

Greedy decoding (temp=0)

All benchmarks run at temperature 0 for reproducibility. Under greedy decoding, speculative output is mathematically identical to running the target alone — speculation never changes what you read, only how fast it arrives.

🎯

Acceptance vs wall-clock

Acceptance rate = fraction of drafted tokens the target keeps. Wall-clock speedup = actual seconds saved end-to-end. High acceptance only converts to wall-clock when the round-trip is cheap — which is why local pools win and cloud APIs don't.

🔁

Re-validation in progress

v0.5.1 replaced a loose same-family fast path with real per-position argmax verification. Legacy text-match acceptance tables are being re-measured under it; the wall-clock 1.86× on 70B-pooled is unaffected and remains real.

💡

The threshold rule

Speculation helps when the target is slow. Below ~8–10 tok/s baseline (a 70B pool, a dense model over RPC), the draft→verify amortization is a clear win. Above ~15 tok/s — when the model already fits locally and runs fast — speculation adds overhead without benefit. Run direct mode there.

// keep exploring

Benchmarks: 1.86× on a 70B model over WiFi

What the pool actually does

The killer result

Greedy-equivalent output

Family matters

When to use combined mode

Why it works

Real wall-clock time

Tune for your setup

Legacy text-match verifier — being re-validated

Why the caveat

What it tells you

Cloud-API speculation is a wall-clock loss

Latency kills it

Same-family or bust

Methodology

Greedy decoding (temp=0)

Acceptance vs wall-clock

Re-validation in progress

The threshold rule

Keep exploring