// the only thing that matters

What do you actually do?

Most people don't get it at first. So here it is, dead simple. One change. That's it.

BEFORE

💬

Open WebUI

your chat app

→

🐢

Ollama :11434

Llama 3.3 70B — slow

Base URL: http://192.168.1.10:11434

⏳ Every token generated one at a time. Waiting.

→

AFTER

💬

Open WebUI

same app, no changes

→

🐷

Tightwad :8088

invisible proxy

→

⚡

Llama 3.3 70B

same output quality, 1.86× faster

Base URL: http://192.168.1.10:8088 ← only change

✓ Equivalent output quality. Just faster.

🔗

One URL change

Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.

🫥

The small model is invisible

You never configure it, select it, or see it. It's like autocomplete on your phone — it suggests tokens, the big model accepts or corrects. You only see the final output.

🔬

Output quality is preserved

In the default speculative-decoding mode with greedy decoding (temperature=0), output is mathematically identical to running the large model alone — the big model validates every token (the Leviathan / Chen guarantee). With sampling, output is statistically equivalent. Multi-Drafter Consensus is a separate opt-in mode that trades that exactness for speed by skipping the target when drafters unanimously agree — it's labeled as approximate consensus, not the same guarantee.

🚀

Most tokens come from your cheap GPU, every token validated by the big one

With same-family models and greedy decoding (Llama 3.1 8B → Llama 3.3 70B), most draft tokens match the target's argmax and ship straight through. The target validates every position — output under greedy decoding is mathematically identical to running the target alone. Acceptance benchmarks are being re-measured under v0.5.2's per-position verification.

That's it. Change one URL. Get up to 2-3x faster responses. Same quality.

Set It Up in 20 Minutes →

// the goods

Six ways to stop wasting money

Pick your poison. Stack them. Run all six. Tightwad doesn't judge — it just saves you cash.

👑 THE KILLER FEATURE

01

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware — P400 2GB, GTX 770, laptop CPU]
        | runs 1.7B draft, ~30 tok/s
        | sends token IDs (bytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to pool for BATCH verify
        v
  [RPC GPU Pool — 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s — 70B fits nowhere else

✓ 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
✓ Output mathematically identical to target alone under greedy decoding (Leviathan guarantee)
✓ Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
✓ Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions.

⚡ 1.86× MEASURED ON 70B

02

Speculative Decoding Proxy

Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.

  [Your App / OpenAI SDK]
          |
          v
+--------------------------+
|  Tightwad Proxy :8088    |
|                          |
|  1. Draft 32 tokens -----+--> Qwen3-8B
|     (~100 tok/s, cheap)  |    RTX 2070 (the dusty one)
|                          |
|  2. Verify batch --------+--> Llama 3.3 70B
|     (one forward pass)   |    4070Ti / Cloud API
|                          |
|  3. Accept/reject <------+
|  4. Stream to client     |
+--------------------------+
  Output quality = equivalent to 70B alone ✓

✓ Output quality equivalent to target model alone*
✓ Draft locally, verify via cloud API
✓ Supports Ollama + llama.cpp backends
✓ SSE streaming, full OpenAI compatibility

🧠 APPROXIMATE · SKIP THE GPU

03

Multi-Drafter Consensus approximate mode

Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely — that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce (typically 3–6% divergence). Three sub-modes: strict, majority, any_disagree — pick your acceptable risk. Off by default.

  [Tightwad Proxy :8088]
        |
        | races all drafters in parallel
        |
   +----+----+----+
   v    v    v    v
 [M2] [CPU] [2070] [P400]
  8B    8B    8B     1.7B
   |    |    |      |
   +----+----+----+-+
        |
        v
  Consensus? ──yes──> Stream tokens (GPU never touched)
        |
       no
        |
        v
  [Target 70B GPU] ──> Verify only disagreed tokens

✓ Race unlimited drafters — CPUs, old GPUs, laptops, anything
✓ Unanimous tokens skip the target GPU entirely
✓ Three modes: strict, majority, any_disagree
✓ Tree-based speculation for branching draft paths
✓ Prometheus metrics for consensus accept/fallback rates

04

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  70B model: covered ✓

✓ Mix NVIDIA + AMD GPUs freely
✓ Run 70B+ models on consumer hardware
✓ Hot-swap models without restarting workers
✓ Built-in benchmarking CLI

🛡 BATCH VERIFICATION

05

Quality Gate — CPU Fleet Drafts, GPU Reviews

Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — the GPU only sweats the hard 20–40%.

  [Client Request]
        |
        v
  [Tightwad Gate :8088]
        |
        | fan-out to CPU fleet
   +----+----+----+
   v    v    v    v
 [CPU] [CPU] [CPU] [CPU]
 each generates full response with 8B model
   +----+----+----+
        |
        v
  [GPU Target — 70B]
  Reviews each response:
    ✓ 60-80% approved (pass through)
    ✎ 15-25% corrected (light edit)
    ✗ 5-10% rejected (regenerate)

✓ GPU only processes the hard 20–40% of requests
✓ Any CPU or cheap GPU can be a drafter
✓ Full-response verification, not token-by-token
✓ Automatic approve/correct/reject pipeline
✓ tightwad gate start — one command to run

🌐 P2P DISTRIBUTION

06

Swarm Transfer — P2P Model Distribution

Models are huge. Downloading 70B from HuggingFace takes hours. Pull from every machine that already has it. Chunked transfer with piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads the model from all your existing machines in parallel.

  [New Machine Joins Cluster]
        |
        | "I need Llama-3.3-70B-Q4_K_M.gguf"
        v
  +---------------------------+
  | Tightwad Swarm Discovery  |
  |                           |
  |  Piece 1 <--- Machine A  |  (4070 Ti — has full model)
  |  Piece 2 <--- Machine B  |  (RTX 2070 — has full model)
  |  Piece 3 <--- Machine C  |  (M2 Metal — has pieces 1-6)
  |  Piece 4 <--- Machine A  |  (parallel, rarest-first)
  |  ...                      |
  +---------------------------+
        |
        v
  SHA256-verified • ready to serve in minutes, not hours

✓ Multi-source parallel download — pull from every peer simultaneously
✓ SHA256 piece verification — every chunk validated before use
✓ Rarest-first selection — ensures model availability across the cluster
✓ Delta updates — new quantization? Only transfer the changed pieces
✓ Zero central server — machines discover each other automatically

Your junk drawer of compute, unified

YOUR HARDWARE (any mix works)

RTX 4070 Ti Super (16GB)

RTX 3060 (12GB)

RTX 2070 (8GB)

GTX 770 (2GB — why not)

RX 7900 XTX (24GB, AMD!)

Old Xeon (CPU only)

Laptop (M2, CPU draft)

CUDA ✓ ROCm ✓ CPU ✓ Mixed ✓

➜

TIGHTWAD

One endpoint

Draft fast. Verify smart.

localhost:8088

✓ ✓ ✓ ✓ ✓ ✗ — —

OpenAI-compatible API

Without Tightwad: big model generates every token, one at a time • With Tightwad: all your hardware works together, big model only handles the hard tokens • Output quality: equivalent* • Speed: up to 2–3× faster*

The math behind the magic

🚀

Draft

Small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap.

→

🔍

Verify

Big model evaluates all 32 tokens in a single forward pass. Batch is basically free.

→

✅

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

→

📡

Stream

Accepted tokens stream to your app instantly. Repeat until done. Output quality is equivalent to the target model alone.

// the research

This isn’t magic — it’s what Google and DeepMind already use

Speculative decoding powers production inference at the biggest players. Tightwad just puts it on your hardware without the data center.

📄 GOOGLE · ICML 2023

Fast Inference from Transformers via Speculative Decoding

Leviathan, Kalman, Matias (2022)

The foundational paper. Introduces the draft-verify loop and proves output equivalence under greedy decoding.

arxiv.org/abs/2211.17192 →

📄 DEEPMIND

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, Borgeaud, Irving, et al. (2023)

Independent parallel formulation, extends the technique to stochastic sampling with the rejection-sampling trick.

arxiv.org/abs/2302.01318 →

📝 GOOGLE RESEARCH BLOG

Looking Back at Speculative Decoding

Google Research (2024)

Plain-English retrospective from the original authors covering production deployment, what held up, and what didn’t.

research.google →

Tightwad is independent open-source software (MIT) with no affiliation, endorsement, or commercial relationship with Google, Google DeepMind, or the listed authors. Citations are nominative fair use of public academic publications.

// v0.5 — MoE first-class

MoE models, finally first-class

Mixture-of-Experts models are the new normal — MiniMax M2.5 (229B), GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE. Everyone’s shipping them. Nobody supports them properly on home hardware. Until now.

🧩 THE UNLOCK

01

GGUF Defusion — the feature that makes the rest possible

Most MoE GGUFs ship fused: one giant tensor per layer covering every expert. llama.cpp can’t per-expert-split a fused tensor — so every MoE optimization tool on the planet silently degrades to whole-layer placement. Tightwad is the only one that rewrites fused weights into indexed form so you can actually pin experts.

  BEFORE (fused)                          AFTER (indexed)
  blk.0.ffn_gate_exps.weight  -->  blk.0.ffn_gate.0.weight
                                    blk.0.ffn_gate.1.weight
                                    blk.0.ffn_gate.2.weight
                                    ...
                                    blk.0.ffn_gate.127.weight

  Same bytes. Same quantization. One pass of disk I/O.

✓ tightwad moe defuse fused.gguf indexed.gguf
✓ Identical weights, identical output, identical quality
✓ Works on GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE, Mixtral
✓ Round-trip verified in the test suite — same tokens out

🎯 EXPERT PINNING

02

Balanced Placement — whole experts on one card

Once your model is indexed, Tightwad bin-packs entire experts onto individual GPUs proportional to VRAM. No more shipping half an expert across the network every forward pass. Emits llama.cpp --override-tensor flags automatically — you just set moe_placement: balanced and start.

  models:
    gpt-oss-120b:
      path: /models/gpt-oss-120b-indexed.gguf
      moe_placement: balanced

  Tightwad emits at startup:
    --override-tensor "^blk\.0\.ffn_(gate|up|down)\.(0|1|...|67)\.weight$=CUDA0"
    --override-tensor "^blk\.0\.ffn_(gate|up|down)\.(68|...|101)\.weight$=RPC[...]"
    ... (one per layer per device)

✓ Automatic VRAM-proportional bin-packing
✓ One -ot regex per (layer, device) — narrowest first
✓ tightwad moe plan —emit-ot to preview before launch
✓ Falls back silently for fused models (with a pointer to defuse)

🔥 PROFILE-GUIDED

03

Hot Experts On Your Fastest Card

MoE routing isn’t uniform. A handful of experts fire 10× more than the tail. Capture a profile from your real traffic, and Tightwad pins the hot ones to your fastest GPU. Cold experts land on slow nodes. Same total VRAM, dramatically different throughput.

  $ tightwad moe profile --follow-coord --duration 300
  $ tightwad moe summary ~/.tightwad/profile.json
     Top 20 hot experts
     Layer  Expert  Hits
     ─────  ──────  ────
        12       7  4,231   ◀ 10× more than average
        12      88  3,109
         0      19  2,874
        ...
  $ # update moe_placement: profile-guided & restart

✓ Weight = bytes × (1 + 3× hit-frequency)
✓ Top-K hot experts pinned to highest-scoring device
✓ Auto-measured device scores (TCP-RTT, 24h cache)
✓ Experimental — needs instrumented llama.cpp build

04

MoE + Speculation = The Whole Stack

Pair expert-aware placement with speculative decoding and you get sparse weights + sparse compute. The draft model (small, same-family) predicts 32 tokens. The pooled MoE target verifies in one batch. Only the activated experts fire. Only the disagreed tokens get regenerated. It’s the whole stack of optimizations working together.

  [Your App]
      |
      v
  [Tightwad Proxy :8088]
      |
      +--> Draft: Qwen3-1.7B (local, ~100 tok/s)
      |
      +--> Target: MiniMax M2.5 (229B MoE, LM Studio)
                      |
                      | expert-aware placement keeps hot experts
                      | on Mac Studio M3 Ultra unified memory
                      |
                      v
                Batch verify 32 tokens → activate 8 of 256 experts

✓ Works with Combined Mode (RPC pool) out of the box
✓ tightwad moe bench streams live TTFT + acceptance table
✓ OpenAI-compatible targets (LM Studio, vLLM, llama-server)
✓ MiniMax M2.5 baseline: 26–48 tok/s direct on Mac Studio M3 Ultra

Full guide: docs/moe.md · Wiki: MoE-Support · Reference config: cluster-moe-youngharold.yaml

// the money shot

Benchmarks that hit different

Real hardware. Real numbers. No cherry-picking. Logprobs-based batch verification is live — these acceptance rates translate directly to wall-clock speedup.

Llama 3.3 70B · 4-GPU RPC pool (52GB VRAM over WiFi) · Llama 3.1 8B draft on M4 Metal

Mode	Tokens	Time	Speed
RPC pool direct (autoregressive)	512	231s	2.2 tok/s
RPC pool + speculation	519	127s	4.1 tok/s
⚡ Speedup	Greedy-equivalent output · 33 tokens/round		1.86×

The 70B model doesn't fit on any single machine. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal) over WiFi. Without speculation: painfully slow. With speculation: usable.

👑

The killer result

A 70B model across 4 consumer GPUs over WiFi — from 2.2 to 4.1 tok/s. No single machine could run this model. Speculation makes it usable.

✅

Greedy-equivalent output

Same-family drafting (Llama 3.1 8B → Llama 3.3 70B) under v0.5.2's per-position verification: output is mathematically identical to running the 70B alone (Leviathan greedy guarantee). Acceptance numbers are being re-measured under the corrected verifier.

⚠️

Family matters

Llama 3.2 3B → Llama 3.3 70B got only 1.6% acceptance despite sharing a tokenizer. Architecture match is critical — Llama 3.1 8B is the correct drafter.

Qwen3-32B · 4-GPU RPC pool · Qwen3-1.7B draft on M4 CPU

Mode	Speed	Notes
Desktop local only (4070+3060, 32B)	17.0 tok/s	Best case — fits on one machine
4-GPU RPC pool (autoregressive)	3.0 tok/s	Each token = full RPC round-trip
RPC pool + speculation	5.4 tok/s	32 tokens verified per batch (greedy decoding, output equivalent to target alone)
⚡ Pool speedup	1.8× over pool-only (3.0 → 5.4 tok/s)

RPC pooling alone is slow over WiFi (one network round-trip per token). Speculation amortizes that — 32 tokens per round-trip instead of 1. Don't pool when the model fits locally (17 tok/s local vs 5.4 tok/s pooled).

Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti Super) · 130 prompts

Prompt Type	Acceptance Rate	Rounds	Verdict
🧮 Reasoning	89%	32	Math is deterministic. Love it.
💻 Code	76%	34	Syntax is law. Both models agree.
📚 Factual	73%	18	Strong agreement on facts.
📋 List	42%	40	Phrasing varies. Still worthwhile.
🎨 Creative	39%	6	Many valid outputs. Expected.
⚡ Average	63.8%	26	64% of tokens = free.

💸

What 64% means

Nearly two-thirds of your tokens come from the cheap GPU. The expensive model only works on the hard parts.

🎯

Output quality

Equivalent to running the big model alone. With greedy decoding, mathematically identical; with other sampling, statistically equivalent.

⚡

Logprobs: live

Logprobs-based batch verification is implemented. These acceptance rates are real wall-clock speedup, not just acceptance stats.

Qwen3-8B (local GPU) → Qwen3.5-397B (API) · logprobs + whitespace normalization

Prompt Type	Acceptance Rate	Notes
🧮 Reasoning	88%	Highest — deterministic math
⚡ Average (normalized)	80%	Key result: 4 in 5 tokens local.

🏆

80% acceptance: Qwen3-8B → Qwen3.5-397B

With whitespace normalization, a consumer GPU running an 8B model drafts 4 out of every 5 tokens for a 397B model. That means up to 80% fewer output tokens billed to the cloud API for the same quality output. The bigger the gap between draft and target quality, the more you save.

🏆

Notable result

Up to 80% acceptance on a 397B model (same-family models). Your gaming PC is doing up to 80% of the work that would otherwise cost API money.

💰

API cost math

At $0.60/M output tokens (Qwen3.5-397B), 80% acceptance means you pay for roughly 20% of output tokens via the API — up to 5× reduction in output token costs. Input/prompt tokens are still processed by the API. Local GPU electricity and hardware costs not included.

📝

Same-family is key

Qwen3-8B + Qwen3.5-397B are from the same model family. Cross-family (e.g. Llama → Qwen) drops to ~3%. Same family = high acceptance.

Wall-clock speedup · Qwen3-8B (RTX 2070) → Qwen3-32B (RTX 4070 Ti + RTX 3060) · llama-server · max_draft_tokens=32

Prompt	Baseline	Speculative	Speedup
Capital of France	1.17s	0.90s	1.30x
Thermodynamics	12.73s	9.09s	1.40x
Prime checker	12.76s	10.15s	1.28x
Average speed	13.24s	10.95s	1.21x
TCP vs UDP	5.58s	4.88s	1.14x
Total	45.43s	35.96s	1.27x

Set max_draft_tokens: auto and Tightwad finds the sweet spot for you. Or pin it at 32 for manual control.

// savings calculator

How much are you leaving on the table?

Slide to see your monthly cloud inference waste. Then stop doing that.

Monthly tokens generated

10M

Cost per 1M output tokens ($)

$15

😭 Without Tightwad $150/mo

→

🐷 With Tightwad $63/mo

You save $87/mo

* Estimated savings assume the selected acceptance rate is sustained. Default uses 60% acceptance rate typical of local GPU setups. Rates vary by model pair and prompt type (58-64% local, up to 80% same-family API). Savings do not account for local electricity, hardware costs, or maintenance. Your results will vary.

// who's this for

Pick your archetype

Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode

Pool all your random GPUs into one endpoint
Run 70B models across consumer hardware
Zero wasted VRAM, zero cloud spend

💰 MOST POPULAR

☁️

The Cloud Escapee

You're still paying OpenAI/Anthropic for some tasks. Fine. But why let them do the easy parts? Draft locally, verify via API. 58% fewer API calls. Same answers.

Speculative Proxy Mode

Local draft GPU does the heavy lifting
Cloud only handles the hard tokens
Drop-in OpenAI SDK replacement — zero code changes

🏗️

The Budget Builder

You want 70B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.

RPC Cluster Mode

Llama 3.3 70B on 4× consumer GPUs
No enterprise hardware required
Benchmark built-in to tune your setup

⚡

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.

RPC Cluster Mode

CUDA + ROCm on the same model
llama.cpp RPC handles the hard parts
Coordinator distributes layers intelligently

// four ways to run it

Pick your compute configuration

Tightwad works however your hardware is set up. Consumer GPU, no GPU, cloud API — there's a config for you.

MOST COMMON

GPU → GPU

Homelab / Small Teams

Draft on any GPU you have, verify on a bigger one. Mix any generations, any vendors. RTX 4070 + GTX 770 + RX 7900 XTX — all in one cluster.

🖥️

Draft Machines

RTX 2070 · GTX 770 · RX 6700 · any GPU

→

🖥️

Target Machine

RTX 4070 Ti · RX 7900 XTX · any big GPU

~64% acceptance · 1.3x speedup · any mix of hardware

GPU → API

Slash Your API Bills

Even a GTX 1060 can draft for GPT-4. Any GPU you have — old, cheap, low VRAM — reduces your API bill. Up to 80% fewer output tokens billed to the cloud API.

🖥️

Your PC

GTX 1060 / RTX 2070 / any GPU · 8B draft

→

☁️

Cloud API

Qwen3.5-397B · pay per token

80% acceptance · up to 5x output token cost reduction · any CUDA/ROCm GPU

CPU → GPU

Zero GPU Required to Participate

Run a tiny draft model on any CPU. Verify on a remote GPU server. Your CPU-only server, your laptop, your NAS — all can contribute to the cluster.

💻

Any machine

CPU only · Qwen3-1.7B draft · even a laptop

→

🖥️

GPU Server

Any GPU · any big target model

~68% acceptance · zero GPU required to participate

ENTERPRISE PLAY

CPU → API

Literally Any Computer

Data centers often run at 10–30% average utilization (industry estimates). Idle CPUs, stranded servers, that old Xeon doing nothing — put them to work drafting tokens. No GPU required, ever.

🏢

Any idle machine

Old Xeon · 32-core server · spare laptop · Qwen3-1.7B

→

☁️

Cloud API

397B model · only for hard tokens

Stranded compute → inference revenue

Legacy GPU revival: that GTX 770 from 2013 can run Qwen3-1.7B as a drafter for a 70B target — turning e-waste into productive infrastructure.

Config	Draft	Target	Use Case	Acceptance
GPU → GPU	Any GPU — old, new, NVIDIA, AMD	Any bigger GPU	Homelab, mixed hardware	~64%
GPU → API	Any GPU (even GTX 1060)	Cloud API	Slash API bills	~80%
CPU → GPU	Any CPU, no GPU needed	GPU server	Zero-GPU participants	~68%
CPU → API	Literally any computer	Cloud API	Data centers, enterprise	~68%

// concrete recipe

Homelab Setup in 30 Minutes

Four machines. One 70B model. Start with two, add machines anytime. The cluster grows.

⚡ Draft Brain

💻

MacBook Air M4

Llama 3.1 8B · Apple Silicon

Tightwad proxy :8088

Proposes 32 tokens/batch
at ~60 tok/s locally

propose →

WiFi

← verify

🖥️ GPU Pool — Target Model

Llama 3.3 70B · 52GB VRAM distributed

🖥️

Desktop

RTX 4070 Ti Super + RTX 3060

28 GB VRAM

🖥️

Gaming PC

RTX 2070

8 GB VRAM

💻

MacBook Air M2

Apple Metal

16 GB unified

3 machines · 52 GB total · rpc-server :50052

1.86× speedup

4.1 tok/s was 2.2 tok/s

= output identical to 70B alone

$0 cloud spend

1

On Machines A, B, C: Start RPC workers

Pool Workers

bash (on each pool machine)

# Machine A — Desktop (4070 Ti + 3060, 28GB)
$ rpc-server -p 50052
# Machine B — Old Gaming PC (RTX 2070, 8GB)
$ rpc-server -p 50052
# Machine C — MacBook Air M2 (Metal, 16GB)
$ rpc-server -p 50052

2

On Machine D: Start the draft model

Machine D

bash

# Machine D — MacBook Air M4 (runs draft + proxy)
$ ollama run llama3.1:8b
# Confirm:
$ ollama ps
✓ llama3.1:8b  running

3

Install Tightwad (either machine)

Either

bash

$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad

4

Edit configs/cluster.yaml

Either

configs/cluster.yaml — combined mode (pool + speculation)

proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes based on acceptance rate
  mode: combined              # speculation over pooled GPUs
  draft:
    url: http://localhost:11434   # Machine D (M4, local draft)
    model_name: llama3.1:8b
    backend: ollama

coordinator:
  host: 0.0.0.0
  port: 8080
  model: Llama-3.3-70B-Q4_K_M.gguf

workers:
  - host: 192.168.1.10        # Machine A (4070 Ti + 3060)
    rpc_port: 50052
  - host: 192.168.1.20        # Machine B (RTX 2070)
    rpc_port: 50052
  - host: 192.168.1.30        # Machine C (M2 Metal)
    rpc_port: 50052

Find your IPs: ip addr on Linux, ipconfig on Windows, ifconfig on macOS. Add more workers anytime — the cluster grows.

5

Start the proxy

Either

bash

$ tightwad proxy start
✓ Draft model healthy  (llama3.1:8b @ localhost:11434) — Machine D
✓ Pool: 3 workers online (52GB VRAM total) — A + B + C
✓ Target: Llama-3.3-70B distributed across pool
✓ Proxy listening on http://localhost:8088

6

Test it

Either

bash

$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}'

# Check acceptance rate
$ tightwad proxy status
→ Acceptance rate: ~58% | Rounds: N | Tokens saved: N

8

Point your chat app at it

Done ✓

In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:

http://192.168.1.10:11434 → http://192.168.1.10:8088

That's it. Four machines, one endpoint. Same app. Same model name. Same output quality. Machines A, B, and C pool a 70B model that fits on no single machine. Machine D drafts and proxies. You just see 4.1 tok/s instead of 2.2.

What to expect with this setup

Same-family models (Llama 3.1 8B → Llama 3.3 70B) with greedy decoding:

Metric	Result
⚡ Output equivalence (greedy)	=
🚀 Speedup	1.86×
💬 Tokens per round	33
⏱️ Speed (pool only)	2.2 tok/s
⏱️ Speed (pool + speculation)	4.1 tok/s

// get running in 5 minutes

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

1

Install

bash

$ pip install tightwad
# or from source:
$ git clone https://github.com/youngharold/tightwad.git
$ cd tightwad && pip install .

2

Configure your hardware

configs/cluster.yaml

proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto          # auto-tunes based on acceptance rate
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama

3

Start it & test

bash

$ tightwad proxy start
✓ Draft model healthy
✓ Target model healthy
✓ Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check acceptance rate stats
$ tightwad proxy status
→ Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891

1

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)

# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0

2

Configure cluster topology

configs/cluster.yaml

coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  llama-3.3-70b:
    path: /models/Llama-3.3-70B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true

3

Start the cluster

bash

$ tightwad start
✓ Coordinator started
✓ Worker @ 192.168.1.100:50052 online
✓ Model llama-3.3-70b loaded across 52 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

// what's in the box

Everything you need, nothing you don't

Built for terminal people who hate bloat as much as they hate cloud bills.

🔁

OpenAI Compatible

Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.

🔄

Hot-Swap Models

tightwad swap model-name — swap the model while workers keep running. Zero downtime.

📡

SSE Streaming

Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.

⌨️

CLI-First

tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.

📄

YAML Config

One file describes your entire hardware topology. Version control it. Share it. Ship it.

📊

A/B Benchmark

tightwad bench — proxy vs direct target comparison. See your exact speedup, tok/s, and per-prompt breakdown.

🧪

Dual Backends

Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.

🔒

Fallback Safety

Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.

💻

Mixed Vendor

NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.

🧬

Family Validation

Auto-detects model architecture families. Warns you before a mismatched draft/target pair wastes hours at 1.6% acceptance.

💬

Auto Chat Templates

Detects Llama 3, Mistral, Gemma, Phi, and more. No more hardcoded Qwen3 template breaking other model families.

🎯

Auto-Tune

max_draft_tokens: auto — adjusts at runtime based on acceptance rate. Zero-config optimization.

🤝

Consensus Verification

Multiple drafters vote on tokens. When they agree, skip the target entirely. Three modes: strict, majority, any-disagree.

🌐

Peer Agent

Cross-platform cluster management without SSH. REST API on every node for version checks, GPU info, and remote control.

🛡️

Safety Checks

Version enforcement, MoE VRAM warnings, SSRF protection, bearer token auth. Production-ready out of the box.

🧩

GGUF Defusion

tightwad moe defuse rewrites fused expert tensors to indexed form so per-expert placement actually works. No quantization change, identical weights.

🎯

Expert-Aware Placement

moe_placement: balanced pins whole experts to individual GPUs via llama.cpp --override-tensor flags. No more half-experts on the wire.

🔥

Profile-Guided Placement

Capture per-expert hit counts, pin the hot ones to your fastest card. tightwad moe profile + moe_placement: profile-guided.

📈

MoE Benchmark

tightwad moe bench — streams live TTFT, rolling acceptance, speedup. Works with LM Studio, vLLM, any OpenAI-compatible target.

// honest comparison

Why not just use vLLM?

Fair question. Here's the honest answer. The other tools are good. Tightwad is for a different problem.

Comparison accurate as of March 2026. These tools evolve quickly — check their docs for the latest capabilities.

vs

vLLM

Excellent production inference engine. CUDA-only. Built for ML teams.

⚠ Primarily CUDA-focused. ROCm support is experimental/limited. Tightwad treats CUDA and ROCm as first-class citizens in the same cluster.
⚠ Can't mix GPU generations. vLLM can't pool a GTX 770 with a 4070 Ti. Tightwad doesn't care what generation or vendor your hardware is from.
⚠ Speculative decoding, but single-machine only. Tightwad does it across your network — draft on one box, verify on another.
⚠ No CPU nodes. Can't add a CPU-only machine to a vLLM cluster. Tightwad: CPU drafting is fully supported.
✓ Use vLLM if: you have a single powerful CUDA machine and need production-grade throughput.

vs

Ollama

The reason most people have local models. One model, one machine, beautifully simple.

⚠ One model, one machine. When you outgrow a single GPU, Ollama can't pool across machines. Your RTX 2070 and RTX 4070 are completely isolated from each other.
⚠ Can't combine machines at all. Ollama has no concept of cross-machine inference. Your hardware can't cooperate.
✓ Tightwad works with Ollama. Keep Ollama on each machine — Tightwad just coordinates between them.
✓ Use Ollama for getting started. Use Tightwad when you have a second machine and want them to work together.

vs

llama.cpp RPC

The low-level primitive Tightwad is built on. Powerful. Requires a lot of scripting.

✓ Tightwad is built on llama.cpp RPC. We add the orchestration, YAML config, CLI, and speculative proxy on top.
⚠ RPC ships 100–300 MB of tensor data per network step. Tightwad's speculative proxy ships token IDs — bytes, not megabytes.
✓ Use raw RPC if you want maximum control. Use Tightwad if you want it to just work.

vs

TGI (HuggingFace)

Production inference for the HuggingFace ecosystem. Great if you're already there.

⚠ Optimized for the HuggingFace ecosystem. Designed to work best with HuggingFace's model hub and services.
✓ Tightwad is vendor-neutral. Works with your existing Ollama or llama.cpp setup. No accounts required.
✓ Use TGI if you're in the HuggingFace ecosystem. Use Tightwad if you want backend-agnostic, no-strings-attached inference.

The honest summary

Single powerful CUDA machine, production workloads → Use vLLM

One machine, just want to run models → Use Ollama

Two or more machines — mixed GPUs, old & new, NVIDIA & AMD, or CPU-only — want them all working together → 🐷 Use Tightwad

Your GPUs are |

What do you actually do?

One URL change

The small model is invisible

Output quality is preserved

Most tokens come from your cheap GPU, every token validated by the big one

Six ways to stop wasting money

Combined Mode — Speculation Over a Pool

Speculative Decoding Proxy

Multi-Drafter Consensus approximate mode

RPC Cluster Mode

Quality Gate — CPU Fleet Drafts, GPU Reviews

Swarm Transfer — P2P Model Distribution

Your junk drawer of compute, unified

The math behind the magic

Draft

Verify

Accept

Stream

This isn’t magic — it’s what Google and DeepMind already use

Fast Inference from Transformers via Speculative Decoding

Accelerating Large Language Model Decoding with Speculative Sampling

Looking Back at Speculative Decoding

MoE models, finally first-class

GGUF Defusion — the feature that makes the rest possible

Balanced Placement — whole experts on one card

Hot Experts On Your Fastest Card

MoE + Speculation = The Whole Stack

Benchmarks that hit different

The killer result

Greedy-equivalent output

Family matters

When to use combined mode

Why it works

What 64% means

Output quality

Logprobs: live

80% acceptance: Qwen3-8B → Qwen3.5-397B

Notable result

API cost math

Same-family is key

Real wall-clock time

Tune for your setup

How much are you leaving on the table?

Pick your archetype

The Homelab Hoarder

The Cloud Escapee

The Budget Builder

The Mixed Vendor Maverick

Pick your compute configuration

Homelab / Small Teams

Slash Your API Bills

Zero GPU Required to Participate

Literally Any Computer

Homelab Setup in 30 Minutes

What to expect with this setup

Quick Start

Install

Configure your hardware

Start it & test

Build RPC workers (CUDA — Windows/Linux)

Configure cluster topology

Start the cluster

Everything you need, nothing you don't

OpenAI Compatible

Hot-Swap Models

SSE Streaming

CLI-First

YAML Config

A/B Benchmark

Dual Backends

Fallback Safety

Mixed Vendor

Family Validation

Auto Chat Templates

Auto-Tune

Consensus Verification

Peer Agent

Safety Checks

GGUF Defusion

Your GPUs are
|