v0.5.2

Your GPUs are
|

Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 1.86× measured on 70B.* Zero cloud bill (fully local setup).

1.86× measured speedup (70B pooled)*
= output identical to target alone (greedy)
70B across 4 consumer GPUs over WiFi
$0 cloud bill (fully local setup)

* 1.86× wall-clock measured on Llama 3.1 8B β†’ Llama 3.3 70B across a 4-GPU RPC pool (52GB VRAM over WiFi) with greedy decoding (temperature=0). Under greedy decoding output is mathematically identical to running the target alone β€” speculative decoding is a pure speed optimization, not a quality tradeoff. Speedup depends on hardware, model pairing, network, and configuration. Your results will vary.

or just pip install
$ pip install tightwad

 Collecting tightwad
 Installing collected packages: tightwad
 Successfully installed tightwad-0.5.2

$ tightwad proxy start
 Proxy listening on http://localhost:8088
 Ready. Point your app at localhost:8088.
πŸ“¦ PyPI · Python 3.10+

What do you actually do?

Most people don't get it at first. So here it is, dead simple. One change. That's it.

BEFORE
πŸ’¬
Open WebUI
your chat app
β†’
🐒
Ollama :11434
Llama 3.3 70B β€” slow
Base URL: http://192.168.1.10:11434
⏳ Every token generated one at a time. Waiting.
β†’
AFTER
πŸ’¬
Open WebUI
same app, no changes
β†’
🐷
Tightwad :8088
invisible proxy
β†’
⚑
Llama 3.3 70B
same output quality, 1.86Γ— faster
Base URL: http://192.168.1.10:8088 ← only change
βœ“ Equivalent output quality. Just faster.
πŸ”—

One URL change

Point your chat app at port 8088 instead of 11434. That's the entire setup from your app's perspective.

πŸ«₯

The small model is invisible

You never configure it, select it, or see it. It's like autocomplete on your phone β€” it suggests tokens, the big model accepts or corrects. You only see the final output.

πŸ”¬

Output quality is preserved

In the default speculative-decoding mode with greedy decoding (temperature=0), output is mathematically identical to running the large model alone β€” the big model validates every token (the Leviathan / Chen guarantee). With sampling, output is statistically equivalent. Multi-Drafter Consensus is a separate opt-in mode that trades that exactness for speed by skipping the target when drafters unanimously agree β€” it's labeled as approximate consensus, not the same guarantee.

πŸš€

Most tokens come from your cheap GPU, every token validated by the big one

With same-family models and greedy decoding (Llama 3.1 8B β†’ Llama 3.3 70B), most draft tokens match the target's argmax and ship straight through. The target validates every position β€” output under greedy decoding is mathematically identical to running the target alone. Acceptance benchmarks are being re-measured under v0.5.2's per-position verification.

That's it. Change one URL. Get up to 2-3x faster responses. Same quality.

Set It Up in 20 Minutes β†’

Six ways to stop wasting money

Pick your poison. Stack them. Run all six. Tightwad doesn't judge — it just saves you cash.

👑 THE KILLER FEATURE
01

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware β€” P400 2GB, GTX 770, laptop CPU]
        | runs 1.7B draft, ~30 tok/s
        | sends token IDs (bytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to pool for BATCH verify
        v
  [RPC GPU Pool β€” 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s β€” 70B fits nowhere else
  • 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
  • Output mathematically identical to target alone under greedy decoding (Leviathan guarantee)
  • Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
  • Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft β†’ Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions.

🧠 APPROXIMATE · SKIP THE GPU
03

Multi-Drafter Consensus approximate mode

Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely β€” that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce (typically 3–6% divergence). Three sub-modes: strict, majority, any_disagree β€” pick your acceptable risk. Off by default.

  [Tightwad Proxy :8088]
        |
        | races all drafters in parallel
        |
   +----+----+----+
   v    v    v    v
 [M2] [CPU] [2070] [P400]
  8B    8B    8B     1.7B
   |    |    |      |
   +----+----+----+-+
        |
        v
  Consensus? ──yes──> Stream tokens (GPU never touched)
        |
       no
        |
        v
  [Target 70B GPU] ──> Verify only disagreed tokens
  • Race unlimited drafters — CPUs, old GPUs, laptops, anything
  • Unanimous tokens skip the target GPU entirely
  • Three modes: strict, majority, any_disagree
  • Tree-based speculation for branching draft paths
  • Prometheus metrics for consensus accept/fallback rates
04

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  70B model: covered ✓
  • Mix NVIDIA + AMD GPUs freely
  • Run 70B+ models on consumer hardware
  • Hot-swap models without restarting workers
  • Built-in benchmarking CLI
🛡 BATCH VERIFICATION
05

Quality Gate — CPU Fleet Drafts, GPU Reviews

Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — the GPU only sweats the hard 20–40%.

  [Client Request]
        |
        v
  [Tightwad Gate :8088]
        |
        | fan-out to CPU fleet
   +----+----+----+
   v    v    v    v
 [CPU] [CPU] [CPU] [CPU]
 each generates full response with 8B model
   +----+----+----+
        |
        v
  [GPU Target β€” 70B]
  Reviews each response:
    ✓ 60-80% approved (pass through)
    ✎ 15-25% corrected (light edit)
    ✗ 5-10% rejected (regenerate)
  • GPU only processes the hard 20–40% of requests
  • Any CPU or cheap GPU can be a drafter
  • Full-response verification, not token-by-token
  • Automatic approve/correct/reject pipeline
  • tightwad gate start — one command to run
🌐 P2P DISTRIBUTION
06

Swarm Transfer — P2P Model Distribution

Models are huge. Downloading 70B from HuggingFace takes hours. Pull from every machine that already has it. Chunked transfer with piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads the model from all your existing machines in parallel.

  [New Machine Joins Cluster]
        |
        | "I need Llama-3.3-70B-Q4_K_M.gguf"
        v
  +---------------------------+
  | Tightwad Swarm Discovery  |
  |                           |
  |  Piece 1 <--- Machine A  |  (4070 Ti — has full model)
  |  Piece 2 <--- Machine B  |  (RTX 2070 — has full model)
  |  Piece 3 <--- Machine C  |  (M2 Metal — has pieces 1-6)
  |  Piece 4 <--- Machine A  |  (parallel, rarest-first)
  |  ...                      |
  +---------------------------+
        |
        v
  SHA256-verified • ready to serve in minutes, not hours
  • Multi-source parallel download — pull from every peer simultaneously
  • SHA256 piece verification — every chunk validated before use
  • Rarest-first selection — ensures model availability across the cluster
  • Delta updates — new quantization? Only transfer the changed pieces
  • Zero central server — machines discover each other automatically

Your junk drawer of compute, unified

YOUR HARDWARE (any mix works)
RTX 4070 Ti Super (16GB)
RTX 3060 (12GB)
RTX 2070 (8GB)
GTX 770 (2GB — why not)
RX 7900 XTX (24GB, AMD!)
Old Xeon (CPU only)
Laptop (M2, CPU draft)
CUDA βœ“ ROCm βœ“ CPU βœ“ Mixed βœ“
TIGHTWAD
One endpoint
Draft fast. Verify smart.
localhost:8088
OpenAI-compatible API
Without Tightwad: big model generates every token, one at a time  •  With Tightwad: all your hardware works together, big model only handles the hard tokens  •  Output quality: equivalent* • Speed: up to 2–3× faster*

The math behind the magic

🚀

Draft

Small model blazes through 32 candidate tokens at ~30+ tok/s. Fast and cheap.

🔍

Verify

Big model evaluates all 32 tokens in a single forward pass. Batch is basically free.

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

📡

Stream

Accepted tokens stream to your app instantly. Repeat until done. Output quality is equivalent to the target model alone.

MoE models, finally first-class

Mixture-of-Experts models are the new normal — MiniMax M2.5 (229B), GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE. Everyone’s shipping them. Nobody supports them properly on home hardware. Until now.

🧩 THE UNLOCK
01

GGUF Defusion — the feature that makes the rest possible

Most MoE GGUFs ship fused: one giant tensor per layer covering every expert. llama.cpp can’t per-expert-split a fused tensor — so every MoE optimization tool on the planet silently degrades to whole-layer placement. Tightwad is the only one that rewrites fused weights into indexed form so you can actually pin experts.

  BEFORE (fused)                          AFTER (indexed)
  blk.0.ffn_gate_exps.weight  -->  blk.0.ffn_gate.0.weight
                                    blk.0.ffn_gate.1.weight
                                    blk.0.ffn_gate.2.weight
                                    ...
                                    blk.0.ffn_gate.127.weight

  Same bytes. Same quantization. One pass of disk I/O.
  • tightwad moe defuse fused.gguf indexed.gguf
  • Identical weights, identical output, identical quality
  • Works on GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE, Mixtral
  • Round-trip verified in the test suite — same tokens out
🔥 PROFILE-GUIDED
03

Hot Experts On Your Fastest Card

MoE routing isn’t uniform. A handful of experts fire 10× more than the tail. Capture a profile from your real traffic, and Tightwad pins the hot ones to your fastest GPU. Cold experts land on slow nodes. Same total VRAM, dramatically different throughput.

  $ tightwad moe profile --follow-coord --duration 300
  $ tightwad moe summary ~/.tightwad/profile.json
     Top 20 hot experts
     Layer  Expert  Hits
     ─────  ──────  ────
        12       7  4,231   ◀ 10× more than average
        12      88  3,109
         0      19  2,874
        ...
  $ # update moe_placement: profile-guided & restart
  • Weight = bytes × (1 + 3× hit-frequency)
  • Top-K hot experts pinned to highest-scoring device
  • Auto-measured device scores (TCP-RTT, 24h cache)
  • Experimental — needs instrumented llama.cpp build
04

MoE + Speculation = The Whole Stack

Pair expert-aware placement with speculative decoding and you get sparse weights + sparse compute. The draft model (small, same-family) predicts 32 tokens. The pooled MoE target verifies in one batch. Only the activated experts fire. Only the disagreed tokens get regenerated. It’s the whole stack of optimizations working together.

  [Your App]
      |
      v
  [Tightwad Proxy :8088]
      |
      +--> Draft: Qwen3-1.7B (local, ~100 tok/s)
      |
      +--> Target: MiniMax M2.5 (229B MoE, LM Studio)
                      |
                      | expert-aware placement keeps hot experts
                      | on Mac Studio M3 Ultra unified memory
                      |
                      v
                Batch verify 32 tokens → activate 8 of 256 experts
  • Works with Combined Mode (RPC pool) out of the box
  • tightwad moe bench streams live TTFT + acceptance table
  • OpenAI-compatible targets (LM Studio, vLLM, llama-server)
  • MiniMax M2.5 baseline: 26–48 tok/s direct on Mac Studio M3 Ultra

Benchmarks that hit different

Real hardware. Real numbers. No cherry-picking. Logprobs-based batch verification is live β€” these acceptance rates translate directly to wall-clock speedup.

Llama 3.3 70B · 4-GPU RPC pool (52GB VRAM over WiFi) · Llama 3.1 8B draft on M4 Metal
Mode Tokens Time Speed
RPC pool direct (autoregressive) 512 231s 2.2 tok/s
RPC pool + speculation 519 127s 4.1 tok/s
⚡ Speedup Greedy-equivalent output · 33 tokens/round 1.86×

The 70B model doesn't fit on any single machine. It's distributed across 4 GPUs (RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal) over WiFi. Without speculation: painfully slow. With speculation: usable.

👑

The killer result

A 70B model across 4 consumer GPUs over WiFi β€” from 2.2 to 4.1 tok/s. No single machine could run this model. Speculation makes it usable.

Greedy-equivalent output

Same-family drafting (Llama 3.1 8B β†’ Llama 3.3 70B) under v0.5.2's per-position verification: output is mathematically identical to running the 70B alone (Leviathan greedy guarantee). Acceptance numbers are being re-measured under the corrected verifier.

⚠️

Family matters

Llama 3.2 3B β†’ Llama 3.3 70B got only 1.6% acceptance despite sharing a tokenizer. Architecture match is critical β€” Llama 3.1 8B is the correct drafter.

Qwen3-32B · 4-GPU RPC pool · Qwen3-1.7B draft on M4 CPU
Mode Speed Notes
Desktop local only (4070+3060, 32B) 17.0 tok/s Best case β€” fits on one machine
4-GPU RPC pool (autoregressive) 3.0 tok/s Each token = full RPC round-trip
RPC pool + speculation 5.4 tok/s 32 tokens verified per batch (greedy decoding, output equivalent to target alone)
⚡ Pool speedup 1.8× over pool-only (3.0 β†’ 5.4 tok/s)

RPC pooling alone is slow over WiFi (one network round-trip per token). Speculation amortizes that β€” 32 tokens per round-trip instead of 1. Don't pool when the model fits locally (17 tok/s local vs 5.4 tok/s pooled).

When to use combined mode

Only when the model doesn't fit on one machine. If it fits locally (17 tok/s), don't pool β€” just use speculation with a remote drafter.

💡

Why it works

Pool autoregressive: 1 token per network round-trip = slow. Pool + speculation: 32 tokens per round-trip = 1.8× faster. The draft model amortizes network overhead.

Qwen3-8B (RTX 2070) β†’ Qwen3-32B (RTX 4070 Ti Super) Β· 130 prompts
Prompt Type Acceptance Rate Rounds Verdict
🧮 Reasoning
89%
32 Math is deterministic. Love it.
💻 Code
76%
34 Syntax is law. Both models agree.
📚 Factual
73%
18 Strong agreement on facts.
📋 List
42%
40 Phrasing varies. Still worthwhile.
🎨 Creative
39%
6 Many valid outputs. Expected.
⚡ Average
63.8%
26 64% of tokens = free.
💸

What 64% means

Nearly two-thirds of your tokens come from the cheap GPU. The expensive model only works on the hard parts.

🎯

Output quality

Equivalent to running the big model alone. With greedy decoding, mathematically identical; with other sampling, statistically equivalent.

Logprobs: live

Logprobs-based batch verification is implemented. These acceptance rates are real wall-clock speedup, not just acceptance stats.

Qwen3-8B (local GPU) β†’ Qwen3.5-397B (API) Β· logprobs + whitespace normalization
Prompt Type Acceptance Rate Notes
🧮 Reasoning
88%
Highest β€” deterministic math
⚡ Average (normalized)
80%
Key result: 4 in 5 tokens local.
πŸ†

80% acceptance: Qwen3-8B β†’ Qwen3.5-397B

With whitespace normalization, a consumer GPU running an 8B model drafts 4 out of every 5 tokens for a 397B model. That means up to 80% fewer output tokens billed to the cloud API for the same quality output. The bigger the gap between draft and target quality, the more you save.

πŸ†

Notable result

Up to 80% acceptance on a 397B model (same-family models). Your gaming PC is doing up to 80% of the work that would otherwise cost API money.

💰

API cost math

At $0.60/M output tokens (Qwen3.5-397B), 80% acceptance means you pay for roughly 20% of output tokens via the API β€” up to 5× reduction in output token costs. Input/prompt tokens are still processed by the API. Local GPU electricity and hardware costs not included.

📝

Same-family is key

Qwen3-8B + Qwen3.5-397B are from the same model family. Cross-family (e.g. Llama β†’ Qwen) drops to ~3%. Same family = high acceptance.

Wall-clock speedup Β· Qwen3-8B (RTX 2070) β†’ Qwen3-32B (RTX 4070 Ti + RTX 3060) Β· llama-server Β· max_draft_tokens=32
Prompt Baseline Speculative Speedup
Capital of France 1.17s 0.90s 1.30x
Thermodynamics 12.73s 9.09s 1.40x
Prime checker 12.76s 10.15s 1.28x
Average speed 13.24s 10.95s 1.21x
TCP vs UDP 5.58s 4.88s 1.14x
Total 45.43s 35.96s 1.27x

Set max_draft_tokens: auto and Tightwad finds the sweet spot for you. Or pin it at 32 for manual control.

⚑

Real wall-clock time

1.27x overall speedup measured end-to-end. Not theoretical β€” actual seconds off the clock per response.

πŸŽ›οΈ

Tune for your setup

Cross-machine HTTP overhead is the enemy. Set max_draft_tokens: auto to let Tightwad optimize round trips for you, or pin at 32 for manual control.

How much are you leaving on the table?

Slide to see your monthly cloud inference waste. Then stop doing that.

10M
$15
😭 Without Tightwad $150/mo
🐷 With Tightwad $63/mo
You save $87/mo

* Estimated savings assume the selected acceptance rate is sustained. Default uses 60% acceptance rate typical of local GPU setups. Rates vary by model pair and prompt type (58-64% local, up to 80% same-family API). Savings do not account for local electricity, hardware costs, or maintenance. Your results will vary.

Pick your archetype

Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode
  • Pool all your random GPUs into one endpoint
  • Run 70B models across consumer hardware
  • Zero wasted VRAM, zero cloud spend
🏗️

The Budget Builder

You want 70B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.

RPC Cluster Mode
  • Llama 3.3 70B on 4× consumer GPUs
  • No enterprise hardware required
  • Benchmark built-in to tune your setup

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.

RPC Cluster Mode
  • CUDA + ROCm on the same model
  • llama.cpp RPC handles the hard parts
  • Coordinator distributes layers intelligently

Pick your compute configuration

Tightwad works however your hardware is set up. Consumer GPU, no GPU, cloud API β€” there's a config for you.

GPU β†’ API

Slash Your API Bills

Even a GTX 1060 can draft for GPT-4. Any GPU you have β€” old, cheap, low VRAM β€” reduces your API bill. Up to 80% fewer output tokens billed to the cloud API.

πŸ–₯️
Your PC
GTX 1060 / RTX 2070 / any GPU Β· 8B draft
β†’
☁️
Cloud API
Qwen3.5-397B Β· pay per token
80% acceptance Β· up to 5x output token cost reduction Β· any CUDA/ROCm GPU
CPU β†’ GPU

Zero GPU Required to Participate

Run a tiny draft model on any CPU. Verify on a remote GPU server. Your CPU-only server, your laptop, your NAS β€” all can contribute to the cluster.

πŸ’»
Any machine
CPU only Β· Qwen3-1.7B draft Β· even a laptop
β†’
πŸ–₯️
GPU Server
Any GPU Β· any big target model
~68% acceptance Β· zero GPU required to participate
ENTERPRISE PLAY
CPU β†’ API

Literally Any Computer

Data centers often run at 10–30% average utilization (industry estimates). Idle CPUs, stranded servers, that old Xeon doing nothing β€” put them to work drafting tokens. No GPU required, ever.

🏒
Any idle machine
Old Xeon Β· 32-core server Β· spare laptop Β· Qwen3-1.7B
β†’
☁️
Cloud API
397B model Β· only for hard tokens
Stranded compute β†’ inference revenue

Legacy GPU revival: that GTX 770 from 2013 can run Qwen3-1.7B as a drafter for a 70B target β€” turning e-waste into productive infrastructure.

Config Draft Target Use Case Acceptance
GPU β†’ GPU Any GPU β€” old, new, NVIDIA, AMD Any bigger GPU Homelab, mixed hardware ~64%
GPU β†’ API Any GPU (even GTX 1060) Cloud API Slash API bills ~80%
CPU β†’ GPU Any CPU, no GPU needed GPU server Zero-GPU participants ~68%
CPU β†’ API Literally any computer Cloud API Data centers, enterprise ~68%

Homelab Setup in 30 Minutes

Four machines. One 70B model. Start with two, add machines anytime. The cluster grows.

⚑ Draft Brain
πŸ’»
MacBook Air M4
Llama 3.1 8B · Apple Silicon
Tightwad proxy :8088
Proposes 32 tokens/batch
at ~60 tok/s locally
propose β†’
WiFi
← verify
πŸ–₯️ GPU Pool β€” Target Model
Llama 3.3 70B Β· 52GB VRAM distributed
πŸ–₯️
Desktop
RTX 4070 Ti Super + RTX 3060
28 GB VRAM
πŸ–₯️
Gaming PC
RTX 2070
8 GB VRAM
πŸ’»
MacBook Air M2
Apple Metal
16 GB unified
3 machines Β· 52 GB total Β· rpc-server :50052
1.86Γ— speedup
4.1 tok/s was 2.2 tok/s
= output identical to 70B alone
$0 cloud spend
1
On Machines A, B, C: Start RPC workers
Pool Workers
bash (on each pool machine)
# Machine A β€” Desktop (4070 Ti + 3060, 28GB)
$ rpc-server -p 50052
# Machine B β€” Old Gaming PC (RTX 2070, 8GB)
$ rpc-server -p 50052
# Machine C β€” MacBook Air M2 (Metal, 16GB)
$ rpc-server -p 50052
2
On Machine D: Start the draft model
Machine D
bash
# Machine D β€” MacBook Air M4 (runs draft + proxy)
$ ollama run llama3.1:8b
# Confirm:
$ ollama ps
 llama3.1:8b  running
3
Install Tightwad (either machine)
Either
bash
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install tightwad
4
Edit configs/cluster.yaml
Either
configs/cluster.yaml β€” combined mode (pool + speculation)
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto            # auto-tunes based on acceptance rate
  mode: combined              # speculation over pooled GPUs
  draft:
    url: http://localhost:11434   # Machine D (M4, local draft)
    model_name: llama3.1:8b
    backend: ollama

coordinator:
  host: 0.0.0.0
  port: 8080
  model: Llama-3.3-70B-Q4_K_M.gguf

workers:
  - host: 192.168.1.10        # Machine A (4070 Ti + 3060)
    rpc_port: 50052
  - host: 192.168.1.20        # Machine B (RTX 2070)
    rpc_port: 50052
  - host: 192.168.1.30        # Machine C (M2 Metal)
    rpc_port: 50052

Find your IPs: ip addr on Linux, ipconfig on Windows, ifconfig on macOS. Add more workers anytime β€” the cluster grows.

5
Start the proxy
Either
bash
$ tightwad proxy start
 Draft model healthy  (llama3.1:8b @ localhost:11434) β€” Machine D
 Pool: 3 workers online (52GB VRAM total) β€” A + B + C
 Target: Llama-3.3-70B distributed across pool
 Proxy listening on http://localhost:8088
6
Test it
Either
bash
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "What is 17 * 24?"}], "max_tokens": 100}'

# Check acceptance rate
$ tightwad proxy status
 Acceptance rate: ~58% | Rounds: N | Tokens saved: N
8
Point your chat app at it
Done βœ“

In Open WebUI, ChatBot UI, or any OpenAI-compatible app, change the base URL from:

http://192.168.1.10:11434 β†’ http://192.168.1.10:8088

That's it. Four machines, one endpoint. Same app. Same model name. Same output quality. Machines A, B, and C pool a 70B model that fits on no single machine. Machine D drafts and proxies. You just see 4.1 tok/s instead of 2.2.

What to expect with this setup

Same-family models (Llama 3.1 8B β†’ Llama 3.3 70B) with greedy decoding:

MetricResult
⚡ Output equivalence (greedy)
=
🚀 Speedup
1.86×
💬 Tokens per round
33
⏱️ Speed (pool only)
2.2 tok/s
⏱️ Speed (pool + speculation)
4.1 tok/s

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

1

Install

bash
$ pip install tightwad
# or from source:
$ git clone https://github.com/youngharold/tightwad.git
$ cd tightwad && pip install .
2

Configure your hardware

configs/cluster.yaml
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: auto          # auto-tunes based on acceptance rate
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama
3

Start it & test

bash
$ tightwad proxy start
 Draft model healthy
 Target model healthy
 Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check acceptance rate stats
$ tightwad proxy status
 Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
1

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)
# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0
2

Configure cluster topology

configs/cluster.yaml
coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  llama-3.3-70b:
    path: /models/Llama-3.3-70B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true
3

Start the cluster

bash
$ tightwad start
 Coordinator started
 Worker @ 192.168.1.100:50052 online
 Model llama-3.3-70b loaded across 52 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

Everything you need, nothing you don't

Built for terminal people who hate bloat as much as they hate cloud bills.

🔁

OpenAI Compatible

Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.

🔄

Hot-Swap Models

tightwad swap model-name β€” swap the model while workers keep running. Zero downtime.

📡

SSE Streaming

Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.

⌨️

CLI-First

tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.

📄

YAML Config

One file describes your entire hardware topology. Version control it. Share it. Ship it.

📊

A/B Benchmark

tightwad bench β€” proxy vs direct target comparison. See your exact speedup, tok/s, and per-prompt breakdown.

🧪

Dual Backends

Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.

🔒

Fallback Safety

Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.

💻

Mixed Vendor

NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.

🧬

Family Validation

Auto-detects model architecture families. Warns you before a mismatched draft/target pair wastes hours at 1.6% acceptance.

💬

Auto Chat Templates

Detects Llama 3, Mistral, Gemma, Phi, and more. No more hardcoded Qwen3 template breaking other model families.

🎯

Auto-Tune

max_draft_tokens: auto β€” adjusts at runtime based on acceptance rate. Zero-config optimization.

🤝

Consensus Verification

Multiple drafters vote on tokens. When they agree, skip the target entirely. Three modes: strict, majority, any-disagree.

🌐

Peer Agent

Cross-platform cluster management without SSH. REST API on every node for version checks, GPU info, and remote control.

🛡️

Safety Checks

Version enforcement, MoE VRAM warnings, SSRF protection, bearer token auth. Production-ready out of the box.

🧩

GGUF Defusion

tightwad moe defuse rewrites fused expert tensors to indexed form so per-expert placement actually works. No quantization change, identical weights.

🎯

Expert-Aware Placement

moe_placement: balanced pins whole experts to individual GPUs via llama.cpp --override-tensor flags. No more half-experts on the wire.

🔥

Profile-Guided Placement

Capture per-expert hit counts, pin the hot ones to your fastest card. tightwad moe profile + moe_placement: profile-guided.

📈

MoE Benchmark

tightwad moe bench — streams live TTFT, rolling acceptance, speedup. Works with LM Studio, vLLM, any OpenAI-compatible target.

Why not just use vLLM?

Fair question. Here's the honest answer. The other tools are good. Tightwad is for a different problem.

Comparison accurate as of March 2026. These tools evolve quickly β€” check their docs for the latest capabilities.

vs

vLLM

Excellent production inference engine. CUDA-only. Built for ML teams.

  • Primarily CUDA-focused. ROCm support is experimental/limited. Tightwad treats CUDA and ROCm as first-class citizens in the same cluster.
  • Can't mix GPU generations. vLLM can't pool a GTX 770 with a 4070 Ti. Tightwad doesn't care what generation or vendor your hardware is from.
  • Speculative decoding, but single-machine only. Tightwad does it across your network β€” draft on one box, verify on another.
  • No CPU nodes. Can't add a CPU-only machine to a vLLM cluster. Tightwad: CPU drafting is fully supported.
  • Use vLLM if: you have a single powerful CUDA machine and need production-grade throughput.
vs

Ollama

The reason most people have local models. One model, one machine, beautifully simple.

  • One model, one machine. When you outgrow a single GPU, Ollama can't pool across machines. Your RTX 2070 and RTX 4070 are completely isolated from each other.
  • Can't combine machines at all. Ollama has no concept of cross-machine inference. Your hardware can't cooperate.
  • Tightwad works with Ollama. Keep Ollama on each machine β€” Tightwad just coordinates between them.
  • Use Ollama for getting started. Use Tightwad when you have a second machine and want them to work together.
vs

llama.cpp RPC

The low-level primitive Tightwad is built on. Powerful. Requires a lot of scripting.

  • Tightwad is built on llama.cpp RPC. We add the orchestration, YAML config, CLI, and speculative proxy on top.
  • RPC ships 100–300 MB of tensor data per network step. Tightwad's speculative proxy ships token IDs β€” bytes, not megabytes.
  • Use raw RPC if you want maximum control. Use Tightwad if you want it to just work.
vs

TGI (HuggingFace)

Production inference for the HuggingFace ecosystem. Great if you're already there.

  • Optimized for the HuggingFace ecosystem. Designed to work best with HuggingFace's model hub and services.
  • Tightwad is vendor-neutral. Works with your existing Ollama or llama.cpp setup. No accounts required.
  • Use TGI if you're in the HuggingFace ecosystem. Use Tightwad if you want backend-agnostic, no-strings-attached inference.

The honest summary

Single powerful CUDA machine, production workloads Use vLLM
One machine, just want to run models Use Ollama
Two or more machines β€” mixed GPUs, old & new, NVIDIA & AMD, or CPU-only β€” want them all working together 🐷 Use Tightwad