v0.1.0 — open source — MIT license

Your GPUs are
|

Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 2‑3× faster. Zero cloud bill.

2-3× faster inference
58% avg token acceptance
$0 cloud bill
72B models on consumer hardware
terminal
$ pip install tightwad
$ tightwad proxy start
 Draft:  Qwen3-8B  @ 192.168.1.50:11434 (RTX 2070 — the dusty one)
 Target: Qwen3-32B @ 192.168.1.100:11434  (RTX 4070 Ti — the good one)
 Proxy listening on http://localhost:8088
 Acceptance rate: 73.2% | Tokens saved this session: 14,891

Two ways to stop wasting money

Pick your poison. Or run both. Tightwad doesn't judge — it just saves you cash.

01

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  72B model: covered ✓
  • Mix NVIDIA + AMD GPUs freely
  • Run 72B+ models on consumer hardware
  • Hot-swap models without restarting workers
  • Built-in benchmarking CLI

The math behind the magic

🚀

Draft

Small model blazes through 8 candidate tokens at ~100+ tok/s. Fast and cheap.

🔍

Verify

Big model evaluates all 8 tokens in a single forward pass. Batch is basically free.

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

📡

Stream

Accepted tokens stream to your app instantly. Repeat until done. Output is mathematically identical.

Benchmarks that hit different

Tested with Qwen3-8B (RTX 2070) drafting for Qwen3-32B (RTX 4070 Ti Super) across 130 prompts. Real hardware. Real numbers. No cherry-picking.

Prompt Type Acceptance Rate Rounds Verdict
🧮 Reasoning
88%
32 Math is deterministic. Love it.
💻 Code
73%
34 Syntax is law. Both models agree.
📚 Factual
52%
18 Decent. Facts don't lie.
📋 List
44%
40 Phrasing varies. Still worthwhile.
🎨 Creative
34%
6 Many valid outputs. Expected.
⚡ Average
58.3%
26 58% of tokens = free.
💸

What 58% means

More than half your tokens come from the cheap GPU. The expensive model only works on the hard parts. That's not a bug — that's the whole point.

🎯

Output quality

Provably identical to running the big model alone. The math guarantees it. You're not trading quality for speed — you're just being smarter about it.

🔜

Coming soon

Logprobs-based batch verification will convert these acceptance rates into real wall-clock 2-3× speedup. The acceptance rates are already there — the turbo button is being installed.

How much are you leaving on the table?

Slide to see your monthly cloud inference waste. Then stop doing that.

10M
$15
😭 Without Tightwad $150/mo
🐷 With Tightwad $63/mo
You save $87/mo

* Based on 58.3% avg acceptance rate. Draft model runs on your local GPU (electricity cost = rounding error). Cloud API calls reduced by ~58%. Your mileage may vary, but it'll vary in your favor.

Pick your archetype

Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode
  • Pool all your random GPUs into one endpoint
  • Run 72B models across consumer hardware
  • Zero wasted VRAM, zero cloud spend
🏗️

The Budget Builder

You want 72B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.

RPC Cluster Mode
  • Qwen3-72B on 2× consumer GPUs
  • No enterprise hardware required
  • Benchmark built-in to tune your setup

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.

RPC Cluster Mode
  • CUDA + ROCm on the same model
  • llama.cpp RPC handles the hard parts
  • Coordinator distributes layers intelligently

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

1

Install

bash
$ git clone https://github.com/akivasolutions/tightwad.git
$ cd tightwad
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install -e .
2

Configure your hardware

configs/cluster.yaml
proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: 8
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama
3

Start it & test

bash
$ tightwad proxy start
 Draft model healthy
 Target model healthy
 Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check acceptance rate stats
$ tightwad proxy status
 Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891
1

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)
# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0
2

Configure cluster topology

configs/cluster.yaml
coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  qwen3-72b:
    path: /models/Qwen3-72B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true
3

Start the cluster

bash
$ tightwad start
 Coordinator started
 Worker @ 192.168.1.100:50052 online
 Model qwen3-72b loaded across 40 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

Everything you need, nothing you don't

Built for terminal people who hate bloat as much as they hate cloud bills.

🔁

OpenAI Compatible

Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.

🔄

Hot-Swap Models

tightwad swap model-name — swap the model while workers keep running. Zero downtime.

📡

SSE Streaming

Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.

⌨️

CLI-First

tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.

📄

YAML Config

One file describes your entire hardware topology. Version control it. Share it. Ship it.

📊

Built-in Benchmark

tightwad benchmark — test your cluster's throughput and acceptance rates on real prompts.

🧪

Dual Backends

Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.

🔒

Fallback Safety

Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.

💻

Mixed Vendor

NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.