v0.1.0 — open source — MIT license

Your GPUs are
|

Tightwad pools your mixed CUDA + ROCm GPUs into a single OpenAI-compatible endpoint.
Speculative decoding proxy: draft fast, verify smart, stream everything.
Same output quality. 2‑3× faster. Zero cloud bill.

Get Started Free How It Works → Join the Discussion →

2-3× faster inference

58% avg token acceptance

$0 cloud bill

72B models on consumer hardware

terminal

$ pip install tightwad
$ tightwad proxy start
✓ Draft:  Qwen3-8B  @ 192.168.1.50:11434 (RTX 2070 — the dusty one)
✓ Target: Qwen3-32B @ 192.168.1.100:11434  (RTX 4070 Ti — the good one)
✓ Proxy listening on http://localhost:8088
→ Acceptance rate: 73.2% | Tokens saved this session: 14,891

// the goods

Two ways to stop wasting money

Pick your poison. Or run both. Tightwad doesn't judge — it just saves you cash.

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  72B model: covered ✓

✓ Mix NVIDIA + AMD GPUs freely
✓ Run 72B+ models on consumer hardware
✓ Hot-swap models without restarting workers
✓ Built-in benchmarking CLI

⚡ 2-3× FASTER

Speculative Decoding Proxy

Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.

  [Your App / OpenAI SDK]
          |
          v
+--------------------------+
|  Tightwad Proxy :8088    |
|                          |
|  1. Draft 8 tokens ------+--> Qwen3-8B
|     (~100 tok/s, cheap)  |    RTX 2070 (the dusty one)
|                          |
|  2. Verify batch --------+--> Qwen3-72B
|     (one forward pass)   |    4070Ti / Cloud API
|                          |
|  3. Accept/reject <------+
|  4. Stream to client     |
+--------------------------+
  Output = identical to 72B alone ✓

✓ Provably identical output quality
✓ Draft locally, verify via cloud API
✓ Supports Ollama + llama.cpp backends
✓ SSE streaming, full OpenAI compatibility

The math behind the magic

🚀

Draft

Small model blazes through 8 candidate tokens at ~100+ tok/s. Fast and cheap.

→

🔍

Verify

Big model evaluates all 8 tokens in a single forward pass. Batch is basically free.

→

✅

Accept

Keep every token both models agree on. Take the big model's token at the first disagreement.

→

📡

Stream

Accepted tokens stream to your app instantly. Repeat until done. Output is mathematically identical.

// the money shot

Benchmarks that hit different

Tested with Qwen3-8B (RTX 2070) drafting for Qwen3-32B (RTX 4070 Ti Super) across 130 prompts. Real hardware. Real numbers. No cherry-picking.

Prompt Type	Acceptance Rate	Rounds	Verdict
🧮 Reasoning	88%	32	Math is deterministic. Love it.
💻 Code	73%	34	Syntax is law. Both models agree.
📚 Factual	52%	18	Decent. Facts don't lie.
📋 List	44%	40	Phrasing varies. Still worthwhile.
🎨 Creative	34%	6	Many valid outputs. Expected.
⚡ Average	58.3%	26	58% of tokens = free.

💸

What 58% means

More than half your tokens come from the cheap GPU. The expensive model only works on the hard parts. That's not a bug — that's the whole point.

🎯

Output quality

Provably identical to running the big model alone. The math guarantees it. You're not trading quality for speed — you're just being smarter about it.

🔜

Coming soon

Logprobs-based batch verification will convert these acceptance rates into real wall-clock 2-3× speedup. The acceptance rates are already there — the turbo button is being installed.

// savings calculator

How much are you leaving on the table?

Slide to see your monthly cloud inference waste. Then stop doing that.

Monthly tokens generated

10M

Cost per 1M output tokens ($)

$15

😭 Without Tightwad $150/mo

→

🐷 With Tightwad $63/mo

You save $87/mo

* Based on 58.3% avg acceptance rate. Draft model runs on your local GPU (electricity cost = rounding error). Cloud API calls reduced by ~58%. Your mileage may vary, but it'll vary in your favor.

// who's this for

Pick your archetype

Tightwad was built by and for people who think paying $200/month for inference is genuinely offensive.

🏠

The Homelab Hoarder

You have a 2070 in the old desktop, a 4070 in the main rig, and a Quadro doing nothing in the server rack. They're all lazy freeloaders. Tightwad makes them a team.

RPC Cluster Mode

Pool all your random GPUs into one endpoint
Run 72B models across consumer hardware
Zero wasted VRAM, zero cloud spend

💰 MOST POPULAR

☁️

The Cloud Escapee

You're still paying OpenAI/Anthropic for some tasks. Fine. But why let them do the easy parts? Draft locally, verify via API. 58% fewer API calls. Same answers.

Speculative Proxy Mode

Local draft GPU does the heavy lifting
Cloud only handles the hard tokens
Drop-in OpenAI SDK replacement — zero code changes

🏗️

The Budget Builder

You want 72B model quality on a $600 GPU budget. That's not a dream — that's just math. RPC mode distributes layers across whatever you've got collecting dust.

RPC Cluster Mode

Qwen3-72B on 2× consumer GPUs
No enterprise hardware required
Benchmark built-in to tune your setup

⚡

The Mixed Vendor Maverick

You bought AMD because it was on sale and NVIDIA for the CUDA ecosystem. Now they won't cooperate. Tightwad makes CUDA and ROCm work the same model together. Finally.

RPC Cluster Mode

CUDA + ROCm on the same model
llama.cpp RPC handles the hard parts
Coordinator distributes layers intelligently

// get running in 5 minutes

Quick Start

No Docker Compose files with 300 environment variables. No Kubernetes YAML. Just Python and one config file.

Install

bash

$ git clone https://github.com/akivasolutions/tightwad.git
$ cd tightwad
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install -e .

Configure your hardware

configs/cluster.yaml

proxy:
  host: 0.0.0.0
  port: 8088
  max_draft_tokens: 8
  fallback_on_draft_failure: true
  draft:
    url: http://192.168.1.50:11434  # Your cheap GPU (Ollama)
    model_name: qwen3:8b
    backend: ollama
  target:
    url: http://192.168.1.100:11434   # Your big GPU (Ollama)
    model_name: qwen3:32b
    backend: ollama

Start it & test

bash

$ tightwad proxy start
✓ Draft model healthy
✓ Target model healthy
✓ Proxy listening on http://localhost:8088

# Test it (drop-in for any OpenAI SDK call)
$ curl http://localhost:8088/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Check acceptance rate stats
$ tightwad proxy status
→ Acceptance rate: 73.2% | Rounds: 34 | Tokens saved: 14,891

Build RPC workers (CUDA — Windows/Linux)

bash (worker machine)

# Or use scripts/install-worker.sh
$ cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
$ cmake --build build --config Release
$ build/bin/rpc-server -p 50052  # GPU 0

Configure cluster topology

configs/cluster.yaml

coordinator:
  host: 0.0.0.0
  port: 8080
  backend: hip  # or cuda
  gpus:
    - name: "7900 XTX #0"
      vram_gb: 24

workers:
  - host: 192.168.1.100  # NVIDIA box
    gpus:
      - name: "RTX 4070 Ti Super"
        vram_gb: 16
    rpc_port: 50052

models:
  qwen3-72b:
    path: /models/Qwen3-72B-Q4_K_M.gguf
    ctx_size: 8192
    flash_attn: true
    default: true

Start the cluster

bash

$ tightwad start
✓ Coordinator started
✓ Worker @ 192.168.1.100:50052 online
✓ Model qwen3-72b loaded across 40 GB VRAM

# Hot-swap to a different model anytime
$ tightwad swap deepseek-r1-70b

# Run the benchmark
$ tightwad benchmark

// what's in the box

Everything you need, nothing you don't

Built for terminal people who hate bloat as much as they hate cloud bills.

🔁

OpenAI Compatible

Drop-in replacement for any OpenAI SDK. Change one URL. That's it. No code changes required.

🔄

Hot-Swap Models

tightwad swap model-name — swap the model while workers keep running. Zero downtime.

📡

SSE Streaming

Full streaming support on all endpoints. Tokens flow as they're accepted. No buffering.

⌨️

CLI-First

tightwad start, tightwad proxy start, tightwad status. Simple commands for complex infrastructure.

📄

YAML Config

One file describes your entire hardware topology. Version control it. Share it. Ship it.

📊

Built-in Benchmark

tightwad benchmark — test your cluster's throughput and acceptance rates on real prompts.

🧪

Dual Backends

Ollama for quick setup. llama.cpp for maximum performance. Switch per-model in the config.

🔒

Fallback Safety

Draft server down? fallback_on_draft_failure: true routes straight to target. Never breaks.

💻

Mixed Vendor

NVIDIA CUDA + AMD ROCm on the same model, same cluster, same endpoint. No compromises.

Your GPUs are |

Two ways to stop wasting money

RPC Cluster Mode

Speculative Decoding Proxy

The math behind the magic

Draft

Verify

Accept

Stream

Benchmarks that hit different

What 58% means

Output quality

Coming soon

How much are you leaving on the table?

Pick your archetype

The Homelab Hoarder

The Cloud Escapee

The Budget Builder

The Mixed Vendor Maverick

Quick Start

Install

Configure your hardware

Start it & test

Build RPC workers (CUDA — Windows/Linux)

Configure cluster topology

Start the cluster

Everything you need, nothing you don't

OpenAI Compatible

Hot-Swap Models

SSE Streaming

CLI-First

YAML Config

Built-in Benchmark

Dual Backends

Fallback Safety

Mixed Vendor

Your GPUs are
|