// six modes, one binary

Six ways to run Tightwad

One install, six inference strategies for a junk drawer of mismatched compute. Pool GPUs that fit nothing on their own. Draft cheap and verify smart. Race drafters for consensus. Gate full responses. Sling 40 GB models peer-to-peer. Stack them, or run just the one you need — they all hang off the same OpenAI-compatible endpoint.

The killer feature: pool + speculate → Read the docs

// the goods

Pick your poison. Stack them. Run all six.

Speculative decoding, GPU pooling, consensus drafting, quality gating, and P2P model transfer — every mode is a `pip install tightwad` away. Tightwad doesn't judge what hardware you bring; it just puts it to work behind one endpoint.

👑 THE KILLER FEATURE

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware — P400 2GB, GTX 770, laptop CPU]
        | runs 1.7B draft, ~30 tok/s
        | sends token IDs (bytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to pool for BATCH verify
        v
  [RPC GPU Pool — 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s — 70B fits nowhere else

✓ 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
✓ Output mathematically identical to target alone under greedy decoding (Leviathan guarantee)
✓ Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
✓ Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions. Full deep dive →

⚡ 1.86× MEASURED ON 70B

Speculative Decoding Proxy

Your cheap GPU isn't slow — it's a draft engine. A fast small model guesses tokens. A big model batch-verifies them. Same output quality as running the big model alone. Ships token IDs (bytes), not 100–300 MB of tensor data over the wire.

  [Your App / OpenAI SDK]
          |
          v
+--------------------------+
|  Tightwad Proxy :8088    |
|                          |
|  1. Draft 32 tokens -----+--> Qwen3-8B
|     (~100 tok/s, cheap)  |    RTX 2070 (the dusty one)
|                          |
|  2. Verify batch --------+--> Qwen3-32B
|     (one forward pass)   |    4070Ti (same machine or LAN)
|                          |
|  3. Accept/reject <------+
|  4. Stream to client     |
+--------------------------+
  Output quality = equivalent to 32B alone ✓

✓ Output quality equivalent to target model alone*
✓ Best on local / low-latency targets — the win is wall-clock, not the bill
✓ Supports Ollama + llama.cpp backends
✓ SSE streaming, full OpenAI compatibility

* Identical to the target under greedy decoding; statistically equivalent under sampling. Speculation shines when both models are local or very low-latency — over a remote cloud API, per-round network latency makes it slower than baseline.

🧠 APPROXIMATE · SKIP THE GPU

Multi-Drafter Consensus approximate mode

Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely — that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce. Three sub-modes: strict, majority, any_disagree — pick your acceptable risk. Off by default.

  [Tightwad Proxy :8088]
        |
        | races all drafters in parallel
        |
   +----+----+----+
   v    v    v    v
 [M2] [CPU] [2070] [P400]
  8B    8B    8B     1.7B
   |    |    |      |
   +----+----+----+-+
        |
        v
  Consensus? ──yes──> Stream tokens (GPU never touched)
        |
       no
        |
        v
  [Target 70B GPU] ──> Verify only disagreed tokens

✓ Race unlimited drafters — CPUs, old GPUs, laptops, anything
✓ Unanimous tokens skip the target GPU entirely
✓ Three modes: strict, majority, any_disagree
✓ Tree-based speculation for branching draft paths
✓ Prometheus metrics for consensus accept/fallback rates

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  70B model: covered ✓

✓ Mix NVIDIA + AMD GPUs freely
✓ Run 70B+ models on consumer hardware
✓ Hot-swap models without restarting workers
✓ Built-in benchmarking CLI

Pooling pays off only when the model fits nowhere on its own — pair it with speculation (Mode 01) to claw back the per-token network cost. See the deep dive →

🛡 FULL-RESPONSE REVIEW

Quality Gate — CPU Fleet Drafts, GPU Reviews

Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — so the GPU only sweats the hard remainder.

  [Client Request]
        |
        v
  [Tightwad Gate :8088]
        |
        | fan-out to CPU fleet
   +----+----+----+
   v    v    v    v
 [CPU] [CPU] [CPU] [CPU]
 each generates full response with 8B model
   +----+----+----+
        |
        v
  [GPU Target — 70B]
  Reviews each response:
    ✓ approve  (pass through)
    ✎ correct  (light edit)
    ✗ reject   (regenerate)
  60-80% pass unchanged

✓ GPU only processes the hard responses the fleet got wrong
✓ Any CPU or cheap GPU can be a drafter
✓ Full-response verification, not token-by-token
✓ Automatic approve/correct/reject pipeline
✓ tightwad gate start — one command to run

🌐 P2P DISTRIBUTION

Swarm Transfer — P2P Model Distribution

Models are huge. Pulling a 40 GB GGUF from HuggingFace to every worker takes hours and wastes bandwidth. Pull from every machine that already has it. Chunked 64 MB transfer with SHA256 piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads from all your existing machines in parallel.

  [New Machine Joins Cluster]
        |
        | "I need Llama-3.3-70B-Q4_K_M.gguf"
        v
  +---------------------------+
  | Tightwad Swarm Discovery  |
  |                           |
  |  Piece 1 <--- Machine A  |  (4070 Ti — has full model)
  |  Piece 2 <--- Machine B  |  (RTX 2070 — has full model)
  |  Piece 3 <--- Machine C  |  (M2 Metal — has pieces 1-6)
  |  Piece 4 <--- Machine A  |  (parallel, rarest-first)
  |  ...                      |
  +---------------------------+
        |
        v
  SHA256-verified • ready to serve in minutes, not hours

✓ Multi-source parallel download — pull from every peer simultaneously
✓ SHA256 piece verification — every 64 MB chunk validated before use
✓ Rarest-first selection — keeps the model available across the cluster
✓ Delta updates — new quantization? Only transfer the changed pieces
✓ Zero central server — resume on interrupt, no single bottleneck

// which one do I run?

Stack them, or pick the one that fits

The honest short version: if a model fits on one box, just speculate (Mode 02). If it doesn't, pool and speculate (Mode 01). Everything else is a knob for a specific junk-drawer shape.

✓ Model fits on one machine → Speculative Decoding Proxy (02). Cheap local draft, exact output.
✓ Model fits nowhere → Combined Mode (01). Pool the GPUs, then speculate to make the pool usable.
✓ A pile of idle CPUs / old GPUs → Multi-Drafter Consensus (03, approximate) or Quality Gate (05, full-response).
✓ One big GPU rig, just need the endpoint → RPC Cluster (04).
✓ Getting a 40 GB model onto every worker → Swarm Transfer (06). Skip the per-worker HuggingFace pull.

// keep exploring