// six modes, one binary

Six ways to run Tightwad

One install, six inference strategies for a junk drawer of mismatched compute. Pool GPUs that fit nothing on their own. Draft cheap and verify smart. Race drafters for consensus. Gate full responses. Sling 40 GB models peer-to-peer. Stack them, or run just the one you need — they all hang off the same OpenAI-compatible endpoint.

Pick your poison. Stack them. Run all six.

Speculative decoding, GPU pooling, consensus drafting, quality gating, and P2P model transfer — every mode is a `pip install tightwad` away. Tightwad doesn't judge what hardware you bring; it just puts it to work behind one endpoint.

👑 THE KILLER FEATURE
01

Combined Mode — Speculation Over a Pool

When a model doesn't fit on any single machine, pool your GPUs AND speculate on top. RPC pooling alone is slow (one network round-trip per token). Speculation amortizes that — 32 tokens verified per round-trip instead of 1. Result: models that fit nowhere become usable.

  [Junk Hardware β€” P400 2GB, GTX 770, laptop CPU]
        | runs 1.7B draft, ~30 tok/s
        | sends token IDs (bytes)
        v
  [Tightwad Proxy :8088]
        | sends draft to pool for BATCH verify
        v
  [RPC GPU Pool β€” 4 GPUs, 52GB total, WiFi]
        | verifies 32 tokens in ONE forward pass
        v
  4.1 tok/s instead of 2.2 tok/s β€” 70B fits nowhere else
  • 1.86× measured speedup on Llama 3.3 70B (4 GPUs over WiFi)*
  • Output mathematically identical to target alone under greedy decoding (Leviathan guarantee)
  • Any junk hardware can be the drafter — 2GB GPU, CPU, laptop
  • Pool CUDA + ROCm + Metal GPUs, speculate on top

* Measured: Llama 3.1 8B draft → Llama 3.3 70B target across RTX 4070 Ti Super + RTX 3060 + RTX 2070 + M2 Metal = 52GB VRAM over WiFi. 519 tokens in 127s vs 512 tokens in 231s direct. Your results will vary with hardware and network conditions. Full deep dive →

🧠 APPROXIMATE · SKIP THE GPU
03

Multi-Drafter Consensus approximate mode

Race multiple cheap machines simultaneously. Each drafter generates candidate tokens in parallel. When they all agree, the expensive GPU verification is skipped entirely — that’s the speed win. Tradeoff: this is an approximate mode, not exact speculative decoding. Skipping the target means consensus-accepted tokens may differ from what the target alone would produce. Three sub-modes: strict, majority, any_disagree — pick your acceptable risk. Off by default.

  [Tightwad Proxy :8088]
        |
        | races all drafters in parallel
        |
   +----+----+----+
   v    v    v    v
 [M2] [CPU] [2070] [P400]
  8B    8B    8B     1.7B
   |    |    |      |
   +----+----+----+-+
        |
        v
  Consensus? ──yes──> Stream tokens (GPU never touched)
        |
       no
        |
        v
  [Target 70B GPU] ──> Verify only disagreed tokens
  • Race unlimited drafters — CPUs, old GPUs, laptops, anything
  • Unanimous tokens skip the target GPU entirely
  • Three modes: strict, majority, any_disagree
  • Tree-based speculation for branching draft paths
  • Prometheus metrics for consensus accept/fallback rates
04

RPC Cluster Mode

Got GPUs scattered across machines? Pool them. CUDA on one box, ROCm on another — Tightwad doesn't care. It distributes model layers across all of them and hands you one clean OpenAI-compatible endpoint.

  [OpenAI Client]
        |
        v
+-------------------+
|  Tightwad         |  <-- One endpoint to rule them all
|  Coordinator :8080|
+--------+----------+
         |  distributes layers
    +----+----+
    v         v
+--------+ +--------+
| Worker | | Worker |
| NVIDIA | |  AMD   |
| 4070Ti | | 7900XTX|
| 16 GB  | | 24 GB  |
+--------+ +--------+
  70B model: covered ✓
  • Mix NVIDIA + AMD GPUs freely
  • Run 70B+ models on consumer hardware
  • Hot-swap models without restarting workers
  • Built-in benchmarking CLI

Pooling pays off only when the model fits nowhere on its own — pair it with speculation (Mode 01) to claw back the per-token network cost. See the deep dive →

🛡 FULL-RESPONSE REVIEW
05

Quality Gate — CPU Fleet Drafts, GPU Reviews

Different from token-level speculation — this operates at the full-response level. A fleet of cheap machines (CPUs, small GPUs) generate complete responses using small models. One powerful GPU reviews each output, approving, correcting, or rejecting. 60–80% pass unchanged — so the GPU only sweats the hard remainder.

  [Client Request]
        |
        v
  [Tightwad Gate :8088]
        |
        | fan-out to CPU fleet
   +----+----+----+
   v    v    v    v
 [CPU] [CPU] [CPU] [CPU]
 each generates full response with 8B model
   +----+----+----+
        |
        v
  [GPU Target β€” 70B]
  Reviews each response:
    ✓ approve  (pass through)
    ✎ correct  (light edit)
    ✗ reject   (regenerate)
  60-80% pass unchanged
  • GPU only processes the hard responses the fleet got wrong
  • Any CPU or cheap GPU can be a drafter
  • Full-response verification, not token-by-token
  • Automatic approve/correct/reject pipeline
  • tightwad gate start — one command to run
🌐 P2P DISTRIBUTION
06

Swarm Transfer — P2P Model Distribution

Models are huge. Pulling a 40 GB GGUF from HuggingFace to every worker takes hours and wastes bandwidth. Pull from every machine that already has it. Chunked 64 MB transfer with SHA256 piece verification — like torrents, but for GGUF files. New machine joins the cluster? It downloads from all your existing machines in parallel.

  [New Machine Joins Cluster]
        |
        | "I need Llama-3.3-70B-Q4_K_M.gguf"
        v
  +---------------------------+
  | Tightwad Swarm Discovery  |
  |                           |
  |  Piece 1 <--- Machine A  |  (4070 Ti — has full model)
  |  Piece 2 <--- Machine B  |  (RTX 2070 — has full model)
  |  Piece 3 <--- Machine C  |  (M2 Metal — has pieces 1-6)
  |  Piece 4 <--- Machine A  |  (parallel, rarest-first)
  |  ...                      |
  +---------------------------+
        |
        v
  SHA256-verified • ready to serve in minutes, not hours
  • Multi-source parallel download — pull from every peer simultaneously
  • SHA256 piece verification — every 64 MB chunk validated before use
  • Rarest-first selection — keeps the model available across the cluster
  • Delta updates — new quantization? Only transfer the changed pieces
  • Zero central server — resume on interrupt, no single bottleneck

Stack them, or pick the one that fits

The honest short version: if a model fits on one box, just speculate (Mode 02). If it doesn't, pool and speculate (Mode 01). Everything else is a knob for a specific junk-drawer shape.

  • Model fits on one machine → Speculative Decoding Proxy (02). Cheap local draft, exact output.
  • Model fits nowhere → Combined Mode (01). Pool the GPUs, then speculate to make the pool usable.
  • A pile of idle CPUs / old GPUs → Multi-Drafter Consensus (03, approximate) or Quality Gate (05, full-response).
  • One big GPU rig, just need the endpoint → RPC Cluster (04).
  • Getting a 40 GB model onto every worker → Swarm Transfer (06). Skip the per-worker HuggingFace pull.