Can vLLM pool mixed AMD and NVIDIA GPUs?

No. vLLM assumes uniform hardware across the cluster and is primarily CUDA-focused, with experimental ROCm support. It cannot pool a CUDA machine together with an AMD/ROCm machine, mix GPU generations, or add a CPU-only node. Tightwad pools CUDA, ROCm, Metal, and CPU together into one OpenAI-compatible endpoint via llama.cpp RPC.

Does vLLM support speculative decoding across machines?

vLLM supports speculative decoding within a single machine. Tightwad runs it at the application layer across network-separated machines, so the draft model can live on a completely different box than the target. It ships token IDs (bytes) between them instead of tensor data.

When should I not use Tightwad?

If you have a single powerful, homogeneous CUDA rig and need maximum raw throughput and concurrency for production serving, use vLLM or TGI instead. Tightwad's design target is heterogeneous, network-separated, consumer hardware, not a uniform datacenter node.

// honest comparison

Tightwad vs vLLM, Ollama, llama.cpp RPC & TGI

The other tools are good. We use some of them. The honest question isn't "which is best" — it's "which fits your hardware." vLLM and TGI win on uniform datacenter GPUs and raw throughput. Tightwad wins when your compute is a junk drawer: mixed vendors, mixed generations, scattered across machines, some with no GPU at all.

How pooling works See the benchmarks

// the short version

One question decides it: is your hardware uniform?

vLLM, Ollama, llama.cpp RPC, and TGI are all genuinely good at what they do. Tightwad isn't trying to beat them at their job — it's built for a job they don't do: pooling mismatched compute (CUDA + ROCm + Metal + CPU, even a 2GB GTX 770) across network-separated machines into one OpenAI-compatible endpoint, and running application-layer speculative decoding over that pool so a 70B that fits on no single box becomes usable.

Comparison reflects each tool's positioning as of early 2026. These projects evolve quickly — check their docs for the latest capabilities.

vLLM

Excellent production inference engine. Primarily CUDA. Built for ML teams running at scale.

⚠ Primarily CUDA-focused. ROCm support is experimental. Tightwad pools CUDA and ROCm on the same model, same endpoint.
⚠ Assumes uniform hardware. vLLM can't pool a GTX 770 with a 4070 Ti, can't combine a CUDA box with an AMD box. Tightwad doesn't care what generation or vendor your hardware is.
⚠ Speculative decoding, but single-machine. Tightwad does it across your network — draft on one box, verify on another.
⚠ No CPU-only nodes. You can't add a GPU-less machine to a vLLM cluster. Tightwad supports CPU drafting.
✓ Use vLLM if you have a single powerful CUDA machine and need production-grade throughput and concurrency. It will out-serve Tightwad on uniform datacenter GPUs.

Ollama

The reason most people run local models at all. One model, one machine, beautifully simple.

⚠ One model, one machine. When you outgrow a single GPU, Ollama can't pool across machines — your RTX 2070 and RTX 4070 are isolated from each other.
⚠ No cross-machine inference. Ollama has no concept of combining hardware. Two boxes never cooperate on one request.
✓ Tightwad works with Ollama. Keep Ollama as the backend on each machine — Tightwad just coordinates between them.
✓ Use Ollama if you have one machine and just want to run models. Reach for Tightwad once a second box shows up and you want them working together.

llama.cpp RPC

The low-level primitive Tightwad is built on. Powerful — and a lot of manual scripting.

✓ Tightwad is built on llama.cpp RPC. We didn't replace it — we added the orchestration, YAML config, CLI, version enforcement, and speculative proxy you'd otherwise script by hand.
⚠ Raw RPC ships 100–300 MB of tensor data per step over the network. For models that fit on a single machine, Tightwad's speculative proxy ships only token IDs (bytes) — far faster over a home network.
✓ Use raw RPC if you want maximum control and don't mind the scripting. Use Tightwad if you want pooling and speculation to just work.

TGI (HuggingFace)

Production inference for the HuggingFace ecosystem. Strong if you already live there.

⚠ Tuned for the HuggingFace ecosystem. Designed to work best with their model hub and services, on capable GPUs.
✓ Tightwad is vendor-neutral and MIT-licensed. Works with your existing Ollama or llama.cpp setup. No accounts, no services to sign up for.
✓ Use TGI if you're already in the HuggingFace ecosystem and want its throughput. Use Tightwad for backend-agnostic, no-strings inference on whatever hardware you happen to own.

// the actual edge

Two things none of them do together

Tightwad's whole reason to exist is the intersection of these two — pool the junk drawer, then speculate over the pool.

Mixed-vendor pooling

CUDA + ROCm + Metal + CPU in one cluster, one model, one endpoint.

✓ The coordinator distributes model layers across local and remote GPUs of any vendor or generation. Run a 70B that fits on no single machine.
✓ That 12-year-old 2GB card isn't dead weight — it drafts tokens. The GPU-less Xeon drafts on CPU. No node left out.

Speculation across the wire

Application-layer speculative decoding between network-separated machines.

✓ Batch verification amortizes RPC overhead — 33 tokens per round instead of one token per round-trip. 1.86x measured wall-clock speedup on Llama 3.1 8B → Llama 3.3 70B across a 4-GPU pool (52GB VRAM) over WiFi.
✓ Under greedy decoding the output is mathematically identical to running the 70B target alone — 2.2 → 4.1 tok/s, same tokens.

// staying honest

When not to use Tightwad

A comparison page that only says "we win" isn't worth reading. Here's where the other tools are the right call.

Reach for vLLM / TGI instead when…

⚠ Your hardware is uniform and datacenter-grade. A rack of identical CUDA GPUs will hit higher raw throughput and concurrency on a purpose-built engine than on a llama.cpp-RPC pool.
⚠ You're serving high-concurrency production traffic. Tightwad targets homelab and small-team setups, not large-scale request fan-out.
⚠ Everything already fits on one machine and you don't need a draft model. If the target runs comfortably on a single GPU, raw local inference is simplest — pooling only pays off when a model spills past one box.
⚠ You only have a cloud API target. Over cloud APIs, per-round network latency makes speculative decoding slower than baseline. Speculation shines when both draft and target are local or very low-latency.

The honest summary

Single powerful CUDA machine, production-scale throughput → Use vLLM or TGI

One machine, just want to run local models → Use Ollama

Maximum low-level control, happy to script the pool yourself → Use llama.cpp RPC

Two or more machines — mixed vendors, old & new GPUs, NVIDIA & AMD, Metal, even CPU-only — and you want them all working together on one model → 🐷 Use Tightwad

// keep exploring