Run MoE models on home hardware
Mixture-of-Experts is the new normal — GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE, Mixtral. Everyone ships them; almost nobody places them properly across a junk-drawer GPU pool. Tightwad treats MoE as a first-class citizen: defuse the GGUF, bin-pack whole experts per card, and pin the hot ones to your fastest device.
Four capabilities. All opt-in. All backward-compatible.
MoE support is experimental and largely untested on home hardware — we ship the plumbing, you bring the model. Leave moe_placement unset and nothing changes about your existing setup. Turn it on and Tightwad rewrites, packs, and profiles experts so they actually land where they should.
No invented numbers below — MoE placement is opt-in/experimental and we have no published home-hardware benchmarks for it yet. The capabilities are real; the throughput is yours to measure.
GGUF Defusion — the feature that makes the rest possible
Most MoE GGUFs ship fused: one giant tensor per layer covering every expert. llama.cpp can’t per-expert-split a fused tensor — so every MoE placement tool silently degrades to whole-layer splits. Tightwad rewrites the fused weights into indexed form by slicing along the expert dimension. Same quantization. Same size. Same tokens out.
BEFORE (fused) AFTER (indexed)
blk.0.ffn_gate_exps.weight --> blk.0.ffn_gate.0.weight
blk.0.ffn_gate.1.weight
blk.0.ffn_gate.2.weight
...
blk.0.ffn_gate.N.weight
Same bytes. Same quantization. One pass of disk I/O.
- ✓
tightwad moe defuse fused.gguf indexed.gguf - ✓ Identical weights, identical quantization, identical output
- ✓ Output size equals input size — load it like the original
- ✓ Works on every known MoE layout today (GPT-OSS, Qwen3-MoE, DeepSeek-MoE, Mixtral)
Balanced Placement — whole experts on one card
Once your model is indexed, Tightwad bin-packs entire experts onto individual GPUs proportional to VRAM — no more shipping half an expert across the network every forward pass. It emits the llama.cpp --override-tensor flags for you. Just set moe_placement: balanced and start.
models:
gpt-oss-120b:
path: /models/gpt-oss-120b-indexed.gguf
moe_placement: balanced # off | balanced | profile-guided
Tightwad emits at startup (one -ot regex per layer per device):
--override-tensor "^blk\.0\.ffn_(gate|up|down)\.(0|1|...)\.weight$=CUDA0"
--override-tensor "^blk\.0\.ffn_(gate|up|down)\.(...)\.weight$=RPC[...]"
...
- ✓ VRAM-proportional bin-packing of whole experts
- ✓ One
-otregex per (layer, device) — narrowest first - ✓
tightwad moe plan —emit-otto preview flags before launch - ✓ Falls back silently for fused models (with a pointer to
defuse)
Hot Experts On Your Fastest Card
MoE routing isn’t uniform — a handful of experts fire far more than the tail. Capture a profile from your real traffic and Tightwad pins the hot ones to your fastest device. Cold experts land on slow nodes. Same total VRAM, different routing topology. Needs an instrumented llama.cpp build — llama.cpp’s public API exposes no hook for expert-routing decisions.
$ tightwad moe profile --follow-coord --duration 300 \
-o ~/.tightwad/profile.json
$ tightwad moe summary ~/.tightwad/profile.json
Top hot experts
Layer Expert Hits
───── ────── ────
12 7 ...
12 88 ...
0 19 ...
$ # set moe_placement: profile-guided + moe_hot_profile, then restart
- ✓ Per-expert hit counts captured from real proxy traffic
- ✓ Hot experts pinned to the highest-scoring device
- ✓ Device scores auto-measured (TCP-RTT, 24h cache)
- ✓ Requires the instrumented llama.cpp build in
scripts/patches/; without it, degrades to balanced
MoE Benchmark — against any OpenAI-compatible target
Don’t guess — measure. tightwad moe bench streams per-prompt TTFT, direct vs proxy tok/s, rolling acceptance, and speedup against any OpenAI-compatible target. Point it at LM Studio, vLLM, or llama-server and dump the run to JSON. Pair expert-aware placement with the speculative proxy and you get sparse weights and sparse compute — the whole stack working together.
tightwad moe defuse gpt-oss-120b.gguf gpt-oss-120b-indexed.gguf
tightwad moe plan gpt-oss-120b-indexed.gguf --emit-ot # preview
tightwad moe profile --follow-coord --duration 300 # hot experts
tightwad moe summary ~/.tightwad/moe-profile.json # show top-K
tightwad moe bench --target-url http://<lmstudio>:1234 \
--target-model <model> --json bench.json
- ✓ Streams live TTFT + rolling acceptance + speedup
- ✓ Targets LM Studio, vLLM, or llama-server (OpenAI-compatible)
- ✓ Composes with Combined Mode (RPC pool) out of the box
- ✓
tightwad doctorcatches MoE-on-dense-model and fused-without-defuse
One block of config
Index the GGUF once, point your model entry at it, pick a placement mode. off preserves the pre-0.5 behavior (layer-level tensor split only).
models:
gpt-oss-120b:
path: /models/gpt-oss-120b-indexed.gguf
moe_placement: balanced # off | balanced | profile-guided
# moe_hot_profile: ~/.tightwad/moe-profile.json # required for profile-guided