// v0.5 — MoE first-class · NEW

Run MoE models on home hardware

Mixture-of-Experts is the new normal — GPT-OSS 120B, Qwen3-MoE, DeepSeek-MoE, Mixtral. Everyone ships them; almost nobody places them properly across a junk-drawer GPU pool. Tightwad treats MoE as a first-class citizen: defuse the GGUF, bin-pack whole experts per card, and pin the hot ones to your fastest device.

See the four capabilities MoE Support wiki

// first-class, not bolted-on

Four capabilities. All opt-in. All backward-compatible.

MoE support is experimental and largely untested on home hardware — we ship the plumbing, you bring the model. Leave moe_placement unset and nothing changes about your existing setup. Turn it on and Tightwad rewrites, packs, and profiles experts so they actually land where they should.

No invented numbers below — MoE placement is opt-in/experimental and we have no published home-hardware benchmarks for it yet. The capabilities are real; the throughput is yours to measure.

🧩 THE UNLOCK

GGUF Defusion — the feature that makes the rest possible

Most MoE GGUFs ship fused: one giant tensor per layer covering every expert. llama.cpp can’t per-expert-split a fused tensor — so every MoE placement tool silently degrades to whole-layer splits. Tightwad rewrites the fused weights into indexed form by slicing along the expert dimension. Same quantization. Same size. Same tokens out.

  BEFORE (fused)                          AFTER (indexed)
  blk.0.ffn_gate_exps.weight  -->  blk.0.ffn_gate.0.weight
                                    blk.0.ffn_gate.1.weight
                                    blk.0.ffn_gate.2.weight
                                    ...
                                    blk.0.ffn_gate.N.weight

  Same bytes. Same quantization. One pass of disk I/O.

✓ tightwad moe defuse fused.gguf indexed.gguf
✓ Identical weights, identical quantization, identical output
✓ Output size equals input size — load it like the original
✓ Works on every known MoE layout today (GPT-OSS, Qwen3-MoE, DeepSeek-MoE, Mixtral)

🎯 EXPERT PLACEMENT

Balanced Placement — whole experts on one card

Once your model is indexed, Tightwad bin-packs entire experts onto individual GPUs proportional to VRAM — no more shipping half an expert across the network every forward pass. It emits the llama.cpp --override-tensor flags for you. Just set moe_placement: balanced and start.

  models:
    gpt-oss-120b:
      path: /models/gpt-oss-120b-indexed.gguf
      moe_placement: balanced       # off | balanced | profile-guided

  Tightwad emits at startup (one -ot regex per layer per device):
    --override-tensor "^blk\.0\.ffn_(gate|up|down)\.(0|1|...)\.weight$=CUDA0"
    --override-tensor "^blk\.0\.ffn_(gate|up|down)\.(...)\.weight$=RPC[...]"
    ...

✓ VRAM-proportional bin-packing of whole experts
✓ One -ot regex per (layer, device) — narrowest first
✓ tightwad moe plan —emit-ot to preview flags before launch
✓ Falls back silently for fused models (with a pointer to defuse)

🔥 PROFILE-GUIDED

Hot Experts On Your Fastest Card EXPERIMENTAL

MoE routing isn’t uniform — a handful of experts fire far more than the tail. Capture a profile from your real traffic and Tightwad pins the hot ones to your fastest device. Cold experts land on slow nodes. Same total VRAM, different routing topology. Needs an instrumented llama.cpp build — llama.cpp’s public API exposes no hook for expert-routing decisions.

  $ tightwad moe profile --follow-coord --duration 300 \
        -o ~/.tightwad/profile.json
  $ tightwad moe summary ~/.tightwad/profile.json
       Top hot experts
       Layer  Expert  Hits
       ─────  ──────  ────
          12       7  ...
          12      88  ...
           0      19  ...
  $ # set moe_placement: profile-guided + moe_hot_profile, then restart

✓ Per-expert hit counts captured from real proxy traffic
✓ Hot experts pinned to the highest-scoring device
✓ Device scores auto-measured (TCP-RTT, 24h cache)
✓ Requires the instrumented llama.cpp build in scripts/patches/; without it, degrades to balanced

📊 MEASURE IT

MoE Benchmark — against any OpenAI-compatible target

Don’t guess — measure. tightwad moe bench streams per-prompt TTFT, direct vs proxy tok/s, rolling acceptance, and speedup against any OpenAI-compatible target. Point it at LM Studio, vLLM, or llama-server and dump the run to JSON. Pair expert-aware placement with the speculative proxy and you get sparse weights and sparse compute — the whole stack working together.

  tightwad moe defuse  gpt-oss-120b.gguf gpt-oss-120b-indexed.gguf
  tightwad moe plan    gpt-oss-120b-indexed.gguf --emit-ot   # preview
  tightwad moe profile --follow-coord --duration 300         # hot experts
  tightwad moe summary ~/.tightwad/moe-profile.json          # show top-K
  tightwad moe bench   --target-url http://<lmstudio>:1234 \
                       --target-model <model> --json bench.json

✓ Streams live TTFT + rolling acceptance + speedup
✓ Targets LM Studio, vLLM, or llama-server (OpenAI-compatible)
✓ Composes with Combined Mode (RPC pool) out of the box
✓ tightwad doctor catches MoE-on-dense-model and fused-without-defuse

Reference config in the YAML below · tightwad doctor validates your placement before you start · Full guide on the wiki.

// drop it in cluster.yaml

One block of config

Index the GGUF once, point your model entry at it, pick a placement mode. off preserves the pre-0.5 behavior (layer-level tensor split only).

  models:
    gpt-oss-120b:
      path: /models/gpt-oss-120b-indexed.gguf
      moe_placement: balanced          # off | balanced | profile-guided
      # moe_hot_profile: ~/.tightwad/moe-profile.json   # required for profile-guided

Full guide: MoE-Support wiki · Source on GitHub · pip install tightwad