This document summarizes focused performance testing of Qwen3-Coder-Next-80B (Sparse MoE, MXFP4 GGUF) on Intel Core Ultra UMA systems using llama.cpp SYCL.
Testing was performed to identify batch size ceilings, throughput plateaus, and failure modes specific to large sparse-MoE models on Intel UMA — where memory topology and kernel launch constraints matter more than raw FLOPs.
- CPU: Intel Core Ultra 7 155H
- Memory: 96 GB DDR5 (UMA)
- Backend: llama.cpp SYCL (ARC iGPU / XPU)
- Threads: 20 (empirically optimal; leaves LP cores idle)
- KV Cache: q8_0 / q8_0
- Flash Attention: Enabled (
--n-gpu-layers 0) - Parallelism:
--parallel 1 - Cache RAM: -1 (adaptive)
Unless stated otherwise, all tests used:
--n-gpu-layers 0
--threads 20
--cache-type-k q8_0
--cache-type-v q8_0
--parallel 1
--cache-ram -1
- Prompt length: 1614 tokens (fixed)
- Measurement focus:
- Prompt ingestion latency
- Steady-state generation throughput
- Stability at increasing batch sizes
Batch (--batch-size) and micro-batch (--ubatch-size) were swept together to isolate upper-bound behavior.
Both CPU-first (ngl=0) and GPU-assisted runs were tested.
| Batch / uBatch | Prompt TPS | Eval TPS | Outcome |
|---|---|---|---|
| 1024 / 1024 | ~20.3 t/s | ~4.11 t/s | Stable |
| 1536 / 1536 | ~25.9 t/s | ~4.07 t/s | Peak efficiency |
| 1550 / 1550 | ~26 t/s | ~4 t/s | Last stable point |
| 2048 / 1024 | — | — | Unstable / degraded |
| 2048 / 2048 | — | — | Failure / regression |
- Performance improves cleanly up to ~1536–1550
- Beyond ~1550 tokens per batch, scaling collapses
- Larger batches do not increase throughput and frequently destabilize execution
This behavior was repeatable across runs.
The ~1550 batch ceiling behaves like a hard transport or scheduling limit, not a soft performance knee.
This aligns with an MTU-style constraint, where an internal limit is reached related to:
- Vocabulary size
- Sparse MoE routing tables
- Quantized expert weight blocks
- KV cache addressing under flash-attention
Conceptually:
effective_batch_limit ≈ f(vocab, experts, quant_stride, routing)
Once this boundary is crossed:
- Kernel launches fragment
- Memory transactions spill
- Scheduler overhead dominates
- Throughput decreases instead of saturating
This is not a RAM exhaustion issue — the system still has headroom.
It is a structural execution limit for this model class on Intel UMA.
GPU-assisted tests (partial or heavy offload):
- Did not exceed CPU-first throughput
- Introduced additional scheduling overhead
- Offered no benefit past ~1024 batch
For Qwen3-Coder-Next-80B:
- CPU-first execution with flash attention is the optimal mode
This is consistent with sparse-MoE behavior where:
- Experts thrash cache locality
- UMA coherence costs outweigh compute gains
--batch-size 1536
--ubatch-size 1536
--n-gpu-layers 0
This hits the maximum stable ingestion throughput without crossing the execution ceiling.
- Batch sizes ≥1600
- Forcing GPU-heavy offload
- Oversized micro-batches assuming "more is better"
- 30B dense models scale smoothly to 2K+ batches
- 80B sparse-MoE models do not
- The ceiling is model-architecture-driven, not hardware-driven
This is why the 80B results must not reuse 30B tuning assumptions.
Qwen3-Coder-Next-80B has a real, measurable batch ceiling at ~1550 tokens on Intel UMA.
Pushing past it:
- Does not improve throughput
- Actively harms stability
- Masks the model's true efficiency window
Respecting this limit yields:
- Predictable performance
- Lower thermals
- Better long-run stability
- Cleaner system behavior under sustained coding workloads
That said, setting global defaults to batch 2048 and microbatch to 2048 has no negative side effects