Tuning llama.cpp Batch Sizes on Intel Arc Pro B60
llama.cpp ships with -b 2048 -ub 512 as the default. On a single-user Intel Arc Pro B60 (Battlemage G21, 24 GB) running the SYCL backend, that default leaves 25-30% of prompt processing throughput on the table for MoE models, and about 8% for dense ones.
The fix is one line: -b 4096 -ub 2048. The interesting part is why the knee sits there, and why pushing one notch further (-ub 4096) makes long-prompt MoE inference fall off a cliff.
TL;DR
For single-user, prompt-heavy workloads (coding agents, long-context Q&A) on an Intel Arc Pro B60 using SYCL:
- Use
-b 4096 -ub 2048for both Qwen3-4B (dense) and Qwen3.5-MoE 35B-A3B (with-ncmoe 4). - vs. the default
-b 2048 -ub 512: roughly +8% PP on the dense 4B, +25-30% PP on the MoE. - Do not use
-ub 4096on the MoE — long-prompt PP collapses (377 -> 195 t/s, and 348 -> 57 t/s at d=8k) with large variance, consistent with FA-scratch-buffer pressure.
What -b and -ub Actually Do
-b is the logical batch: the maximum number of prompt tokens llama.cpp will buffer before handing work to the backend. -ub is the physical chunk size: the number of tokens actually submitted to the GPU in one kernel launch.
For a single inference stream, -b only matters insofar as -b >= -ub. There’s no benefit beyond that. -ub is what the GPU sees, and it’s what governs prompt-processing throughput. Larger chunks mean better GPU utilization — until the kernel saturates, or until Flash-Attention’s scratch buffer starts thrashing the allocator.
For llama-server running K parallel slots, the picture changes: set -b = K * ub so each decode step can pack a full chunk per slot.
System Under Test
| Component | Spec |
|---|---|
| GPU | Intel Arc Pro B60 (Battlemage G21), 24 GB VRAM |
| GPU driver | Level Zero V2 1.14.37020, Compute Runtime (NEO) 26.05.037020 |
| CPU | AMD Ryzen 9 3950X (16 cores / 32 threads, boost 4.76 GHz) |
| RAM | 64 GB system memory |
| Kernel | Linux 6.19.13 |
| Toolchain | Intel oneAPI DPC++/C++ Compiler 2026.0.0 |
| llama.cpp | master at fa8feaed3 |
| Models | Qwen3-4B-Instruct-2507 UD-Q4_K_XL (2.37 GiB); Qwen3.5-MoE 35B-A3B UD-Q4_K_M (20.6 GiB) |
Hypotheses Going In
-ubgoverns PP throughput. Larger-ubimproves GPU utilization until the kernel saturates or memory pressure kicks in.-bonly matters insofar as-b >= -ub. With one stream, no benefit beyond that.- Token generation is unaffected by either.
- On
fa-overhead-sycl, Flash-Attention scratch scales with-ub. There should be an upper edge of-ubpast which allocator overhead and memory pressure dominate, and throughput regresses. - MoE with
-ncmoe Nshould be more sensitive to-ubthan dense, because every chunk pays expert-routing and CPU-offload overhead. Small chunks amortize that overhead across fewer tokens.
Methodology
Tool: llama-bench from master (fa8feaed3).
Per-model sweep matrix:
-p {512, 2048, 8192}— short / medium / long prompts-ub {256, 512, 1024, 2048, 4096}with-b 4096pinned (so-bnever constrains-ub)-d {0, 8192}— cold context vs. mid-depth-n 0— pure PP, no TG-r 2
Common flags: -fa 1 -ctk q8_0 -ctv q8_0 -ngl 999. MoE adds -ncmoe 4 (top 4 layers of expert weights kept on CPU; 24 GB VRAM does not fit the full 35B model otherwise).
3 prompt sizes x 5 ubatch sizes x 2 depths = 30 cells per model, 60 cells total. llama-bench caches the depth-prefill across reps, so reported t/s is chunk throughput at that KV depth, not amortized over the prefill.
Findings
Qwen3-4B Q4_K_M, dense
PP throughput (t/s):
| -ub | pp512 | pp2048 | pp8192 | pp512@d8k | pp2048@d8k | pp8192@d8k |
|---|---|---|---|---|---|---|
| 256 | 1039 | 847 | 484 | 302 | 284 | 227 |
| 512* | 1058 | 876 | 497 | 297 | 289 | 232 |
| 1024 | 1051 | 938 | 524 | 312 | 299 | 229 |
| 2048 | 1083 | 953 | 529 | 311 | 286 | 238 |
| 4096 | 1084 | 952 | 528 | 306 | 299 | 238 |
* default
- Knee at
-ub 1024-2048.-ub 4096adds nothing. -ub 512 -> 2048: +6-9% on long PP at d=0; near tie at d=8k (within noise).- Hypothesis 4 does not fire here — dense 4B has comfortable FA-scratch headroom.
Qwen3.5-MoE 35B-A3B Q4_K_M, -ncmoe 4
| -ub | pp512 | pp2048 | pp8192 | pp512@d8k | pp2048@d8k | pp8192@d8k |
|---|---|---|---|---|---|---|
| 256 | 166** | 253 | 249 | 239 | 229 | 223 |
| 512* | 315 | 304 | 296 | 269 | 269 | 276 |
| 1024 | 315 | 366 | 348 | 283 | 328 | 315 |
| 2048 | 317 | 396 | 377 | 277 | 348 | 342 |
| 4096 | 318 | 400 | 195*** | 279 | 57*** | 338 |
* default ** first-cell warmup variance (sigma=68); subsequent cells stable *** unstable: collapse with very high variance
- Knee at
-ub 2048. Default-ub 512leaves ~25-30% PP on the table. -ub 4096confirms hypothesis 4: long-prompt PP collapses, variance blows up. This is the buffer-pressure regime the FA-overhead work targets — the upper edge of-ubis where it bites.- Hypothesis 5 confirmed: MoE is ~3x more sensitive to
-ubthan dense (+30% vs +8% gain over default).
Cross-Cutting
-b 4096was never the binding constraint at any tested-ub.- TG not measured; independent of
-b/-ubfor one stream. - d=0 vs d=8k preserves the relative
-ubranking; absolute throughput drops 30-50% with KV depth, as expected.
Recommendation
-b 4096 -ub 2048for both models on the Arc Pro B60.- For
llama-serverwith K parallel slots, set-b = K * ubso each decode step can pack a full chunk per slot.
Caveats
- 2 reps per cell. Most cells have <2% variance; the two starred MoE cells have >5x larger variance — treat them as qualitative (“don’t go to 4096”), not precise.
- Only
-d 0and-d 8192measured. Real agent contexts often run 16-64k+. Sweep-dseparately for depth curves. - All results are SYCL backend on Battlemage. CUDA, Vulkan, and Alchemist Arc cards will have different knees.