Tuning llama.cpp Batch Sizes on Intel Arc Pro B60

2026-05-08

llama.cpp ships with -b 2048 -ub 512 as the default. On a single-user Intel Arc Pro B60 (Battlemage G21, 24 GB) running the SYCL backend, that default leaves 25-30% of prompt processing throughput on the table for MoE models, and about 8% for dense ones.

The fix is one line: -b 4096 -ub 2048. The interesting part is why the knee sits there, and why pushing one notch further (-ub 4096) makes long-prompt MoE inference fall off a cliff.

TL;DR

For single-user, prompt-heavy workloads (coding agents, long-context Q&A) on an Intel Arc Pro B60 using SYCL:

Use -b 4096 -ub 2048 for both Qwen3-4B (dense) and Qwen3.5-MoE 35B-A3B (with -ncmoe 4).
vs. the default -b 2048 -ub 512: roughly +8% PP on the dense 4B, +25-30% PP on the MoE.
Do not use -ub 4096 on the MoE — long-prompt PP collapses (377 -> 195 t/s, and 348 -> 57 t/s at d=8k) with large variance, consistent with FA-scratch-buffer pressure.

What `-b` and `-ub` Actually Do

-b is the logical batch: the maximum number of prompt tokens llama.cpp will buffer before handing work to the backend. -ub is the physical chunk size: the number of tokens actually submitted to the GPU in one kernel launch.

For a single inference stream, -b only matters insofar as -b >= -ub. There’s no benefit beyond that. -ub is what the GPU sees, and it’s what governs prompt-processing throughput. Larger chunks mean better GPU utilization — until the kernel saturates, or until Flash-Attention’s scratch buffer starts thrashing the allocator.

For llama-server running K parallel slots, the picture changes: set -b = K * ub so each decode step can pack a full chunk per slot.

System Under Test

Component	Spec
GPU	Intel Arc Pro B60 (Battlemage G21), 24 GB VRAM
GPU driver	Level Zero V2 1.14.37020, Compute Runtime (NEO) 26.05.037020
CPU	AMD Ryzen 9 3950X (16 cores / 32 threads, boost 4.76 GHz)
RAM	64 GB system memory
Kernel	Linux 6.19.13
Toolchain	Intel oneAPI DPC++/C++ Compiler 2026.0.0
llama.cpp	`master` at `fa8feaed3`
Models	Qwen3-4B-Instruct-2507 UD-Q4_K_XL (2.37 GiB); Qwen3.5-MoE 35B-A3B UD-Q4_K_M (20.6 GiB)

Hypotheses Going In

-ub governs PP throughput. Larger -ub improves GPU utilization until the kernel saturates or memory pressure kicks in.
-b only matters insofar as -b >= -ub. With one stream, no benefit beyond that.
Token generation is unaffected by either.
On fa-overhead-sycl, Flash-Attention scratch scales with -ub. There should be an upper edge of -ub past which allocator overhead and memory pressure dominate, and throughput regresses.
MoE with -ncmoe N should be more sensitive to -ub than dense, because every chunk pays expert-routing and CPU-offload overhead. Small chunks amortize that overhead across fewer tokens.

Methodology

Tool: llama-bench from master (fa8feaed3).

Per-model sweep matrix:

-p {512, 2048, 8192} — short / medium / long prompts
-ub {256, 512, 1024, 2048, 4096} with -b 4096 pinned (so -b never constrains -ub)
-d {0, 8192} — cold context vs. mid-depth
-n 0 — pure PP, no TG
-r 2

Common flags: -fa 1 -ctk q8_0 -ctv q8_0 -ngl 999. MoE adds -ncmoe 4 (top 4 layers of expert weights kept on CPU; 24 GB VRAM does not fit the full 35B model otherwise).

3 prompt sizes x 5 ubatch sizes x 2 depths = 30 cells per model, 60 cells total. llama-bench caches the depth-prefill across reps, so reported t/s is chunk throughput at that KV depth, not amortized over the prefill.

Findings

Qwen3-4B Q4_K_M, dense

PP throughput (t/s):

-ub	pp512	pp2048	pp8192	pp512@d8k	pp2048@d8k	pp8192@d8k
256	1039	847	484	302	284	227
512*	1058	876	497	297	289	232
1024	1051	938	524	312	299	229
2048	1083	953	529	311	286	238
4096	1084	952	528	306	299	238

* default

Knee at -ub 1024-2048. -ub 4096 adds nothing.
-ub 512 -> 2048: +6-9% on long PP at d=0; near tie at d=8k (within noise).
Hypothesis 4 does not fire here — dense 4B has comfortable FA-scratch headroom.

Qwen3.5-MoE 35B-A3B Q4_K_M, `-ncmoe 4`

-ub	pp512	pp2048	pp8192	pp512@d8k	pp2048@d8k	pp8192@d8k
256	166**	253	249	239	229	223
512*	315	304	296	269	269	276
1024	315	366	348	283	328	315
2048	317	396	377	277	348	342
4096	318	400	195***	279	57***	338

* default ** first-cell warmup variance (sigma=68); subsequent cells stable *** unstable: collapse with very high variance

Knee at -ub 2048. Default -ub 512 leaves ~25-30% PP on the table.
-ub 4096 confirms hypothesis 4: long-prompt PP collapses, variance blows up. This is the buffer-pressure regime the FA-overhead work targets — the upper edge of -ub is where it bites.
Hypothesis 5 confirmed: MoE is ~3x more sensitive to -ub than dense (+30% vs +8% gain over default).

Cross-Cutting

-b 4096 was never the binding constraint at any tested -ub.
TG not measured; independent of -b/-ub for one stream.
d=0 vs d=8k preserves the relative -ub ranking; absolute throughput drops 30-50% with KV depth, as expected.

Recommendation

-b 4096 -ub 2048 for both models on the Arc Pro B60.
For llama-server with K parallel slots, set -b = K * ub so each decode step can pack a full chunk per slot.

Caveats

2 reps per cell. Most cells have <2% variance; the two starred MoE cells have >5x larger variance — treat them as qualitative (“don’t go to 4096”), not precise.
Only -d 0 and -d 8192 measured. Real agent contexts often run 16-64k+. Sweep -d separately for depth curves.
All results are SYCL backend on Battlemage. CUDA, Vulkan, and Alchemist Arc cards will have different knees.