Tuning llama.cpp Batch Sizes on Intel Arc Pro B60

llama.cpp ships with -b 2048 -ub 512 as the default. On a single-user Intel Arc Pro B60 (Battlemage G21, 24 GB) running the SYCL backend, that default leaves 25-30% of prompt processing throughput on the table for MoE models, and about 8% for dense ones.

The fix is one line: -b 4096 -ub 2048. The interesting part is why the knee sits there, and why pushing one notch further (-ub 4096) makes long-prompt MoE inference fall off a cliff.

TL;DR

For single-user, prompt-heavy workloads (coding agents, long-context Q&A) on an Intel Arc Pro B60 using SYCL:

What -b and -ub Actually Do

-b is the logical batch: the maximum number of prompt tokens llama.cpp will buffer before handing work to the backend. -ub is the physical chunk size: the number of tokens actually submitted to the GPU in one kernel launch.

For a single inference stream, -b only matters insofar as -b >= -ub. There’s no benefit beyond that. -ub is what the GPU sees, and it’s what governs prompt-processing throughput. Larger chunks mean better GPU utilization — until the kernel saturates, or until Flash-Attention’s scratch buffer starts thrashing the allocator.

For llama-server running K parallel slots, the picture changes: set -b = K * ub so each decode step can pack a full chunk per slot.

System Under Test

Component Spec
GPU Intel Arc Pro B60 (Battlemage G21), 24 GB VRAM
GPU driver Level Zero V2 1.14.37020, Compute Runtime (NEO) 26.05.037020
CPU AMD Ryzen 9 3950X (16 cores / 32 threads, boost 4.76 GHz)
RAM 64 GB system memory
Kernel Linux 6.19.13
Toolchain Intel oneAPI DPC++/C++ Compiler 2026.0.0
llama.cpp master at fa8feaed3
Models Qwen3-4B-Instruct-2507 UD-Q4_K_XL (2.37 GiB); Qwen3.5-MoE 35B-A3B UD-Q4_K_M (20.6 GiB)

Hypotheses Going In

  1. -ub governs PP throughput. Larger -ub improves GPU utilization until the kernel saturates or memory pressure kicks in.
  2. -b only matters insofar as -b >= -ub. With one stream, no benefit beyond that.
  3. Token generation is unaffected by either.
  4. On fa-overhead-sycl, Flash-Attention scratch scales with -ub. There should be an upper edge of -ub past which allocator overhead and memory pressure dominate, and throughput regresses.
  5. MoE with -ncmoe N should be more sensitive to -ub than dense, because every chunk pays expert-routing and CPU-offload overhead. Small chunks amortize that overhead across fewer tokens.

Methodology

Tool: llama-bench from master (fa8feaed3).

Per-model sweep matrix:

Common flags: -fa 1 -ctk q8_0 -ctv q8_0 -ngl 999. MoE adds -ncmoe 4 (top 4 layers of expert weights kept on CPU; 24 GB VRAM does not fit the full 35B model otherwise).

3 prompt sizes x 5 ubatch sizes x 2 depths = 30 cells per model, 60 cells total. llama-bench caches the depth-prefill across reps, so reported t/s is chunk throughput at that KV depth, not amortized over the prefill.

Findings

Qwen3-4B Q4_K_M, dense

PP throughput (t/s):

-ub pp512 pp2048 pp8192 pp512@d8k pp2048@d8k pp8192@d8k
256 1039 847 484 302 284 227
512* 1058 876 497 297 289 232
1024 1051 938 524 312 299 229
2048 1083 953 529 311 286 238
4096 1084 952 528 306 299 238

* default

Qwen3.5-MoE 35B-A3B Q4_K_M, -ncmoe 4

-ub pp512 pp2048 pp8192 pp512@d8k pp2048@d8k pp8192@d8k
256 166** 253 249 239 229 223
512* 315 304 296 269 269 276
1024 315 366 348 283 328 315
2048 317 396 377 277 348 342
4096 318 400 195*** 279 57*** 338

* default ** first-cell warmup variance (sigma=68); subsequent cells stable *** unstable: collapse with very high variance

Cross-Cutting

Recommendation

Caveats