A SYCL iGPU in a CUDA Box: Great for Building and Testing, Useless for Inference

A single llama.cpp binary that registers both an RTX 5090 (CUDA) and an Intel UHD 770 iGPU (SYCL) as ggml device backends, exposes CUDA0 and SYCL0 side by side, and runs inference on either.

The point of this is not to make the two GPUs compute together. It is that the iGPU is already in the box, costs nothing, and gives you a real SYCL device to build and smoke-test the SYCL backend against without owning a discrete Intel card. That is where it earns its place. For actual inference it contributes nothing—the rest of this post is the benchmarks that show why.


Bottom line

The combined build works. Split the toolchains—icpx compiles everything and does the final link, nvcc compiles only the .cu files with g++ forced as its host compiler—and one binary enumerates both devices and runs inference on each standalone. ggml has no mutual-exclusion guard between backends; the only real risk, whether icpx could link nvcc-produced objects, did not bite.

Combining them for compute, though, almost never pays:

The one place it could earn its keep at inference time is load-enabling a model too large for 32 GB, via a minimal layer-split. That is the open thread at the end. Its real value, though, is upstream of all this: a free, always-present SYCL device to build and test the backend on.

One caveat colors every number below: the 5090 ran power-capped at 400 W (stock 575 W, max 600 W). That throttles prefill more than decode, so the CUDA/SYCL ratios here are conservative—at full power the 5090’s lead only widens, and every conclusion holds a fortiori.


Hypothesis

ggml registers each backend independently - the CUDA and SYCL backends are separate subdirectories keyed only on their own GGML_* option, with no mutual-exclusion guard. So a single binary should be able to carry both and expose CUDA0 and SYCL0 side by side.

The only real obstacle is the compiler, not ggml:

Resolution: split the toolchains. icpx compiles everything (ggml core, llama, the SYCL backend) and does the final link; nvcc compiles only the .cu files with gcc forced as its host compiler via -DCMAKE_CUDA_HOST_COMPILER=g++ (threaded through as -ccbin at ggml-cuda/CMakeLists.txt:219). icpx defaults to gcc’s libstdc++, so the objects are ABI-compatible at link time.

The untested seam was whether icpx could link nvcc-produced CUDA objects.

Process

Built from upstream llama.cpp, tag b9660. build-both.sh is build-sycl.sh reduced to a single combined target:

Findings

Build and link succeeded (~48 min cold, ccache-incremental after). The icpx-links-nvcc-objects seam did not bite. Both devices enumerate:

Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31585 MiB free)
  SYCL0: Intel(R) UHD Graphics 770 (59281 MiB, 16205 MiB free)

Each device runs inference standalone in the same binary.

iGPU gotchas (none are build defects)

  1. SYCL_CACHE_PERSISTENT (env defaults it on) trips a known UHD 770 IGC crash. Must be 0 for any SYCL run.
  2. GGML_SYCL_DISABLE_DNN=1 needed (oneDNN path).
  3. The UHD 770 cannot JIT FLASH_ATTN_EXT, so any SYCL run needs -fa off with f16 KV (quantized KV needs FA).
  4. With both GPUs visible and no -dev pin, auto device-fit spills layers onto the iGPU; combined with -fa on it cores. Always pin -dev for combined runs.

iGPU drive recipe:

SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
  llama-... -dev SYCL0 -fa off -ctk f16 -ctv f16 -ngl 99 ...

Results: split-mode bench

Model: Qwen3-4B Q4_K_M (2.37 GiB, small-model.gguf), -fa 0, r=3. Device indices: CUDA0=0, SYCL0=1.

Caveat: the 5090 was power-capped at 400 W (default 575 W, max 600 W). That throttles prefill (compute-bound) more than decode (bandwidth-bound; VRAM clocks are barely power-gated), so the CUDA0 pp512 here sits below the chip’s full-power ceiling while tg128 is near-true. This makes the iGPU look less bad than it is: at stock power the 5090’s prefill lead widens, so every conclusion below holds and the CUDA/SYCL ratios are conservative.

split-mode tensor-split (CUDA/SYCL) pp512 t/s tg128 t/s
none CUDA0 only (-mg 0) 16574 324
none SYCL0 only (-mg 1) 82 5.0
layer 99/1 (rounds to 0 iGPU layers) 16591 324
layer 9/1 (~10% on iGPU) 1774 32.8
layer 5/5 179 10.5
row 9/1, 5/5 GGML_ABORT -
tensor 9/1, 5/5 ctx create fails -

Reading it

Commands

# baselines (both single-device)
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
  ./build-both/bin/llama-bench -m small-model.gguf -fa 0 -sm none -mg 0,1 -r 3

# layer split sweep
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
  ./build-both/bin/llama-bench -m small-model.gguf -fa 0 -sm layer -ts 99/1,9/1,5/5 -r 3

Power efficiency: is the slow iGPU at least greener?

A fair question, since the UHD 770 is slow but sips power. To rank the three compute paths the model can land on, the whole model was run on each device and power sampled during a sustained decode. CPU/iGPU draw is Intel RAPL (package-0); the 5090 is nvidia-smi board power. On this Raptor Lake desktop the RAPL uncore domain is the iGPU power plane (PP1), so the iGPU is measured directly, not by differential. Idle package = 14.6 W.

device tok/s power tok/J energy / 1k tok
5090 (CUDA0) 301 400 W board 0.75 0.37 Wh
CPU (28 core) 16.6 153 W pkg 0.108 2.6 Wh
iGPU (770) 4.67 64 W pkg 0.073 3.8 Wh

The intuition is half right, and the wrong half is the finding. At the silicon level the iGPU is frugal: its own power plane drew only 10.3 W for those 4.67 tok/s (0.45 tok/J, ~4x the CPU). But during that same run the RAPL core domain sat at 42.9 W - ~37 W over idle, ~1.5 cores pinned - because the SYCL/Level-Zero host driver busy-waits on queue completion instead of blocking. So the system spends 64 W to produce 4.67 tokens, and the honest tok/J (0.073) is worse than the CPU’s (0.108). The iGPU is efficient; the way it is driven is not, and the CPU tax inverts the result.

The 5090 wins efficiency outright, not just speed: it pins the 400 W cap (even decoding this 4B - it never throttles down) but finishes ~65x faster, so race-to-idle gives it ~0.75 tok/J, ~7x the CPU and ~10x the iGPU-as-deployed. Fast is green.

Consequence for overflow: the split-mode plan already showed CPU spill beats iGPU layer-split on throughput; energy agrees. Spill overflow to CPU, never to the iGPU.

The one open lever: if the busy-wait were a blocking wait, iGPU system power would fall toward idle + 10 W ~= 27 W, giving ~0.17 tok/J - then it would edge the CPU ~1.6x and vindicate the “greener” hypothesis for a low-power, always-on side workload. It is a driver-config artifact, not a hardware ceiling. It still would not come within 4x of the 5090. Unverified.

# whole-model on each device (-fa 0, f16 KV)
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
  ./build-both/bin/llama-bench -m small-model.gguf -fa 0 -ngl 0          -r 3  # CPU
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
  ./build-both/bin/llama-bench -m small-model.gguf -fa 0 -dev SYCL0 -ngl 99 -r 3  # iGPU

# power during a sustained decode (run alongside a long -n bench)
sudo chmod -R a+r /sys/class/powercap/intel-rapl   # RAPL energy_uj is root-only
./rapl-power.sh 30   # pkg / core(CPU) / uncore(iGPU) avg watts

The RAPL sampler (perf/turbostat were not installed). It reads the energy_uj counters before and after a window and divides by elapsed time; the uncore domain is the iGPU on this client part.

#!/bin/bash
# Sample average Intel RAPL power over a window.
# Domains on Raptor Lake client: package-0, core (PP0=CPU cores),
# uncore (PP1=integrated GPU power plane).
# Usage: rapl-power.sh <seconds>
set -e

dur=${1:-5}
base=/sys/class/powercap/intel-rapl/intel-rapl:0

uj()  { cat "$1/energy_uj"; }
max=$(cat "$base/max_energy_range_uj")

t0=$(date +%s.%N)
p0=$(uj "$base"); c0=$(uj "$base/intel-rapl:0:0"); u0=$(uj "$base/intel-rapl:0:1")
sleep "$dur"
t1=$(date +%s.%N)
p1=$(uj "$base"); c1=$(uj "$base/intel-rapl:0:0"); u1=$(uj "$base/intel-rapl:0:1")

awk -v p0="$p0" -v p1="$p1" -v c0="$c0" -v c1="$c1" -v u0="$u0" -v u1="$u1" \
    -v t0="$t0" -v t1="$t1" -v mx="$max" '
function w(a, b) { d = b - a; if (d < 0) d += mx; return d / 1e6 / dt }
BEGIN {
    dt = t1 - t0;
    printf "dt=%.1fs  pkg=%.1fW  core(CPU)=%.1fW  uncore(iGPU)=%.1fW\n", dt, w(p0,p1), w(c0,c1), w(u0,u1)
}'

MTP offload: not possible for the head

Separate question investigated: can the MTP draft head of an MTP model (--spec-type draft-mtp) be offloaded to the iGPU? No. The head is welded to the target’s KV cache, so it cannot live on a different backend than the target. This was confirmed against ground-truth runs, and the behaviour depends on a build-time property of the GGUF: how the publisher packaged the MTP head.

There are two packagings, and they are visible directly in the HuggingFace repo:

That one difference picks the server’s code path, and the path decides whether -devd SYCL0 does anything:

MTP packaging server branch -devd SYCL0 effect
embedded (qwen35.nextn_*) against-target inert - draft stays on the target’s device
separate file (mtp-*.gguf) separate-model honored -> draft on iGPU -> aborts on shared KV

Which branch is taken is decided by has_dft() (common.h:360), true iff a draft model path is set. For an MTP -hf repo, arg.cpp:473-477 auto-populates that path only if a separate MTP sibling file is discovered (res.found_mtp) – which is precisely the embedded-vs-separate distinction above. The server log disambiguates: [spec] estimated memory usage of draft model (separate) vs ... of MTP context (embedded), server-context.cpp:968-969.

Embedded MTP (against-target branch)

ctx_dft is built from the same model_tgt object (server-context.cpp:1055) with cparams_mtp derived from params_base (the target’s devices), and shares the target’s KV (cparams_mtp.ctx_other = ctx_tgt). -devd / -ngld are read only in the separate-model branch (server-context.cpp:1002-1004); here they are merely logged (speculative.cpp:862), never applied - so devices=[SYCL0] in the log is the requested value, not where anything ran. Target VRAM is unchanged whatever you pass, and any SYCL-vs-CUDA decode delta is the power- capped 5090 throttling, not the iGPU.

The MTP context’s memory (e.g. 2240 MiB on Qwen3.6 at ~205k ctx) is one draft layer’s KV over the full context at the draft KV type. It is bound to the target device and cannot be moved; the only lever is to shrink it. Dropping LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_{K,V} from q8_0 to q4_0 roughly halves it. Draft mispredictions are caught by the target’s verify step, so a coarser draft KV costs at most a sliver of accept-rate, never correctness.

Separate MTP file (separate-model branch)

Here -devd SYCL0 is honored and the draft graph is genuinely placed on the iGPU - and that is exactly why it crashes. The separate MTP head still shares the target’s KV cache (log: layer 3: sharing with layer 59, cache_k_l58), and that KV is pre-allocated on CUDA0 with the target. A SYCL op against a CUDA0-pinned tensor cannot be scheduled, so it aborts:

ggml-backend.cpp:898: pre-allocated tensor (cache_k_l58) in a buffer (CUDA0)
                      that cannot run the operation (NONE)
  ggml_backend_sched_backend_id_from_cur -> ggml_abort

Same heterogeneous-backend wall as row-split. The fix is to not cross backends: drop -devd SYCL0 and let the draft default onto the target’s device (-dev CUDA0). That is the only working config for an MTP draft.

Could the draft keep its own KV copy instead?

The shared KV is the only thing welding the draft to the target’s backend, so the structural question is whether the draft can carry an independent copy. For some MTP architectures it already can - llama.cpp has a non-shared path - but the gemma arch is specifically excluded from it.

The switch is is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt (speculative.cpp:904). When it is false, process() runs a catch-up decode (speculative.cpp:990-1009): it reads the target’s per-position hidden states via llama_get_embeddings_nextn(ctx_tgt), memcpys them into the draft’s own batch, and decodes them through the draft to populate ctx_dft’s own KV cache. That transfer is host memory, not a shared device tensor, so a draft on this path is cross-backend-clean and could live on SYCL0. This is exactly the “keep two KV copies” idea, already implemented.

Two reasons it does not rescue the gemma case:

  1. Gemma4 forbids it. LLM_ARCH_GEMMA4_ASSISTANT hard-requires ctx_other and throws without it (llama-context.cpp:94-101); LLM_ARCH_EAGLE3 without its own tok_embd/output is the same (llama-context.cpp:103-110). Their MTP heads read the target’s K/V directly (the layer 3: sharing with layer 59 log) and have no k/v projections of their own, so there is no independent cache to fill. The weld is in the model, not the plumbing - is_mem_shared is forced true. The server also sets cparams.ctx_other = ctx_tgt unconditionally in both branches (server-context.cpp:1036, 1053).

  2. The iGPU loses even where it is allowed. The catch-up decode re-runs the MTP layer over every position - the whole prompt at prefill (~205k tokens here) and one decode per step. Shared mode exists precisely to skip that. Pushing it onto the UHD 770 pays ~200x prefill and a per-step MTP decode that must still beat the 27B verify, or speculation goes net-negative. That spends the draft’s “nearly free” budget to reclaim ~2.2 GiB - a VRAM trade, never a speed one, and only worth it if 5090 VRAM is the hard binding constraint on a non-gemma MTP arch.

The architecturally-supported iGPU draft is a separate, independent small draft GGUF (-md ... -devd SYCL0 -ngld all, not draft-mtp): it carries its own KV, so speculative decoding only passes tokens across the backend boundary. Whether it beats MTP-on-5090 is unbenchmarked.

Next: real split on a too-large model

The split-mode results above are on a model that fits the 5090, where any iGPU participation is pure drag. The genuine test is a model that does not fit in 32 GB, where -sm layer with a minimal iGPU fraction is load-enabling rather than optional.

Plan: