A SYCL iGPU in a CUDA Box: Great for Building and Testing, Useless for Inference
A single llama.cpp binary that registers both an RTX 5090 (CUDA) and an Intel
UHD 770 iGPU (SYCL) as ggml device backends, exposes CUDA0 and SYCL0 side by
side, and runs inference on either.
The point of this is not to make the two GPUs compute together. It is that the iGPU is already in the box, costs nothing, and gives you a real SYCL device to build and smoke-test the SYCL backend against without owning a discrete Intel card. That is where it earns its place. For actual inference it contributes nothing—the rest of this post is the benchmarks that show why.
Bottom line
The combined build works. Split the toolchains—icpx compiles everything and
does the final link, nvcc compiles only the .cu files with g++ forced as
its host compiler—and one binary enumerates both devices and runs inference on
each standalone. ggml has no mutual-exclusion guard between backends; the only
real risk, whether icpx could link nvcc-produced objects, did not bite.
Combining them for compute, though, almost never pays:
- The 5090 beats the UHD 770 ~200x at prefill and ~65x at decode.
- Layer-split is strictly negative on a model that fits. It is sequential, so throughput is gated by the slowest device, not additive—10% of layers on the iGPU already costs 10x.
- Cross-vendor tensor and row split do not work at all: both assume a homogeneous, peer-capable backend set, which CUDA0 + SYCL0 is not.
- The iGPU is not even greener. The silicon sips power, but the SYCL host driver busy-waits a CPU core, and that tax makes its real tok/J worse than the CPU’s.
- An MTP draft head cannot be offloaded to the iGPU—it is welded to the target’s KV cache, on the target’s device.
The one place it could earn its keep at inference time is load-enabling a model too large for 32 GB, via a minimal layer-split. That is the open thread at the end. Its real value, though, is upstream of all this: a free, always-present SYCL device to build and test the backend on.
One caveat colors every number below: the 5090 ran power-capped at 400 W (stock 575 W, max 600 W). That throttles prefill more than decode, so the CUDA/SYCL ratios here are conservative—at full power the 5090’s lead only widens, and every conclusion holds a fortiori.
Hypothesis
ggml registers each backend independently - the CUDA and SYCL backends are
separate subdirectories keyed only on their own GGML_* option, with no
mutual-exclusion guard. So a single binary should be able to carry both and
expose CUDA0 and SYCL0 side by side.
The only real obstacle is the compiler, not ggml:
- SYCL TUs only compile with Intel’s DPC++ driver (
icpx), soCMAKE_CXX_COMPILER=icpxis mandatory. nvccneeds a host compiler it supports, andicpx(clang-based oneAPI) is not one.
Resolution: split the toolchains. icpx compiles everything (ggml core, llama,
the SYCL backend) and does the final link; nvcc compiles only the .cu files
with gcc forced as its host compiler via -DCMAKE_CUDA_HOST_COMPILER=g++
(threaded through as -ccbin at ggml-cuda/CMakeLists.txt:219). icpx defaults
to gcc’s libstdc++, so the objects are ABI-compatible at link time.
The untested seam was whether icpx could link nvcc-produced CUDA objects.
Process
Built from upstream llama.cpp, tag b9660. build-both.sh is build-sycl.sh
reduced to a single combined target:
- sources oneAPI, guards on both
icpxandnvccbeing present -DGGML_CUDA=ON -DGGML_SYCL=ON -DGGML_VULKAN=OFF-DCMAKE_CUDA_HOST_COMPILER=g++,-DCMAKE_CUDA_ARCHITECTURES=120a-real-DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx- post-build check asserts both
CUDAandSYCLappear inllama-bench --list-devices(a one-backend build is a silent failure) - smoke test pinned to
-dev CUDA0(see iGPU gotchas below)
Findings
Build and link succeeded (~48 min cold, ccache-incremental after). The
icpx-links-nvcc-objects seam did not bite. Both devices enumerate:
Available devices:
CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31585 MiB free)
SYCL0: Intel(R) UHD Graphics 770 (59281 MiB, 16205 MiB free)
Each device runs inference standalone in the same binary.
iGPU gotchas (none are build defects)
SYCL_CACHE_PERSISTENT(env defaults it on) trips a known UHD 770 IGC crash. Must be0for any SYCL run.GGML_SYCL_DISABLE_DNN=1needed (oneDNN path).- The UHD 770 cannot JIT
FLASH_ATTN_EXT, so any SYCL run needs-fa offwithf16KV (quantized KV needs FA). - With both GPUs visible and no
-devpin, auto device-fit spills layers onto the iGPU; combined with-fa onit cores. Always pin-devfor combined runs.
iGPU drive recipe:
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
llama-... -dev SYCL0 -fa off -ctk f16 -ctv f16 -ngl 99 ...
Results: split-mode bench
Model: Qwen3-4B Q4_K_M (2.37 GiB, small-model.gguf), -fa 0, r=3.
Device indices: CUDA0=0, SYCL0=1.
Caveat: the 5090 was power-capped at 400 W (default 575 W, max 600 W). That
throttles prefill (compute-bound) more than decode (bandwidth-bound; VRAM
clocks are barely power-gated), so the CUDA0 pp512 here sits below the chip’s
full-power ceiling while tg128 is near-true. This makes the iGPU look less
bad than it is: at stock power the 5090’s prefill lead widens, so every
conclusion below holds and the CUDA/SYCL ratios are conservative.
| split-mode | tensor-split (CUDA/SYCL) | pp512 t/s | tg128 t/s |
|---|---|---|---|
| none | CUDA0 only (-mg 0) |
16574 | 324 |
| none | SYCL0 only (-mg 1) |
82 | 5.0 |
| layer | 99/1 (rounds to 0 iGPU layers) | 16591 | 324 |
| layer | 9/1 (~10% on iGPU) | 1774 | 32.8 |
| layer | 5/5 | 179 | 10.5 |
| row | 9/1, 5/5 | GGML_ABORT |
- |
| tensor | 9/1, 5/5 | ctx create fails | - |
Reading it
- The 5090 alone wins everything: ~200x the iGPU at prefill, ~65x at decode.
- Layer split is strictly negative on a model that fits. It is sequential (each
token flows through all layers in order), so throughput is gated by whichever
device holds a layer, not additive. 10% of layers on the iGPU already costs
10x; 50/50 collapses to roughly iGPU-only.
99/1ties the baseline only because 1% of 36 layers rounds to zero iGPU layers. - Tensor-parallel modes are unavailable across vendors.
rowaborts inggml_backend_sched_backend_id_from_cur;tensorfails context creation. Both assume a peer-capable homogeneous backend set, which CUDA0 + SYCL0 is not.
Commands
# baselines (both single-device)
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
./build-both/bin/llama-bench -m small-model.gguf -fa 0 -sm none -mg 0,1 -r 3
# layer split sweep
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
./build-both/bin/llama-bench -m small-model.gguf -fa 0 -sm layer -ts 99/1,9/1,5/5 -r 3
Power efficiency: is the slow iGPU at least greener?
A fair question, since the UHD 770 is slow but sips power. To rank the three
compute paths the model can land on, the whole model was run on each device and
power sampled during a sustained decode. CPU/iGPU draw is Intel RAPL
(package-0); the 5090 is nvidia-smi board power. On this Raptor Lake desktop
the RAPL uncore domain is the iGPU power plane (PP1), so the iGPU is measured
directly, not by differential. Idle package = 14.6 W.
| device | tok/s | power | tok/J | energy / 1k tok |
|---|---|---|---|---|
| 5090 (CUDA0) | 301 | 400 W board | 0.75 | 0.37 Wh |
| CPU (28 core) | 16.6 | 153 W pkg | 0.108 | 2.6 Wh |
| iGPU (770) | 4.67 | 64 W pkg | 0.073 | 3.8 Wh |
The intuition is half right, and the wrong half is the finding. At the silicon
level the iGPU is frugal: its own power plane drew only 10.3 W for those
4.67 tok/s (0.45 tok/J, ~4x the CPU). But during that same run the RAPL core
domain sat at 42.9 W - ~37 W over idle, ~1.5 cores pinned - because the
SYCL/Level-Zero host driver busy-waits on queue completion instead of
blocking. So the system spends 64 W to produce 4.67 tokens, and the honest
tok/J (0.073) is worse than the CPU’s (0.108). The iGPU is efficient; the way
it is driven is not, and the CPU tax inverts the result.
The 5090 wins efficiency outright, not just speed: it pins the 400 W cap (even decoding this 4B - it never throttles down) but finishes ~65x faster, so race-to-idle gives it ~0.75 tok/J, ~7x the CPU and ~10x the iGPU-as-deployed. Fast is green.
Consequence for overflow: the split-mode plan already showed CPU spill beats iGPU layer-split on throughput; energy agrees. Spill overflow to CPU, never to the iGPU.
The one open lever: if the busy-wait were a blocking wait, iGPU system power would fall toward idle + 10 W ~= 27 W, giving ~0.17 tok/J - then it would edge the CPU ~1.6x and vindicate the “greener” hypothesis for a low-power, always-on side workload. It is a driver-config artifact, not a hardware ceiling. It still would not come within 4x of the 5090. Unverified.
# whole-model on each device (-fa 0, f16 KV)
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
./build-both/bin/llama-bench -m small-model.gguf -fa 0 -ngl 0 -r 3 # CPU
SYCL_CACHE_PERSISTENT=0 GGML_SYCL_DISABLE_DNN=1 \
./build-both/bin/llama-bench -m small-model.gguf -fa 0 -dev SYCL0 -ngl 99 -r 3 # iGPU
# power during a sustained decode (run alongside a long -n bench)
sudo chmod -R a+r /sys/class/powercap/intel-rapl # RAPL energy_uj is root-only
./rapl-power.sh 30 # pkg / core(CPU) / uncore(iGPU) avg watts
The RAPL sampler (perf/turbostat were not installed). It reads the
energy_uj counters before and after a window and divides by elapsed time;
the uncore domain is the iGPU on this client part.
#!/bin/bash
# Sample average Intel RAPL power over a window.
# Domains on Raptor Lake client: package-0, core (PP0=CPU cores),
# uncore (PP1=integrated GPU power plane).
# Usage: rapl-power.sh <seconds>
set -e
dur=${1:-5}
base=/sys/class/powercap/intel-rapl/intel-rapl:0
uj() { cat "$1/energy_uj"; }
max=$(cat "$base/max_energy_range_uj")
t0=$(date +%s.%N)
p0=$(uj "$base"); c0=$(uj "$base/intel-rapl:0:0"); u0=$(uj "$base/intel-rapl:0:1")
sleep "$dur"
t1=$(date +%s.%N)
p1=$(uj "$base"); c1=$(uj "$base/intel-rapl:0:0"); u1=$(uj "$base/intel-rapl:0:1")
awk -v p0="$p0" -v p1="$p1" -v c0="$c0" -v c1="$c1" -v u0="$u0" -v u1="$u1" \
-v t0="$t0" -v t1="$t1" -v mx="$max" '
function w(a, b) { d = b - a; if (d < 0) d += mx; return d / 1e6 / dt }
BEGIN {
dt = t1 - t0;
printf "dt=%.1fs pkg=%.1fW core(CPU)=%.1fW uncore(iGPU)=%.1fW\n", dt, w(p0,p1), w(c0,c1), w(u0,u1)
}'
MTP offload: not possible for the head
Separate question investigated: can the MTP draft head of an MTP model
(--spec-type draft-mtp) be offloaded to the iGPU? No. The head is welded to
the target’s KV cache, so it cannot live on a different backend than the target.
This was confirmed against ground-truth runs, and the behaviour depends on a
build-time property of the GGUF: how the publisher packaged the MTP head.
There are two packagings, and they are visible directly in the HuggingFace repo:
- Embedded (e.g.
unsloth/Qwen3.6-27B-MTP-GGUF): the MTP head lives inside the main model GGUF. The repo ships one model file and no MTP sibling; the header carries a*.nextn_predict_layersmetadata key (hereqwen35.nextn_predict_layers). llama.cpp loads the draft from the same file as the target. - Separate file (e.g.
unsloth/gemma-4-31B-it-GGUF): the repo ships the model GGUF plus a distinct sibling,mtp-gemma-4-31B-it.gguf. The draft is a second file that llama.cpp loads alongside the target.
That one difference picks the server’s code path, and the path decides whether
-devd SYCL0 does anything:
| MTP packaging | server branch | -devd SYCL0 effect |
|---|---|---|
embedded (qwen35.nextn_*) |
against-target | inert - draft stays on the target’s device |
separate file (mtp-*.gguf) |
separate-model | honored -> draft on iGPU -> aborts on shared KV |
Which branch is taken is decided by has_dft() (common.h:360), true iff a draft
model path is set. For an MTP -hf repo, arg.cpp:473-477 auto-populates that
path only if a separate MTP sibling file is discovered (res.found_mtp) –
which is precisely the embedded-vs-separate distinction above.
The server log disambiguates: [spec] estimated memory usage of draft model
(separate) vs ... of MTP context (embedded), server-context.cpp:968-969.
Embedded MTP (against-target branch)
ctx_dft is built from the same model_tgt object (server-context.cpp:1055)
with cparams_mtp derived from params_base (the target’s devices), and shares
the target’s KV (cparams_mtp.ctx_other = ctx_tgt). -devd / -ngld are read
only in the separate-model branch (server-context.cpp:1002-1004); here they
are merely logged (speculative.cpp:862), never applied - so devices=[SYCL0]
in the log is the requested value, not where anything ran. Target VRAM is
unchanged whatever you pass, and any SYCL-vs-CUDA decode delta is the power-
capped 5090 throttling, not the iGPU.
The MTP context’s memory (e.g. 2240 MiB on Qwen3.6 at ~205k ctx) is one draft
layer’s KV over the full context at the draft KV type. It is bound to the target
device and cannot be moved; the only lever is to shrink it. Dropping
LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_{K,V} from q8_0 to q4_0 roughly halves it.
Draft mispredictions are caught by the target’s verify step, so a coarser draft
KV costs at most a sliver of accept-rate, never correctness.
Separate MTP file (separate-model branch)
Here -devd SYCL0 is honored and the draft graph is genuinely placed on the
iGPU - and that is exactly why it crashes. The separate MTP head still shares
the target’s KV cache (log: layer 3: sharing with layer 59, cache_k_l58),
and that KV is pre-allocated on CUDA0 with the target. A SYCL op against a
CUDA0-pinned tensor cannot be scheduled, so it aborts:
ggml-backend.cpp:898: pre-allocated tensor (cache_k_l58) in a buffer (CUDA0)
that cannot run the operation (NONE)
ggml_backend_sched_backend_id_from_cur -> ggml_abort
Same heterogeneous-backend wall as row-split. The fix is to not cross
backends: drop -devd SYCL0 and let the draft default onto the target’s device
(-dev CUDA0). That is the only working config for an MTP draft.
Could the draft keep its own KV copy instead?
The shared KV is the only thing welding the draft to the target’s backend, so the structural question is whether the draft can carry an independent copy. For some MTP architectures it already can - llama.cpp has a non-shared path - but the gemma arch is specifically excluded from it.
The switch is is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt
(speculative.cpp:904). When it is false, process() runs a catch-up
decode (speculative.cpp:990-1009): it reads the target’s per-position hidden
states via llama_get_embeddings_nextn(ctx_tgt), memcpys them into the
draft’s own batch, and decodes them through the draft to populate ctx_dft’s
own KV cache. That transfer is host memory, not a shared device tensor, so a
draft on this path is cross-backend-clean and could live on SYCL0. This is
exactly the “keep two KV copies” idea, already implemented.
Two reasons it does not rescue the gemma case:
-
Gemma4 forbids it.
LLM_ARCH_GEMMA4_ASSISTANThard-requiresctx_otherand throws without it (llama-context.cpp:94-101);LLM_ARCH_EAGLE3without its owntok_embd/outputis the same (llama-context.cpp:103-110). Their MTP heads read the target’s K/V directly (thelayer 3: sharing with layer 59log) and have no k/v projections of their own, so there is no independent cache to fill. The weld is in the model, not the plumbing -is_mem_sharedis forced true. The server also setscparams.ctx_other = ctx_tgtunconditionally in both branches (server-context.cpp:1036, 1053). -
The iGPU loses even where it is allowed. The catch-up decode re-runs the MTP layer over every position - the whole prompt at prefill (~205k tokens here) and one decode per step. Shared mode exists precisely to skip that. Pushing it onto the UHD 770 pays ~200x prefill and a per-step MTP decode that must still beat the 27B verify, or speculation goes net-negative. That spends the draft’s “nearly free” budget to reclaim ~2.2 GiB - a VRAM trade, never a speed one, and only worth it if 5090 VRAM is the hard binding constraint on a non-gemma MTP arch.
The architecturally-supported iGPU draft is a separate, independent small
draft GGUF (-md ... -devd SYCL0 -ngld all, not draft-mtp): it carries its
own KV, so speculative decoding only passes tokens across the backend boundary.
Whether it beats MTP-on-5090 is unbenchmarked.
Next: real split on a too-large model
The split-mode results above are on a model that fits the 5090, where any iGPU
participation is pure drag. The genuine test is a model that does not fit in
32 GB, where -sm layer with a minimal iGPU fraction is load-enabling rather
than optional.
Plan:
- Pick a model whose weights + KV exceed ~31.5 GB free on CUDA0.
-sm layeronly (row/tensor are out across vendors). Sweep-tsto put the smallest viable fraction on SYCL0 - just enough layers to fit, since every iGPU layer is a sequential tax.- Hold
-fa 0,f16KV,SYCL_CACHE_PERSISTENT=0,GGML_SYCL_DISABLE_DNN=1. - Compare against the all-CPU-offload fallback (
--n-cpu-moe/ partial-nglon CUDA0 alone) - the iGPU only earns its place if layer-split beats spilling the overflow to host RAM. - Expectation: tg stays iGPU-bound for whatever fraction lands there, so the win (if any) is “runs at all / runs faster than CPU spill”, not throughput parity with a fitting model.