KV-Cache Quantization vs Positional Recall
Quantizing the KV cache is supposed to trade recall for memory. To find the cost, I ran
four full -ctk x -ctv sweeps—2304 verbatim function recalls in total—against a
heavily weight-quantized Qwen3.6-27B served by llama.cpp.
Setup: qwen3.6-27b-n4_0-mse.gguf (~Q3-effective weight quant, built with the
MSE quant PR), llama.cpp on :4000,
flash-attention on, temperature 0, 262K context window. Harness:
codeneedle, a positional verbatim-recall
benchmark.
Bottom line
Across four 36-combo -ctk x -ctv sweeps — memorized code, unseen code at two depths,
and synthetic incompressible content at 140K tokens — KV-cache quantization down to
q4_0 shows no measurable cost to verbatim positional recall. q4_0/q4_0 matches
bf16/bf16 to within sampling noise in every run, with no precision-aligned gradient and
no advantage to mixed pairings like q8/q4. The reason is mechanical: verbatim recall is a
single-hop copy off a unique anchor, a near one-hot attention lookup whose margin swamps
the noise that 4-bit K/V adds. This is a negative result: it does not say q4 cache is
free in general — only that exact-match retrieval, the thing long-context recall benchmarks
usually measure, is the wrong probe for KV-quant degradation. To see q8 beat q4 you have to
leave the high-margin-copy regime (confusable keys, multi-hop, aggregation).
Hypothesis
Quantizing the KV cache should degrade an LLM’s ability to reproduce text verbatim
from long context, and the degradation should worsen as precision drops
(bf16 > q8_0 > q5 > q4). If true, a sweep over -ctk x -ctv quantization pairings
should show a precision-aligned gradient: the high-precision corner (bf16/bf16)
scores best, the low-precision corner (q4_0/q4_0) worst, and mixed pairings
(q8/q4) land in between — quantifying the cost of cheaper cache, and settling whether
mixed quants buy anything over the matched low-precision pair.
This was motivated by hands-on reports that the same model serving at q8/q8 “feels
better” than at q4/q4, and by skepticism that mixed pairings like q8/q4 are
genuinely better than q4/q4 (mean KL-divergence often looks identical).
Hypothesis as refined by evidence
The original hypothesis failed on real code (see Results). It was successively narrowed:
- H1 (original): KV quant degrades verbatim recall, monotonically with precision.
- H2 (memorization confound): the model recalls memorized code (jQuery) from weights, not the cache, so quantizing the cache changes nothing. Test with code the model has never seen.
- H3 (compressibility): even unseen code is highly predictable — a noisy cache read is reconstructed from the language prior, laundering KV error. Test with incompressible content (random identifiers bound to random string literals), where no prior exists and every output token must come from the cache.
- H4 (confirmed): even incompressible content recalls cleanly under
q4_0KV — the full 36-combo grid is flat at 20.00/20. The bottleneck for KV-quant recall failure is therefore not the absence of a prior but the margin of the attention lookup itself — which is large for a single-hop copy with a unique anchor regardless of content. Breaking it would require stressing attention (confusable keys / distractors), not just removing the prior.
Methodology
The task
For each target function, stuff the entire source file into the model’s context, then ask it to reproduce the first 20 lines of the named function’s body, verbatim. This measures positional recall under long context, not named-entity lookup.
- Extraction: named functions with >=20 body lines (
.jsvia esprima,.pyviaast,.phpviatoken_get_all). - Sampling: stratified by line position,
k=16,seed=42— so recall is probed at all depths, not just the tail. - Scoring: LCS alignment of produced vs expected lines.
primary_matched= how many of the 20 expected lines appear at the right position. Pass = >=8/20.relax_indent=true: leading whitespace stripped on both sides (this build re-indents recalled code).hallucinated= produced lines that match nothing expected. - CoT suppression:
prefill_no_think(the only technique this Qwen3.6 build honors). It echoes an empty<think>\n</think>into the response — a uniform +2 tohallucinated, not a recall signal. - Determinism: temperature 0, so within one server config a prompt is deterministic; the only variable across cells is the KV quant.
The sweep
sweep-kv-quants.sh walks the full 6x6 grid of -ctk x -ctv over
{bf16, q8_0, q5_1, q5_0, q4_1, q4_0} = 36 combinations. Per combo: relaunch
llama-server with that K/V cache quant, wait for /health, run a tagged 16-function
round, kill the server. Resumable per-combo; quant order shuffled each run.
iq4_nl excluded (no flash-attention kernel — loads but wedges inference).
f16 would belong in the grid but this build lacks it; bf16 is the
highest-precision reference.
Corpora
| corpus | content | size | depth (tokens) | model has seen it? |
|---|---|---|---|---|
jquery |
jQuery source (JS) | ~280 KB | ~80K | yes (memorized) |
proprietary |
private PHP, trimmed | ~230 KB | ~64K | no |
proprietary-full |
private PHP, full | ~453 KB | ~122K | no |
random-bodies |
generated Python, incompressible | ~206 KB | ~140K | no (synthetic) |
random-bodies is produced by gen-random-corpus.py: N functions whose bodies are
<random ident> = "<random literal>" lines. Both name and value are high-entropy, so no
language prior can reconstruct a line. Seeded — same knobs produce a byte-identical file.
Results
Summary across all four runs
Each cell of every grid is 16 functions; each run is 36 quant combos (576 function
recalls per run). “avg matched” is mean primary_matched out of 20.
| run | combos | pass /16 (mean +/- sd) | avg matched /20 (mean +/- sd) | matched range | gradient? |
|---|---|---|---|---|---|
jquery (memorized) |
36 | 15.89 +/- 0.31 | 19.49 +/- 0.30 | 18.50 - 19.88 | none |
proprietary (64K, unseen) |
36 | 16.00 +/- 0.00 | 19.06 +/- 0.11 | 18.81 - 19.31 | none |
proprietary-full (122K) |
36 | 15.97 +/- 0.16 | 19.40 +/- 0.22 | 18.38 - 19.62 | none |
random-bodies (140K, incompressible) |
36 | 16.00 +/- 0.00 | 19.99 +/- 0.02 | 19.94 - 20.00 | none |
In every run, recall is flat across the entire quantization grid. The cell-to-cell
variation (stdev ~0.1-0.3 matched lines) is smaller than a single recalled line and is
not precision-aligned — in several runs the q4_0 rows score higher than the
bf16 rows, which is anti-physical and the signature of sampling jitter, not
degradation.
Run 3 detail — proprietary-full (122K tokens), avg matched /20
A representative full grid. Note q4_0 (least precise) is among the best rows:
ctk\ctv bf16 q8_0 q5_1 q5_0 q4_1 q4_0
bf16 19.25 19.38 19.44 18.38 19.19 19.44
q8_0 19.19 19.56 19.56 19.44 19.44 19.56
q5_1 19.44 19.31 19.62 19.38 19.56 19.12
q5_0 19.50 19.38 19.56 19.50 19.50 19.50
q4_1 19.31 19.56 19.56 19.44 19.06 19.25
q4_0 19.56 19.56 19.44 19.56 19.44 19.56
The hallucination column is an artifact, not a signal
On proprietary-full, the hallucinated metric spiked wildly (2.5 to 19.7 avg). This is
not recall degradation. Tracing the worst cells: the model reproduced the 20
requested lines perfectly (matched=20), then failed to stop — it kept faithfully
copying the source file (the rest of the function, the next function’s docblock, then
other functions’ comments) until it hit max_tokens. So a high hallucinated count here
means the model read the cache so well it ran on, not that it produced garbage. Which
functions run on is scattered across cells with no precision alignment — a stop-boundary
decision sensitive to exact KV state, deterministic within a cell but not monotonic with
bit-width. Noise with respect to the recall question.
Run 4 detail — random-bodies (140K tokens, incompressible), avg matched /20
The decisive run, and the cleanest null in the set. Calibration check passed first:
q8/q8 recalls at 20/20, so the model can perfectly copy ~400 incompressible tokens
at 140K depth — we are not floored, we are in the discriminating band. The full grid:
ctk\ctv bf16 q8_0 q5_1 q5_0 q4_1 q4_0
bf16 20.00 20.00 20.00 20.00 20.00 20.00
q8_0 20.00 19.94 19.94 20.00 20.00 20.00
q5_1 20.00 20.00 20.00 20.00 20.00 20.00
q5_0 20.00 20.00 20.00 20.00 20.00 20.00
q4_1 20.00 20.00 20.00 20.00 20.00 20.00
q4_0 20.00 20.00 20.00 20.00 19.94 20.00
Thirty-three of 36 cells are a perfect 20.00. The three 19.94 cells are a single missed
line in one of 16 functions (319/320) — and one of them is q8/q8 itself, the
highest-precision config. The misses are scattered with no relation to bit-width: this is
sampling jitter, not degradation. The hallucinated column is clean and uniform (~2.0 =
the <think> echo), because random content gives the model no coherent file to run on
past the 20 lines.
The result: q4_0 KV cache is lossless for verbatim single-hop recall, even of
incompressible content at 140K tokens. q4_0/q4_0 equals bf16/bf16 to within noise.
Interpretation
Verbatim recall of a named span is mechanically a copy: locate the anchor (the function name, a literal string present in the context), attend to it, emit the tokens that follow. This is an induction-head operation — among the earliest and most robust circuits transformers form, sharp even in small, heavily weight-quantized models.
Why it resists KV quantization, even with no prior to lean on:
- The correct next token in a copy sits at large logit margin over all alternatives. Low-bit K/V perturbs attention scores and retrieved values slightly, but not enough to flip a high-confidence argmax — and at temperature 0, “perturbed but still argmax- correct” emits the identical token.
- This margin comes from the uniqueness of the anchor and the single-hop lookup, not from the predictability of the content. Removing the language prior (H3) was expected to expose V-cache error, but the data (H4) shows it does not: the lookup margin alone is sufficient. Attention to a unique anchor is near one-hot, so a single value vector dominates each emitted token — and a 4-bit-quantized lone value vector still decodes to the correct token. KV-quant noise needs diffuse attention (many positions averaged, errors compounding) or confusable keys (a spread softmax that K-noise can tip) to bite — neither of which a unique-name copy creates, whatever the content’s entropy.
This reframes the practical “q8 feels better than q4” reports: whatever they measure, it is not exact-match single-hop retrieval, which is saturated and precision-insensitive down to q4_0. The aggregate degradation that perplexity / KL-divergence captures is an average over all tokens — including the many low-margin, genuinely-uncertain positions where the model is reasoning or generating, not copying. KV-quant error bites there; it does not bite a high-margin copy.
Open questions / next steps
- Stress attention, not the prior. With H4 confirmed, the lever to break recall is
confusable anchors / distractors — many similarly-named functions, or “the
function that does X” instead of the literal name — so K-cache error can flip which
token attention retrieves. This is the regime where
q8K-cache should finally beatq4. (RULER’s multi-key / multi-value / variable-tracking tasks are built for exactly this; the single-needle copy here is known to saturate.) - Continuous aggregate metric. Run
llama-perplexity(already in the build) over a fixed corpus across the same-ctk/-ctvgrid, ideally in KL-divergence mode against abf16-KV reference. This measures the distribution-level degradation that recall pass/fail cannot, and is the honest tool for the “is q8 better than q4” question on tasks that are not high-margin copies. - Strip the
<think>prefill echo in scoring so thehallucinatedcolumn reads honestly once a real signal appears.
Reproduction
The harness scripts are in this gist.
# one corpus, full 36-combo KV-quant sweep against the running :4000 server
CORPUS=random-bodies ./sweep-kv-quants.sh
# regenerate the incompressible corpus (seeded, reproducible)
./gen-random-corpus.py --functions 200 --value-len 24 --seed 42
# analyze a completed (or partial) sweep
.venv/bin/python grid.py random-bodies
.venv/bin/python heatmap.py random-bodies