KV-Cache Quantization vs Positional Recall

2026-06-10

Quantizing the KV cache is supposed to trade recall for memory. To find the cost, I ran four full -ctk x -ctv sweeps—2304 verbatim function recalls in total—against a heavily weight-quantized Qwen3.6-27B served by llama.cpp.

Setup: qwen3.6-27b-n4_0-mse.gguf (~Q3-effective weight quant, built with the MSE quant PR), llama.cpp on :4000, flash-attention on, temperature 0, 262K context window. Harness: codeneedle, a positional verbatim-recall benchmark.

Bottom line

Across four 36-combo -ctk x -ctv sweeps — memorized code, unseen code at two depths, and synthetic incompressible content at 140K tokens — KV-cache quantization down to q4_0 shows no measurable cost to verbatim positional recall. q4_0/q4_0 matches bf16/bf16 to within sampling noise in every run, with no precision-aligned gradient and no advantage to mixed pairings like q8/q4. The reason is mechanical: verbatim recall is a single-hop copy off a unique anchor, a near one-hot attention lookup whose margin swamps the noise that 4-bit K/V adds. This is a negative result: it does not say q4 cache is free in general — only that exact-match retrieval, the thing long-context recall benchmarks usually measure, is the wrong probe for KV-quant degradation. To see q8 beat q4 you have to leave the high-margin-copy regime (confusable keys, multi-hop, aggregation).

Hypothesis

Quantizing the KV cache should degrade an LLM’s ability to reproduce text verbatim from long context, and the degradation should worsen as precision drops (bf16 > q8_0 > q5 > q4). If true, a sweep over -ctk x -ctv quantization pairings should show a precision-aligned gradient: the high-precision corner (bf16/bf16) scores best, the low-precision corner (q4_0/q4_0) worst, and mixed pairings (q8/q4) land in between — quantifying the cost of cheaper cache, and settling whether mixed quants buy anything over the matched low-precision pair.

This was motivated by hands-on reports that the same model serving at q8/q8 “feels better” than at q4/q4, and by skepticism that mixed pairings like q8/q4 are genuinely better than q4/q4 (mean KL-divergence often looks identical).

Hypothesis as refined by evidence

The original hypothesis failed on real code (see Results). It was successively narrowed:

H1 (original): KV quant degrades verbatim recall, monotonically with precision.
H2 (memorization confound): the model recalls memorized code (jQuery) from weights, not the cache, so quantizing the cache changes nothing. Test with code the model has never seen.
H3 (compressibility): even unseen code is highly predictable — a noisy cache read is reconstructed from the language prior, laundering KV error. Test with incompressible content (random identifiers bound to random string literals), where no prior exists and every output token must come from the cache.
H4 (confirmed): even incompressible content recalls cleanly under q4_0 KV — the full 36-combo grid is flat at 20.00/20. The bottleneck for KV-quant recall failure is therefore not the absence of a prior but the margin of the attention lookup itself — which is large for a single-hop copy with a unique anchor regardless of content. Breaking it would require stressing attention (confusable keys / distractors), not just removing the prior.

Methodology

The task

For each target function, stuff the entire source file into the model’s context, then ask it to reproduce the first 20 lines of the named function’s body, verbatim. This measures positional recall under long context, not named-entity lookup.

Extraction: named functions with >=20 body lines (.js via esprima, .py via ast, .php via token_get_all).
Sampling: stratified by line position, k=16, seed=42 — so recall is probed at all depths, not just the tail.
Scoring: LCS alignment of produced vs expected lines. primary_matched = how many of the 20 expected lines appear at the right position. Pass = >=8/20. relax_indent=true: leading whitespace stripped on both sides (this build re-indents recalled code). hallucinated = produced lines that match nothing expected.
CoT suppression: prefill_no_think (the only technique this Qwen3.6 build honors). It echoes an empty <think>\n</think> into the response — a uniform +2 to hallucinated, not a recall signal.
Determinism: temperature 0, so within one server config a prompt is deterministic; the only variable across cells is the KV quant.

The sweep

sweep-kv-quants.sh walks the full 6x6 grid of -ctk x -ctv over {bf16, q8_0, q5_1, q5_0, q4_1, q4_0} = 36 combinations. Per combo: relaunch llama-server with that K/V cache quant, wait for /health, run a tagged 16-function round, kill the server. Resumable per-combo; quant order shuffled each run. iq4_nl excluded (no flash-attention kernel — loads but wedges inference).

f16 would belong in the grid but this build lacks it; bf16 is the highest-precision reference.

Corpora

corpus	content	size	depth (tokens)	model has seen it?
`jquery`	jQuery source (JS)	~280 KB	~80K	yes (memorized)
`proprietary`	private PHP, trimmed	~230 KB	~64K	no
`proprietary-full`	private PHP, full	~453 KB	~122K	no
`random-bodies`	generated Python, incompressible	~206 KB	~140K	no (synthetic)

random-bodies is produced by gen-random-corpus.py: N functions whose bodies are <random ident> = "<random literal>" lines. Both name and value are high-entropy, so no language prior can reconstruct a line. Seeded — same knobs produce a byte-identical file.

Results

Summary across all four runs

Each cell of every grid is 16 functions; each run is 36 quant combos (576 function recalls per run). “avg matched” is mean primary_matched out of 20.

run	combos	pass /16 (mean +/- sd)	avg matched /20 (mean +/- sd)	matched range	gradient?
`jquery` (memorized)	36	15.89 +/- 0.31	19.49 +/- 0.30	18.50 - 19.88	none
`proprietary` (64K, unseen)	36	16.00 +/- 0.00	19.06 +/- 0.11	18.81 - 19.31	none
`proprietary-full` (122K)	36	15.97 +/- 0.16	19.40 +/- 0.22	18.38 - 19.62	none
`random-bodies` (140K, incompressible)	36	16.00 +/- 0.00	19.99 +/- 0.02	19.94 - 20.00	none

In every run, recall is flat across the entire quantization grid. The cell-to-cell variation (stdev ~0.1-0.3 matched lines) is smaller than a single recalled line and is not precision-aligned — in several runs the q4_0 rows score higher than the bf16 rows, which is anti-physical and the signature of sampling jitter, not degradation.

Run 3 detail — `proprietary-full` (122K tokens), avg matched /20

A representative full grid. Note q4_0 (least precise) is among the best rows:

ctk\ctv    bf16    q8_0    q5_1    q5_0    q4_1    q4_0
   bf16   19.25   19.38   19.44   18.38   19.19   19.44
   q8_0   19.19   19.56   19.56   19.44   19.44   19.56
   q5_1   19.44   19.31   19.62   19.38   19.56   19.12
   q5_0   19.50   19.38   19.56   19.50   19.50   19.50
   q4_1   19.31   19.56   19.56   19.44   19.06   19.25
   q4_0   19.56   19.56   19.44   19.56   19.44   19.56

The hallucination column is an artifact, not a signal

On proprietary-full, the hallucinated metric spiked wildly (2.5 to 19.7 avg). This is not recall degradation. Tracing the worst cells: the model reproduced the 20 requested lines perfectly (matched=20), then failed to stop — it kept faithfully copying the source file (the rest of the function, the next function’s docblock, then other functions’ comments) until it hit max_tokens. So a high hallucinated count here means the model read the cache so well it ran on, not that it produced garbage. Which functions run on is scattered across cells with no precision alignment — a stop-boundary decision sensitive to exact KV state, deterministic within a cell but not monotonic with bit-width. Noise with respect to the recall question.

Run 4 detail — `random-bodies` (140K tokens, incompressible), avg matched /20

The decisive run, and the cleanest null in the set. Calibration check passed first: q8/q8 recalls at 20/20, so the model can perfectly copy ~400 incompressible tokens at 140K depth — we are not floored, we are in the discriminating band. The full grid:

ctk\ctv    bf16    q8_0    q5_1    q5_0    q4_1    q4_0
   bf16   20.00   20.00   20.00   20.00   20.00   20.00
   q8_0   20.00   19.94   19.94   20.00   20.00   20.00
   q5_1   20.00   20.00   20.00   20.00   20.00   20.00
   q5_0   20.00   20.00   20.00   20.00   20.00   20.00
   q4_1   20.00   20.00   20.00   20.00   20.00   20.00
   q4_0   20.00   20.00   20.00   20.00   19.94   20.00

Thirty-three of 36 cells are a perfect 20.00. The three 19.94 cells are a single missed line in one of 16 functions (319/320) — and one of them is q8/q8 itself, the highest-precision config. The misses are scattered with no relation to bit-width: this is sampling jitter, not degradation. The hallucinated column is clean and uniform (~2.0 = the <think> echo), because random content gives the model no coherent file to run on past the 20 lines.

The result: q4_0 KV cache is lossless for verbatim single-hop recall, even of incompressible content at 140K tokens. q4_0/q4_0 equals bf16/bf16 to within noise.

Interpretation

Verbatim recall of a named span is mechanically a copy: locate the anchor (the function name, a literal string present in the context), attend to it, emit the tokens that follow. This is an induction-head operation — among the earliest and most robust circuits transformers form, sharp even in small, heavily weight-quantized models.

Why it resists KV quantization, even with no prior to lean on:

The correct next token in a copy sits at large logit margin over all alternatives. Low-bit K/V perturbs attention scores and retrieved values slightly, but not enough to flip a high-confidence argmax — and at temperature 0, “perturbed but still argmax- correct” emits the identical token.
This margin comes from the uniqueness of the anchor and the single-hop lookup, not from the predictability of the content. Removing the language prior (H3) was expected to expose V-cache error, but the data (H4) shows it does not: the lookup margin alone is sufficient. Attention to a unique anchor is near one-hot, so a single value vector dominates each emitted token — and a 4-bit-quantized lone value vector still decodes to the correct token. KV-quant noise needs diffuse attention (many positions averaged, errors compounding) or confusable keys (a spread softmax that K-noise can tip) to bite — neither of which a unique-name copy creates, whatever the content’s entropy.

This reframes the practical “q8 feels better than q4” reports: whatever they measure, it is not exact-match single-hop retrieval, which is saturated and precision-insensitive down to q4_0. The aggregate degradation that perplexity / KL-divergence captures is an average over all tokens — including the many low-margin, genuinely-uncertain positions where the model is reasoning or generating, not copying. KV-quant error bites there; it does not bite a high-margin copy.

Open questions / next steps

Stress attention, not the prior. With H4 confirmed, the lever to break recall is confusable anchors / distractors — many similarly-named functions, or “the function that does X” instead of the literal name — so K-cache error can flip which token attention retrieves. This is the regime where q8 K-cache should finally beat q4. (RULER’s multi-key / multi-value / variable-tracking tasks are built for exactly this; the single-needle copy here is known to saturate.)
Continuous aggregate metric. Run llama-perplexity (already in the build) over a fixed corpus across the same -ctk/-ctv grid, ideally in KL-divergence mode against a bf16-KV reference. This measures the distribution-level degradation that recall pass/fail cannot, and is the honest tool for the “is q8 better than q4” question on tasks that are not high-margin copies.
Strip the <think> prefill echo in scoring so the hallucinated column reads honestly once a real signal appears.

Reproduction

The harness scripts are in this gist.

# one corpus, full 36-combo KV-quant sweep against the running :4000 server
CORPUS=random-bodies ./sweep-kv-quants.sh

# regenerate the incompressible corpus (seeded, reproducible)
./gen-random-corpus.py --functions 200 --value-len 24 --seed 42

# analyze a completed (or partial) sweep
.venv/bin/python grid.py    random-bodies
.venv/bin/python heatmap.py random-bodies