R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

1University of Wisconsin–Madison 2Microsoft 3Carnegie Mellon University 4California Institute of Technology 5University of California–San Diego 6University of Surrey 7Adobe 8University of California–Berkeley

🔥 News

  • 🚀 2025-05-29 — We are pleased to announce the initial release of R-KV, a highly-efficient decoding-time KV-cache compression method to serve reasoning LLMs.

Overview

Large language models that rely on chain-of-thought (CoT) or self-reflection can crack tough reasoning tasks — but at the cost of very long outputs that bloat the key–value (KV) cache during inference. Traditional cache-compression schemes, tuned for long prompts, tumble on these generated traces, keeping only ≈60 % of the original accuracy when restricted to 10 % of the cache.

R-KVRedundancy-aware KV-cache compression for Reasoning models — solves this by ranking tokens on-the-fly for both importance and non-redundancy, retaining only the informative, diverse ones.

🌟 Highlights

MetricFull KVR-KV (ours)
KV cache kept100 %10 %
Accuracy100 %≈100 %
Throughput ↑6.6×
Memory saved ↓90 %

At 16 % cache, noise removal even nudges accuracy to 105 % of the full baseline.

✨ Why R-KV?

Up to 90 % KV-cache memory savings with zero—sometimes negative—accuracy loss.

Plug-and-play: a lightweight wrapper for any autoregressive LLM.

Training-free: drop straight into inference or RL roll-outs—no finetuning required.

Chain-of-thought (CoT) and self-reflection unlock impressive reasoning, but they explode the KV cache. A single DeepSeek-R1-Distill-8B run on a tough math problem can:

  • Generate 32 000 tokens
  • Load 15.5 GB of weights
  • Allocate 4.1 GB of KV just to remember its own musings

Existing compression tools focus on long prompts and falter on long generations—often pruning the wrong tokens because redundant self-checks still attend heavily to themselves.

🔍 Method — Redundancy-aware KV Cache Compression (R-KV)

R-KV tackles redundant key/value (KV) tokens by compressing the KV cache on-the-fly while the model is decoding, keeping only tokens that are important and non-redundant.

R-KV method diagram
PartWhat happensKey idea
1. Decoding-time KV stagingNewly generated tokens are written to a buffer B_buffer.Separate buffer lets us decide what to keep after seeing a chunk of text.
2. Importance scoringUse attention weights from the last α observation tokens to score each candidate token.High attention ⇒ token is critical for future predictions.
3. Redundancy estimationCompute cosine similarity between key vectors; older near-duplicates are marked redundant.Keeps semantics while pruning repetition.
4. Joint selectionFinal score Z = λ·Importance − (1-λ)·Redundancy. Top-k tokens + observation tokens are retained in the budgeted cache.One knob (λ) trades memory vs. quality.

📈 Performance Results

Our method surpassed baselines by a large margin in challenging math benchmarks, and surprisingly even outperformed full KV.

Accuracy curves

⚡️ Efficiency Results

Memory Saving

R-KV keeps two small, fixed-size buffers, so memory usage remains constant, unlike FullKV whose memory grows linearly:

BufferPurposeShape
KV budgetRetained tokensb × B_budget × L × H × d
KV bufferFresh tokensb × B_buffer × L × H × d

Computation Overhead

PartComplexity
Importance scoringO(α × B_budget)
Redundancy scoringO(B_budget²)
Attention (compressed)O((B_budget + B_buffer) × B_buffer)
Attention (FullKV)O(B_full × B_buffer)

For long sequences (B_full ≫ B_budget), the tiny cache more than offsets scoring overhead.

🔍 R-KV vs. SnapKV: Token-Selection Comparison

The figure below shows which tokens are picked by R-KV and the pure-attention baseline SnapKV at the same decoding step.

Grey = not selected  |  Light orange → Dark red = selected tokens (deeper red = chosen by more attention heads)

Token selection comparison

Key Findings

  • Broader Coverage — hybrid score (attention × redundancy suppression) selects tokens spread across the whole output.
  • Higher Information Diversity — captures valuable pieces of information at diverse positions.
  • Significantly Less Redundancy — redundancy check filters low-value segments.

By combining attention strength with redundancy filtering, R-KV retains the important context and removes noise, successfully completing the task.

📚 Citation

If you find R-KV useful in your research, please cite us:

@misc{cai2025rkvredundancyawarekvcache,
  title        = {R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration},
  author       = {Zefan Cai and Wen Xiao and Hanshi Sun and Cheng Luo and Yikai Zhang and Ke Wan
                  and Yucheng Li and Yeyang Zhou and Li-Wen Chang and Jiuxiang Gu
                  and Zhen Dong and Anima Anandkumar and Abedelkadir Asi and Junjie Hu},
  year         = {2025},
  eprint       = {2505.24133},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.24133}
}