R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

🔥 News

🚀 2025-05-29 — We are pleased to announce the initial release of R-KV, a highly-efficient decoding-time KV-cache compression method to serve reasoning LLMs.

Overview

Large language models that rely on chain-of-thought (CoT) or self-reflection can crack tough reasoning tasks — but at the cost of very long outputs that bloat the key–value (KV) cache during inference. Traditional cache-compression schemes, tuned for long prompts, tumble on these generated traces, keeping only ≈60 % of the original accuracy when restricted to 10 % of the cache.

R-KV — Redundancy-aware KV-cache compression for Reasoning models — solves this by ranking tokens on-the-fly for both importance and non-redundancy, retaining only the informative, diverse ones.

🌟 Highlights

Metric	Full KV	R-KV (ours)
KV cache kept	100 %	10 %
Accuracy	100 %	≈100 %
Throughput ↑	1×	6.6×
Memory saved ↓	–	90 %

At 16 % cache, noise removal even nudges accuracy to 105 % of the full baseline.

✨ Why R-KV?

Up to 90 % KV-cache memory savings with zero—sometimes negative—accuracy loss.

Plug-and-play: a lightweight wrapper for any autoregressive LLM.

Training-free: drop straight into inference or RL roll-outs—no finetuning required.

Chain-of-thought (CoT) and self-reflection unlock impressive reasoning, but they explode the KV cache. A single DeepSeek-R1-Distill-8B run on a tough math problem can:

Generate 32 000 tokens
Load 15.5 GB of weights
Allocate 4.1 GB of KV just to remember its own musings

Existing compression tools focus on long prompts and falter on long generations—often pruning the wrong tokens because redundant self-checks still attend heavily to themselves.

🔍 Method — Redundancy-aware KV Cache Compression (R-KV)

R-KV tackles redundant key/value (KV) tokens by compressing the KV cache on-the-fly while the model is decoding, keeping only tokens that are important and non-redundant.

Part	What happens	Key idea
1. Decoding-time KV staging	Newly generated tokens are written to a buffer `B_buffer`.	Separate buffer lets us decide what to keep after seeing a chunk of text.
2. Importance scoring	Use attention weights from the last `α` observation tokens to score each candidate token.	High attention ⇒ token is critical for future predictions.
3. Redundancy estimation	Compute cosine similarity between key vectors; older near-duplicates are marked redundant.	Keeps semantics while pruning repetition.
4. Joint selection	Final score `Z = λ·Importance − (1-λ)·Redundancy`. Top-k tokens + observation tokens are retained in the budgeted cache.	One knob (`λ`) trades memory vs. quality.

📈 Performance Results

Our method surpassed baselines by a large margin in challenging math benchmarks, and surprisingly even outperformed full KV.

⚡️ Efficiency Results

Memory Saving

R-KV keeps two small, fixed-size buffers, so memory usage remains constant, unlike FullKV whose memory grows linearly:

Buffer	Purpose	Shape
KV budget	Retained tokens	`b × B_budget × L × H × d`
KV buffer	Fresh tokens	`b × B_buffer × L × H × d`

Computation Overhead

Part	Complexity
Importance scoring	`O(α × B_budget)`
Redundancy scoring	`O(B_budget²)`
Attention (compressed)	`O((B_budget + B_buffer) × B_buffer)`
Attention (FullKV)	`O(B_full × B_buffer)`

For long sequences (B_full ≫ B_budget), the tiny cache more than offsets scoring overhead.

🔍 R-KV vs. SnapKV: Token-Selection Comparison

The figure below shows which tokens are picked by R-KV and the pure-attention baseline SnapKV at the same decoding step.

Grey = not selected | Light orange → Dark red = selected tokens (deeper red = chosen by more attention heads)

Key Findings

Broader Coverage — hybrid score (attention × redundancy suppression) selects tokens spread across the whole output.
Higher Information Diversity — captures valuable pieces of information at diverse positions.
Significantly Less Redundancy — redundancy check filters low-value segments.

By combining attention strength with redundancy filtering, R-KV retains the important context and removes noise, successfully completing the task.

📚 Citation

If you find R-KV useful in your research, please cite us:

@misc{cai2025rkvredundancyawarekvcache,
  title        = {R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration},
  author       = {Zefan Cai and Wen Xiao and Hanshi Sun and Cheng Luo and Yikai Zhang and Ke Wan
                  and Yucheng Li and Yeyang Zhou and Li-Wen Chang and Jiuxiang Gu
                  and Zhen Dong and Anima Anandkumar and Abedelkadir Asi and Junjie Hu},
  year         = {2025},
  eprint       = {2505.24133},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.24133}
}