InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context

Preprint

Under review at ICML 2026

Jun. 2025 – Feb. 2026

*Co-first author. With Prof. Shengjie Wang, Prof. Tianyi Zhou, and Prof. Danyang Zhuo.

Overview

Retrieval-augmented generation (RAG) for long-context question answering is bottlenecked by inference-time prefilling over large retrieved contexts. We propose InfoFlow KV, casting selective KV recomputation as an information flow problem.

Contributions

Built a chunk-based KV prefilling and recomputation pipeline enabling reusable KV caches and recovering cross-chunk attention under long-context inference across multiple LLMs (Qwen, LLaMA, ChatGLM).
Proposed two information-flow-aware recomputation strategies: a query-conditioned attention-norm token selection method aligned with inference-consistent RoPE geometry, and a chunk reordering scheme that improves prompt–context interaction, achieving up to +6% F1 score on HotpotQA over prior methods.
Analyzed how RoPE positional geometry affects token selection in chunk-wise long-context inference, identifying frequency range and prompt–context proximity as key factors.