IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
Researchers have developed IntentKV, a novel method for pruning KV caches in large language model agents to improve inference efficiency. This technique maintains a session-level memory of cross-turn intent, allowing it to score and selectively drop tokens without significant accuracy loss. IntentKV has demonstrated substantial reductions in peak request tokens and KV reads, particularly for long-horizon agent tasks, while keeping the base LLM unchanged. AI
IMPACT Reduces KV cache size for LLM agents, potentially lowering inference costs and enabling longer context windows.