Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation
A new paper from arXiv explores the reliability of counterfactual token-credit estimation in language models. The research highlights that re-feeding the transcript prefix as a fresh prompt, a common method, can introduce significant noise compared to resuming from the verified decode-time KV state. This noise can alter credit estimates, particularly at low-margin decision tokens, and impacts the selection of critical tokens. The study suggests that using batch-invariant kernels or resuming decoder state is crucial for more accurate credit estimation, and recommends reporting a replica floor to account for inherent noise in single-sample measurements. AI
IMPACT Highlights potential unreliability in current methods for attributing model outputs to specific tokens, impacting research into model interpretability.