Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 4d

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Researchers have identified a temporal grounding issue in multimodal large language models (MLLMs) where the models understand event timing during an initial phase but lose this signal during answer generation. They discovered specific attention heads, termed Temporal Grounding Heads (TG-Heads), that focus on the correct time intervals in videos during prefill. To address this, they developed an inference-time framework that leverages these TG-Heads to extract the relevant interval and then re-invokes the model with restricted visual context, improving performance on video temporal grounding benchmarks without requiring model retraining. AI

IMPACT Improves multimodal LLM accuracy on video temporal grounding tasks by addressing a key perception-generation gap without retraining.

MLLMs
Qwen3-VL-8B
TG-Heads
TimeLens-8B
MiMo-VL-7B