Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding
Researchers have introduced CounterVQA, a new benchmark designed to evaluate the counterfactual reasoning capabilities of Vision Language Models (VLMs). Current state-of-the-art models show a significant performance gap, struggling with complex causal chains despite reasonable accuracy on simpler questions. To address this, a post-training method called CFGPT has been developed, which enhances visual counterfactual reasoning by distilling knowledge from the language modality. AI
IMPACT Highlights a critical gap in VLM reasoning, potentially guiding future model development towards more robust causal understanding.