New VLM evaluation method reveals poor evidence use in large models

By PulseAugur Editorial · [2 sources] · 2026-06-23 09:20

A new research paper introduces "Ill-Posed by Design," a novel method for evaluating how Vision-Language Models (VLMs) utilize evidence. The study proposes using monocular metric object-size estimation as an ill-posed task, forcing models to rely on various imperfect cues like category priors, appearance, and context. Researchers assembled a dataset called Metric VQA and tested 12 open-weight VLMs, finding that even the largest models performed worse than a text-only LLM on real-world scenes. The analysis revealed that while target identity is crucial, global scene geometry is largely ignored by current VLMs, even after LoRA fine-tuning. AI

IMPACT This research highlights limitations in current VLM reasoning and evidence utilization, suggesting a need for improved architectures and training strategies for complex scene understanding.

RANK_REASON Research paper detailing a new evaluation methodology for VLMs.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New VLM evaluation method reveals poor evidence use in large models

COVERAGE [2]

arXiv cs.CV TIER_1 English(EN) · Boaz Meivar, Shaked Perek, Shani Shvartzman, Eli Schwartz, Shai Avidan · 2026-06-24 04:00

Ill-Posed by Design: Probing Evidence Use in VLMs

arXiv:2606.24335v1 Announce Type: new Abstract: Counterfactual analysis is widely used to study evidence use in vision-language models, but its diagnostic value is limited on well-posed tasks: when several cues independently support the same answer, removing one may not change th…
arXiv cs.CV TIER_1 English(EN) · Shai Avidan · 2026-06-23 09:20

Ill-Posed by Design: Probing Evidence Use in VLMs

Counterfactual analysis is widely used to study evidence use in vision-language models, but its diagnostic value is limited on well-posed tasks: when several cues independently support the same answer, removing one may not change the prediction. We propose monocular metric object…

COVERAGE [2]

Ill-Posed by Design: Probing Evidence Use in VLMs

Ill-Posed by Design: Probing Evidence Use in VLMs

RELATED ENTITIES

RELATED TOPICS