Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 2w · [2 sources]

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

Researchers have developed VIABLE, a new benchmark designed to evaluate the reliability of Visual Language Models (VLMs) when used as judges for Visually Impaired Assistance (VIA) tasks. Their study, which tested seven different VLM judges, found that current models are largely unreliable for this purpose, with even the strongest performer, GPT-5.4, showing limited diagnostic accuracy. To improve this, they propose VIA-Judge-Agent, a harness that enhances judges with visual evidence extraction and a structured workflow, leading to better accuracy and more preferred user responses. AI

IMPACT Highlights the unreliability of current VLMs for specialized assistance tasks, necessitating new evaluation methods and tools.

GPT-5.4
VLM
VIA-Judge-Agent