A new study published on arXiv reveals significant grounding failures in multimodal large language models (MLLMs) when generating feedback on student science drawings. Researchers found that 41.3% of feedback instances from GPT-5.1 contained errors, such as object mismatch or false absence, indicating a phenomenon called modal decoupling where the model's claims contradict the visual evidence. While an inventory-list-first workflow reduced some errors, a substantial portion of feedback remained flawed, suggesting current prompting strategies are insufficient for generating valid and diagnostically useful feedback. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights critical limitations in current MLLMs for educational feedback, necessitating new grounding mechanisms for reliable application.
RANK_REASON Academic paper detailing limitations in MLLM feedback generation.