A new study published on arXiv reveals significant grounding failures in multimodal large language models (MLLMs) when generating feedback on student science drawings. Researchers found that 41.3% of feedback instances from GPT-5.1 contained errors, such as object mismatch or false absence, indicating a phenomenon called modal decoupling where the model's claims contradict the visual evidence. While an inventory-list-first workflow reduced some errors, a substantial portion of feedback remained flawed, suggesting current prompting strategies are insufficient for generating valid and diagnostically useful feedback. AI
IMPACT Highlights critical limitations in current MLLMs for educational feedback, necessitating new grounding mechanisms for reliable application.
RANK_REASON Academic paper detailing limitations in MLLM feedback generation.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →