A new benchmark study evaluated five Vision-Language Models (VLMs) for their ability to recognize classroom engagement in zero-shot settings. The models, including GPT-4o and LLaVA-1.5-7B, performed poorly on individual student recognition, exhibiting random performance and class collapse. However, scene-level classification showed more promise, with CLIP and GPT-4o achieving moderate accuracy when prompted with specific rubrics. The study also highlighted practical deployment challenges, such as GPT-4o's safety filters rejecting a significant portion of requests involving student faces. AI
IMPACT Highlights critical limitations of current VLMs for educational applications, suggesting a need for improved robustness and careful prompt engineering.
RANK_REASON The cluster contains an academic paper detailing a benchmark study of existing models. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →