PulseAugur
实时 13:25:54

AI助教将基准测试与真实学生行为不匹配

两篇提交至arXiv的新研究论文强调了AI助教在基准测试中的评估方式与学生在真实教育环境中的实际互动方式之间存在严重不匹配。第一篇论文引入了“聊天机器人脚手架”和“学生采纳度”的指标,揭示学生经常绕过教学指导以追求自己的学习目标。第二篇论文提出了一种诊断方法,以区分仅仅解决问题和真正教学的LLM助教,发现当前的基准测试并不总是将解决问题的能力与教学效果相匹配。两项研究都表明,未来的AI助教评估需要考虑学生的能动性和多样化的学习情境,而不是假设脚手架会被动采纳。 AI

影响 强调需要对AI教育工具进行更现实的评估,以确保它们能有效支持学习,而不仅仅是解决问题。

排序理由 两篇发表在arXiv上的学术论文,讨论了AI助教的评估方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Alexandra Neagu, Jeffrey T. H. Wong, Marcus Messer, Rhodri Nelson, Peter B. Johnson ·

    Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments

    arXiv:2606.15766v1 Announce Type: new Abstract: A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, re…

  2. arXiv cs.AI TIER_1 English(EN) · Junyi Yao, Zihao Zheng, Baichuan Li ·

    Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

    arXiv:2606.16206v1 Announce Type: new Abstract: Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in …

  3. arXiv cs.CL TIER_1 English(EN) · Baichuan Li ·

    Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

    Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring b…