A new dataset called FOXGLOVE has been released, containing feedback on argumentative essays from both human experts and large language models. The dataset includes over 2,300 feedback comments, with LLMs generating more complex and longer feedback than human instructors. While both human and AI feedback align on general goals and essay positions, they differ in the specific sentences they target for improvement. Interestingly, human instructors rated LLM feedback higher on quality, though this was largely attributed to the LLMs' tendency to provide lengthier comments. AI
IMPACT Provides a benchmark for evaluating LLM writing assistance capabilities against human experts.
RANK_REASON The cluster contains an academic paper detailing a new dataset and research findings.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →