A new benchmark for creative writing, focusing on short stories, has been released. The benchmark evaluates models based on head-to-head comparisons of stories generated in response to specific creative prompts. Early results show Baidu's Ernie 5.1 performing best among the tested models, with Qwen 3.7 Max, Mistral Medium 3.5, and Grok 4.3 scoring significantly lower. AI
IMPACT This benchmark could drive improvements in AI's creative writing capabilities and highlight areas for future model development.
RANK_REASON The cluster describes a new benchmark for evaluating AI models on a specific creative task. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →