Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81.
A new benchmark for creative writing, focusing on short stories, has been released. The benchmark evaluates models based on head-to-head comparisons of stories generated in response to specific creative prompts. Early results show Baidu's Ernie 5.1 performing best among the tested models, with Qwen 3.7 Max, Mistral Medium 3.5, and Grok 4.3 scoring significantly lower. AI
IMPACT This benchmark could drive improvements in AI's creative writing capabilities and highlight areas for future model development.