When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
Researchers have developed a new dataset containing over 260,000 long-form stories, each annotated with creativity scores and review comments based on the Torrance Test of Creative Writing (TTCW). They fine-tuned Qwen3 models on this data to generate literary reviews, finding that models trained without explicit reasoning supervision performed better. The study suggests that for structured, rubric-based review generation, reasoning supervision may not be beneficial and can even lead to irrelevant or repetitive outputs. AI
IMPACT Introduces a novel dataset and methodology for AI-driven literary review generation, potentially improving automated evaluation of creative writing.