A new paper introduces "prompt-variant output-mode collapse," a failure mode where large language models fail to maintain their output format when a prompt is rephrased, even with temperature set to zero. Researchers developed the PARACONSIST benchmark with 900 prompts to measure this phenomenon across five LLMs. Their findings indicate that approximately 78% of responses from prompt variants deviate from the expected format, highlighting the need to track response-mode preservation as a key reliability metric. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights a critical LLM reliability issue that could impact evaluation pipelines and downstream applications.
RANK_REASON Academic paper detailing a new failure mode in LLMs and introducing a benchmark to measure it.