PulseAugur
LIVE 12:23:22
research · [2 sources] ·
0
research

LLMs fail to maintain output format when prompts are paraphrased

A new paper introduces "prompt-variant output-mode collapse," a failure mode where large language models fail to maintain their output format when a prompt is rephrased, even with temperature set to zero. Researchers developed the PARACONSIST benchmark with 900 prompts to measure this phenomenon across five LLMs. Their findings indicate that approximately 78% of responses from prompt variants deviate from the expected format, highlighting the need to track response-mode preservation as a key reliability metric. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights a critical LLM reliability issue that could impact evaluation pipelines and downstream applications.

RANK_REASON Academic paper detailing a new failure mode in LLMs and introducing a benchmark to measure it.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Aofan Liu, Jingxiang Meng ·

    Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

    arXiv:2605.04665v1 Announce Type: new Abstract: When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five c…

  2. arXiv cs.CL TIER_1 · Jingxiang Meng ·

    Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

    When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we obs…