PulseAugur
EN
LIVE 21:01:54

Anthropic's Claude Fable 5 hindered by excessive refusals in agentic tests

An independent evaluator spent over $11,000 testing Anthropic's Claude Fable 5 model, expecting it to outperform GPT-5.5. However, the model exhibited a high rate of refusals, leading to timeouts and failures on 13 specific tasks within the WolfBench benchmark. This excessive refusal behavior, while intended for safety, hindered the model's performance in agentic workflows, causing it to burn tokens and fail tasks that other models like Claude Opus and GPT-5.5 could solve. AI

IMPACT Excessive safety refusals in LLM agents can lead to token waste and task failure, hindering practical application despite strong underlying capabilities.

RANK_REASON Independent evaluation of a specific model's performance on a benchmark, detailing its strengths and weaknesses. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/ClaudeAI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/ClaudeAI TIER_2 English(EN) · /u/WolframRavenwolf ·

    Spent $11k evaluating Fable: capability looked SOTA, refusals killed it (before Anthropic did)

    <!-- SC_OFF --><div class="md"><p>Before its suspension, I spent $11,081.12 evaluating Claude Fable 5 on WolfBench, an agentic benchmark based on Terminal-Bench 2.0. It was by far my most expensive benchmark run ever, and I fully expected Fable to become the new top model and det…