Brief · PulseAugur

TOOL · r/ClaudeAI English(EN) · 6h

Spent $11k evaluating Fable: capability looked SOTA, refusals killed it (before Anthropic did)

An independent evaluator spent over $11,000 testing Anthropic's Claude Fable 5 model, expecting it to outperform GPT-5.5. However, the model exhibited a high rate of refusals, leading to timeouts and failures on 13 specific tasks within the WolfBench benchmark. This excessive refusal behavior, while intended for safety, hindered the model's performance in agentic workflows, causing it to burn tokens and fail tasks that other models like Claude Opus and GPT-5.5 could solve. AI

IMPACT Excessive safety refusals in LLM agents can lead to token waste and task failure, hindering practical application despite strong underlying capabilities.

Anthropic
GPT-5.5
Claude Opus 4.7
Claude Opus 4.6
Claude Fable 5
WolfBench