Developer shares $4,200 lesson on Promptfoo's limits in LLM evaluation

By PulseAugur Editorial · [1 sources] · 2026-05-26 18:12

A developer recounts a costly mistake where they treated Promptfoo as a comprehensive evaluation framework, leading to a $4,200 bill and production bugs. Promptfoo was found to be a regression test runner, not a true evaluator, as its automated judge had a low Cohen's kappa score of 0.47 when compared to human labels. The solution involved separating Promptfoo for CI gating and implementing a new pipeline to validate the judge against human-scored production traces, which improved the kappa score to 0.68. AI

IMPACT Highlights the critical need for robust evaluation beyond simple regression testing in LLM development to avoid costly production issues.

RANK_REASON Developer's personal account of a mistake and its resolution, not a new product release or industry-wide benchmark.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Developer shares $4,200 lesson on Promptfoo's limits in LLM evaluation

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Ethan Walker · 2026-05-26 18:12

Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200

<p>Last Monday I logged into our billing dashboard and saw a $4,200 LangSmith spike from the weekend. Our auto-eval pipeline had been running overnight against a fresh prompt change. The Promptfoo regression suite passed 91% of its 300 questions. The release went out Monday at 9a…

COVERAGE [1]

Promptfoo is a CI gate, not an eval framework. Treating it like one cost us $4,200

RELATED ENTITIES

RELATED TOPICS