A cheap specialist judge gets used by agents but fails to reduce alignment audit costs
A researcher explored using a lightweight, specialized judge model (Gemma 2-2B) to assist AI agents in identifying misalignment within audits. While the judge was consistently used by the agents, it only proved helpful in specific scenarios where its training data directly matched the misalignment type and the primary auditor (Sonnet) was already struggling. The experiment did not reduce overall evaluation costs, as the primary driver model accounted for the vast majority of expenses, and mandated tool use even increased costs. AI
IMPACT Specialized, low-cost AI judges may offer limited benefits in reducing audit costs and improving misalignment detection, suggesting current approaches need further refinement.