JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
Researchers have introduced JADE, a novel two-layer evaluation framework designed to address the challenges of assessing AI agents on open-ended professional tasks. The first layer of JADE encodes expert knowledge into evaluation skills for stable criteria, while the second layer performs dynamic, claim-level assessments with evidence-dependency gating. Experiments on BizBench demonstrated JADE's ability to improve evaluation stability and identify critical agent failures that were missed by standard LLM-based evaluators, also showing alignment with expert rubrics and effective transfer to other domains like HealthBench. AI
IMPACT JADE offers a more robust method for evaluating AI agents, potentially leading to more reliable and trustworthy AI systems in professional applications.