PulseAugur
EN
LIVE 08:58:51

New DeFAb benchmark reveals foundation models struggle with defeasible abduction

Researchers have developed DeFAb, a new benchmark designed to rigorously evaluate defeasible abduction capabilities in foundation models. This benchmark converts extensive knowledge bases into formally grounded instances, requiring models to construct hypotheses that explain anomalies by overriding defaults while preserving other expectations. Unlike previous evaluations, DeFAb enforces logical rigor, ensuring that hypotheses are derived correctly, conservatively, and minimally. Frontier models tested on DeFAb demonstrated significant limitations, with accuracy dropping to as low as 7.8% on certain levels, indicating a struggle with complex theoretical reasoning and theory revision. AI

IMPACT Highlights a critical gap in current foundation models' ability to perform complex theoretical reasoning, potentially guiding future research and development.

RANK_REASON The cluster describes a new benchmark and dataset for evaluating AI models, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Patrick Cooper, Alvaro Velasquez ·

    DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

    arXiv:2606.18557v1 Announce Type: new Abstract: A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case ov…