Researchers have identified significant inaccuracies in popular Natural Language to First-Order Logic (NL-to-FOL) datasets, with FOLIO and MALLS showing approximately 39% and 36% incorrect formalizations, respectively. These errors distort model evaluations, leading to accuracy gains of up to 22 percentage points when using corrected ground truths for models like Gemma 4, Qwen3, and GPT-4o-mini. To address this, a new LLM-assisted framework is proposed that can achieve 90% dataset accuracy by reviewing fewer than 24% of instances, a substantial improvement over unguided review. AI
IMPACT Improves the reliability of benchmarks for neurosymbolic AI and NLI, leading to more accurate model evaluations and development.
RANK_REASON The cluster contains an academic paper detailing a new methodology and findings related to dataset verification and LLM-assisted annotation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →