PulseAugur
EN
LIVE 08:58:49

EpiBench benchmark reveals AI agents struggle with epigenomics analysis

A new benchmark called EpiBench has been developed to evaluate AI agents on short-horizon epigenomics analysis tasks. The benchmark, which includes 106 evaluations across various genomic assay workflows, found that no AI system passed a majority of attempts. GPT-5.5 / Pi performed best, passing 45.0% of tasks, followed closely by GPT-5.5 / OpenAI Codex and Claude Opus 4.8 Max / Pi. While agents could often identify correct files and compute intermediate results, they struggled with tasks requiring deep, assay-specific scientific judgment. AI

IMPACT Highlights current limitations of AI agents in complex scientific domains, indicating a need for improved reasoning and domain-specific judgment.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI agents on a specific scientific task.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman ·

    EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

    arXiv:2606.13602v1 Announce Type: new Abstract: We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable an…

  2. arXiv cs.AI TIER_1 English(EN) · Kenny Workman ·

    EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

    We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations ac…