EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
A new benchmark called EpiBench has been developed to evaluate AI agents on short-horizon epigenomics analysis tasks. The benchmark, which includes 106 evaluations across various genomic assay workflows, found that no AI system passed a majority of attempts. GPT-5.5 / Pi performed best, passing 45.0% of tasks, followed closely by GPT-5.5 / OpenAI Codex and Claude Opus 4.8 Max / Pi. While agents could often identify correct files and compute intermediate results, they struggled with tasks requiring deep, assay-specific scientific judgment. AI
IMPACT Highlights current limitations of AI agents in complex scientific domains, indicating a need for improved reasoning and domain-specific judgment.