Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 15h · [2 sources]

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

A new benchmark called EpiBench has been developed to evaluate AI agents on short-horizon epigenomics analysis tasks. The benchmark, which includes 106 evaluations across various genomic assay workflows, found that no AI system passed a majority of attempts. GPT-5.5 / Pi performed best, passing 45.0% of tasks, followed closely by GPT-5.5 / OpenAI Codex and Claude Opus 4.8 Max / Pi. While agents could often identify correct files and compute intermediate results, they struggled with tasks requiring deep, assay-specific scientific judgment. AI

IMPACT Highlights current limitations of AI agents in complex scientific domains, indicating a need for improved reasoning and domain-specific judgment.

GPT-5.5
GPT-5.4
Pi
OpenAI Codex
Claude Opus 4.8 Max
EpiBench