Verifiable Benchmarking of Long-Horizon Spatial Biology
A new benchmark, SpatialBench-Long, has been developed to evaluate AI agents' capabilities in long-horizon scientific reasoning within spatial biology. This benchmark assesses agents' ability to derive biological conclusions from complex, raw data across various experimental modalities and biological systems. Initial results show that current leading models like Gemini 3.5 Flash and GPT-5.5, when paired with specific coding harnesses, achieve a modest success rate of 11.1% on the benchmark. AI
IMPACT This benchmark will drive the development of AI agents capable of complex scientific discovery in biology.