METR has released preliminary findings on OpenAI's o1-mini and o1-preview models, evaluating their autonomous capabilities and AI R&D potential. While initial tests showed performance below Claude 3.5 Sonnet in general autonomy tasks without specific scaffolding, the models demonstrated strong reasoning and planning. When integrated into tailored agent frameworks, their performance became comparable to Claude 3.5 Sonnet, and they showed progress on AI R&D tasks, suggesting their full capabilities may not have been captured in the limited evaluation period. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON This is a research paper detailing the evaluation of new AI models.