Anthropic's Claude Mythos Preview model has demonstrated capabilities that push the boundaries of current evaluation methodologies, according to METR. The model achieved completion times of over 16 hours for 50% of tasks and 3 hours for 80%, surpassing previous benchmarks. This advancement highlights the rapid progress in AI capabilities and raises questions about the adequacy of existing assessment tools. AI
IMPACT Demonstrates AI models are outpacing current evaluation benchmarks, signaling a need for new assessment tools.
RANK_REASON The cluster reports on a new benchmark evaluation of an AI model that pushes the limits of existing assessment methodologies.
Read on Mastodon — mastodon.social →
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →