Polski(PL) Najnowszy model Claude Mythos Preview osiągnął limity metodologii badawczej organizacji METR, wykazując zdolności wykraczające poza obecne standardy pomiarowe.

Claude Mythos Preview超越评估极限，展现AI快速进展

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-09 05:50

据METR称，Anthropic的Claude Mythos Preview模型已展现出突破当前评估方法学极限的能力。该模型在50%的任务上达到16小时以上的完成时间，在80%的任务上达到3小时以上的完成时间，超越了此前的基准。这一进展凸显了AI能力的快速进步，并引发了对现有评估工具充分性的疑问。 AI

影响证明AI模型正在超越当前的评估基准，预示着需要新的评估工具。

排序理由该集群报告了一项对AI模型的新基准评估，该评估突破了现有评估方法的极限。

在 Mastodon — mastodon.social 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

Mastodon — fosstodon.org TIER_1 한국어(KO) · [email protected] · 2026-05-09 05:50

AI Leaks and News (@AILeaksAndNews) METR released the Task-Completion Time Horizon evaluation of Claude Mythos Preview. It explained that it surpassed previous evaluations, recording over 16 hours for the 50% benchmark and 3 hours for the 80% benchmark, and...

AI Leaks and News (@AILeaksAndNews) METR가 Claude Mythos Preview의 Task-Completion Time Horizon 평가를 공개했다. 50% 기준 16시간 이상, 80% 기준 3시간 수준을 기록하며 기존 평가를 넘어섰다고 설명하고, 결과를 AI 능력의 빠른 진전과 관련해 해석한다. https:// x.com/AILeaksAndNews/status/20 52901460375949510 # metr # claude # benchmark # evalu…
Mastodon — fosstodon.org TIER_1 한국어(KO) · [email protected] · 2026-05-09 05:50

Shain Noor (@shaincodes) mentions the IdeaBlock concept, evaluating that changing the unit of embedding instead of simply searching and tuning is a smarter approach. He also introduces a real-world case where AI CFO Silvia needs to remember the financial history of thousands of users across sessions, enabling long-term memory.

Shain Noor (@shaincodes) IdeaBlock 개념을 언급하며, 단순히 검색 후 튜닝하는 대신 임베딩하는 단위를 바꾸는 접근이 더 스마트하다고 평가한다. 또한 AI CFO Silvia가 수천 명 사용자의 금융 이력을 세션 간 기억해야 하는 실제 사례를 소개해, 장기 메모리와 개인화가 필요한 AI 응용의 중요성을 보여준다. https:// x.com/shaincodes/status/205282 2155558347242 # ai # memory # personalization # ra…
Mastodon — mastodon.social TIER_1 Polski(PL) · aisight · 2026-05-11 10:09

The latest Claude Mythos Preview model has reached the limits of METR organization's research methodology, demonstrating capabilities beyond current measurement standards.

Najnowszy model Claude Mythos Preview osiągnął limity metodologii badawczej organizacji METR, wykazując zdolności wykraczające poza obecne standardy pomiarowe. Eksperci od ewaluacji przyznają, że brakuje im narzędzi do oceny tak potężnych systemów, a w tym samym czasie Palo Alto …

链接 aisight.pl/…/koniec-skali-dla-benchmarkow… aisight.pl/…/Awarie-i-cyberataki-tydzien-…

报道来源 [3]

AI Leaks and News (@AILeaksAndNews) METR released the Task-Completion Time Horizon evaluation of Claude Mythos Preview. It explained that it surpassed previous evaluations, recording over 16 hours for the 50% benchmark and 3 hours for the 80% benchmark, and...

The latest Claude Mythos Preview model has reached the limits of METR organization's research methodology, demonstrating capabilities beyond current measurement standards.

相关实体

相关话题