Terminal-Bench 2.0
PulseAugur coverage of Terminal-Bench 2.0 — every cluster mentioning Terminal-Bench 2.0 across labs, papers, and developer communities, ranked by signal.
2 天有情绪数据
-
本地 LLM 在基准测试成功后,在实际终端任务中仍面临挑战
本地大型语言模型在多步终端任务上的表现往往不佳,尽管它们在 MMLU 等标准基准测试中表现出色。这种差异源于传统基准测试衡量的是单轮推理,未能考虑到代理模型需要选择工具、解析混乱的输出、维护状态以及从错误中恢复。为解决此问题,新的代理基准测试(如 Terminal-Bench 2.0)正在涌现,它们通过评估任务完成情况而非仅仅中间推理,在沙盒环境中对模型进行评估。
-
Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally
The llama.cpp project has integrated Multi-head Attention Parallelism (MTP), leading to an 11.5% speed increase for 27B Qwen models in local inference. A new finetuned Gemma-4 model, optimized for creative writing and a…
-
Qwen 3.6-Plus excels in complex AI agent tasks and coding
Alibaba's Qwen 3.6-Plus model has demonstrated advanced capabilities in complex decision-making and agentic coding tasks, according to a recent evaluation. The model successfully generated a detailed implementation plan…
-
Poolside AI releases open-weight Laguna XS.2 and M.1 coding models
Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using…
-
Anthropic 的 'Mythos' AI 因过于危险而无法公开发布
Anthropic 开发了一个名为 Claude Mythos 的新 AI 模型,该模型在基准测试性能方面取得了显著进步,尤其是在识别软件漏洞方面。由于其在查找和利用安全漏洞方面的先进能力,Anthropic 选择不公开发布 Mythos。取而代之的是,该公司通过“Project Glasswing”向特定组织提供有限的访问权限,以协助网络安全研究和漏洞发现,并大力支持开源安全计划。
-
Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities
Google DeepMind has launched Gemini 3 Pro, their latest and most intelligent model, which demonstrates significant improvements in reasoning and coding capabilities. This new model surpasses previous versions and excels…