English(EN) Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generi

新的“WOOFER”指标揭示了令人惊讶的 LLM 性能得分

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-29 09:06

一位用户开发了一个“WOOFER”指标，使用“Probe_prompt”来评估大型语言模型 (LLM) 的性能。该指标产生了一些令人惊讶的结果，一些模型的得分出人意料地低，例如 # bigpickle 得分为 25，而一组小型模型（LFM2、Gemma3 (2B)、Llama32 和 Quen25）的共识得分仅为 18。值得注意的是，# Claude # Opus 评估自己的回应是天才，而较新的 # Nvidia 模型 nematron 根据 WOOFER 分数表现良好。 AI

影响引入了一种新颖但非正式的评估 LLM 能力的方法，可能会影响用户评估和比较模型的方式。

排序理由该条目描述了用户个人开发和测试 LLM 新评估指标的过程，而不是正式发布或研究论文。

在 Mastodon — fosstodon.org 阅读 →

其他

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-29 09:06

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generi

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generic or run vs the mission.md for a realistic assessment of a engine efficacy. (There is a column called WOOFER with values…

报道来源 [1]

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generi

相关实体

相关话题