New 'WOOFER' Metric Reveals Surprising LLM Performance Scores

By PulseAugur Editorial · [1 sources] · 2026-06-29 09:06

A user has developed a "WOOFER" metric to evaluate Large Language Model (LLM) performance using a "Probe_prompt." This metric has yielded surprising results, with some models scoring unexpectedly low, such as # bigpickle at 25 and a consensus panel of smaller models (LFM2, Gemma3 (2B), Llama32, and Quen25) scoring only 18. Notably, # Claude # Opus assessed its own response as genius, while the newer # Nvidia model, nematron, performed well according to the WOOFER score. AI

IMPACT Introduces a novel, albeit informal, method for evaluating LLM capabilities, potentially influencing how users assess and compare models.

RANK_REASON The item describes a user's personal development and testing of a new evaluation metric for LLMs, rather than a formal release or research paper.

Read on Mastodon — fosstodon.org →

other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New 'WOOFER' Metric Reveals Surprising LLM Performance Scores

COVERAGE [1]

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-29 09:06

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generi

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generic or run vs the mission.md for a realistic assessment of a engine efficacy. (There is a column called WOOFER with values…

COVERAGE [1]

Before I shut down for the night... I now have an effective WOOFER, My unit or arbitrary LLM performance based on a Probe_prompt. The probe_prompt can be generi

RELATED ENTITIES

RELATED TOPICS