Geometric Metrics and LLMs: What They Measure and When They Work
Researchers have conducted a comprehensive stress-test of geometric metrics used for evaluating Large Language Models (LLMs). Their analysis revealed that some metrics, like Schatten Norm and MOM, primarily reflect output length rather than genuine quality. While geometric metrics offer a modest improvement over text statistics alone for generator identification, they show only a weak association with lexical diversity. The study recommends specific use cases and identifies failure detection as a promising application for these metrics. AI
IMPACT Identifies limitations of current LLM evaluation methods and suggests new applications for geometric metrics in failure detection.