ENTITY JudgeBench: A Benchmark for Evaluating LLM-based Judges

JudgeBench: A Benchmark for Evaluating LLM-based Judges

PulseAugur coverage of JudgeBench: A Benchmark for Evaluating LLM-based Judges — every cluster mentioning JudgeBench: A Benchmark for Evaluating LLM-based Judges across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

3 over 90d

Releases · 30d

0 over 90d

Papers · 30d

3 over 90d

TIER MIX · 90D

TOPICS

RECENT · PAGE 1/1 · 3 TOTAL

RESEARCH · CL_99671 · Jun 17 · 19:37

LLM-as-a-Judge models show significant reliability and bias issues, study finds

A new study evaluating LLM-as-a-Judge models reveals significant issues with their reliability and validity. The research, which analyzed 21 judges across multiple benchmarks and over 541,000 judgments, found that commo…
TOOL · CL_77334 · Jun 8 · 04:00

AdaJudge framework improves LLM reward modeling with adaptive pooling

Researchers have introduced AdaJudge, a novel framework designed to enhance the accuracy of reward modeling in large language models. This approach tackles limitations in current static pooling strategies by adapting bo…
RESEARCH · CL_36948 · May 13 · 15:48

RTLC prompting boosts LLM judge accuracy by 14 percentage points

Researchers have developed a new three-stage prompting technique called RTLC (Research, Teach-to-Learn, Critique) that significantly improves the accuracy of large language models when used as judges. This method, inspi…

LLM-as-a-Judge models show significant reliability and bias issues, study finds

AdaJudge framework improves LLM reward modeling with adaptive pooling

RTLC prompting boosts LLM judge accuracy by 14 percentage points