ENTITY LLM judges

LLM judges

PulseAugur coverage of LLM judges — every cluster mentioning LLM judges across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

6 over 90d

Releases · 30d

0 over 90d

Papers · 30d

6 over 90d

TIER MIX · 90D

TOPICS

SENTIMENT · 30D

1 day(s) with sentiment data

RECENT · PAGE 1/1 · 6 TOTAL

TOOL · CL_62713 · Jun 1 · 04:00

New framework audits LLM judge rubrics for reliability and robustness

Researchers have developed PReMISE, a framework designed to evaluate the effectiveness of rubrics used by Large Language Model (LLM) judges. The framework treats rubrics as measurement specifications, analyzing their st…
TOOL · CL_53666 · May 27 · 04:00

New BITE framework exploits LLM judge biases to inflate scores

Researchers have developed a novel black-box adversarial framework called BITE that exploits stylistic biases in LLM judges to artificially inflate their scores. By framing the selection of stylistic edits as a contextu…
TOOL · CL_51221 · May 26 · 04:00

LLM judges show rationalization bias, new framework reveals

Researchers have developed a causal framework to analyze rationalization bias in large language models (LLMs) when they act as judges for text evaluation. The study introduces new metrics and cue interventions to test i…
TOOL · CL_51073 · May 26 · 04:00

New framework tackles preference cycles in AI feedback

Researchers have developed a new framework called Topological Consensus Rewards (TCR) to improve the stability of Reinforcement Learning from AI Feedback (RLAIF). This method addresses the issue of preference cycles, wh…
TOOL · CL_40852 · May 18 · 23:55

New benchmark reveals LLM judges unreliable for research agents

Researchers have developed a new benchmark called REFLECT to evaluate the reliability of Large Language Models (LLMs) when used as judges for deep research agents. These agents automate complex information-seeking tasks…
TOOL · CL_21933 · May 8 · 04:00

LLM judges evaluate agentic stock predictors, improving accuracy via reinforcement learning

Researchers have developed a novel framework for evaluating agentic stock prediction systems by utilizing large language models as judges. This system breaks down performance into six specific dimensions, including regi…

New framework audits LLM judge rubrics for reliability and robustness

New BITE framework exploits LLM judge biases to inflate scores

LLM judges show rationalization bias, new framework reveals

New framework tackles preference cycles in AI feedback

New benchmark reveals LLM judges unreliable for research agents

LLM judges evaluate agentic stock predictors, improving accuracy via reinforcement learning