ENTITY LLM-as-a-Judge

LLM-as-a-Judge

PulseAugur coverage of LLM-as-a-Judge — every cluster mentioning LLM-as-a-Judge across labs, papers, and developer communities, ranked by signal.

Total · 30d

9 over 90d

Releases · 30d

0 over 90d

Papers · 30d

9 over 90d

TIER MIX · 90D

RELATIONSHIPS

instance of arXiv 90%

TIMELINE

2026-05-13 research_milestone A paper was published detailing the limitations of AI evaluation tools in assessing creativity for literary translations. source

SENTIMENT · 30D

4 day(s) with sentiment data

RECENT · PAGE 1/1 · 10 TOTAL

TOOL · CL_30770 · May 13 · 14:30

AI evaluation tools fail to recognize creativity in literary translations

A new research paper reveals that current automatic evaluation metrics and LLM-as-a-judge systems struggle to accurately assess creativity in literary translations. These tools exhibit a bias favoring machine-translated…
TOOL · CL_29410 · May 12 · 17:16

AI predicts human rater disagreement in LLM-generated difficulty scores

Researchers have developed a new method to predict when AI-generated difficulty ratings for educational materials might disagree with human assessments. This approach uses a separate embedding space, like ModernBERT, to…
TOOL · CL_27695 · May 11 · 16:30

New routing method optimizes LLM judges for cost and accuracy

A new research paper introduces a method called RACER (Robust Adaptive Cost-Efficient Routing) to optimize the use of large language models (LLMs) as judges. The study found that while explicit reasoning in LLMs signifi…
TOOL · CL_25635 · May 8 · 09:28

New framework efficiently selects data for multimodal models

Researchers have developed a new framework called One-Step-Train (OST) to efficiently select high-quality synthetic data for training large multimodal models (LMMs). OST reframes data selection as an incremental optimiz…
TOOL · CL_22500 · May 8 · 04:00

AI researchers introduce Joint Consistency for improved test-time reasoning aggregation

Researchers have introduced Joint Consistency (JC), a novel framework for test-time aggregation that improves reasoning trace aggregation by considering comparative interactions between candidate answers. Unlike previou…
RESEARCH · CL_21818 · May 7 · 12:30

Pest-Thinker uses RL to help MLLMs reason like entomologists

Researchers have developed Pest-Thinker, a novel reinforcement learning framework designed to enhance the reasoning capabilities of multimodal large language models (MLLMs) for agricultural pest identification. This sys…
RESEARCH · CL_10999 · Apr 30 · 20:09

Amazon Nova models use LLM-as-a-judge for reinforcement fine-tuning

Amazon's AWS ML blog details Reinforcement Learning from AI Feedback (RLAIF), a method for fine-tuning large language models. This technique uses an LLM as a judge to provide feedback, guiding the model's learning proce…
RESEARCH · CL_10085 · Apr 30 · 04:00

LLM-as-a-Judge in Healthcare Faces Safety and Bias Concerns

A scoping review of Large Language Model-as-a-Judge (LaaJ) applications in healthcare identified significant gaps in validation rigor and safety assessments. The review, which screened over 11,000 studies, found that wh…
COMMENTARY · CL_04666 · Apr 20 · 00:00

Eugene Yan: LLM-as-judge won't fix AI product evals; focus on process

Eugene Yan argues that relying solely on tools like LLM-as-judge will not fix product evaluation issues. Instead, he emphasizes that a robust evaluation process, akin to the scientific method, is crucial for improving A…
RESEARCH · CL_00195 · Mar 21 · 21:34

AI code review bots show limits in automated evaluation, GitHub COO discusses ambient AI

A new paper explores the limitations of automated evaluation for AI code review bots, finding that current automated methods like G-Eval and LLM-as-a-Judge show only moderate alignment with human developer labels. The s…

AI evaluation tools fail to recognize creativity in literary translations

AI predicts human rater disagreement in LLM-generated difficulty scores

New routing method optimizes LLM judges for cost and accuracy

New framework efficiently selects data for multimodal models

AI researchers introduce Joint Consistency for improved test-time reasoning aggregation

Pest-Thinker uses RL to help MLLMs reason like entomologists

Amazon Nova models use LLM-as-a-judge for reinforcement fine-tuning

LLM-as-a-Judge in Healthcare Faces Safety and Bias Concerns

Eugene Yan: LLM-as-judge won't fix AI product evals; focus on process

AI code review bots show limits in automated evaluation, GitHub COO discusses ambient AI