RTLC prompting boosts LLM judge accuracy by 14 points

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new three-stage prompting technique called RTLC (Research, Teach-to-Learn, Critique) that significantly improves the accuracy of large language models when used as judges for evaluating generated content. Inspired by the Feynman Learning Technique, RTLC prompts a single LLM to act as an ensemble judge without requiring fine-tuning or external tools. This method boosted Claude 3.7 Sonnet's accuracy on the JudgeBench-GPT benchmark by 14 percentage points, outperforming standard self-consistency methods. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Improves LLM evaluation accuracy, potentially accelerating research and development by providing more reliable automated judging.

RANK_REASON The cluster describes a new academic paper detailing a novel prompting technique for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
other

COVERAGE [1]

arXiv cs.AI TIER_1 · Andrea Morandi · 2026-05-13 15:48

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- …

COVERAGE [1]

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

RELATED TOPICS