Researchers have introduced Prosa, a new benchmark designed to evaluate Large Language Models (LLMs) using real user conversations in Brazilian Portuguese. This benchmark utilizes a rubric-based scoring system with multi-judge filtering to mitigate bias often found in holistic LLM-as-a-judge evaluations. Prosa includes 1,000 WildChat conversations and aims to improve the discriminative power of LLM evaluations by increasing score gaps between models. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Introduces a new evaluation benchmark for LLMs in Brazilian Portuguese, potentially improving model assessment and comparison.
RANK_REASON The cluster contains a new academic paper introducing a novel benchmark for LLM evaluation.