PulseAugur
LIVE 06:04:38
research · [2 sources] ·
0
research

New benchmark 'Prosa' evaluates LLMs on Brazilian Portuguese chats

Researchers have introduced Prosa, a new benchmark designed to evaluate Large Language Models (LLMs) using real user conversations in Brazilian Portuguese. This benchmark utilizes a rubric-based scoring system with multi-judge filtering to mitigate bias often found in holistic LLM-as-a-judge evaluations. Prosa includes 1,000 WildChat conversations and aims to improve the discriminative power of LLM evaluations by increasing score gaps between models. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Introduces a new evaluation benchmark for LLMs in Brazilian Portuguese, potentially improving model assessment and comparison.

RANK_REASON The cluster contains a new academic paper introducing a novel benchmark for LLM evaluation.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Roseval Malaquias Junior, Giovana Kerche Bon\'as, Thales Sales Almeida, Hugo Abonizio, Thiago Laitz, Ramon Pires, Marcos Piau, Celio Larcher, Rodrigo Nogueira ·

    Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

    arXiv:2605.01630v1 Announce Type: new Abstract: Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement…

  2. Mastodon — mastodon.social TIER_1 · [email protected] ·

    How LLMs Work - A complete Walkthrough of how Large Language Models like ChatGPT are built: from raw Internet Text to a conversational Assistant. Based on Andre

    How LLMs Work - A complete Walkthrough of how Large Language Models like ChatGPT are built: from raw Internet Text to a conversational Assistant. Based on Andrej Karpathy's technical deep dive. # AI # LLM https:// ynarwal.github.io/how-llms-wor k/