P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs
A new benchmark called P3B3 has been developed to assess how large language models (LLMs) handle variations in Portuguese, specifically European Portuguese (pt-PT) and Brazilian Portuguese (pt-BR). The benchmark aims to address the current imbalance where pt-BR data is more prevalent, leading to LLMs exhibiting a bias towards this variety. Experiments using P3B3 revealed that most tested LLMs show a strong preference for pt-BR, with varying degrees of controllability across different models, underscoring the need for more balanced representation of language varieties in LLMs. AI
IMPACT Highlights the need for improved representation of linguistic diversity in LLMs to ensure equitable and reliable performance across different language varieties.