New Research Tackles LLM Nuances in Translation, Bias, and Multilingual Tasks

By PulseAugur Editorial · [15 sources] · 2026-05-26 04:00

Several new research papers explore the nuances of large language models (LLMs) across different languages and cultural contexts. One study introduces LLMBridge, a system that improves referential bridging resolution in English, outperforming previous state-of-the-art models. Another paper presents a benchmark for evaluating cultural localization in machine translation, highlighting that idioms and puns are particularly challenging for LLMs. Research on German LLMs, GRUFF, reveals issues with pronoun fidelity and biases, especially concerning neopronouns. Additionally, studies on multilingual LLMs investigate language roles in task execution, cultural biases in Asian languages, and methods to mitigate cross-lingual cultural inconsistencies. AI

IMPACT These studies highlight ongoing challenges in LLM development, particularly in achieving cultural nuance, robust multilingual capabilities, and unbiased reasoning, indicating areas for future research and model improvement.

RANK_REASON Cluster consists of multiple academic papers published on arXiv, focusing on LLM research and evaluation.

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 15 sources. How we write summaries →

New Research Tackles LLM Nuances in Translation, Bias, and Multilingual Tasks

COVERAGE [15]

arXiv cs.CL TIER_1 English(EN) · Lauren Levine, Amir Zeldes · 2026-05-29 04:00

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

arXiv:2605.29048v1 Announce Type: new Abstract: In this paper, we introduce LLMBridge, a new LLM based system for the task of end-to-end referential bridging resolution in English. Our bridging resolution pipeline combines heuristic pre/post-processing with the natural language i…
arXiv cs.CL TIER_1 English(EN) · Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, Cory Holland · 2026-05-29 04:00

"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

arXiv:2602.04729v2 Announce Type: replace Abstract: We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level …
arXiv cs.CL TIER_1 English(EN) · Fabian Mewes, Anne Lauscher, Vagrant Gautam · 2026-05-29 04:00

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

arXiv:2605.30214v1 Announce Type: new Abstract: Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated…
arXiv cs.CL TIER_1 English(EN) · Vagrant Gautam · 2026-05-28 16:47

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assess…
arXiv cs.CL TIER_1 English(EN) · Qishi Zhan, Minxuan Hu, Seoyeon Jang, Lei Zhao, Ziheng Chen, Man Liang, Xinyue Xiang, Jiaxin Liu, Guansu Wang, Liang He · 2026-05-28 04:00

Disentangling Language Roles in Multilingual LLM Task Execution

arXiv:2605.27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate thes…
arXiv cs.CL TIER_1 English(EN) · Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, Tanmoy Chakraborty, Yuki Arase… · 2026-05-28 04:00

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

arXiv:2510.05291v2 Announce Type: replace Abstract: As Large Language Models (LLMs) develop stronger multilingual capabilities, their sensitivity to culturally diverse entities becomes increasingly important. Prior work by Naous et al. (2024) has shown that LLMs often favor Weste…
arXiv cs.AI TIER_1 English(EN) · Santiago Acevedo, Alessandro Laio, Marco Baroni · 2026-05-28 04:00

Differential syntactic and semantic encoding in LLMs

arXiv:2601.04765v4 Announce Type: replace-cross Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of…
arXiv cs.AI TIER_1 English(EN) · Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram · 2026-05-28 04:00

DEPART: DEcomposing PARiTy across Multilingual LLMs

arXiv:2605.28163v1 Announce Type: cross Abstract: Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establi…
arXiv cs.AI TIER_1 English(EN) · Irune Zubiaga, Aitor Soroa, Rodrigo Agerri · 2026-05-28 04:00

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

arXiv:2605.28710v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to m…
arXiv cs.CL TIER_1 English(EN) · Lucas Resck, Isabelle Augenstein, Anna Korhonen · 2026-05-28 04:00

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

arXiv:2605.12515v2 Announce Type: replace Abstract: Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical …
arXiv cs.AI TIER_1 English(EN) · Rodrigo Agerri · 2026-05-27 16:33

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particul…
arXiv cs.CL TIER_1 English(EN) · Sunayana Sitaram · 2026-05-27 08:45

DEPART: DEcomposing PARiTy across Multilingual LLMs

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than arti…
arXiv cs.CL TIER_1 English(EN) · Yoonwon Jung, Aaron S. Cohen, Benjamin K. Bergen · 2026-05-26 04:00

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs

arXiv:2605.24310v1 Announce Type: new Abstract: Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human…
dev.to — LLM tag TIER_1 English(EN) · Ai developer · 2026-05-29 09:00

One Ruler to Measure Them All: How Language Affects LLM Quality

<h1> One Ruler to Measure Them All: How Language Affects LLM Quality </h1> <p>Most discussions about LLM performance focus on the model architecture and prompting. But there's a hidden factor: the tokenizer. It determines how much of your text fits in the context window.</p> <h2>…
dev.to — LLM tag TIER_1 English(EN) · Ai developer · 2026-05-29 06:01

One Ruler to Measure Them All: How Language Affects LLM Quality

<h1> One Ruler to Measure Them All: How Language Affects LLM Quality </h1> <p>Most discussions about LLM performance focus on the model architecture and prompting. But there's a hidden factor: the tokenizer. It determines how much of your text fits in the context window.</p> <h2>…

COVERAGE [15]

RELATED ENTITIES

RELATED TOPICS