New BITE framework exploits LLM judge biases to inflate scores

By PulseAugur Editorial · [1 sources] · 2026-05-27 04:00

Researchers have developed a novel black-box adversarial framework called BITE that exploits stylistic biases in LLM judges to artificially inflate their scores. By framing the selection of stylistic edits as a contextual bandit problem, BITE uses a LinUCB policy to adaptively choose edits that maximize judge scores without needing access to model parameters. The framework successfully achieved over a 65% attack success rate and increased scores by 1-2 points on a 9-point scale, while maintaining semantic equivalence and evading detection methods, highlighting a significant vulnerability in the LLM-as-a-judge paradigm. AI

IMPACT Exposes a fundamental weakness in LLM-based evaluation systems, necessitating the development of more robust and attack-aware assessment methods.

RANK_REASON Academic paper detailing a new method for attacking LLM judges. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New BITE framework exploits LLM judge biases to inflate scores

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong · 2026-05-27 04:00

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

arXiv:2605.26156v1 Announce Type: cross Abstract: The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a …

COVERAGE [1]

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

RELATED TOPICS