Researchers have developed a novel black-box adversarial framework called BITE that exploits stylistic biases in LLM judges to artificially inflate their scores. By framing the selection of stylistic edits as a contextual bandit problem, BITE uses a LinUCB policy to adaptively choose edits that maximize judge scores without needing access to model parameters. The framework successfully achieved over a 65% attack success rate and increased scores by 1-2 points on a 9-point scale, while maintaining semantic equivalence and evading detection methods, highlighting a significant vulnerability in the LLM-as-a-judge paradigm. AI
IMPACT Exposes a fundamental weakness in LLM-based evaluation systems, necessitating the development of more robust and attack-aware assessment methods.
RANK_REASON Academic paper detailing a new method for attacking LLM judges. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →