Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1w

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Researchers have identified persistent biases in language reward models, which are used to align AI language models with human preferences. Despite using high-quality models, issues such as favoring longer responses, sycophancy, and overconfidence remain, along with new biases towards specific answer orders and model-generated styles. The study proposes a post-hoc intervention method to mitigate these biases by addressing spurious correlations, which effectively reduces targeted biases without significantly impacting reward quality and requires minimal labeled data. AI

IMPACT Highlights critical limitations in AI alignment techniques, potentially impacting the reliability and safety of future AI systems.

AI language models
Max Lamparth
Language Reward Models