Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 7h

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

Researchers have developed a Stroop-style paradigm to investigate how language models handle conflicting instructions. Their experiments, conducted across 11 open-weight models, reveal that lexical priors persist through override rather than being replaced. Activation patching on aligned models pinpointed a specific source-position triplet crucial for binding these conflicting pieces of information. AI

IMPACT This research offers a new method for probing LLM behavior, potentially leading to better understanding and control of their responses.

arXiv
language models
Stroop paradigm