Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override
Researchers have developed a Stroop-style paradigm to investigate how language models handle conflicting instructions. Their experiments, conducted across 11 open-weight models, reveal that lexical priors persist through override rather than being replaced. Activation patching on aligned models pinpointed a specific source-position triplet crucial for binding these conflicting pieces of information. AI
IMPACT This research offers a new method for probing LLM behavior, potentially leading to better understanding and control of their responses.