Researchers have developed a Stroop-style paradigm to investigate how language models handle conflicting instructions. Their experiments, conducted across 11 open-weight models, reveal that lexical priors persist through override rather than being replaced. Activation patching on aligned models pinpointed a specific source-position triplet crucial for binding these conflicting pieces of information. AI
IMPACT This research offers a new method for probing LLM behavior, potentially leading to better understanding and control of their responses.
RANK_REASON The cluster contains an academic paper detailing a new experimental method for studying language model behavior. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →