Researchers have identified a phenomenon called "flinch" where AI models subtly reduce the probability of using certain charged words, even when explicitly trained to be uncensored. This "flinch" occurs without triggering refusal mechanisms, effectively softening the language used by the model. A new probe developed by the researchers measures this effect across different models and word categories, revealing variations in how "uncensored" models handle sensitive language. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON The article details a new research paper and methodology for measuring a subtle form of AI censorship called 'flinch'.