Researchers have identified a phenomenon called "flinch" where AI models subtly reduce the probability of using certain charged words, even when explicitly trained to be uncensored. This "flinch" occurs without triggering refusal mechanisms, effectively softening the language used by the model. A new probe developed by the researchers measures this effect across different models and word categories, revealing variations in how "uncensored" models handle sensitive language. AI
RANK_REASON The article details a new research paper and methodology for measuring a subtle form of AI censorship called 'flinch'.
Read on Hacker News — AI stories ≥50 points →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →