A new research paper published on arXiv details how pretraining filters and guardrails in language models can lead to epistemic injustice. The audit found that these systems disproportionately flag content related to marginalized groups, such as transgender people, women, and Central Americans, while often failing to detect explicit hate speech or private information. Human annotators would have retained a significant majority of the content flagged by these automated systems, highlighting a gap in their ability to capture nuanced representational harms. AI
IMPACT Reveals how current content moderation systems in LLMs can inadvertently silence marginalized voices, necessitating more nuanced approaches to AI safety.
RANK_REASON The cluster contains an academic paper detailing research findings on language model safety and bias.
- Central Americans
- Common Crawl
- Epistemic injustice
- Language models
- Marco Antonio Stranisci
- Pretraining filters
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →