PulseAugur
EN
LIVE 06:58:14

New ToxiREX dataset tackles implicit toxicity across six languages

Researchers have introduced ToxiREX, a new multilingual dataset designed to capture implicit and context-dependent toxicity in online conversations. The dataset comprises Reddit comment threads, annotated using a structured toxic reasoning schema, and includes content in six languages. ToxiREX aims to provide a more nuanced understanding of toxicity by considering conversational context, a feature not present in previous datasets. Initial experiments show that while language models can perform better than random chance on this task, significant improvements are still needed. AI

IMPACT This dataset could improve LLM safety by enabling better detection of nuanced and context-dependent toxic language.

RANK_REASON The cluster describes a new academic dataset and associated research paper.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New ToxiREX dataset tackles implicit toxicity across six languages

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Stefan F. Schouten, Ilia Markov, Piek Vossen ·

    ToxiREX: A Dataset on Toxic REasoning in ConteXt

    arXiv:2606.27981v1 Announce Type: new Abstract: We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic …

  2. arXiv cs.CL TIER_1 English(EN) · Piek Vossen ·

    ToxiREX: A Dataset on Toxic REasoning in ConteXt

    We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic toxic reasoning schema developed in a previous p…