New benchmark tests LLMs on animal welfare during adversarial conversations

By PulseAugur Editorial · [1 sources] · 2026-06-04 04:00

Researchers have developed MANTA, a new benchmark designed to evaluate how well large language models maintain their ethical stances on animal welfare during multi-turn adversarial conversations. The benchmark consists of 1,088 five-turn dialogues that test both value stability and moral sensitivity. When tested on seven frontier models, including Claude Opus 4.7 and GPT-5.5, MANTA revealed that some models' performance rankings shifted significantly under sustained pressure, indicating a potential degradation of their alignment. AI

IMPACT This benchmark could reveal vulnerabilities in LLM alignment, prompting developers to improve robustness against adversarial pressure in sensitive ethical domains.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Isabella Luong, Joyee Chen, Arturs Kanepajs, Jasmine Brazilek, Sankalpa Ghose, David Williams-King, Linh Le, Allen Lu · 2026-06-04 04:00

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

arXiv:2605.16301v2 Announce Type: replace-cross Abstract: Evaluating animal welfare reasoning in LLMs remains an open challenge despite rapid deployment in consumer and professional contexts where welfare considerations appear implicitly in everyday queries. Existing benchmarks s…

COVERAGE [1]

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

RELATED ENTITIES

RELATED TOPICS