Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning
Researchers have developed MANTA, a new benchmark designed to evaluate how well large language models maintain their ethical stances on animal welfare during multi-turn adversarial conversations. The benchmark consists of 1,088 five-turn dialogues that test both value stability and moral sensitivity. When tested on seven frontier models, including Claude Opus 4.7 and GPT-5.5, MANTA revealed that some models' performance rankings shifted significantly under sustained pressure, indicating a potential degradation of their alignment. AI
IMPACT This benchmark could reveal vulnerabilities in LLM alignment, prompting developers to improve robustness against adversarial pressure in sensitive ethical domains.