Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Researchers have developed MANTA, a new benchmark designed to evaluate how well large language models maintain their ethical stances on animal welfare during multi-turn adversarial conversations. The benchmark consists of 1,088 five-turn dialogues that test both value stability and moral sensitivity. When tested on seven frontier models, including Claude Opus 4.7 and GPT-5.5, MANTA revealed that some models' performance rankings shifted significantly under sustained pressure, indicating a potential degradation of their alignment. AI

IMPACT This benchmark could reveal vulnerabilities in LLM alignment, prompting developers to improve robustness against adversarial pressure in sensitive ethical domains.

GPT-5.5
Llama 3.3 70B
DeepSeek V4
Claude Opus 4.7
Gemini 3.1 Flash Lite
Mistral Small
Grok 4.3
MANTA