Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models
Researchers have developed a new metric, the Safety Asymmetry Score (SAS), to evaluate how language models' vulnerability to adversarial attacks changes based on the delivery channel of the malicious content. Their study, which tested six production LLMs, found that models designed for agentic roles are more susceptible to attacks embedded in tool descriptions than in user messages. This vulnerability shifts when the content appears in tool outputs, indicating that models may implicitly trust tool metadata more than user input. AI
IMPACT Highlights a critical safety blind spot in current tool-using LLMs, potentially impacting the security of AI agents.