Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
Researchers have developed new methods to detect when AI agents might be exfiltrating sensitive credentials. One approach uses activation probes to identify credential access before the agent even outputs information. Another method employs honeytokens and split conformal prediction to detect specific formats of leaked data. Additionally, a cumulative accounting system tracks a leakage budget across multiple conversation turns to catch more sophisticated attacks. AI
IMPACT Introduces novel detection methods for AI agent security vulnerabilities, potentially improving the safety of systems handling sensitive data.