Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Researchers have developed a new method called RuleSHAP to better detect and understand injected behaviors in large language models (LLMs). This technique combines global SHAP aggregates with rule induction, significantly improving the identification of complex, non-univariate triggers compared to existing methods like RuleFit and global SHAP alone. The study demonstrates RuleSHAP's effectiveness in surfacing belief-driven heuristics that can lead to misinformation, showing an 82% improvement in MRR@1 over RuleFit. AI

IMPACT Provides a novel method for detecting and understanding potential biases or misinformation triggers within LLMs.

LLMs
GPT
Llama
SHAP
Francesco Sovrano
RuleFit
RuleSHAP