PulseAugur / Brief
EN
LIVE 12:43:25

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

    Researchers have compared two methods, Diff-in-Means (DiM) and Iterative Nullspace Projection (INLP), for controlling refusal behavior in AI chat models. The study found that INLP's counterfactual flipping intervention was as effective as DiM's directional ablation in suppressing model refusal, while its nullspace projection method was less effective. Restricting INLP to key directions maintained most of its suppression capability with minimal impact on model perplexity, offering a tunable approach to controlling AI responses. AI

    IMPACT Offers tunable methods for controlling AI refusal, potentially improving safety and reliability in chat models.