PulseAugur / Brief
EN
LIVE 09:46:23

Brief

last 24h
[1/1] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. GradShield: Alignment Preserving Finetuning

    Researchers have developed GradShield, a new method to prevent large language models from becoming misaligned after fine-tuning. The technique identifies and removes harmful data points before they can corrupt the model's alignment by calculating a Finetuning Implicit Harmfulness Score (FIHS) for each data point. Experiments show GradShield effectively maintains model utility while keeping the attack success rate below 6%, outperforming existing baseline methods. AI

    IMPACT This method could enable safer deployment of fine-tuned LLMs by preventing the introduction of harmful behaviors.