Llama 70B evaluations show context matters more than adversarial training

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new analysis using AuditBench and Natural Language Autoencoders (NLA) on Llama 70B Instruct fine-tunes reveals that evaluation methods are more sensitive to sampling techniques than adversarial training. The study found that "strong evidence" evaluation formats, which provide more context, are more robust to adversarial attacks like Knowledge-Targeted Optimization (KTO) and Supervised Fine-Tuning (SFT) compared to single-turn evaluations. Specifically, certain behaviors like reward wireheading and contextual optimism were only surfaced in the more robust "strong evidence" evaluations, indicating limitations in simpler testing methods. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights the limitations of current LLM evaluation methods and suggests "strong evidence" formats are more reliable for detecting nuanced behaviors.

RANK_REASON The cluster details a research paper analyzing LLM evaluation methods and their robustness to adversarial training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on LessWrong (AI tag) →

paper
safety

Llama 70B evaluations show context matters more than adversarial training

COVERAGE [1]

LessWrong (AI tag) TIER_1 · Realmbird · 2026-05-16 05:25

NLA Verbalizations on AuditBench: Llama 70B

<h1><span>Quick Summary:</span></h1><ul><li value="1"><span>Ran Llama 70B through Audit Bench with NLA</span></li><li value="2"><span>Strong Evidence evals were less sensitive to sampling method and more robust to KTO and SFT adversarial training than Single Turn evals</span></li…

COVERAGE [1]

NLA Verbalizations on AuditBench: Llama 70B

RELATED ENTITIES

RELATED TOPICS