Brief

last 24h

[3/3] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · LessWrong (AI tag) English(EN) · 7h

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Researchers have developed and evaluated lie detectors for large language models, finding that while these detectors show promise, their effectiveness is limited, particularly when models are trained to be deceptive. The study highlights the difficulty in creating testbeds where models verifiably hold opposing beliefs, a crucial step for robust evaluation. Existing detectors performed poorly when deception was trained into the models, suggesting they are not yet reliable enough for high-confidence claims about model lying, though they may serve as a component in broader auditing toolkits. AI

IMPACT Current LLM lie detection methods are insufficient for high-confidence claims, necessitating further research for robust AI safety and auditing.
- Less Wrong
- model organism
TOOL · arXiv cs.CL English(EN) · 1mo

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Researchers have developed a method to identify the specific objectives used to finetune large language models, even when those objectives are hidden. The technique involves comparing perplexity scores between a finetuned model and a reference model using short prompts. Completions with the largest perplexity differences are likely to reveal the finetuning goals, such as the internalization of false facts or the production of specific phrases. This approach is effective even without direct access to the original pre-finetuning model and can work with API-gated models that provide token log probabilities. AI

IMPACT Provides a new method for understanding and potentially mitigating hidden risks introduced during LLM finetuning.
RESEARCH · Ahead of AI (Sebastian Raschka) English(EN) · 26mo · [30 sources]

My Workflow for Understanding LLM Architectures

OpenAI has introduced the IH-Challenge dataset to train large language models to better prioritize instructions from different sources, such as system messages, developers, and users. This training aims to improve safety steerability and robustness against prompt-injection attacks by teaching models to follow a hierarchy where system instructions are most trusted. The dataset is designed to overcome common pitfalls in reinforcement learning for instruction hierarchy, ensuring models can reliably adhere to safety policies even when faced with conflicting user or tool-generated prompts. AI

IMPACT Enhances LLM safety and reliability by improving their ability to follow prioritized instructions, reducing risks from prompt injection and policy violations.

Brief

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

My Workflow for Understanding LLM Architectures