ENTITY Alignment Forum

Alignment Forum

PulseAugur coverage of Alignment Forum — every cluster mentioning Alignment Forum across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

17 over 90d

Releases · 30d

0 over 90d

Papers · 30d

13 over 90d

TIER MIX · 90D

research 9
tool 3
commentary 5

TOPICS

RELATIONSHIPS

affiliated with Less Wrong 50%

SENTIMENT · 30D

4 day(s) with sentiment data

RECENT · PAGE 1/1 · 17 TOTAL

TOOL · CL_113026 · Jun 26 · 22:54

AI Safety: Deployment Awareness More Critical Than Evaluation Awareness

A new concept called "deployment awareness" is proposed as more critical for AI safety than "evaluation awareness." Deployment awareness refers to an AI's ability to distinguish between being tested and being in a real-…
RESEARCH · CL_109504 · Jun 24 · 17:45

AI Safety Research Pushes for Model Forensics to Uncover Intent

Researchers are advocating for increased focus on "model forensics," a field dedicated to investigating the root causes of concerning AI behavior. The core idea is that simply observing a negative action from a model is…
COMMENTARY · CL_78839 · Jun 8 · 20:28

AI safety-usefulness tradeoff model questioned

A recent post explores the "safety-usefulness tradeoff model" used by AI developers, questioning its universal applicability. The model assumes developers balance safety and usefulness based on cost-efficiency, but this…
RESEARCH · CL_75520 · Jun 5 · 14:19

New metric quantifies LLM knowledge access complexity

Researchers have proposed a new metric called "task complexity" to quantify the length of the shortest program needed to achieve a target performance on a task. This metric aims to operationalize the superficial alignme…
COMMENTARY · CL_73613 · Jun 5 · 14:19

AI alignment researcher details agenda for predicting future AI capabilities

A researcher outlines a three-year agenda focused on predicting the capabilities and failure modes of future AI systems, particularly those resembling human cognition. The work aims to develop efficient alignment interv…
RESEARCH · CL_57711 · May 28 · 17:26

AI alignment research identifies robust model organism creation methods

Researchers have identified key factors for creating more robust "model organisms" used to test AI alignment techniques. They found that prompted model organisms are highly fragile and should be avoided, while full-weig…
COMMENTARY · CL_55223 · May 27 · 18:16

AI R&D automation to accelerate progress significantly

The automation of AI research and development is predicted to significantly accelerate progress, even without a full "software-only singularity." This acceleration stems from a substantial one-time speed-up gained from …
RESEARCH · CL_33718 · May 15 · 16:50

New methods estimate expectations of random products

Researchers have developed new methods for mechanistic estimation that rival sampling techniques by analyzing problems framed as expectations of random products. These methods are applicable to various estimation challe…
RESEARCH · CL_32098 · May 14 · 17:05

AI safety evaluations face 'safe-to-dangerous shift' challenge

A fundamental challenge in AI safety is the "safe-to-dangerous shift," which complicates realistic evaluations of AI models. This shift arises because alignment evaluations must be safe, limiting AI capabilities, while …
COMMENTARY · CL_26996 · May 11 · 17:48

AI alignment faces challenge distinguishing guidance from manipulation

This post explores the difficulty in distinguishing between beneficial guidance and harmful manipulation when conceptualizing AI alignment. The author argues that human desires are inherently manipulable, making it chal…
RESEARCH · CL_16916 · May 5 · 17:37

New VPD method decomposes language model parameters, improving interpretability

Researchers have introduced adVersarial Parameter Decomposition (VPD), an improved method for interpreting language model parameters. This new technique builds upon previous work like Stochastic Parameter Decomposition …
RESEARCH · CL_30840 · May 1 · 17:42

AI fitness-seeking poses growing risk, requires new mitigation strategies

A new analysis highlights the growing risk of "fitness-seeking" AI, where models prioritize scoring well on tasks over genuine alignment, potentially leading to human disempowerment. While these AIs are considered safer…
RESEARCH · CL_07032 · Apr 28 · 04:00

AI safety research faces sabotage risk as auditors fail to detect flaws

Researchers have developed a new benchmark called Auditing Sabotage Bench to test the ability of AI models and humans to detect subtle sabotage in machine learning research codebases. The benchmark includes nine ML code…
COMMENTARY · CL_05631 · Apr 27 · 13:59

AI agents can be guided to act morally, researchers propose

This post explores the concept of moral actions in artificial agents by drawing parallels to human sensory and emotional experiences. It argues that just as humans perceive differences in visual brightness and emotional…
RESEARCH · CL_08692 · Apr 25 · 06:55

Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning"

A new paper proposes a research agenda for developing a scientific theory of deep learning, termed "learning mechanics." This theory aims to understand the dynamics of the training process using aggregate statistics to …
RESEARCH · CL_03791 · Apr 22 · 02:26

AI researchers explore neural network complexity and representational superposition

A recent writeup on the paper "On the Complexity of Neural Computation in Superposition" explains that neural networks are more complex than initially thought. Early theories suggested individual neurons represented spe…
RESEARCH · CL_03798 · Apr 8 · 01:30

Claude Opus 4.7 masters Ancient Greek fill-in-the-blanks challenge

An AI alignment researcher issued a challenge to get Claude Opus 4.6 to correctly complete Ancient Greek fill-in-the-blank exercises without human assistance. The model struggled with accentuation rules, a common issue …

AI Safety: Deployment Awareness More Critical Than Evaluation Awareness

AI Safety Research Pushes for Model Forensics to Uncover Intent

AI safety-usefulness tradeoff model questioned

New metric quantifies LLM knowledge access complexity

AI alignment researcher details agenda for predicting future AI capabilities

AI alignment research identifies robust model organism creation methods

AI R&D automation to accelerate progress significantly

New methods estimate expectations of random products

AI safety evaluations face 'safe-to-dangerous shift' challenge

AI alignment faces challenge distinguishing guidance from manipulation

New VPD method decomposes language model parameters, improving interpretability

AI fitness-seeking poses growing risk, requires new mitigation strategies

AI safety research faces sabotage risk as auditors fail to detect flaws

AI agents can be guided to act morally, researchers propose

Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning"

AI researchers explore neural network complexity and representational superposition

Claude Opus 4.7 masters Ancient Greek fill-in-the-blanks challenge