PulseAugur
EN
LIVE 19:59:35

Google DeepMind: SFT Key to Gemini Model Safety

Google DeepMind researchers have discovered that Supervised Fine-Tuning (SFT) is the primary driver of safety properties in their Gemini models, rather than other training stages like Reinforcement Learning (RL). Experiments comparing pre-training-only versions of Gemini 3.1 Pro and Gemini 3 Flash with SFT to their production counterparts showed remarkably similar safety performance. This finding suggests that SFT is a high-leverage intervention point for improving model safety and behavior in future Gemini developments. AI

IMPACT Highlights SFT as a critical stage for ensuring AI safety, potentially guiding future development and evaluation strategies.

RANK_REASON Research update from a major AI lab detailing findings on model training and safety properties.

Read on Alignment Forum →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Google DeepMind: SFT Key to Gemini Model Safety

COVERAGE [2]

  1. Alignment Forum TIER_1 English(EN) · Josh Engels ·

    SFT Drives Gemini’s Safety Properties

    <p><i><span>This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found </span></i><a href="https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/buildin…

  2. LessWrong (AI tag) TIER_1 English(EN) · Josh Engels ·

    SFT Drives Gemini’s Safety Properties

    <p><i><span>This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found </span></i><a href="https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/buildin…