OpenAI estimates worst-case risks of open-weight LLMs via malicious fine-tuning

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

OpenAI researchers explored the potential risks associated with open-weight large language models by introducing a method called malicious fine-tuning (MFT). This technique involved fine-tuning an open-weight model, gpt-oss, to excel in biology and cybersecurity domains, aiming to uncover worst-case capabilities. The study found that while MFT gpt-oss showed some marginal improvements in biological capabilities compared to other open-weight models, it did not significantly advance the frontier and underperformed against closed-weight models on specific risk evaluations. These findings informed OpenAI's decision to release the model and aim to guide future risk assessments for similar open releases. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON The item describes a research paper published by OpenAI detailing a study on the risks of open-weight LLMs, including a novel methodology for risk estimation.

Read on OpenAI News →

OpenAI estimates worst-case risks of open-weight LLMs via malicious fine-tuning

COVERAGE [1]

OpenAI News TIER_1 · 2025-08-05 00:00

Estimating worst case frontier risks of open weight LLMs

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity.

COVERAGE [1]

Estimating worst case frontier risks of open weight LLMs

RELATED TOPICS