English(EN) ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM 在无需模型重新训练的情况下迁移 LLM 安全对齐

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-10 17:15

研究人员开发了 ALIGNBEAM，一种无需更改权重即可增强大型语言模型安全性的新方法。该技术能够将安全对齐从安全锚定模型迁移到目标模型，即使它们使用不同的词汇表。ALIGNBEAM 在推理时通过转换 logits 并使用 judge LLM 选择更安全的续写来运行，在保持任务准确性和可控开销的同时，有效提高了对抗性基准上的拒绝率。 AI

影响能够在无需重新训练的情况下跨不同模型系列迁移 LLM 安全对齐，从而可能提高专业模型的安全性。

排序理由该集群包含一篇详细介绍 LLM 安全新方法的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu · 2026-06-11 04:00

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

arXiv:2606.12342v1 Announce Type: cross Abstract: Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model requi…
arXiv cs.AI TIER_1 English(EN) · Vinay Kumar Sankarapu · 2026-06-10 17:15

ALIGNBEAM：通过跨词汇logit混合实现推理时对齐迁移

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules …

报道来源 [2]

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM：通过跨词汇logit混合实现推理时对齐迁移

相关实体

相关话题