PulseAugur
实时 14:42:13
English(EN) Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

新的“行为金丝雀”审计LLM在RL微调中的训练数据使用情况

研究人员开发了一种名为行为金丝雀的新审计方法,用于检测大型语言模型(LLM)在人类反馈强化学习(RLHF)微调过程中是否不当使用受法律保护的检索上下文。传统的审计技术,如逐字记忆检查,对于RLHF来说是不够的,因为这个过程会改变模型的行为,而不是记忆特定的事实。行为金丝雀框架引入了文档触发器与反馈配对,以产生风格化的响应,使审计人员能够以67%的检测率和10%的误报率识别未经授权的数据合并。 AI

影响 为审计人员提供了一种新的方法来验证LLM在微调过程中是否遵守数据使用策略。

排序理由 学术论文,介绍了一种新颖的LLM微调审计机制。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

新的“行为金丝雀”审计LLM在RL微调中的训练数据使用情况

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Chaoran Chen, Dayu Yuan, Peter Kairouz ·

    Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

    arXiv:2604.22191v1 Announce Type: cross Abstract: In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorp…

  2. arXiv cs.CL TIER_1 English(EN) · Peter Kairouz ·

    Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

    In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially …