PulseAugur
实时 23:32:23

LLM training costs reverse-engineered; finetuning unlocks latent copyright recall

A recent preprint suggests that fine-tuning large language models on a single author's works can lead to the verbatim recall of copyrighted material the model was not explicitly trained on. This phenomenon appears to stem from latent information within the pretraining data, rather than the fine-tuning dataset itself. The research indicates that fine-tuning on synthetic text does not yield similar verbatim outputs, potentially shifting copyright liability towards the model developers. AI

影响 This research could redefine copyright liability for AI labs by highlighting latent data recall issues in LLMs.

排序理由 The cluster discusses findings from a new preprint concerning LLM behavior and copyright implications.

在 Mastodon — fosstodon.org 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

LLM training costs reverse-engineered; finetuning unlocks latent copyright recall

报道来源 [3]

  1. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Dwarkesh's two-hour blackboard discussion with Reiner Pope deduces how frontier LLMs are actually trained and served, using just API price lists, public benchma

    Dwarkesh's two-hour blackboard discussion with Reiner Pope deduces how frontier LLMs are actually trained and served, using just API price lists, public benchmark numbers, and memory-bandwidth arithmetic. The thesis: you can reverse-engineer frontier architecture from the per-tok…

  2. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    A new preprint finds that finetuning frontier LLMs on one author's novels unlocks verbatim output from dozens of other copyrighted books the model never saw at

    A new preprint finds that finetuning frontier LLMs on one author's novels unlocks verbatim output from dozens of other copyrighted books the model never saw at finetune time. The real finding is the control: synthetic-text finetuning produces near-zero extraction. So the copies a…

  3. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    Dwarkesh's two-hour blackboard discussion with Reiner Pope deduces how frontier LLMs are actually trained and served, using just API price lists, public benchma

    Dwarkesh's two-hour blackboard discussion with Reiner Pope deduces how frontier LLMs are actually trained and served, using just API price lists, public benchmark numbers, and memory-bandwidth arithmetic. The thesis: you can reverse-engineer frontier architecture from the per-tok…