A U.S. judge has allowed a class-action lawsuit to proceed against Databricks, alleging that their DBRX large language model was trained on pirated copyrighted books. The authors claim Databricks acquired MosaicLM, which used the RedPajama dataset containing approximately 196,000 titles, including their works. Databricks has argued that the authors cannot prove DBRX was trained on this specific data, but the judge requires further information to determine if copyright infringement occurred. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Potential for significant damages in copyright infringement cases could impact LLM training data acquisition strategies.
RANK_REASON Class action lawsuit proceeding regarding copyright infringement in LLM training data.