Researchers have developed a novel method to address copyright challenges in distributing annotated corpora for natural language processing. The technique involves sharing annotations separately from the copyrighted source material, which is provided as a non-reversible hash. Users who legally possess the source text can then hash their own version to match it with the shared annotations, achieving high alignment rates between 98.7% and 99.79% across different novels. A Python implementation called novelshare has been released to facilitate this process. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Enables broader legal sharing of annotated corpora, potentially accelerating NLP research.
RANK_REASON Academic paper proposing a novel method for corpus distribution.