Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 8h

A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction

Researchers have developed a novel four-phase pipeline for automating grammatical annotation in large natural language corpora using large language models (LLMs). This method, which includes prompt engineering, pre-hoc evaluation, batch processing, and post-hoc validation, achieved over 98% accuracy in annotating 143,933 'consider' concordance lines from the Corpus of Historical American English via the OpenAI API. A subsequent analysis revealed previously undocumented genre-specific changes in the evaluative consider construction, suggesting LLMs can significantly accelerate corpus linguistic research by enabling the exploration of questions previously out of practical reach. AI

IMPACT Enables large-scale linguistic research previously impractical due to manual annotation bottlenecks.

Hugging Face
arXiv
OpenAI API
Corpus of Historical American English
Corpus of Contemporary American English
Cameron Morin