A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction
Researchers have developed a novel four-phase pipeline for automating grammatical annotation in large natural language corpora using large language models (LLMs). This method, which includes prompt engineering, pre-hoc evaluation, batch processing, and post-hoc validation, achieved over 98% accuracy in annotating 143,933 'consider' concordance lines from the Corpus of Historical American English via the OpenAI API. A subsequent analysis revealed previously undocumented genre-specific changes in the evaluative consider construction, suggesting LLMs can significantly accelerate corpus linguistic research by enabling the exploration of questions previously out of practical reach. AI
IMPACT Enables large-scale linguistic research previously impractical due to manual annotation bottlenecks.