LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
Researchers have explored the use of large language models (LLMs) for annotating credibility assessments in Danish asylum decisions, a novel legal NLP task. They introduced the RAB-Cred dataset, featuring expert annotations and metadata, to evaluate 21 open-weight models and various prompt combinations in zero-shot and few-shot settings. The study found that while LLMs show potential for cost-effective labeling, their annotations are imperfect and inconsistent, necessitating careful consideration beyond single model predictions. AI
IMPACT Demonstrates LLM utility in specialized legal domains, but highlights the need for careful validation of their outputs.