A new study evaluated 12 large language models (LLMs) from OpenAI, Google Gemini, and Anthropic, alongside four classical machine learning models, for their effectiveness in screening research papers for systematic literature reviews. The research found significant variability and non-determinism among LLMs, even when set to temperature zero. While abstract availability was crucial for performance, adding titles and keywords did not consistently improve results. The study concluded that LLMs did not consistently outperform traditional models, suggesting that their adoption should be carefully considered based on operational factors like reproducibility and cost. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights LLM variability and lack of consistent superiority over traditional models in systematic literature reviews, cautioning against uncritical adoption.
RANK_REASON Academic paper evaluating LLM performance on a specific task.