Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization
A new study demonstrates that fine-tuning smaller language models like Mistral-7B using QLoRA can achieve performance comparable to or exceeding larger models such as GPT-4o and GPT-5 on biomedical claim verification tasks. The research highlights that Mistral-7B, with a fraction of the cost and training data, surpassed GPT-4o by up to 12% in F1 score. The study also identified a structural artifact in the SciFact dataset that artificially inflates scores, emphasizing the importance of structurally sound data for robust cross-domain generalization. AI
IMPACT Demonstrates cost-effective fine-tuning of smaller LLMs can rival frontier models for specialized tasks, potentially lowering barriers to AI adoption in research.