Benchmarking three large language model tasks (GSM8K, HellaSwag, and TruthfulQA) on a single T4 GPU costs approximately $0.12. The analysis reveals that generative tasks are the primary cost driver, while log-likelihood tasks can be processed in parallel. Optimizing by capping tokens at 256, using a 25% stratified sample, and employing MC2 scoring can significantly reduce runtime and costs. AI
IMPACT Provides a cost breakdown for LLM evaluation, suggesting methods to reduce expenses for researchers and developers.
RANK_REASON Analysis of computational costs for LLM evaluation benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →