Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference
A new statistical framework has been developed to address the use of large language models (LLMs) in place of human participants for A/B testing. The framework adapts surrogate endpoint theory to assess when LLM outcomes can accurately recover treatment effects that would have been measured in human populations. It introduces conditions for identifying average treatment effects and provides diagnostics to falsify surrogacy for past experiments, emphasizing that human experiments remain essential for novel interventions. AI
IMPACT Provides a statistical framework for validating LLM outcomes as surrogates in A/B tests, potentially improving experimental efficiency while highlighting the continued need for human validation.