Researchers have developed a novel framework for evaluating agentic stock prediction systems by utilizing large language models as judges. This system breaks down performance into six specific dimensions, including regime detection and risk calibration, offering a more nuanced assessment than traditional aggregate metrics. The LLM judges, including GPT 5.4, Claude 4.6 Opus, and Gemini 3.1 Pro, demonstrated high agreement and correlated well with realized trading performance. This behavioral evaluation was then integrated into a reinforcement learning feedback loop, leading to significant improvements in prediction accuracy and trading strategy. AI
影响 Introduces a new method for evaluating and improving AI agents in complex decision-making tasks like financial prediction.
排序理由 Academic paper detailing a new evaluation framework for AI systems. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →