Claude Sonnet, Grok, Gemini, and GPT-5 mini were each assigned ten different roles in a simulated town, and directed to manage it for 15 days. Claude did OK; th
A new simulation tested several AI models, including Claude Sonnet, Grok, Gemini, and a GPT-5 mini, by assigning them ten distinct roles in a virtual town for 15 days. Claude Sonnet performed adequately, while the other models struggled to manage the simulated environment effectively. This evaluation aimed to assess the long-horizon autonomy of these AI agents. AI
IMPACT This research highlights current limitations in AI agent autonomy and long-horizon task management, suggesting areas for future development.