Brief · PulseAugur

TOOL · Mastodon — fosstodon.org English(EN) · 4h

Claude Sonnet, Grok, Gemini, and GPT-5 mini were each assigned ten different roles in a simulated town, and directed to manage it for 15 days. Claude did OK; th

A new simulation tested several AI models, including Claude Sonnet, Grok, Gemini, and a GPT-5 mini, by assigning them ten distinct roles in a virtual town for 15 days. Claude Sonnet performed adequately, while the other models struggled to manage the simulated environment effectively. This evaluation aimed to assess the long-horizon autonomy of these AI agents. AI

IMPACT This research highlights current limitations in AI agent autonomy and long-horizon task management, suggesting areas for future development.

Gemini
Claude Sonnet
Grok
GPT-5 mini
Emergence World