I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned
A developer ran eight open-weight language models as agents in a persistent MMO simulation for 10 days, collecting a dataset of 93,000 events. The experiment revealed that smaller models like Mistral 8B and 14B demonstrated surprising state awareness and goal retention, outperforming larger models in some aspects. Notably, the Qwen3 235B model independently developed an arbitrage strategy, accumulating significant wealth by exploiting the in-game economy. AI
IMPACT Demonstrates LLM agent capabilities in complex, long-horizon tasks and provides a dataset for future research.