GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
A new benchmark, GeoNatureAgent, has been released to evaluate the performance of AI agents in environmental geospatial analysis using real-world APIs. The benchmark includes 93 tasks across various categories, such as spatial reasoning and error handling, and utilizes a self-hostable API with environmental indicators for Spain and Portugal. Initial evaluations of seven LLMs revealed that Claude Sonnet 4 performed best, but open-weight models like DeepSeek V3.2 offered a more cost-effective alternative, achieving a significant portion of Claude's capability at a fraction of the price. The study also highlighted that comparison tasks remain a challenge for current models, and API-based evaluations are more discriminative than general GIS benchmarks. AI
IMPACT This benchmark highlights the capabilities and limitations of current LLM agents in complex geospatial analysis, potentially guiding future development for environmental applications.