PulseAugur
EN
LIVE 23:08:24

Qwen 3.5 and Ornith 1.0 models fail as coding agents

A comparison of the Qwen 3.5 9B and Ornith 1.0 9B models revealed that neither is ready for use as coding agents, even on standard hardware. Both models failed to clear the easiest tier of agent tasks, with the native tool-calling API performing worse than simple prompting. While both models exhibited dangerous failure modes like hallucinating task completion or entering infinite loops on harder tasks, Qwen 3.5 9B was more prone to outputting prose instead of tool calls, and Ornith 1.0 9B hallucinated completions more frequently. AI

IMPACT Highlights limitations in current 9B models for agentic tasks and questions the efficacy of native tool-calling APIs.

RANK_REASON Comparison of two specific LLM models on agent capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Qwen 3.5 and Ornith 1.0 models fail as coding agents

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Dhanush G ·

    Qwen 3.5 vs Ornith 1.0 9B Models, Same Hardware, Same Quant as Coding Agents

    <p>I ran Qwen 3.5 9B and Ornith 1.0 9B, both at Q8, on the same 16GB Mac, through the same multi-step agent tests. Neither is agent-ready. But they're not ready in interesting, different ways — and the most surprising result is that the native tool-calling API made both of them w…