A developer demonstrated that a smaller coding model, GPT-5.1-Codex-Mini, can achieve competitive performance on the Terminal-Bench 2.0 benchmark by utilizing an improved "harness" or wrapper. This setup, named Hookele, achieved a score of 61.6% ± 1.9, placing it among larger models like GPT-5.2 and Claude Opus 4.6. The key improvements included a classifier to select relevant skill files for the system prompt and robust handling of tool outputs and context. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Demonstrates that improved system design can significantly boost smaller models, potentially reducing reliance on larger, more expensive ones for specific tasks.
RANK_REASON The article details an experiment and benchmark results for a coding model, focusing on the impact of the surrounding system rather than a new model release. [lever_c_demoted from research: ic=1 ai=1.0]