ENTITY Terminal Bench 2.0

Terminal Bench 2.0

PulseAugur coverage of Terminal Bench 2.0 — every cluster mentioning Terminal Bench 2.0 across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

19 over 90d

Releases · 30d

0 over 90d

Papers · 30d

10 over 90d

TIER MIX · 90D

frontier release 1
significant 2
research 6
tool 10

TOPICS

SENTIMENT · 30D

7 day(s) with sentiment data

RECENT · PAGE 1/1 · 19 TOTAL

FRONTIER RELEASE · CL_108496 · Jun 24 · 05:31

Alibaba Qwen unveils AgentWorld language model for environment simulation

Alibaba's Qwen team has introduced Qwen-AgentWorld, a new language world model designed to simulate various agent environments. This model focuses on training LLMs to understand and predict environments, rather than jus…
TOOL · CL_107959 · Jun 24 · 04:00

New LemonHarness framework boosts LLM agent performance on long tasks

Researchers have developed LemonHarness, a new execution framework designed to improve the stability and performance of large language model (LLM) agents working on extended tasks. The framework establishes explicit exe…
TOOL · CL_107146 · Jun 23 · 19:05

Tmax-27B terminal agent released, optimized for consumer GPUs

A new terminal agent model named Tmax-27B has been released, built upon Qwen3.6-27B and trained using DPPO for reinforcement learning. This model achieves competitive scores on agentic benchmarks like Terminal Bench 2.0…
TOOL · CL_105288 · Jun 23 · 07:00

Xiaomi launches MiMo Code with persistent memory, claims Claude Code advantage

Xiaomi has released MiMo Code, an open-source fork of the OpenCode terminal coding agent. This new version introduces a persistent memory system designed to handle long tasks, along with subagent orchestration and intel…
TOOL · CL_93131 · Jun 16 · 04:00

New APEX Framework Enhances AI Agent Self-Improvement

Researchers have introduced APEX, a novel three-layer framework designed to enhance AI agent self-improvement. Unlike previous methods that focused solely on prompt optimization, APEX simultaneously evolves the agent's …
TOOL · CL_106548 · Jun 16 · 00:00

GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robotic planning. The system incorporates GeoFuse-MV3D to enhance 3D reconstruction accuracy by leveraging geometry…
RESEARCH · CL_96078 · Jun 16 · 00:00

GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robot planning. This system incorporates GeoFuse-MV3D for enhanced 3D reconstruction and an improved KnowledgeBank …
SIGNIFICANT · CL_99036 · Jun 15 · 09:17

Poolside releases Laguna M.1, a 225B MoE model for agentic coding

Poolside has released Laguna M.1, a 225 billion parameter Mixture-of-Experts model optimized for agentic coding tasks. The model features a large sparse MoE architecture with 256 experts and global attention, enabling i…
TOOL · CL_79558 · Jun 8 · 13:50

Self-Harness enables LLM agents to improve their own operational harnesses

Researchers have developed a novel method called Self-Harness, enabling LLM-based agents to autonomously improve their own operational harnesses. This iterative process involves identifying model-specific failure patter…
TOOL · CL_68283 · Jun 3 · 04:00

Research: Interaction trajectories boost AI agent generalization

A new research paper explores the effectiveness of interaction trajectories for training AI agents, finding that standalone performance doesn't dictate teaching efficacy. Surprisingly, agents fine-tuned on trajectories …
TOOL · CL_60204 · May 29 · 19:01

AI coding agents: GPT-5.5, Claude Sonnet 4.6, Gemini 3.5 Flash compared

A recent comparison evaluated three AI coding agents: OpenAI's Codex (powered by GPT-5.5), Anthropic's Claude Code (using Claude Sonnet 4.6), and Google's Antigravity (with Gemini 3.5 Flash). The experiment focused on r…
SIGNIFICANT · CL_56706 · May 28 · 08:20

Alibaba's Qwen3.7-Max debuts with 1M context, autonomous coding

Alibaba has released Qwen3.7-Max, an agent-first LLM with a 1 million token context window, capable of autonomous coding tasks. The model demonstrated a 35-hour coding session without human intervention, optimizing code…
TOOL · CL_35928 · May 17 · 21:00

Local LLMs struggle with real-world terminal tasks despite benchmark success

Local large language models often perform poorly on multi-step terminal tasks despite excelling at standard benchmarks like MMLU. This discrepancy arises because traditional benchmarks measure single-turn reasoning, fai…
TOOL · CL_34986 · May 16 · 21:33

Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally

The llama.cpp project has integrated Multi-head Attention Parallelism (MTP), leading to an 11.5% speed increase for 27B Qwen models in local inference. A new finetuned Gemma-4 model, optimized for creative writing and a…
SIGNIFICANT · CL_26039 · May 11 · 03:44

Qwen 3.6-Plus excels in complex AI agent tasks and coding

Alibaba's Qwen 3.6-Plus model has demonstrated advanced capabilities in complex decision-making and agentic coding tasks, according to a recent evaluation. The model successfully generated a detailed implementation plan…
RESEARCH · CL_07734 · Apr 28 · 16:17

Poolside AI releases open-weight Laguna XS.2 and M.1 coding models

Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using…
RESEARCH · CL_47566 · Apr 9 · 13:05

Anthropic's 'Mythos' AI too risky for public release

Anthropic has developed a new AI model named Claude Mythos, which demonstrates significant advancements in benchmark performance, particularly in identifying software vulnerabilities. Due to its advanced capabilities in…
FRONTIER RELEASE · CL_01718 · Nov 18 · 17:49

Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities

Google DeepMind has launched Gemini 3 Pro, their latest and most intelligent model, which demonstrates significant improvements in reasoning and coding capabilities. This new model surpasses previous versions and excels…
RESEARCH · CL_99526 · Apr 15 · 22:38

New research explores LLM agent evaluation and improvement techniques

Researchers are exploring new methods for evaluating and improving Large Language Model (LLM) agents. One paper introduces semantic early-stopping for iterative LLM loops, aiming to reduce token usage by halting when me…

Alibaba Qwen unveils AgentWorld language model for environment simulation

New LemonHarness framework boosts LLM agent performance on long tasks

Tmax-27B terminal agent released, optimized for consumer GPUs

Xiaomi launches MiMo Code with persistent memory, claims Claude Code advantage

New APEX Framework Enhances AI Agent Self-Improvement

GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory

GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory

Poolside releases Laguna M.1, a 225B MoE model for agentic coding

Self-Harness enables LLM agents to improve their own operational harnesses

Research: Interaction trajectories boost AI agent generalization

AI coding agents: GPT-5.5, Claude Sonnet 4.6, Gemini 3.5 Flash compared

Alibaba's Qwen3.7-Max debuts with 1M context, autonomous coding

Local LLMs struggle with real-world terminal tasks despite benchmark success

Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally

Qwen 3.6-Plus excels in complex AI agent tasks and coding

Poolside AI releases open-weight Laguna XS.2 and M.1 coding models

Anthropic's 'Mythos' AI too risky for public release

Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities

New research explores LLM agent evaluation and improvement techniques