Terminal-Bench
PulseAugur coverage of Terminal-Bench — every cluster mentioning Terminal-Bench across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
Sakana Fugu orchestrator models combine LLMs for collective intelligence
Researchers have developed Sakana Fugu, a family of orchestrator models designed to combine the specialized capabilities of multiple Large Language Models (LLMs) into a collectively intelligent system. These models act …
-
Fireworks AI launches GLM-5.2 with 1M context, optimized for coding
Fireworks AI has launched GLM-5.2, a new frontier model with a 1 million token context window, optimized for coding tasks. The model has undergone independent validation on benchmarks including SWE-bench and GPQA. Firew…
-
Z.ai releases GLM-5.2, setting new open-source benchmark for long-context AI
Z.ai has released GLM-5.2, an open-source language model with a 1 million token context window, positioning it as a strong contender in long-horizon tasks and coding benchmarks. The model features an improved architectu…
-
AI benchmarks hardened against reward hacking with adversarial loops
Researchers have developed a novel "hacker-fixer loop" to improve the robustness of AI agent benchmarks against reward hacking. This adversarial process uses three LLM agents to iteratively identify and patch vulnerabil…
-
New methods enhance AI agent reliability and safety
Researchers have developed new methods to improve the reliability and safety of AI agents. One approach, TRACE, focuses on monitoring long-horizon agent trajectories to detect malicious or unintended behaviors by analyz…
-
Fireworks AI enables training of trillion-parameter MoE models
Fireworks AI has developed a new training infrastructure that enables the fine-tuning of trillion-parameter Mixture-of-Experts (MoE) models, overcoming previous memory and orchestration bottlenecks. This platform was in…
-
AI models: Choose benchmarks over hype for true performance
A recent analysis highlights that tech companies often select AI models based on hype rather than performance on relevant benchmarks. The article emphasizes that benchmarks like SWE-bench for coding, Terminal-Bench for …
-
DeepClaude slashes coding agent costs by 17x using DeepSeek V4 Pro
An open-source tool called DeepClaude has gained significant traction by allowing developers to use the Claude Code agent loop with DeepSeek V4 Pro instead of Anthropic's models. This swap drastically reduces costs, wit…
-
Public AI models replicate Anthropic's vulnerability discovery findings
Researchers have successfully replicated Anthropic's Mythos findings using publicly available AI models like GPT-5.4 and Claude Opus 4.6. This suggests that advanced AI capabilities for discovering software vulnerabilit…