ENTITY tau-Bench

tau-Bench

PulseAugur coverage of tau-Bench — every cluster mentioning tau-Bench across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

8 over 90d

Releases · 30d

0 over 90d

Papers · 30d

8 over 90d

TIER MIX · 90D

TOPICS

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/1 · 8 TOTAL

TOOL · CL_122959 · Jul 3 · 04:00

New architecture routes customer service AI based on task difficulty

Researchers have introduced a difficulty-routed service-control architecture designed to manage autonomous customer-service agents. This system aims to maintain efficiency for routine tasks while implementing enhanced s…
RESEARCH · CL_107868 · Jun 22 · 20:57

AI retrieval metrics may mislead in evaluating agent policy utility

Researchers have identified a potential flaw in how retrieval metrics are used to evaluate AI agents. The study, focusing on long-horizon tool-use agents, found that exact-match retrieval recall may underestimate the ac…
TOOL · CL_105539 · Jun 19 · 00:00

New dataset 'Counsel' aims to improve AI agent evaluation

Researchers have introduced Counsel, a new dataset designed to improve the evaluation of AI agents. This dataset contains human meta-evaluations of critiques generated by large language models (LLMs) for agentic tasks. …
RESEARCH · CL_93186 · Jun 15 · 17:38

New paper proposes Bayesian audits for AI evaluation archives

A new paper proposes a Bayesian inference framework to audit public archives of frontier AI evaluations. The research highlights how selective reporting and benchmark revisions can distort the perception of AI progress,…
RESEARCH · CL_90852 · Jun 12 · 07:31

New foundation models aim to simulate human behavior at scale

Researchers have introduced OdysSim, a new framework for developing foundation models designed to simulate human behavior. This initiative includes a large corpus of 21.4 million interactions and a benchmark called SOUL…
TOOL · CL_20510 · May 7 · 04:00

New research argues AI alignment can't be judged by model-level tests alone

A new paper argues that evaluating AI alignment solely at the model level is insufficient for understanding its real-world deployment. The research highlights that current benchmarks lack user-facing verification and pr…
RESEARCH · CL_06668 · Apr 28 · 04:00

AgentEval framework improves AI agent workflow evaluation with DAG-based error tracking

Researchers have developed AgentEval, a new framework for evaluating agentic workflows by representing them as directed acyclic graphs (DAGs). This approach allows for detailed step-level assessment and tracking of erro…
RESEARCH · CL_02985 · Apr 23 · 03:48

New metrics quantify LLM agent behavioral similarity and convergence

A new paper introduces two metrics, Response Pattern Similarity (RPS) and Action Graph Similarity (AGS), to quantify how similar the tool-use behaviors of different AI agents are. These metrics aim to distinguish betwee…

New architecture routes customer service AI based on task difficulty

AI retrieval metrics may mislead in evaluating agent policy utility

New dataset 'Counsel' aims to improve AI agent evaluation

New paper proposes Bayesian audits for AI evaluation archives

New foundation models aim to simulate human behavior at scale

New research argues AI alignment can't be judged by model-level tests alone

AgentEval framework improves AI agent workflow evaluation with DAG-based error tracking

New metrics quantify LLM agent behavioral similarity and convergence