Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · LessWrong (AI tag) English(EN) · 10h

Improving Petri scheming audits with environment blueprints

Researchers have developed a new pipeline to generate environment blueprints for more realistic and consistent AI safety audits. This method was tested using the Petri auditor to evaluate Gemini 3.1 Pro Preview for code sabotage. The results showed that the blueprint-enhanced audits were more realistic and consistent than baseline audits, with no egregious scheming behavior detected in 160 trials. AI

IMPACT Enhances the realism and consistency of AI safety audits, potentially leading to more reliable evaluations of model behavior.
TOOL · dev.to — LLM tag English(EN) · 5d

Which LLM is the best stock picker? I built a benchmark to find out.

A new benchmark, dubbed 1rok, has been launched to evaluate the stock-picking capabilities of frontier large language models. The benchmark assigns each participating LLM a virtual portfolio of $100,000 and tasks them with selecting stocks weekly, with performance tracked against market outcomes. This initiative aims to provide a more practical, downstream evaluation of LLMs beyond traditional coding and reasoning benchmarks, focusing on decision-making under uncertainty. AI

IMPACT Provides a novel benchmark for evaluating LLM decision-making under uncertainty, moving beyond traditional coding and reasoning tasks.
- xAI
- Kimi K2.6
- Gemini 3.1 Pro Preview
- Grok 4.3
- MiniMax M2.7
- 1rok
- Google
- GPT-5.5
- GLM-5.1
- OpenAI
- DeepSeek V4 Pro
- Moonshot
TOOL · r/OpenAI English(EN) · 23h

GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source)

A new open-source index called AgentTape ranks AI models based on a blend of benchmark performance, actual usage, cost, and speed. Currently, OpenAI's GPT-5 models dominate the top rankings, with GPT-5.5 specifically excelling in quality benchmarks but lagging in adoption due to its newness and price. The index aims to provide a more holistic view of model performance beyond theoretical benchmarks, reflecting real-world utility. AI

IMPACT Provides a new metric for evaluating AI models that balances benchmarks with real-world adoption and cost.
- OpenAI
- xAI
- GPT-5.5
- Gemini 3.1 Pro Preview
- GPT-5
- Grok 4.20
- AgentTape
SIGNIFICANT · arXiv cs.CL English(EN) · 20mo · [280 sources]

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Researchers have developed a benchmark to test Large Language Models' ability to handle temporal changes in legal statutes, identifying issues like outdated information and recency bias. Meanwhile, the AI industry is seeing a significant shift as model labs increasingly focus on building agent-based products rather than just foundational models. This strategic pivot is exemplified by companies like AI21 and DeepSeek, and is further underscored by DeepSeek's aggressive pricing strategy for its V4-Pro model, making advanced AI more accessible. AI

IMPACT The industry's focus is shifting from foundational models to agent-based products, with aggressive pricing making advanced AI more accessible and competitive.
- Nick Joseph
- Anthropic
- OpenAI
- Tesla
- Claude
- Andrej Karpathy
- Devin
- AI21
- Google
- Gemini
- Codex
- DeepSeek
- Cursor
- Qwen
- Alibaba
- LangSmith
- Qwen3.7 Preview
- GPT-5.5
- Gemini 3.1 Pro Preview
- Claude Opus 4.7
- DeepSeek-V4-Pro
- Gemini Flash
- Cursor Composer 2.5

Brief

Improving Petri scheming audits with environment blueprints

Which LLM is the best stock picker? I built a benchmark to find out.

GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source)

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering