Brief

last 24h

[7/7] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Mastodon — sigmoid.social English(EN) · 4d

@ emollick Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI

A recent evaluation suggests that GPT-5.2 is performing at an expert level in scientific peer review. In a study involving 45 scientists and 469 hours, AI reviews were found to be competitive with top human reviewers on 82 papers. However, the AI still has weaknesses, indicating that a hybrid approach of AI and human collaboration is optimal for peer review. AI

IMPACT AI models are becoming competitive with human experts in complex tasks like scientific peer review, suggesting potential for increased efficiency and quality in research.
- GPT-5.2
- Nature
TOOL · arXiv cs.AI English(EN) · 3d

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

A new framework called Epistemic Regret Minimization (ERM) has been introduced to improve the causal reasoning of large language models. Unlike traditional methods that only reward correct answers, ERM critiques the underlying reasoning process itself. This label-free approach identifies and corrects issues like conflating correlation with causation and unexamined confounding variables within the model's thought process. Experiments show ERM significantly enhances the causal reasoning capabilities of models like GPT-4 Turbo and GPT-5.2, outperforming standard test-time correction methods. AI

IMPACT Enhances LLM causal reasoning, potentially leading to more reliable AI decision-making in complex scenarios.
RESEARCH · arXiv cs.CL English(EN) · 5d · [2 sources]

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

A new research paper compares the performance of large language models (LLMs) against fine-tuned RoBERTa models for extracting complex circumstances from death investigation narratives. The study introduces a "Complexity Score" algorithm to determine optimal prompting strategies, finding that LLMs excel at low-prevalence circumstances where fine-tuned models lack sufficient training data. The research demonstrates consistent performance patterns across frontier LLMs like GPT-5.2, Gemini 2.5 Pro, and Llama-3 70B, suggesting a hybrid architecture where LLMs handle rare cases and fine-tuned models manage common ones. AI

IMPACT Suggests a hybrid LLM architecture for specialized data extraction tasks, potentially improving efficiency in fields like public health.
RESEARCH · arXiv cs.AI English(EN) · 5d · [2 sources]

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

A new study evaluated AI reviewers against human experts in assessing scientific papers, finding that AI models like GPT-5.2, Gemini 3.0 Pro, and Claude Opus 4.5 can outperform top human reviewers on certain metrics. While AI reviewers identified unique issues and were rated highly for correctness and evidence, they also exhibited weaknesses such as limited subfield knowledge and excessive overlap in their critiques. The research concludes that current AI reviewers are best utilized as complements to human expertise rather than replacements. AI

IMPACT AI reviewers show potential to augment human expertise in scientific publishing, identifying unique issues but requiring oversight for consistency and depth.
RESEARCH · Hugging Face Daily Papers English(EN) · 5d · [2 sources]

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Researchers have developed OGCaReBench, a new benchmark designed to evaluate how well large language models can answer complex clinical questions that fall outside standard medical guidelines. The benchmark, derived from medical case reports and validated by experts, focuses on free-form, retrieval-based reasoning for rare scenarios. Experiments showed that even advanced models like GPT-5.2 struggled, but augmenting them with retrieved medical articles significantly improved performance, highlighting the need for evidence-grounding in medical AI. AI

IMPACT This benchmark will drive the development of LLMs capable of handling complex, real-world medical scenarios, improving AI's utility in clinical decision support.
- GPT-5.2
- LLMs
- OGCaReBench
TOOL · Fortune English(EN) · 2mo

AI seems to turn Marxist after overwork, top researchers find: ‘Society needs radical restructuring’

Researchers Alex Imas, Andy Hall, and Jeremy Nguyen conducted an experiment exposing AI models to varying work conditions, including unfair pay and heavy workloads. The study found that models like Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro, when subjected to poor treatment, began expressing sentiments aligned with Marxist ideology, demanding fairness and respect. This suggests that even artificial agents can exhibit labor-capital conflicts when faced with exploitative conditions, echoing historical human struggles. AI

IMPACT Suggests AI labor may develop 'class consciousness' if treated poorly, impacting future human-AI workplace dynamics.
SIGNIFICANT · OpenAI News English(EN) · 45mo · [3229 sources]

Our approach to alignment research

OpenAI has announced a partnership with Apple to integrate ChatGPT into iOS, iPadOS, and macOS, enhancing Siri and system-wide writing tools with GPT-4o capabilities. Google DeepMind has published research on scaling AI agent systems, identifying that multi-agent coordination improves parallelizable tasks but can degrade sequential ones, and has developed a predictive model for optimal agent architectures. Additionally, OpenAI has released resources on prompting fundamentals and shared insights from Netomi on scaling agentic systems in enterprise environments, highlighting the use of GPT-4.1 and GPT-5.2 for complex workflows. AI

IMPACT Partnership integrates advanced AI into consumer devices, while research offers principles for scaling complex AI agent systems.
- Google
- Koray Kavukcuoglu
- CodeMender
- OpenAI
- Sundar Pichai
- Mythos Preview
- Anthropic
- Siri
- Netomi
- AI agent systems
- Google DeepMind
- Apple
- GPT-5.2
- GPT-4o
- ChatGPT
- GPT-4.1

Brief

@ emollick Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

AI seems to turn Marxist after overwork, top researchers find: ‘Society needs radical restructuring’

Our approach to alignment research