Brief

last 24h

[3/3] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CL English(EN) · 4d

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

A new research paper introduces "intelligence per watt" (IPW) as a metric to evaluate the efficiency of local AI models. The study found that local models can accurately answer 88.7% of real-world queries and have shown a 5.3x improvement in IPW from 2023 to 2025. Local accelerators also demonstrated at least 1.4x lower IPW compared to cloud-based solutions, suggesting local inference can significantly offload demand from centralized infrastructure. AI

IMPACT Introduces a new metric to track the viability and efficiency of local AI inference, potentially shifting demand from cloud infrastructure.
RESEARCH · Hugging Face Daily Papers English(EN) · 5d · [2 sources]

Forecasting Scientific Progress with Artificial Intelligence

A new benchmark called CUSP has been developed to evaluate AI's ability to forecast scientific progress. The study found that current frontier AI models struggle with predicting the realization and timing of scientific advances, despite being able to identify plausible research directions. Performance varies significantly across scientific domains, with AI progress being more predictable than advances in biology, chemistry, and physics, and models exhibit overconfidence in their predictions. AI

IMPACT Current AI systems are not yet reliable for predicting scientific breakthroughs or their timelines, indicating a need for further development in forecasting capabilities.
RESEARCH · arXiv cs.AI English(EN) · 6d · [11 sources]

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

Researchers are developing new benchmarks to address the safety risks of AI agents, particularly in multi-agent and interactive environments. GT-HarmBench evaluates frontier models in game-theoretic scenarios, revealing significant failures in high-stakes situations. Boiling the Frog and AgentThreatBench focus on incremental attacks and indirect prompt injections that traditional benchmarks miss, assessing both task utility and security. These efforts aim to create more robust evaluations for AI systems operating beyond simple text generation. AI

IMPACT These new benchmarks are crucial for ensuring the safe deployment of increasingly capable AI agents in real-world, multi-agent scenarios.

Brief

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Forecasting Scientific Progress with Artificial Intelligence

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard