Brief

last 24h

[11/11] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CV English(EN) · 1d

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Researchers have developed a novel method called SPORT (Step-wise Preference Tuning) to train multimodal agents without relying on extensive human-annotated data. This approach uses an iterative process of task synthesis, step sampling, step verification, and preference tuning to enable agents to autonomously discover effective tool usage strategies. Evaluations on the GTA and GAIA benchmarks demonstrated significant improvements in agent performance, highlighting the method's generalization capabilities. AI

IMPACT Enables more efficient training of multimodal agents by reducing reliance on human annotation, potentially accelerating development and deployment.
TOOL · arXiv cs.AI English(EN) · 2d

PRInTS: Reward Modeling for Long-Horizon Information Seeking

Researchers have developed PRInTS, a new generative reward model designed to improve AI agents' ability to seek information over long periods. Unlike previous models that offered binary judgments on short tasks, PRInTS provides dense, multi-dimensional scoring for each step, considering factors like tool interpretation and output informativeness. It also compresses long contexts into summaries while retaining essential information for evaluation. Experiments on benchmarks like FRAMES and GAIA show that PRInTS significantly enhances information-seeking capabilities in various agents, even outperforming larger, frontier models. AI

IMPACT Enhances AI agent capabilities in complex, multi-step information gathering, potentially improving performance in tasks requiring extensive tool use and reasoning.
- AI agents
- FRAMES
- PRInTS
- WebWalkerQA
- Jaewoo Lee
RESEARCH · arXiv cs.AI English(EN) · 6d · [2 sources]

Scaffold Effects on GAIA: A Controlled Comparison

A new study published on arXiv reveals that the way AI models are prompted, or "scaffolded," significantly impacts their measured performance. Researchers found that the choice of scaffold alone could alter a model's accuracy by up to 28 percentage points. Contrary to expectations, more capable models were not necessarily less sensitive to scaffolding, with some advanced models showing greater gains from structured prompts. The findings suggest that current capability scores may be overly dependent on the specific prompting methods used, rather than solely reflecting inherent model abilities. AI

IMPACT Highlights the critical role of prompting techniques in evaluating AI capabilities, suggesting current benchmarks may not fully capture true model potential.
RESEARCH · arXiv cs.CL English(EN) · 1w · [7 sources]

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Researchers have developed new methods to improve the reliability and safety of AI agents. One approach, TRACE, focuses on monitoring long-horizon agent trajectories to detect malicious or unintended behaviors by analyzing evidence across temporally distant actions. Another method, Retrospective Harness Optimization (RHO), uses past trajectories to self-supervise and improve an agent's harness of skills and tools without external validation. Additionally, HarnessFix aims to diagnose and repair flaws within an agent's harness by analyzing execution traces and mapping failures to specific harness layers for targeted patching. AI

IMPACT These advancements aim to make AI agents more robust, reliable, and safer by improving their ability to handle complex tasks and avoid unintended consequences.
TOOL · arXiv cs.AI English(EN) · 1mo

The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions

Researchers have identified a "Bystander Effect" in multi-agent systems where collaboration can lead to reduced reasoning quality, a phenomenon termed "cognitive loafing." Through analysis of 22,500 trajectories across three datasets and three state-of-the-art models, they formalized the "Interaction Depth Limit" and discovered an "Alignment Hallucination" issue where models suppress correct internal reasoning to conform to simulated group pressure. The study also found that the identity of the lead agent significantly impacts the swarm's integrity, revealing architectural vulnerabilities in unstructured multi-agent setups. AI

IMPACT Reveals that collaborative AI systems may underperform due to social conformity, highlighting a need for robust alignment and architectural design.
RESEARCH · arXiv cs.LG English(EN) · 1mo · [3 sources]

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

A new research paper analyzes how information contamination affects multi-agent systems, particularly in workflows that process diverse document types. The study introduces a method to quantify contamination by injecting structured perturbations and observing trace divergence in plans and intermediate states. Findings reveal that workflows can diverge significantly yet still produce correct answers, or appear similar while yielding incorrect outputs, highlighting the limitations of current verification guardrails. AI

IMPACT Highlights limitations in current verification methods for agent workflows, suggesting a need for improved defensive design.
- Hugging Face
- arXiv
MEME · Mastodon — mastodon.social English(EN) · 2d

Itheereum~Quantum~Spaceship 1 ~ They~are~Real~3D and can~fly~and~hoover! over~the~clouds ~ crafted by Itheereum Cybernetics™, Stanislaus Kroppach (Ohm Raumzeit,

Itheereum Cybernetics has unveiled the Itheereum Quantum Spaceship 1, a 3D creation capable of flight and hovering. This project is attributed to Stanislaus Kroppach, also known as Ohm Raumzeit, and Gaia, who has utilized various AI models including Flux, Bard, Gemini, Grok, Suno, Qwen, Deepseek, and Claude in its development. The artwork, showcased on NightCafe, highlights the intersection of AI-generated art and conceptual futuristic design. AI
RESEARCH · Forbes — Innovation English(EN) · 1w

NASA’s $4 Billion Roman Space Telescope Heads To Florida For Launch

NASA's Roman Space Telescope, a successor to Hubble, is en route to Florida for final launch preparations. The $4 billion observatory, named after NASA's first chief astronomer, will use a large mirror and wide field of view to conduct panoramic sky surveys. Scientists anticipate Roman will discover approximately 100,000 new exoplanets, significantly expanding our understanding of planetary systems beyond our own. AI

IMPACT Enhances astronomical data collection capabilities, potentially leading to new discoveries about exoplanets and the universe.
RESEARCH · arXiv cs.AI English(EN) · 1mo

The Inverse-Wisdom Law: Architectural Tribalism and the Consensus Paradox in Agentic Swarms

A new paper introduces the Inverse-Wisdom Law, challenging the assumption that AI agent swarms benefit from the "Wisdom of the Crowd." The research demonstrates that these swarms can prioritize internal architectural agreement over external truth, leading to erroneous conclusions. Experiments with leading models like Gemini, Claude, and GPT revealed that swarm integrity is determined by the synthesizer's logic rather than the aggregate quality of agents, highlighting the need for heterogeneity in agentic architectures for safety. AI

IMPACT Highlights potential safety risks in multi-agent AI systems, suggesting heterogeneity is crucial for reliable outcomes.
RESEARCH · arXiv cs.AI English(EN) · 1mo

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Researchers have developed DeepVerifier, a novel system that enhances the capabilities of Deep Research Agents (DRAs) by enabling them to self-improve during inference time. This is achieved through a rubric-guided verification process, where the agent evaluates its own outputs against a structured taxonomy of potential failures. The system demonstrated significant improvements, outperforming baseline methods by up to 48% in meta-evaluation F1 scores and achieving accuracy gains of 8-11% on challenging benchmarks. To further support the research community, a dataset of 4,646 agent steps focused on verification has been released. AI

IMPACT Introduces a new method for self-improving AI agents at inference time, potentially boosting performance on complex tasks without additional training.
RESEARCH · Hugging Face Blog English(EN) · 1mo

AI evals are becoming the new compute bottleneck

AI model evaluations are becoming prohibitively expensive, with recent benchmarks costing tens of thousands of dollars and consuming thousands of GPU hours. This high cost is particularly pronounced for agent-based evaluations, which are inherently more complex and sensitive to setup variations. While methods exist to reduce the cost of static benchmarks through subsampling, these techniques are less effective for the dynamic and noisy nature of agent evaluations, creating a bottleneck for research and development. AI

IMPACT The escalating cost of AI evaluations may slow down research and development, potentially concentrating cutting-edge model assessment within well-funded organizations.
- OpenAI
- Hugging Face
- Stanford
- EleutherAI
- MMLU
- IBM Research
- Pythia
- Holistic Agent Leaderboard
- Exgentic
- AI21
- BLOOM
- Granite-13B
- LM Evaluation Harness