Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.AI English(EN) · 3h

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Researchers have introduced SkillEvolBench, a new benchmark designed to evaluate how well large language model agents can transform episodic experiences into reusable procedural skills. The benchmark features 180 tasks across six environments, organized by task families with shared underlying procedures. Initial tests across various agent configurations revealed that current agents struggle to form robust, reusable skills, often performing better with raw trajectory reuse than with distilled skills, indicating that current abstraction methods may discard useful contextual information. AI

IMPACT This benchmark could drive progress in developing LLM agents that can generalize knowledge and form reusable skills, moving beyond task-specific memory.
- large language model agents
- SkillEvolBench
RESEARCH · arXiv cs.AI English(EN) · 3d · [2 sources]

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Researchers have developed MemAudit, a new framework designed to identify and audit malicious data within the memory of large language model agents. This post-hoc auditing system addresses the security vulnerability where adversarial users can inject harmful records into an agent's memory, potentially steering its actions. MemAudit utilizes causal attribution and structural anomaly detection to pinpoint the specific memories responsible for undesirable outputs, significantly reducing attack success rates in testing scenarios. AI

IMPACT Provides a method to detect and mitigate security risks in LLM agents by auditing their memory stores.
TOOL · arXiv cs.LG English(EN) · 4d

Harnesses for Inference-Time Alignment over Execution Trajectories

Researchers have developed a new framework called "harnesses" to improve the performance of large language model agents during inference. This approach focuses on aligning execution trajectories by separating harness functions into task decomposition and guided execution. The study reveals how factors like workflow granularity and retry budgets impact success rates, identifying failure modes such as over-decomposition and hallucinated execution. The findings suggest that partial harnesses, which specify only initial steps, can outperform fully structured workflows. AI

IMPACT Introduces a novel method for enhancing LLM agent reliability and performance through structured execution guidance.
RESEARCH · arXiv cs.AI English(EN) · 4d · [2 sources]

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Researchers have developed DeferMem, a new framework designed to improve question answering for large language model agents dealing with long-term conversational memory. This system separates the process into initial broad candidate retrieval and a subsequent query-conditioned evidence distillation phase. DeferMem utilizes a reinforcement learning algorithm called DistillPO to refine retrieved information into concise, relevant evidence, outperforming existing methods in accuracy and efficiency. AI

IMPACT Improves LLM agent performance in complex, long-context question answering tasks.

Brief

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Harnesses for Inference-Time Alignment over Execution Trajectories

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA