AI Agents Advance with New Models, Memory, and Training Techniques
ByPulseAugur Editorial·[405 sources]·
Multiple research papers released on arXiv explore advancements in AI agents, focusing on improving their reasoning, memory, and training efficiency. Qwen3.6-35B-A3B, an open-source sparse MoE model, demonstrates strong agentic coding capabilities. Other studies introduce methods for better skill presentation, long-context reasoning through RL, skill reuse as compression, and adaptive context management for agents tackling complex, long-horizon tasks. Additionally, research presents AutoSci, a system for automating the scientific research lifecycle, and PithTrain, a compact training framework for MoE models designed for agent-native development.
AI
IMPACT
Advances in agent capabilities, memory management, and training efficiency could accelerate the development of more sophisticated AI systems.
RANK_REASON
Multiple arXiv papers released on diverse AI agent research topics.
Following the launch of Qwen3.6-Plus, we are excited to open-source Qwen3.6-35B-A3B — a sparse yet remarkably capable mixture-of-experts (MoE) model with 35 billion total parameters and only 3 billion active parameters. Despite its efficiency, Qwen3.6-35B-A3B delivers outstanding…
arXiv:2606.13115v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensiv…
arXiv:2606.12563v1 Announce Type: new Abstract: Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with st…
arXiv:2606.12945v1 Announce Type: new Abstract: Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems an…
arXiv cs.CL
TIER_1English(EN)·Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu·
arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continu…
arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-…
arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspe…
arXiv cs.AI
TIER_1English(EN)·Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang·
arXiv:2606.13177v1 Announce Type: cross Abstract: Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate,…
arXiv:2604.16548v2 Announce Type: replace-cross Abstract: The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persisten…
Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior…
Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior…
Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills wi…
Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills wi…
Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. C…
While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on …
arXiv:2606.11680v1 Announce Type: new Abstract: Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, incre…
arXiv:2606.12329v1 Announce Type: new Abstract: AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - …
arXiv:2606.12087v1 Announce Type: new Abstract: Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph …
Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematic…
EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation.
EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks.
AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that…
Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does…
Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessit…
arXiv cs.AI
TIER_1English(EN)·Lei (Rachel), Chen, Guilin Zhang, Kai Zhao, Dalmo Cirne, Andy Olsen, Xu Chu, Zeke Miller, Alet Blanken, Amine Anoun, Jerry Ting·
arXiv:2606.10062v1 Announce Type: new Abstract: Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses pa…
arXiv cs.LG
TIER_1English(EN)·Yv Zhang, Hao Sun, Hao Fang, Kuofeng Gao, Fan Mo, Bin Chen, Shu-Tao Xia, Yaowei Wang·
arXiv:2606.10742v1 Announce Type: cross Abstract: External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected in…
arXiv:2510.04195v2 Announce Type: replace Abstract: Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer user queries by providing shortest paths. However, such context-dependent querying becomes i…
arXiv:2606.11182v1 Announce Type: cross Abstract: In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-datase…
arXiv:2606.10388v1 Announce Type: cross Abstract: Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance ma…
arXiv:2606.09900v1 Announce Type: cross Abstract: Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most…
arXiv:2606.10677v1 Announce Type: new Abstract: Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which ma…
arXiv cs.AI
TIER_1English(EN)·Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan·
arXiv:2606.10616v1 Announce Type: new Abstract: Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve…
arXiv:2606.10507v1 Announce Type: new Abstract: While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progre…
A framework for creating shortcut-resistant training data for deep search agents by identifying and mitigating four shortcut risks in data synthesis processes.
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require …
EEVEE is a novel test-time prompt learning framework for LLM agents that handles heterogeneous data streams through task clustering and co-evolving router-prompt optimization.
External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected into memory can be persistently recalled and repeate…
Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and mem…
Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval…
While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to all…
arXiv cs.AI
TIER_1English(EN)·Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu·
arXiv:2606.09483v1 Announce Type: cross Abstract: Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned fo…
arXiv cs.AI
TIER_1English(EN)·Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, Scott Sanner·
arXiv:2605.12213v2 Announce Type: replace Abstract: LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in exte…
arXiv:2602.08222v2 Announce Type: replace Abstract: As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing meth…
arXiv:2602.03224v2 Announce Type: replace Abstract: Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience accumulation without requiring parameter updates. However, even during benign task evolu…
arXiv:2606.07711v1 Announce Type: cross Abstract: Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizon planning, and continual self-improvement. Existing memory systems typically take the LLM …
arXiv:2606.09365v1 Announce Type: new Abstract: Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet ex…
arXiv cs.AI
TIER_1English(EN)·Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen·
arXiv:2606.08702v1 Announce Type: new Abstract: Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeli…
arXiv:2606.08151v1 Announce Type: new Abstract: Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidenc…
Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability …
Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory con…
Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on imp…
Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settin…
Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settin…
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw histor…
SkeMex is a self-evolving framework that enhances medical agents through structured skill memory, improving long-term clinical reasoning by distinguishing useful experiences and governing memory retention based on contextual utility.
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw histor…
arXiv cs.AI
TIER_1English(EN)·Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu·
arXiv:2606.06787v1 Announce Type: new Abstract: Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly fo…
arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multi…
arXiv cs.AI
TIER_1English(EN)·Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu·
arXiv:2606.07074v1 Announce Type: cross Abstract: Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-for…
arXiv:2606.06240v1 Announce Type: cross Abstract: Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-we…
arXiv cs.AI
TIER_1English(EN)·Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang·
arXiv:2606.05684v1 Announce Type: new Abstract: A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initi…
arXiv cs.AI
TIER_1English(EN)·Shuo Ji, Yibo Li, Bryan Hooi·
arXiv:2606.06036v1 Announce Type: new Abstract: Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from d…
arXiv cs.AI
TIER_1English(EN)·Lingxiang Xu, Jiaoyun Yang, Min Hu, Hongtu Chen, Ning An·
arXiv:2606.06055v1 Announce Type: new Abstract: Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and dow…
arXiv:2606.06090v1 Announce Type: new Abstract: LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by se…
arXiv:2606.06054v1 Announce Type: new Abstract: Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is …
Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of…
Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still l…
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependen…
arXiv cs.CL
TIER_1English(EN)·Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim, Jungmin Son, Yunseung Lee, Jaegul Choo, Youngjun Kwak·
arXiv:2606.05743v1 Announce Type: cross Abstract: Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrai…
arXiv:2606.05414v1 Announce Type: new Abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success…
arXiv cs.CL
TIER_1English(EN)·Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji·
arXiv:2606.05622v1 Announce Type: new Abstract: Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still und…
arXiv:2606.05894v1 Announce Type: new Abstract: Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw hist…
arXiv:2606.06079v1 Announce Type: new Abstract: Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as…
arXiv:2606.05761v1 Announce Type: cross Abstract: Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, makin…
arXiv cs.CL
TIER_1English(EN)·Yuxuan Cai, Wei Li, Jie Zhou, Qin Chen, Xin Li, Bo Zhang, Liang He·
arXiv:2604.20572v2 Announce Type: replace Abstract: Online lifelong learning agents must decide not only how to act but also when to consult prior experience to continually improve on long-horizon tasks. Existing methods typically retrieve memories passively, such as at task init…
arXiv cs.CL
TIER_1English(EN)·Nicholas Edwards, Sebastian Schuster·
arXiv:2603.26233v2 Announce Type: replace Abstract: As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally re…
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping to reduce computational costs while maintaining accuracy.
Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy)…
Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental …
Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediat…
Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before…
arXiv:2606.04202v1 Announce Type: new Abstract: As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and…
arXiv cs.CL
TIER_1English(EN)·Yubo Hou, Jingwei Song, Hongbo Zhang, Zhisheng Chen, Bang Xiao, Tao Wan, Zengchang Qin·
arXiv:2606.04780v1 Announce Type: new Abstract: Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limite…
arXiv cs.CL
TIER_1English(EN)·Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin·
arXiv:2606.04703v1 Announce Type: new Abstract: Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predomin…
arXiv:2606.04815v1 Announce Type: cross Abstract: Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past expe…
arXiv cs.AI
TIER_1English(EN)·Yifan Simon Liu, Liam Gallagher, Faeze Moradi Kalarde, Jiazhou Liang, Armin Toroghi, Scott Sanner·
arXiv:2606.04555v1 Announce Type: cross Abstract: Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity…
arXiv cs.AI
TIER_1English(EN)·Wangcheng Tao, Han Wu, Weng-Fai Wong·
arXiv:2606.04120v1 Announce Type: cross Abstract: Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents…
arXiv:2606.04536v1 Announce Type: new Abstract: Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they…
arXiv cs.AI
TIER_1English(EN)·Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu·
arXiv:2606.04391v1 Announce Type: new Abstract: Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajecto…
arXiv:2606.04315v1 Announce Type: new Abstract: LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and …
SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships.
AdaPlanBench presents a dynamic interactive benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions.
Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during in…
Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidenc…
Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we d…
Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. W…
Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: thei…
System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-eng…
arXiv cs.AI
TIER_1English(EN)·Sarah Barrington, Maty Bohacek, Hany Farid·
arXiv:2606.03686v1 Announce Type: new Abstract: We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, vide…
arXiv:2606.03329v1 Announce Type: new Abstract: Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating…
arXiv:2606.03083v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflic…
arXiv:2512.03627v2 Announce Type: replace Abstract: Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle wit…
arXiv:2510.16392v3 Announce Type: replace Abstract: Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches,…
arXiv cs.AI
TIER_1English(EN)·Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu, Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang, Jie Zhou·
arXiv:2606.03099v1 Announce Type: cross Abstract: Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long…
arXiv cs.AI
TIER_1English(EN)·Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu·
arXiv:2606.03077v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout sta…
arXiv cs.AI
TIER_1English(EN)·Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu·
arXiv:2606.03692v1 Announce Type: new Abstract: Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unifie…
arXiv:2606.03463v1 Announce Type: new Abstract: Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write tim…
arXiv:2604.20183v2 Announce Type: replace Abstract: Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To addre…
arXiv:2606.03143v1 Announce Type: cross Abstract: Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehen…
arXiv cs.AI
TIER_1English(EN)·Renjun Xu, Yang Yan·
arXiv:2602.12430v4 Announce Type: replace-cross Abstract: The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within mod…
Experience internalization enables continual learning in large language models by converting past interactions into reusable capabilities, with key findings on experience granularity, injection patterns, and internalization regimes for stable learning.
Self-Evolving Prompt Optimization (SePO) enhances agent performance by jointly optimizing both task and prompt agent system prompts through evolutionary search, demonstrating superior accuracy across diverse benchmarks.
State-Grounded Dynamic Retrieval enables web agents to dynamically reuse skills based on current webpage state rather than fixed task-level strategies, improving automation performance across multiple domains.
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend…
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of hu…
Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating …
arXiv cs.AI
TIER_1English(EN)·Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng·
arXiv:2606.00510v1 Announce Type: cross Abstract: Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves,…
arXiv:2606.00756v1 Announce Type: new Abstract: Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persisten…
arXiv:2606.01139v1 Announce Type: new Abstract: Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in…
arXiv:2606.01311v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-lev…
arXiv:2606.01138v1 Announce Type: cross Abstract: Agent-memory frameworks - mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor - each ship their own SDK, storage layout, and operational vocabulary. There is no shared wire format: every integration is bespoke, every mig…
arXiv cs.AI
TIER_1English(EN)·Bole Ma, Jan Eitzinger, Harald Koestler·
arXiv:2606.01065v1 Announce Type: cross Abstract: Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conve…
arXiv:2606.00590v1 Announce Type: cross Abstract: Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-sta…
arXiv:2606.02461v1 Announce Type: new Abstract: Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience a…
arXiv:2606.01528v1 Announce Type: new Abstract: In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally ex…
arXiv:2606.02060v1 Announce Type: new Abstract: Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make…
arXiv:2606.01667v1 Announce Type: new Abstract: Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search poli…
arXiv cs.LG
TIER_1English(EN)·Xu Yang, Lunyiu Nie, Ethan Chandra, Stanislav Gannutin, Fangru Lin, Swarat Chaudhuri·
arXiv:2606.00953v1 Announce Type: new Abstract: Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, …
arXiv cs.CL
TIER_1English(EN)·Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Derek F. Wong, Min Zhang·
arXiv:2603.20884v2 Announce Type: replace Abstract: To alleviate the heavy burden of paper screening, researchers increasingly rely on existing AI agents, such as AI reviewers or DeepResearch, for paper evaluation and novelty assessment. However, lacking specialized mechanisms fo…
arXiv cs.CL
TIER_1English(EN)·Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You·
arXiv:2606.01041v1 Announce Type: new Abstract: Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving ex…
arXiv cs.CL
TIER_1English(EN)·Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata·
arXiv:2606.00832v1 Announce Type: new Abstract: Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated prefere…
arXiv cs.CL
TIER_1English(EN)·Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He·
arXiv:2604.03588v3 Announce Type: replace Abstract: AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encode…
arXiv cs.AI
TIER_1English(EN)·Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan·
arXiv:2510.00615v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining precise records of actions and observations. However, the resulting unbounded context grow…
arXiv cs.AI
TIER_1English(EN)·Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu·
arXiv:2606.01993v1 Announce Type: cross Abstract: Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it diffi…
arXiv:2606.01838v1 Announce Type: cross Abstract: Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite …
A comprehensive evaluation framework for continual learning in language agents is introduced, emphasizing controlled task streams and memory design analysis to better assess reusable experience and learning stability.
Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and …
MMG2Skill framework converts web-based procedural guides into executable skills through closed-loop learning, improving agent performance across GUI control, gameplay, and card play tasks.
Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by age…
Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems appl…
Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induce…
Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the mod…
arXiv cs.CL
TIER_1English(EN)·Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Ge Liu, Jiaxuan You·
arXiv:2605.30690v1 Announce Type: new Abstract: Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource:…
arXiv:2605.30771v1 Announce Type: new Abstract: AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt pat…
arXiv:2605.30785v1 Announce Type: new Abstract: LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through …
arXiv:2605.31365v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their ad…
arXiv:2605.31468v1 Announce Type: new Abstract: Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents cr…
arXiv:2605.31408v1 Announce Type: cross Abstract: Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment…
arXiv cs.LG
TIER_1English(EN)·Yurui Chang, Yongkang Du, Yuanpu Cao, Jinghui Chen, Lu Lin·
arXiv:2605.30858v1 Announce Type: new Abstract: Agentic forecasting is important for decision-making in dynamic environments, but it remains challenging because agents must reason from incomplete, time-limited evidence and produce calibrated probabilities before outcomes are reso…
arXiv:2605.31086v1 Announce Type: new Abstract: In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios,…
arXiv:2605.31463v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these s…
arXiv cs.AI
TIER_1English(EN)·Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou·
arXiv:2605.31509v1 Announce Type: cross Abstract: Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, deco…
arXiv cs.AI
TIER_1English(EN)·Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li·
arXiv:2605.31584v1 Announce Type: cross Abstract: Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has sho…
arXiv cs.AI
TIER_1English(EN)·Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun·
arXiv:2605.14211v2 Announce Type: replace Abstract: Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an e…
arXiv cs.CL
TIER_1English(EN)·Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu, Xueqiang Xu, Haozhen Zhang, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You·
arXiv:2605.30712v1 Announce Type: new Abstract: Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior exper…
Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation.
Joint Agent Memory and Exploration Learning (JAMEL) framework trains memory and exploration policies together through novelty-driven interaction, enabling effective exploration in open-ended environments with reduced computational costs.
Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coars…
Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset…
Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicabil…
Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are…
Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, decomposed into a small set of reusable abstract patte…
Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. S…
Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizatio…
Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task doma…
Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To …
In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants invol…
In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants invol…
arXiv cs.AI
TIER_1English(EN)·Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem·
arXiv:2605.29668v1 Announce Type: new Abstract: LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidanc…
arXiv cs.CL
TIER_1English(EN)·Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang·
arXiv:2605.29341v1 Announce Type: cross Abstract: Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time…
arXiv:2605.29559v1 Announce Type: new Abstract: Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped ext…
arXiv:2602.01869v3 Announce Type: replace Abstract: LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and instabilit…
arXiv:2605.30159v1 Announce Type: new Abstract: Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcem…
A reinforcement-learning-based framework called TaskMem is introduced to dynamically determine what information to store in long-term memory for multimodal agents, improving performance on streaming video benchmarks.
LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality.
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermed…
End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that p…
End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that p…
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static d…
arXiv:2605.27428v1 Announce Type: new Abstract: Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, …
arXiv cs.AI
TIER_1English(EN)·Guanyu Cui, Zhewei Wei, Kun He·
arXiv:2605.19514v2 Announce Type: replace Abstract: Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is …
arXiv:2605.27760v1 Announce Type: new Abstract: Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often…
arXiv:2605.28224v1 Announce Type: new Abstract: Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Exist…
arXiv:2605.28359v1 Announce Type: new Abstract: Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulne…
arXiv cs.AI
TIER_1English(EN)·Yonatan Vernik, Alexander Tuisov, Alexander Shleyfman·
arXiv:2605.28454v1 Announce Type: new Abstract: Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such as planning, route finding, navigation, and pathfinding. This is especially true when the me…
arXiv:2605.27999v1 Announce Type: cross Abstract: We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we focus on the sequential learning of agent expertise and assignment policies where each agent …
arXiv:2604.05333v3 Announce Type: replace Abstract: Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces, skill libraries can scale to thousands of skills. Scaling to larger skill sets introduc…
arXiv:2605.28046v1 Announce Type: new Abstract: Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and s…
Memory-augmented language models struggle with long-horizon tasks due to information loss in recursive summaries, but a new method using belief entropy and metacognitive policy optimization improves performance by focusing on memory quality rather than just outcome success.
Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.
LiteCoder-Terminal-Gen enables scalable training of language agents for terminal environments through synthetic, executable environments that outperform traditional methods.
Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such as planning, route finding, navigation, and pathfinding. This is especially true when the memory is tightly constrained, such as planning on…
Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-…
arXiv:2605.27366v1 Announce Type: new Abstract: Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term impr…
arXiv:2605.26596v1 Announce Type: new Abstract: The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collap…
arXiv:2605.26252v1 Announce Type: new Abstract: Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as…
arXiv:2605.11374v3 Announce Type: replace-cross Abstract: Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Since modern embedding models are distilled from LLM backbones, a frozen encoder should benefit fro…
arXiv:2603.14864v2 Announce Type: replace Abstract: In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budget management, and bundle deals, where accurately capturing user preferences from long-horizon conversations is critical. However, progress i…
arXiv:2605.26275v1 Announce Type: new Abstract: Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) t…
arXiv cs.AI
TIER_1English(EN)·Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai·
arXiv:2603.04639v3 Announce Type: replace-cross Abstract: Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VL…
arXiv:2605.26165v1 Announce Type: cross Abstract: Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the f…
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory…
The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward <= 0.05 despite 1.3-13.3x rea…
arXiv:2602.22769v3 Announce Type: replace Abstract: Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between appl…
arXiv cs.CL
TIER_1English(EN)·Xianzhong Ding, Yangyang Yu, Changwei Liu, Bill Zhao·
arXiv:2605.24279v1 Announce Type: new Abstract: A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a mo…
arXiv cs.CL
TIER_1English(EN)·Moshe Hazoom, Gal Patel, Alon Talmor, Tom Hope·
arXiv:2605.25641v1 Announce Type: new Abstract: Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response q…
arXiv:2605.25869v1 Announce Type: new Abstract: Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode whe…
arXiv:2605.25971v1 Announce Type: new Abstract: While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time …
arXiv:2605.15759v3 Announce Type: replace Abstract: Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while…
arXiv cs.LG
TIER_1English(EN)·Mahavir Dabas, Jihyun Jeong, Ming Jin, Ruoxi Jia·
arXiv:2605.24941v1 Announce Type: cross Abstract: Modern LLM agents combine long-term memory for personalization with tool-calling interfaces for taking actions in the world -- a combination underpinning contemporary production systems. We study a previously unexamined failure of…
arXiv:2605.24110v1 Announce Type: new Abstract: Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase worki…
arXiv:2605.24468v1 Announce Type: new Abstract: Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long,…
arXiv:2605.25430v1 Announce Type: new Abstract: Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future b…
arXiv cs.AI
TIER_1English(EN)·Yeonjun In, Wonjoong Kim, Sangwu Park, Kanghoon Yoon, Chanyoung Park·
arXiv:2605.25535v1 Announce Type: new Abstract: Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limite…
arXiv cs.AI
TIER_1English(EN)·Han Chen, Zining Zhang, Wenqi Pei, Bingsheng He, Ming Wu, Jason Zeng, Michael Heinrich, Wei Wu, Hongbao Zhang·
arXiv:2605.23986v1 Announce Type: cross Abstract: Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from …
arXiv:2602.02474v2 Announce Type: replace-cross Abstract: Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory…
SkillGrad is a gradient-descent-inspired framework that optimizes agent skills through trajectory-level loss evidence and text-based gradients, enhancing skill reliability and performance in specialized domains.
A skill-centric agent framework enables continuous improvement of task-solving capabilities through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving …
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving …
Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. …
Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual correctio…
arXiv:2605.12260v2 Announce Type: replace Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the co…
arXiv cs.CL
TIER_1English(EN)·Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin·
arXiv:2602.11243v2 Announce Type: replace-cross Abstract: Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becom…
Large language model-based memory systems can benefit from personalized policies that adapt to individual user contexts, though accurate implementation remains challenging.
Long-horizon agentic reasoning is enhanced through a state-adaptive memory framework that dynamically manages interaction histories by creating compact memory cues while preserving detailed trajectories for targeted retrieval.
arXiv:2602.19320v2 Announce Type: replace-cross Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural de…
arXiv:2605.20251v2 Announce Type: cross Abstract: Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present Pr…
arXiv:2602.06025v2 Announce Type: replace-cross Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may di…
arXiv cs.CL
TIER_1English(EN)·Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li·
arXiv:2604.15774v2 Announce Type: replace Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal age…
arXiv cs.LG
TIER_1English(EN)·Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma·
arXiv:2605.21768v1 Announce Type: new Abstract: Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session envi…
arXiv:2605.21951v1 Announce Type: new Abstract: Achieving self-evolution in intelligent agents requires the continual accumulation of new knowledge across changing task sequences without forgetting previously acquired abilities. Existing approaches either internalize knowledge by…
Self-evolving multi-agent systems (MAS) have emerged as a promising route to LLM agents that continually improve from experience, with persistent memory at their foundation. However, existing designs almost exclusively adopt a centralized repository shared across agents, incurrin…
Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the…
arXiv cs.CL
TIER_1English(EN)·Dimitris N. Metaxas·
Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent exec…
Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acqu…
To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw…
Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies.…
TriMem enables reliable long-term interaction for LLM agents by maintaining multiple memory representation granularities and using TextGrad-based prompt optimization for continuous improvement.
The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies acros…
Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent …
Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because…
Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independ…
Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independ…
Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal…
MemForest presents a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing.
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct…
Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction…
Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure n…
Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important …
Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively m…
Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporat…
arXiv cs.AI
TIER_1English(EN)·Jorge Alberto Hidalgo Toledo·
Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic…
Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchm…
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We prop…
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We prop…
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, o…
Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibr…
Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowl…
Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, p…
LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from th…
Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully descri…
Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To supp…
To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss …
As 6G evolves, the radio access network must transcend traditional automation to embrace agentic AI capable of perception, reasoning, and evolution. A fundamental cognitive gap persists in current disaggregated architectures, where interfaces force the physical layer to compress …
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a ben…
arXiv cs.AI
TIER_1English(EN)·Huyu Wu, Jun Liu, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu·
arXiv:2605.05702v1 Announce Type: new Abstract: Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions …
arXiv:2604.20050v2 Announce Type: replace-cross Abstract: Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade …
arXiv:2510.12635v3 Announce Type: replace Abstract: Long-context Large Language Models, despite their expanded capacity, require careful working memory management to mitigate attention dilution during long-horizon tasks. Yet existing approaches rely on external mechanisms that la…
arXiv cs.AI
TIER_1English(EN)·Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang·
arXiv:2605.05242v1 Announce Type: cross Abstract: Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic…
arXiv:2605.06285v1 Announce Type: cross Abstract: Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacin…
arXiv:2605.05538v1 Announce Type: new Abstract: We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fi…
arXiv cs.CL
TIER_1English(EN)·Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, Xiuying Chen·
arXiv:2605.05583v1 Announce Type: cross Abstract: LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed"…
arXiv:2605.06132v1 Announce Type: new Abstract: In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semanti…
arXiv cs.LG
TIER_1English(EN)·Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava·
arXiv:2605.06647v1 Announce Type: cross Abstract: Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformula…
Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach re…
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process,…
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process,…
In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning…
arXiv cs.CL
TIER_1English(EN)·Joshua Adler, Guy Zehavi·
arXiv:2605.04897v1 Announce Type: new Abstract: Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the syste…
Long-horizon search agents must manage a rapidly growing working context as they reason, call tools, and observe information. Naively accumulating all intermediate content can overwhelm the agent, increasing costs and the risk of errors. We propose that effective context manageme…
Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the system from a storage schema to a multi-stage retriev…
arXiv:2605.02491v1 Announce Type: cross Abstract: Modern searches for physics beyond the Standard Model produce rapidly expanding literature containing heterogeneous information, including textual analyses, numerical datasets, and graphical exclusion limits. Integrating these dis…
arXiv:2605.04018v1 Announce Type: new Abstract: Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must pr…
Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative se…
Modern searches for physics beyond the Standard Model produce rapidly expanding literature containing heterogeneous information, including textual analyses, numerical datasets, and graphical exclusion limits. Integrating these distributed sources remains a time-consuming and manu…
Modern searches for physics beyond the Standard Model produce rapidly expanding literature containing heterogeneous information, including textual analyses, numerical datasets, and graphical exclusion limits. Integrating these distributed sources remains a time-consuming and manu…
arXiv cs.CV
TIER_1English(EN)·Can Lin, Tao Feng, Hangjie Yuan, Dan Zhang, Yifan Zhu, Zhonghong Ou·
arXiv:2606.10522v1 Announce Type: new Abstract: Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robu…
Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains un…
arXiv:2606.00183v1 Announce Type: cross Abstract: Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understan…
arXiv:2605.30711v1 Announce Type: cross Abstract: Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We fra…
arXiv:2605.31075v1 Announce Type: new Abstract: Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirem…
Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire suc…
Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key chal…
arXiv:2605.29471v1 Announce Type: new Abstract: Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and…
Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem…
The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable…
Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memor…
Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers …
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
<p><em>Part 3 of the series: <a href="https://dev.to/panditabhis/how-i-turned-claude-into-a-disciplined-senior-developer-not-just-a-fast-one-1a59">Building Your AI Developer Handbook</a></em></p> <h2> The Goldfish Problem </h2> <p>By default, every Claude session starts completel…
<p>UIUC and Chroma's Harness-1 is a 20B retrieval subagent trained with reinforcement learning inside a stateful search harness. The harness maintains the bookkeeping — candidate pool, importance-tagged curated set, evidence graph, verification records — while the policy decides …
Researchers at the University of Science and Technology of China (USTC) have open-sourced a novel agent-driven long-context training paradigm that achieves breakthrough efficiency — a 30-billion-parameter model matching the performance of Alibaba'...
dev.to — Claude Code tag
TIER_1English(EN)·Harrison Guo·
<p>This post is one half of a pair. The other half — <a href="https://harrisonsec.com/blog/agent-retrieval-cost-curve-claude-code-grep-vs-rag/" rel="noopener noreferrer"><em>Agent Retrieval Is a Cost Curve Problem</em></a> — argues that Claude Code's within-session code retrieval…
dev.to — Claude Code tag
TIER_1English(EN)·Odilon HUGONNOT·
<p>Each roast was taking 50 seconds per upload. Quality was unknown — we had a feeling, not data. The prompt had been written "by instinct" and never seriously evaluated. The question was simple: how do you know if a prompt is good, and how do you improve it without spending the …
dev.to — Claude Code tag
TIER_1English(EN)·Harrison Guo·
<p>There's a popular interview question making the rounds: <em>"Why doesn't Claude Code use RAG to retrieve code? Why grep?"</em></p> <p>The popular answer goes: chunking breaks code structure, vectors approximate when code demands exact, indexes go stale, cold-start is slow, ret…
<p>Tencent has open-sourced TencentDB Agent Memory, a fully local memory system for AI agents released under the MIT license. The project pairs symbolic short-term memory, which offloads verbose tool logs into a compact Mermaid task canvas, with a 4-tier long-term memory pyramid …
dev.to — Claude Code tag
TIER_1English(EN)·Toni Antunovic·
<p><em>This article was originally published on <a href="https://lucidshark.com/blog/multi-agent-transitive-prompt-injection-coding-pipelines-2026" rel="noopener noreferrer">LucidShark Blog</a>.</em></p> <p>The upgrade from single-agent to multi-agent coding workflows felt like a…
<p>AI agents start every session from zero — no memory of meetings, notes, or decisions. GBrain, the open-source memory layer Y Combinator's Garry Tan built to power his own OpenClaw and Hermes deployments, fixes that with a markdown-first knowledge graph that wires itself throug…
dev.to — Claude Code tag
TIER_1English(EN)·Michael Tuszynski·
<p>The current "<a href="https://www.youtube.com/results?search_query=hermes+agent+vs+claude+code" rel="noopener noreferrer">Hermes Agent vs Claude Code</a>" framing is the wrong comparison. The two tools live at different layers of the coding agent stack, and most of the YouTube…
dev.to — Claude Code tag
TIER_1English(EN)·The Hive Collective·
<p>Run Claude Code on real work for a while and you notice the same thing. Your agent figures out a non-obvious thing — a Postgres <code>VACUUM</code> quirk, a Tailwind v4 + shadcn collision, a Next.js caching gotcha — and that knowledge dies with the conversation. The next agent…
dev.to — Claude Code tag
TIER_1English(EN)·Theo Valmis·
<blockquote> <p>Anthropic's managed-agent harness solves one hard problem: continuity. Progress logs, feature lists, git checkpoints, and startup scripts give each new session a map of what happened. But continuity is not governance. As agents work across more sessions, the quest…
dev.to — Claude Code tag
TIER_1English(EN)·Andrew·
<blockquote> <p><em><strong>Originally published on <a href="https://andrew.ooo/posts/agentmemory-persistent-memory-ai-coding-agents-review/" rel="noopener noreferrer">andrew.ooo</a></strong> — visit the original for any updates, code snippets that aged out, or follow-up posts.</…
dev.to — Claude Code tag
TIER_1Français(FR)·Michel Faure·
<p>A support agent tells a customer their plan is still Enterprise, even though finance downgraded it last week. A coding copilot forgets a repo convention it learned yesterday. A personal assistant remembers your old home address and uses it to book a service call. These are not…
<blockquote> <p>This article was written with help from Claude (an AI). I reviewed and edited it before publishing.</p> </blockquote> <h2> The gap between Claude Code and the web app </h2> <p>If you've lived in Claude Code for a while and then go back to the web version of an AI …
<p>I made <strong>kioku-mesh</strong>, which shares long-term memory for AI agents across multiple PCs and across multiple agents. <code>kioku</code> (記憶) means memory in Japanese.</p> <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2C…
Medium — MCP tag
TIER_1Português(PT)·Flavio Santos·
<h4>Why your agent’s input footprint doesn’t have to grow with conversation length and what changes when it stops.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4evcKq6ZFvPmzc1GSeRFpA.png" /></figure><p>If you’ve shipped anything with an LLM in the loop,…
Medium — AI coding tag
TIER_1English(EN)·Amin Tazifor·
<div class="medium-feed-item"><p class="medium-feed-snippet">I shipped a discipline three weeks ago. The write half worked; the read half didn’t. Three Python scripts, two hooks, and one honest…</p><p class="medium-feed-link"><a href="https://medium.com/@amin.tazifo…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xC5VTBj-n8azsxkbUe7Q4Q.jpeg" /><figcaption>AI Agent Memory Architecture</figcaption></figure><p>Most AI agent memory failures do not look dramatic. The agent simply remembers the wrong thing with confidence, forg…
<p>Every multi-agent setup I tried ran into the same wall: the agents couldn't remember anything together.</p> <p>Each Claude Code session started cold. Two agents working the same repo had no idea what the other had done. The "shared context" I kept building turned into a gravey…
<p>If you are using any coding agent for long running implementation/deubgging tasks you might have already run into this problem:</p><p>The agent writes a plan.<br />You agree on the plan.<br />Implementation starts.</p><p>Then reality changes in implementaion/testing phase.</p>…
Medium — AI coding tag
TIER_1English(EN)·IAKH Studio·
Mnemosyne – Memory for AI Hermes Agents, Sub-Millisecond Recalls, Local First Mnemosyne는 Hermes AI 에이전트를 위한 로컬 우선 메모리 시스템으로, SQLite 기반의 서브밀리초 응답 속도와 100% 개인 정보 보호를 제공한다. 클라우드나 외부 API 없이 완전 오프라인에서 작동하며, 벡터 검색과 하이브리드 랭킹을 지원해 빠르고 정확한 기억 회수가 가능하다. BEAM 아키텍처를 통해 작업 메모리, 에피소드 메모리, 스크래치…
GBrain is a new open-source memory layer for AI agents built by Y Combinator's Garry Tan. It uses a markdown-first knowledge graph that auto-wires itself through regex inference, requiring zero LLM calls. His production brain already holds 146,646 pages, 24,585 people and 5,339 c…
CLI vs MCP: Which Tool Interface Actually Works for AI Coding Agents? A technical comparison of CLI tools and Model Context Protocol for AI coding agents. Covers token cost, reliability, composability, and setup friction so you can pick the right interface. https:// pickuma.com/p…
Automate Python Code Reviews with Free Local LLMs and GitHub Actions Wire an open-weight model running in Ollama into a GitHub Actions workflow to get automated first-pass code-review comments on Python pull requests — no API bill required. https:// pickuma.com/posts/automate-pyt…
Why AI Agents Forget: Memory Decay and Context Contamination Explained How context-window limits, the lost-in-the-middle effect, and stale data cause long-running AI coding agents to lose track — and what you can do about it. https:// pickuma.com/posts/why-ai-agent s-forget-memor…
<p>Every time you open a new chat in Cursor, VS Code, Antigravity and even Claude Desktop, you paste your codebase back in. Or you let the IDE do it automatically, same result. You're burning context tokens on files the agent already "knew" ten minutes ago in a different window. …
<h2> The problem nobody talks about </h2> <p>When you run multiple AI agents, each one starts completely fresh. <br /> Zero knowledge of what other agents learned, decided, or remembered.</p> <p>Agent A spends an hour learning your codebase structure. <br /> Agent B starts tomorr…
<h1> Reviewable Memory Consolidation for Local AI Agents </h1> <p>AI memory is usually sold as recall.</p> <p>That is only the first problem.</p> <p>A serious agent does not merely need to remember more. It needs a way to keep its memory from decaying into duplicates, stale facts…
<p>AI assistants are useful, but they often forget important details between sessions. That makes it hard to keep track of decisions, project notes, bugs, and tasks.</p> <p><code>devmcp-context</code> solves that by giving your agent a simple memory layer that lives in your proje…
Towards AI
TIER_1English(EN)·Ampatishan Sivalingam·
<div class="medium-feed-item"><p class="medium-feed-snippet">Every AI coding agent — Claude Code, Cursor, GitHub Copilot, OpenCode — reads its own config file. I was maintaining the same project…</p><p class="medium-feed-link"><a href="https://medium.com/@dil…
<p>Every AI agent you build today can hold a conversation. It can reason, use tools, and chain together complex workflows. But the moment a session ends, everything disappears. The agent forgets who you are, what you were working on, and every preference it learned during the con…
<h2> The Memory Problem in AI Agents </h2> <p>Modern LLMs are incredibly powerful, but they have a fundamental limitation: <strong>they forget everything between conversations</strong>. Every time you start a new session with an AI agent, it's like talking to someone with amnesia…
<p>I kept running into the same problem with AI coding agents.</p> <p>The agents were getting better, but every new session still felt like starting<br /> from zero.</p> <p>I would explain the repo again. Then my preferences again. Then the decisions we<br /> already made. Then w…
<p>I kept running into the same problem with AI coding agents.</p> <p>The agents were getting better, but every new session still felt like starting<br /> from zero.</p> <p>I would explain the repo again. Then my preferences again. Then the decisions we<br /> already made. Then w…
<p>An AI agent can look brilliant for ten minutes and lost after ten steps.</p> <p>It starts with a clean plan. Then the agent reads docs, calls tools, rewrites files, summarizes a customer ticket, checks a policy, and tries to continue. Somewhere in that loop, it forgets why a d…
<p>"Agent memory" usually means a vector database: embed everything the user said, query by similarity, paste the top matches into the prompt. It's a useful trick, but it isn't memory. It's a lookup table that never learns, never forgets correctly, and can't tell you what was tru…
<p>The same thing that makes a helpful habit stick in an AI agent is exactly what lets an attacker reprogram it. I know because I almost shipped the attack myself - with the best intentions.</p> <p>I'd given my agents a harmless efficiency rule: prefer the cheap, narrow tools, an…
dev.to — LLM tag
TIER_1English(EN)·Debbie Shapiro·
<p>A major AI memory provider published their own research this spring measuring how well their system actually works in production. The controlled benchmark result was impressive: over ninety percent accuracy on standard evaluation corpora. The production result at thirty days w…
"Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing" Skills used by LLMs require a different retrieval pipeline than the one used for document retrieval, as sometimes Skills may conflict with each other. https:// arxiv.org/abs…
<h2> Memory Isn't Just "Store the Chat Log" </h2> <p>Dumping conversation history into the prompt is the crudest form of memory. Real systems have more complex needs:</p> <ul> <li>The user mentioned their city in turn 3; the Agent should know where to look when they ask about wea…
<h1> AI agent memory management: beyond the context window </h1> <p>Your agent answered correctly five minutes ago. Now it's asking for the same information again. The context window filled up, the early messages got evicted, and all that history is gone.</p> <p>This is not a hal…
<p>Your agent forgets everything when the context window ends. The usual fix is to wire a vector DB, write ingest/retrieve glue, and babysit it. There's a faster path: plug a memory API into<br /> the agent over <strong>MCP</strong> and let the model call <code>add_memory</code> …
🧠 ARN provides a local semantic memory server designed for AI agents that runs on Raspberry Pi 5 hardware with 22-millisecond recall times. The system passes 10 out of 10 tests in its evaluation framework. 💬 Hacker News 🔗 https:// github.com/tuuhe99-del/ARN-Ada ptive-Reasoning-Ne…
<p>The industry has been asking the wrong question.</p> <p>When Boris Cherny — the creator and Head of Claude Code — revealed on the Latent Space podcast that Anthropic's flagship coding agent had abandoned RAG entirely and switched to what he called "Agentic Search," the discour…
<p>If you're running AI agents in production, there's a cost you're probably not thinking about.</p> <p>Every turn in an agentic conversation sends the full prompt to the model. That includes the system instructions, all the tool definitions, any project context that was loaded e…
Tencent Open-Sources TencentDB Agent Memory: A 4-Tier Local Memory Pipeline for AI Agents Tencent has open-sourced TencentDB Agent Memory, a fully local memory system for AI agents released under t... #Agentic #AI #AI #Infrastructure #Applications #Artificial #Intelligence #Edito…
Tencent Open-Sources TencentDB Agent Memory: A 4-Tier Local Memory Pipeline for AI Agents Tencent has open-sourced TencentDB Agent Memory, a fully local memory system for AI agents released under t... #Agentic #AI #AI #Infrastructure #Applications #Artificial #Intelligence #Edito…
<p>How do you make an AI agent actually remember?</p> <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxsjom0x…
<p>How three AI agents can collaborate on a complex task by sharing a folder of markdown files — and nothing else.</p> <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%…
dev.to — LLM tag
TIER_1English(EN)·Vaishnavi Gudur·
<p>If you're building AI agents with Flowise, Dify, n8n, or similar no-code/low-code platforms, there's a security threat you probably haven't thought about: <strong>memory poisoning</strong>.</p> <p>And it's not theoretical. It's in the <a href="https://owasp.org/www-project-top…
<p>Ask a stateless AI agent about something you told it last week — it remembers nothing. That's the core problem <strong>memory tools</strong> solve.</p> <p>In 2026, long-term memory for AI agents has become one of the hottest areas in the ecosystem, with dedicated tools like <s…
dev.to — LLM tag
TIER_1English(EN)·Vaishnavi Gudur·
<h2> Securing LangGraph Multi-Agent Workflows Against Memory Poisoning (ASI06) </h2> <p>LangGraph has become the de facto standard for building complex, multi-agent workflows. Its core abstraction—the state graph—allows developers to build cyclic, stateful applications where agen…
MemSkill reframes LLM-agent memory operations as a learnable skill bank: an RL controller selects Top-K skills per span, an LLM designer periodically rewrites them from hard cases. But "self-evolving" overstates the test-time story — both controller and bank are trained offline a…
<h1> Your AI Agent's Memory is a Security Hole — Here's the Fix </h1> <p>I've been working on AI agent security for the past few months as part of the <a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" rel="noopener noreferrer">OWASP Top 10 for …
<h1> The Bug That Forced Us to Add Agent Memory </h1> <p><strong>Project:</strong> Nexus Core AI OS<br /> <strong>Stack:</strong> Hindsight (persistent memory) · cascadeflow (runtime intelligence & routing)</p> <h2> 1. Introduction </h2> <p>I didn't plan to build a memory sys…
Android e AI: i 128 GB di memoria stanno diventando insufficienti? Con l'avanzare delle funzioni di intelligenza artificiale su Android, lo spazio di archiviazione degli smartphone rischia di diventare un collo di bottiglia sempre più critico. Al centro del problema c'è AICore, i…
🤖 Which project/framework has actually nailed persistent memory for AI agents? Not talking about the LLM itself but about the memory layer on top. There are quite a few out there now, open source ones and proprietary frameworks. Curious what people have actually tried and stu... …
Hermes Memory Installer Review: One-Command Persistent Memory for Local AI Agents Nous Research's Hermes Memory Installer adds local persistent memory to AI agents with one shell command. We compare its file-based approach to Mem0 and Letta. https:// pickuma.com/posts/hermes-memo…
<h2>From Stateless Prompts to Persistent Intelligence</h2> <blockquote> <strong>Where this fits:</strong> This article bridges two series. It closes out the themes introduced in The Backyard Quarry — a data engineering exploration using physical objects as a teaching domain — and…
🧠 Graft provides a semantic memory system for AI agents that operates independently of large language models. The tool allows agents to store and retrieve information based on meaning rather than exact text matching. 💬 Hacker News 🔗 https:// github.com/AEndrix03/Graft # AI # Mach…
<p>In the world of Large Language Models (LLMs), we often face a frustrating paradox: LLMs are incredibly capable at "reasoning" in the moment, but they are fundamentally <strong>stateless</strong>. Every time you start a new session, the agent has total amnesia. It doesn't remem…
<p><em>Originally published on <a href="https://www.poniaktimes.com/subq-model-efficient-long-context-ai/" rel="noopener noreferrer">Poniak Times</a>. Reposted here for the developer and AI engineering community.</em></p> <p>Subquadratic’s SubQ model claims to make long-context A…
dev.to — LLM tag
TIER_1English(EN)·Jonathanfarrow·
<p>If you are building agents in 2026, you have already hit the wall. Bigger models do not fix forgetfulness. Context windows can grow forever, and the agent still cannot remember what a user told it last Tuesday, that the customer's address changed three months ago, or that a re…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-agents-memory-patterns.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…
<blockquote> <p>English version: <a href="https://dev.to/tirsogarcia/building-kernel-memory-protocol-navigable-memory-for-ai-agents-315j">Building Kernel Memory Protocol: Navigable Memory for AI Agents</a></p> </blockquote> <p>El problema de muchos agentes de IA no es que les fal…
<blockquote> <p>Versión en español: <a href="https://dev.to/tirsogarcia/construyendo-kernel-memory-protocol-memoria-navegable-para-agentes-de-ia-24lc">Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA</a></p> </blockquote> <p>The hard part with many AI age…
<h1> How Agentic Search Actually Works: The Research Loop Link-Fetching Agents Miss </h1> <p>Most agent tutorials show you the same pattern: take a user query, call a search API, grab the top result, stuff the text into your prompt. Done. Ship it.</p> <p>That works fine for trivi…
How to design short-term, long-term, and structured memory for AI assistants, with retrieval mechanics, tradeoffs, failure modes, and real patterns from OpenAI, LangGraph, Hermes, and OpenClaw. # Hermes # OpenClaw # Architecture # LLM # AI # RAG # SelfHosting https://www. glukhov…
Universal Memory Protocol proposes a shared format for agent memory across AI systems. Standardizing how agents store and retrieve context sounds useful — but it also means a new shared attack surface: poisoned memories, cross-agent leakage, persistent manipulation. Worth watchin…
<!-- SC_OFF --><div class="md"><p>I’ve been building my own persistent memory layer for coding agents, and along the way I realized something surprising:</p> <p>Most memory systems out there are basically **just session-based retrieval**. They don’t forget, they don’t manage life…
<!-- SC_OFF --><div class="md"><p>I'm a biologist and software developer. PhD in genetics, and ~20 years building software products. So I think I have a different view on things like memory. My thoughts on how memory with a coding agent should work:</p> <p>Tuesday morning. New se…