收录解读
这篇论文关注动态环境中的 agent memory,而不是静态任务成绩。
EvoArena 将环境变化建模为 progressive updates,EvoMem 用 patch-style update history 表示记忆演化。
它值得收录,因为动态环境记忆评测和 memory evolution 表示是 agent 长期部署的关键接口。
局限在于当前证据主要来自预印本实验与作者自建评测,后续需要独立复现和更大范围部署验证。
原始摘要与中文对照
中文对照翻译
大型语言模型(LLM)智能体在广泛的基准测试中取得了强大的性能,然而大多数评估都假设环境是静态的。相比之下,现实世界的部署本质上是动态的,要求智能体不断使其知识、技能和行为与变化的环境和更新的任务条件保持一致。为了解决这一差距,我们引入了EvoArena,这是一个基准测试套件,它将环境变化建模为跨终端、软件和社会领域的渐进式更新序列。我们进一步提出了EvoMem,这是一种基于补丁的记忆范式,它将记忆演变记录为结构化更新历史,使智能体能够通过其记忆中的变化来推断环境演变。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社会偏好领域中平均准确率仅为39.6%。EvoMem持续改进了性能,在EvoArena上平均提升了1.5%,并且还将GAIA和LoCoMo等标准基准测试的性能分别提高了6.1%和4.8%。除了单个任务,EvoMem还在EvoArena上将链级准确率进一步提高了3.7%,其中成功需要完成一系列相关的连续演化子任务。机制分析表明,EvoMem改进了记忆中的证据捕获,表明对完整的演化环境状态有更好的保存。我们的结果强调了在评估和记忆中对演化进行建模对于可靠的智能体部署的重要性。项目页面:https://aiden0526.github.io/EvoArena/ 代码:https://github.com/Aiden0526/EvoArena 数据集:https://huggingface.co/collections/Aiden0526/evoarena
原始摘要
Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment. Project Page: https://aiden0526.github.io/EvoArena/ § Code: https://github.com/Aiden0526/EvoArena Dataset: https://huggingface.co/collections/Aiden0526/evoarena