核心要点
- 问题/背景
- 这篇论文研究 self-evolving LLM agents 中一个很实际的问题:把过去交互经验内化成可复用参数能力时,多轮迭代不一定越学越强,反而可能出现 progressive capability collapse。
- 方法/机制
- 作者系统拆解了三条轴线:experience granularity、experience injection pattern 和 internalization regime。结果显示,principle-level experience 比 instance-level 更耐久,step-wise injection 比 global injection 更适合长程工具使用。
- 结果/证据
- 在训练信号上,off-policy context distillation on high-quality teacher trajectories 比 on-policy local corrections 更稳定,因为后者容易围绕学生自己的错误状态局部修补。
- 收录价值
- 收录价值在于它为 agent memory / continual capability acquisition 给出负结果和设计边界:经验内化不是简单反复蒸馏,需要抽象层级、注入位置和数据分布三者匹配。
原始摘要与中文对照
中文对照翻译
重新思考自演化LLM智能体的持续经验内化。经验内化将过去交互中的上下文经验转化为可重用的参数能力,为大型语言模型(LLMs)的持续学习提供了一条有前景的路径。尽管先前的工作主要集中在单次迭代迁移上,但我们发现在多迭代经验学习下,现有方法会遭受渐进式能力崩溃,而非复合式提升。我们通过经验内化的三个关键维度系统地审视了这一失败:(1)经验粒度:我们发现原则级经验比实例级经验更持久,因为它能有效地从轨迹特定细节中抽象出可迁移的策略。(2)经验注入模式:我们的分析表明,分步注入通过将经验与中间决策状态对齐,显著优于全局注入,这一特性对于长周期工具使用至关重要。(3)内化机制:我们证明,在高质量教师轨迹上进行的离策略上下文蒸馏比在策略上下文蒸馏提供了更稳定的训练信号,后者本质上受限于对学生引起的错误状态进行局部修正。综合来看,这些见解为稳定和可持续的经验内化提供了一个简单而稳健的方法,为构建自演化和持续学习的LLM提供了具体指导。本工作的代码和数据可在 https://github.com/RUCBM/ExpInternalization 获取。
原始摘要
Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs. The code and data for this work are available at https:// github.com/RUCBM/ExpInternalization.