Self-Distilled Agentic Reinforcement Learning

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-05-14
arXiv: 2605.15155

收录解读

SDAR 处理 agentic post-training 的核心痛点：RL 只有轨迹级稀疏反馈，而 on-policy self-distillation 能提供 token-level dense guidance，但直接用于多轮 agent 会因轨迹漂移和 teacher-student mismatch 产生不稳定。

方法把 OPSD 降级为 gated auxiliary objective，让 RL 仍然是主优化骨架；token-level teacher-student gap 经 detached sigmoid gate 调节，强化 teacher-endorsed positive-gap token，软化负向 rejection。

它值得正式收录，因为它给长程 agent RL 提供了一个可复用的 RL + privileged-context distillation 组合方式，在 ALFWorld、WebShop、Search-QA 和 Qwen2.5/Qwen3 多尺度上提升明显。

它没有更高，是因为 benchmark 仍集中在典型 agent 环境，真实开放工具、长期记忆和安全约束下的稳定性还需要更多复现。

链接

论文链接代码代码