Polar: Agentic RL on Any Harness at Scale

Binfeng Xu; Hao Zhang; Shaokun Zhang; Songyang Han; Mingjie Liu; Jian Hu; Shizhe Diao; Zhenghui Jin; Yunheng Zou; Michael Demoret; Jan Kautz; Yi Dong

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-05-22
arXiv: 2605.24220

核心要点

问题/背景: 这篇 arXiv 论文提出 Polar，目标是解决 agent RL 训练中最实际的系统问题：真实 agent harness 往往包含长上下文、多轮工具使用、多 agent 编排和复杂运行时，难以直接移植成标准 RL environment。
方法/机制: Polar 把 agent harness 当作黑盒，通过代理 LLM API 调用记录 token-level 交互，并重建 token-faithful trajectories。这样训练侧可以获得可用于 RL 的轨迹，而不需要重写 Codex、Claude Code、Qwen Code 等复杂 harness。
结果/证据: 系统采用异步 rollout service 设计，每个 rollout node 管理 runtime prewarming、agent execution、trajectory reconstruction 和 evaluation，并通过服务端点供独立 trainer 消费，解耦 harness、训练基础设施和 RL 算法。
收录价值: 收录价值在于它是 agentic RL infrastructure 的可复用接口层：把任意 agent harness 变成可训练环境，直接关系到软件工程 agent、工具使用 agent 和长时程 agent 的规模化后训练。

收录解读

这篇 arXiv 论文提出 Polar，目标是解决 agent RL 训练中最实际的系统问题：真实 agent harness 往往包含长上下文、多轮工具使用、多 agent 编排和复杂运行时，难以直接移植成标准 RL environment。

Polar 把 agent harness 当作黑盒，通过代理 LLM API 调用记录 token-level 交互，并重建 token-faithful trajectories。这样训练侧可以获得可用于 RL 的轨迹，而不需要重写 Codex、Claude Code、Qwen Code 等复杂 harness。

系统采用异步 rollout service 设计，每个 rollout node 管理 runtime prewarming、agent execution、trajectory reconstruction 和 evaluation，并通过服务端点供独立 trainer 消费，解耦 harness、训练基础设施和 RL 算法。

收录价值在于它是 agentic RL infrastructure 的可复用接口层：把任意 agent harness 变成可训练环境，直接关系到软件工程 agent、工具使用 agent 和长时程 agent 的规模化后训练。

论文摘要

Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. Polar is a rollout framework for scalable asynchronous RL over arbitrary agent harnesses, proxying LLM API calls, recording token-level model interactions, and reconstructing token-faithful trajectories for training.

链接

论文链接论文链接论文链接

核心要点

收录解读

论文摘要

相关论文

链接