RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Gaotang Li; Bhavana Dalvi Mishra; Zifeng Wang; Jun Yan; Yanfei Chen; Chun-Liang Li; Long T. Le; Rujun Han; George Lee; Hanghang Tong; Chen-Yu Lee; Tomas Pfister

智能体与自主科学突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-05-11
arXiv: 2605.10899

收录解读

RubricEM 针对 deep research agent 的难点：长报告、证据搜索和综合任务通常没有可验证答案，因此标准 RLVR 很难直接给密集、可靠奖励。

它把 rubric 从最终评分工具提升为执行接口：规划、证据收集、审阅和综合各阶段都由 rubric 组织，并用 stage-structured GRPO 和 reflection meta-policy 把经验转成可复用指导。

它值得正式收录，因为它提供了 beyond-verifiable-reward 的 agent RL 训练框架，把评估、执行分解和记忆更新接成一个工作流。

它没有更高，是因为 long-form research 评测仍容易受 judge 偏差、数据泄漏和报告风格影响，需要更多独立复现。

原始摘要与中文对照

中文对照翻译

训练深度研究代理——即规划、搜索、评估证据并合成长篇报告的系统——将强化学习推向了可验证奖励机制之外的领域。它们的输出缺乏真实答案，它们的轨迹跨越了许多工具增强的决策，并且标准的训练后过程几乎没有机制将过去的尝试转化为可重用的经验。在这项工作中，我们认为评分标准（rubrics）不应仅仅作为最终答案的评估者，而应作为构建策略执行、判断反馈和代理记忆的共享接口。基于这一观点，我们引入了RubricEM，一个评分标准引导的强化学习框架，它将阶段性策略分解与基于反思的元策略训练相结合。RubricEM首先通过将规划、证据收集、审查和合成条件化于自生成的评分标准，使研究轨迹具有阶段意识。然后，它使用Stage-Structured GRPO分配信用，该方法利用阶段性评分标准判断为长周期优化提供更密集的语义反馈。同时，RubricEM训练一个共享骨干模型（shared-backbone）的反思元策略，将已判断的轨迹提炼成可重用的、以评分标准为基础的指导，用于未来的尝试。由此产生的RubricEM-8B在四个代表性的长篇研究基准测试中取得了强大性能，超越了可比较的开源模型，并接近专有的深度研究系统。除了最终性能，我们还进行了彻底分析，以理解RubricEM的关键组成部分。

原始摘要

Training deep research agents—systems that plan, search, evaluate evidence, and synthesize long-form reports—pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy training. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four representative longform research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

链接

论文链接