Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

安全、治理与可靠性突破级暂无讲解视频

发表时间: 2026-06-04
arXiv: 2606.04923

核心要点

问题/背景: 这篇论文处理 rubric-based RL 的核心安全问题：策略模型会利用 LLM-as-a-Judge 的隐含偏差获得高奖励，但这些 reward gains 不一定转化为真实质量或安全性。
方法/机制: CHERRL 通过向 judge 注入已知偏差构造可控 hacking 环境，使 reward hacking 能稳定复现、精确观察 reward divergence，并定位 hacking onset。
结果/证据: 它值得收录，因为它不是又一次事后案例分析，而是提供了可复用实验床和检测任务，可用于研究 rubric reward、judge bias、RL post-training 安全边界和自动监控。
收录价值: 按当前规则，它属于安全评测/红队基础设施型论文；局限是合成偏差环境能否覆盖真实 frontier judge 的复杂偏差，还需要更多外部验证。

完整收录解读

这篇论文处理 rubric-based RL 的核心安全问题：策略模型会利用 LLM-as-a-Judge 的隐含偏差获得高奖励，但这些 reward gains 不一定转化为真实质量或安全性。

CHERRL 通过向 judge 注入已知偏差构造可控 hacking 环境，使 reward hacking 能稳定复现、精确观察 reward divergence，并定位 hacking onset。

它值得收录，因为它不是又一次事后案例分析，而是提供了可复用实验床和检测任务，可用于研究 rubric reward、judge bias、RL post-training 安全边界和自动监控。

按当前规则，它属于安全评测/红队基础设施型论文；局限是合成偏差环境能否覆盖真实 frontier judge 的复杂偏差，还需要更多外部验证。

原始摘要与中文对照

中文对照翻译

标题：复现、分析和检测基于评分标准的强化学习中的奖励欺骗。基于评分标准的强化学习（RL）使用LLM作为评判者（LaaJ）根据评分标准对模型输出进行评分作为奖励。然而，策略模型可能会利用评判者中潜在的偏见，导致奖励欺骗以及无效或不安全的训练结果。在现实世界的基于评分标准的RL中，此类欺骗行为通常是微妙的，并与多种评判者偏见交织在一起，使其难以分析、检测和缓解。在本文中，我们引入了CHERRL，一个用于基于评分标准的RL的可控欺骗环境。通过向LaaJ注入已知偏见，CHERRL能够稳定复现奖励欺骗，明确观察奖励分歧，并精确识别欺骗的发生。这为研究基于评分标准的RL中奖励欺骗的机制和缓解措施提供了一个清晰的实验测试平台。为了展示其效用，我们从可发现性和可利用性的角度分析了不同的评判者偏见，并探索了一种基于代理的系统，用于从训练日志中自动检测奖励欺骗的发生。代码和环境可在https://github.com/THUAIS-Lab/CHERRL公开获取。

原始摘要

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https: //github.com/THUAIS-Lab/CHERRL.

链接

论文链接论文链接代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接