推理、记忆与推理时控制 突破级 暂无讲解视频
发表时间
2026-06-11
arXiv
2606.13106

收录解读

这篇论文解决 hidden-state latent reasoning 难以用 on-policy RL 训练、也难解释的问题。

它用显式边界 token 打开和关闭 latent mode,使 GRPO ratio 可定义,同时为机理分析提供锚点。

它值得收录,因为它直接推进 latent reasoning 的训练接口和可解释性。

局限在于当前证据主要来自预印本实验与作者自建评测,后续需要独立复现和更大范围部署验证。

原始摘要与中文对照

中文对照翻译

揭秘隐状态循环:基于在线策略强化学习的可切换潜在推理。潜在思维链通过用连续的隐状态循环替换可见的推理轨迹来压缩推理,但现有公式难以用标准的在线策略强化学习(RL)进行优化,并且难以进行因果解释。我们的关键见解是,一对显式边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准的在线策略RL兼容,并且相同的锚点为机制分析提供了自然的立足点。受此启发,我们提出了Switch,一个可切换的潜在推理框架。模型发出<enter>以进入潜在模式,并发出<exit>以退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都定义良好。相同的锚点也将潜在步骤暴露给直接探测和因果干预。我们使用可见到潜在的课程以及一个Switch-GRPO目标来训练模型,该目标通过循环潜在计算传播梯度。在相似规模下,Switch始终优于先前的隐状态循环潜在推理方法。通过边界令牌进行的机制分析进一步揭示了三个发现:(i) <enter>是一个高度局部化、学习到的切换策略,而非风格化的人工产物;(ii) 它开启的潜在步骤执行的是问题特定的、因果重要的计算,而不是充当惰性占位符;以及 (iii) 该计算在进入时集中于单个隐状态转换。综上所述,这些结果表明隐状态循环潜在推理既可以通过RL训练,也适用于直接的机制分析,包括在线策略RL本身如何从内部改进模型。

原始摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose Switch, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. Switch consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

相关论文

链接