Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Cai Zhou; Zekai Wang; Menghua Wu; Qianyu Julie Zhu; Flora C. Shi; Chenyu Wang; Ashia Wilson; Tommi Jaakkola; Stephen Bates

推理、记忆与推理时控制突破级暂无讲解视频

发表时间: 2026-04-01
arXiv: 2604.01170

收录解读

test-time scaling 带来了更强推理能力，但也把推理成本迅速推高。很多情况下，真正的问题不是模型不会做，而是采样和 stopping 决策缺乏校准，导致系统在无需额外思考时仍然花大量算力。ORCA 正是从 reasoning calibration 这个角度切入，尝试在保证风险控制的前提下减少无效 compute。

论文提出 Online Reasoning Calibration，把 conformal prediction 与 test-time training 结合起来。核心做法是为每个输入在线更新 calibration module，使其适应 reasoning 过程中的分布变化，以及开发阶段与部署阶段 prompt 分布的偏移。作者给出 conformal risk 保证，并在多类 reasoning 任务上报告显著效率提升，尤其在 OOD 设置下相对静态校准大幅提高节省算力的幅度。

这篇工作值得收录，因为它把 reasoning efficiency 问题从简单的 sample budget 调整，推进到‘在线校准 reasoning process’这一更系统的 post-deployment adaptation 模式。它与仓库关注的 test-time learning、inference-time adaptation 和 reasoning control 非常贴近，属于可复用的方法模式，而不只是某个技巧性节流。

它没有升到更高一级，是因为当前仍主要验证在特定推理 benchmark 和模型族上。理论与实证都不错，但是否会成为更广泛 reasoning stack 的标准组件，还需要跨模型、跨任务、跨部署场景的进一步证据。

原始摘要与中文对照

中文对照翻译

尽管测试时缩放使得大型语言模型能够解决高度困难的任务，但最先进的结果却伴随着高昂的计算成本。这些低效率可归因于后训练语言模型的错误校准以及流行采样技术中校准的缺失。在本文中，我们提出了在线推理校准（ORCA），这是一个基于共形预测和测试时训练的采样过程校准框架。具体来说，我们引入了一种元学习过程，该过程为每个输入更新校准模块。这使我们能够在分布偏移下提供有效的置信度估计，例如在推理不同阶段出现的思维模式中，或在模型开发和部署之间的提示分布中。ORCA不仅提供了共形风险的理论保证，而且在不同推理任务中经验性地显示出更高的效率和泛化能力。在风险水平δ=0.1下，ORCA将Qwen2.5-32B在同分布任务上的效率提高了，使用监督标签可节省高达47.5%，使用自洽性标签可节省40.7%。在零样本域外设置下，它将MATH-500的节省从静态校准基线的24.8%提高到67.0%，同时保持较低的经验错误率，并且相同的趋势在模型家族和下游基准中也成立。我们的代码可在https://github.com/wzekai99/ORCA公开获取。

原始摘要

While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level δ=0.1, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.

链接

论文链接