Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

Zhiyuan Zhai; Bingcong Li; Bingnan Xiao; Ming Li; Xin Wang

推理、记忆与推理时控制突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-04-16
arXiv: 2604.14853

收录解读

问题与背景：test-time compute scaling 有效但昂贵，真实部署必须决定哪些输入值得多采样/搜索/长推理，哪些可以低成本回答。

方法与新意：论文把问题形式化为平均 compute budget 约束下最大化准确率，用 Lagrangian relaxation 分解为单样本 oracle action，再训练轻量分类器从低成本特征模仿预算分配策略。

收录意义：这篇对 inference-time scaling 很重要，因为它把“多想一点”从启发式变成可优化的预算分配问题，并给出 regret bound 与可部署的 solve-then-learn pipeline。

局限：实验集中在数学推理与少数模型，特征选择和 oracle 构造在复杂 agent workflow 中还需扩展。

原始摘要与中文对照

中文对照翻译

测试时计算扩展，即通过重复采样、搜索或扩展推理在推理过程中花费额外计算的做法，已成为提高大型语言模型（LLM）性能的强大手段。然而，在有限的推理预算下部署这些技术需要一个当前系统大多忽略的决策：哪些输入值得更多的计算，哪些可以廉价地回答？我们将此形式化为一个约束优化问题（在平均计算预算下最大化预期准确性），并使用两阶段的S OLVE - THEN -L EARN管道来解决。在求解阶段，拉格朗日松弛将全局约束分解为每个实例的子问题，每个子问题都允许一个闭式最优动作，该动作以最优方式权衡准确性和成本。我们证明了所产生的成本在对偶变量中是单调的，从而可以通过二分搜索实现精确的预算目标。在学习阶段，训练一个轻量级分类器，从廉价的输入特征预测最优动作，从而为实时部署分摊分配规则。我们确定学习策略的任务级遗憾由其模仿误差乘以最坏情况下的每个实例差距所限制，从而将约束推理干净地简化为监督分类。在MATH和GSM8K数据集上使用三个LLM（DeepSeek-V3、GPT-4omini、Qwen2.5-7B）进行的实验表明，我们的方法始终优于统一和启发式分配基线，在匹配的预算约束下，在MATH上实现了高达12.8%的相对准确性提升，同时以超过91%的模仿准确性密切跟踪拉格朗日最优上限。代码可在https://github.com/zhiyuanZhai20/AdaCompute-LLM获取。

原始摘要

Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage S OLVE - THEN -L EARN pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4omini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy. Code is available at https://github.com/zhiyuanZhai20/AdaCompute-LLM.

链接

论文链接