Learning to Discover at Test Time

Mert Yuksekgonul; Daniel Koceja; Xinhao Li; Federico Bianchi; Jed McCaleb; Xiaolong Wang; Jan Kautz; Yejin Choi; James Zou; Carlos Guestrin; Yu Sun

智能体与自主科学颠覆级暂无讲解视频

发表时间: 2026-01-22
arXiv: 2601.16175

收录解读

这篇论文把 test-time scaling 从 frozen-model prompting/search 推进到 test-time reinforcement learning：模型在单个待解问题上继续训练，以发现更优解。它对本仓库的价值在于重新定义了 inference-time adaptation 的目标：不是泛化到很多问题，而是为当前问题找到一个最优结果。

TTT-Discover 将在线 RL 与搜索子程序结合，面向连续可验证 reward 的科学和工程问题运行，包括数学构造、GPU kernel 优化、AtCoder heuristic competition 以及 single-cell RNA-seq denoising。项目还公开代码和可核验结果，降低了与 AlphaEvolve 类封闭系统相比的复现实验门槛。

它值得正式收录，是因为它提供了 AI scientist / test-time learning 的关键操作模式：冻结参数之外，还可以在推理期为具体任务进行局部训练。这对科学发现、算法工程、kernel search 和可验证优化问题都有明显可复用价值。

它没有升到 paradigm，是因为该方法目前强依赖可计算 reward、较小任务成本、可接受的在线训练预算和特定任务选择。开放式理论发现、长程实验设计和高噪声现实科学问题是否同样适用仍需验证。

原始摘要与中文对照

中文对照翻译

测试时学习发现。我们如何利用人工智能为一个科学问题发现新的最先进解决方案？先前的测试时扩展工作，例如AlphaEvolve，通过提示冻结的LLM来执行搜索。我们在测试时执行强化学习，因此LLM可以继续训练，但现在是利用特定于测试问题的经验进行训练。这种形式的持续学习非常特殊，因为其目标是产生一个出色的解决方案，而不是平均而言的许多良好解决方案，并且是解决这个特定问题，而不是泛化到其他问题。因此，我们的学习目标和搜索子程序旨在优先考虑最有前途的解决方案。我们将此方法称为测试时训练发现（TTT-Discover）。遵循先前的工作，我们专注于具有连续奖励的问题。我们报告了我们尝试过的所有问题的结果，涵盖数学、GPU内核工程、算法设计和生物学领域。TTT-Discover在几乎所有这些问题中都创造了新的最先进水平：(i) 埃尔德什的最小重叠问题和自相关不等式；(ii) GPUMode内核竞赛（比现有技术快2倍）；(iii) 过去的AtCoder算法竞赛；以及(iv) 单细胞分析中的去噪问题。我们的解决方案经过专家或组织者的审查。我们所有的结果都是使用开放模型OpenAI gpt-oss-120b实现的，并且可以使用我们公开可用的代码进行复现，这与需要封闭前沿模型的先前最佳结果形成对比。我们的测试时训练运行是使用Thinking Machines提供的Tinker API进行的，每个问题的成本仅为几百美元。

原始摘要

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős’ minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2× faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接