Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu; Yuzhu Cai; Zexi Liu; Bingyang Zheng; Cheng Wang; Rui Ye; Yuzhi Zhang; Linfeng Zhang; Weinan E; Siheng Chen; Yanfeng Wang

智能体与自主科学颠覆级暂无讲解视频

发表时间: 2026-01-15
arXiv: 2601.10402

收录解读

- 分级：`颠覆性` - 正式标题：`Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering` - 原文：`2026-01-15-A2_ML_Master_2_0-Toward_Ultra_Long_Horizon_Agentic_Science_Cognitive_Accumulation_for_Machine_Lea.pdf` - 抽取：`extracted.md`

## 重写摘要

这篇论文瞄准的是科研代理最难的一类问题：不是单次推理，而是长时间、多轮试错、跨任务复用的持续研究。作者提出“认知累积”框架，把科研代理的上下文管理从简单的对话拼接，升级为分层缓存和长期经验沉淀。核心思想是：把短期执行轨迹不断蒸馏成稳定知识，再在新任务中复用，而不是让代理每次都从头滚上下文。

论文把这一思路落在机器学习工程场景中，并报告 ML-Master 2.0 在长预算设置下取得 56.4% 的 medal rate。这说明作者不是只在 toy task 上讲概念，而是在比较接近真实工程循环的环境里验证了“长时程记忆组织”本身的价值。

## 为什么重要

很多 agent 系统失败，并不是因为不会一步步推理，而是因为几小时后开始遗忘、漂移和自相矛盾。ML-Master 2.0 提供的是一种更接近“研究操作系统”的方向：把记忆、蒸馏和经验复用变成一等公民。

## 局限

它的收益高度依赖评测环境、工具权限和基础模型能力。缓存蒸馏如果没有严格 provenance 机制，也可能积累偏差并污染后续实验。

原始摘要与中文对照

中文对照翻译

迈向超长周期智能体科学：机器学习工程中的认知积累。人工智能迈向智能体科学的进展目前受到超长周期自主性挑战的瓶颈制约，即在跨越数天或数周的实验周期中维持战略连贯性和迭代修正的能力。尽管大型语言模型（LLMs）在短周期推理方面表现出色，但在真实世界研究的高维、延迟反馈环境中，它们很容易被执行细节所淹没，无法将稀疏反馈整合为连贯的长期指导。在本文中，我们提出了ML-Master 2.0，一个掌握超长周期机器学习工程（MLE）的自主智能体，MLE是科学发现的一个代表性缩影。通过将上下文管理重新定义为认知积累的过程，我们的方法引入了分层认知缓存（HCC），这是一种受计算机系统启发的、多层级的架构，能够实现经验随时间的结构化分化。通过动态地将瞬态执行轨迹提炼为稳定知识和跨任务智慧，HCC使智能体能够将即时执行与长期实验策略解耦，有效克服了静态上下文窗口的扩展限制。在OpenAI的MLE-Bench上，在24小时预算下的评估中，ML-Master 2.0实现了56.44%的最新（state-of-the-art）奖牌率。我们的研究结果表明，超长周期自主性为能够自主探索超越人类先例复杂性的人工智能提供了一个可扩展的蓝图。† 贡献相同。顺序随机。* 通讯作者：sihengc@sjtu.edu.cn, wangyanfeng622@sjtu.edu.cn

原始摘要

The advancement of artiﬁcial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily over- whelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon ma- chine learning engineering (MLE) which is a representative microcosm of scientiﬁc discovery. By reframing context management as a process of cognitive accumulation, our approach intro- duces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically dis- tilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI’s MLE- Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our ﬁndings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities. † Equal contribution. Order randomized. * Corresponding author: sihengc@sjtu.edu.cn, wangyanfeng622@sjtu.edu.cn

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接