General scales unlock AI evaluation with explanatory and predictive power

Lexin Zhou; Lorenzo Pacchiardi; Fernando Martínez-Plumed; Katherine M. Collins; Yael Moros-Daval; Seraphina Zhang; Qinlin Zhao; Yitian Huang; Luning Sun; Jonathan E. Prunty; Zongqian Li; Pablo Sánchez-García; Kexin Jiang-Chen; Pablo A. M. Casares; Jiyun Zu; John Burden; Behzad Mehrbakhsh; David Stillwell; Manuel Cebrian; Jindong Wang; Peter Henderson; Sherry Tongshuang Wu; Patrick C. Kyllonen; Lucy Cheke; Xing Xie; José Hernández-Orallo

doi:10.1038/s41586-026-10303-2

智能体与自主科学颠覆级有讲解视频

发表时间: 2026-04-01
DOI: 10.1038/s41586-026-10303-2

收录解读

这篇 Nature 论文针对当前大模型评测体系的根本缺陷发力：常见 benchmark 能给出分数，却难以解释模型到底具备什么能力，也难以可靠预测模型在新任务、新实例上的表现。作者把问题从‘比较模型在固定题集上的平均表现’改写为‘用通用量尺刻画任务需求与模型能力，并据此解释和预测表现’。

论文提出一套面向 AI 评测的 general scales 方法学，用 18 条通用能力/知识/外生维度刻画任务实例需求，并为模型估计对应的 ability profile。核心不是再造一个排行榜，而是把 benchmark 项拆成可解释的 demand profile，并在此基础上对模型进行 commensurate profiling，从而实现实例级、跨任务、尤其是 out-of-distribution 条件下的性能预测。

这项工作对仓库的价值很高，因为它改变的是 AI evaluation 的组织方式，而不是单个测试集或单个 predictor。它把 psychometrics、rubric annotation、instance-level prediction 结合起来，直接外溢到模型路由、安全 operating area、拒答规则和部署前评估等实际问题，更像一套可扩展的评测科学基础设施。

它还没有升到更高一级，是因为这套 general scales 目前主要在 LLM 和作者定义的评测电池上验证，领域采纳度仍有待时间检验。它已经明显超出普通 benchmark paper，但是否成为长期默认标准，还要看后续独立复现、扩展到更多模型形态和真实部署场景的情况。

原始摘要与中文对照

中文对照翻译

确保人工智能（AI）的安全有效使用，需要理解并预测其在新任务上的表现，这些任务涵盖从先进科学挑战到转型工作场所活动。迄今为止，基准测试指导了AI的进展，但对于通用AI系统而言，其解释力和预测力有限，这归因于特定任务之间的可迁移性受限。在本文中，我们引入了用于AI评估的通用量表，这些量表能够揭示需求画像，解释常见AI基准真正衡量了哪些能力；提取能力画像，量化AI系统的通用优势和局限性；并稳健地预测AI在新任务实例上的表现。我们的全自动化方法基于18个评估标准（rubrics），这些标准涵盖了广泛的认知和智力需求，将不同的任务实例置于相同的通用量表上，并在15个大型语言模型（LLMs）和63个任务上进行了演示。这些量表上的需求画像和能力画像都带来了新的见解，例如通过基准敏感性和特异性实现的结构效度，并解释了关于AI是否具有推理能力的相互矛盾的主张。最终，利用通用量表可以在实例层面实现高预测力，相比强大的黑盒基线预测器，它能提供更优的估计，尤其是在分布外设置（新任务和基准）中。本文提出的量表、评估标准、测试集（battery）、技术和结果构成了AI评估科学的坚实基础，为未来AI的可靠部署提供了支撑。

原始摘要

Ensuring safe and effective use of artificial intelligence (AI) requires understanding and anticipating its performance on new tasks, from advanced scientific challenges to transformed workplace activities . So far, benchmarking has guided progress in AI but has offered limited explanatory and predictive power for general-purpose AI systems , attributed to limited transferability across specific tasks . Here we introduce general scales for AI evaluation that elicit demand profiles explaining what capabilities common AI benchmarks truly measure, extract ability profiles quantifying the general strengths and limits of AI systems and robustly predict AI performance for new task instances. Our fully automated methodology builds on 18 rubrics, capturing a broad range of cognitive and intellectual demands, which place different task instances on the same general scales, illustrated on 15 large language models (LLMs) and 63 tasks. Both the demand and the ability profiles on these scales bring new insights such as construct validity through benchmark sensitivity and specificity and explain conflicting claims about whether AI has reasoning capabilities. Ultimately, high predictive power at the instance level becomes possible using the general scales, providing superior estimates over strong black-box baseline predictors, especially in out-of-distribution settings (new tasks and benchmarks). The scales, rubrics, battery, techniques and results presented here constitute a solid foundation for a science of AI evaluation, underpinning the reliable deployment of AI in the years ahead.

解读视频

视频观看页 B 站 YouTube

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

解读视频

相关论文

链接