核心要点
- 问题/背景
- Agents' Last Exam 面向 real-world economic utility,评估 agent 是否能完成长程、可验证、具有经济价值的实际工作任务。
- 方法/机制
- 它基于 O*NET/SOC 职业分类构建任务 taxonomy,并由大量行业专家参与设计,覆盖多个行业簇和上千任务。
- 结果/证据
- 评价重点是 verifiable outcomes,而不是模型自评或偏好式人类打分,因此更接近真实工作自动化边界。
- 收录价值
- 它值得收录,因为它为‘agent 是否真的 job-ready’提供了可扩展评估接口,是通用 agent 能力、成本和可靠性讨论的重要基准。
原始摘要与中文对照
中文对照翻译
最近的AI系统在广泛的基准测试中取得了显著成果,然而这些进展尚未转化为许多专业领域中具有经济意义的实际部署。我们认为这一差距主要是一个评估问题:广泛使用的基准测试缺乏对真实且具有经济价值的工作流程的持续性能测量。本文介绍了Agents’ Last Exam (ALE),这是一个旨在评估AI智能体在具有可验证结果的长期、有经济价值的真实世界任务上的基准测试。ALE与250多位行业专家合作开发,涵盖了参照O*NET / SOC 2018(美国联邦职业分类法)定义的非实体行业。它围绕一个任务分类法组织,包含55个子领域,分为13个行业集群,覆盖1000多个任务。当前结果显示,最困难的层级远未饱和:在主流的测试框架和骨干模型配置中,平均完全通过率低于1%。ALE被设计为一个活的基准测试:随着新的工作流程和行业的加入,其任务池持续增长。更广泛地说,ALE不仅旨在成为另一个排行榜,更是一个弥合基准测试成功与GDP相关影响之间差距的工具。硬件验证
原始摘要
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents’ Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact. Hardware Verification