DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-05-20
arXiv: 2605.19099

收录解读

DecisionBench 针对 agent orchestration 的关键问题：一个 agent 何时应该把子任务交给另一个模型，以及如何评估这种 delegation 是否真的有效。

基准固定任务套件、11 个 peer models、call_model/read_profile 接口和多轴指标，覆盖质量、成本、延迟、delegation rate、routing fidelity、vendor self-preference 与 counterfactual ceiling。

它值得正式收录，因为它把 delegation 从经验 prompt 技巧变成可复现实验 substrate，能评估 routers、peer memory 和多步委派策略。

它没有更高，是因为当前主要是离线评测 substrate，还没有证明某个 delegation learning 方法能稳定缩小 counterfactual gap。

链接

论文链接