ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

智能体与自主科学突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-01-01

收录解读

这篇 Findings EACL 2026 论文针对 MCP 生态中的工具编排能力提出五级评测。它不只测一次工具调用，而是分层评估 agent 发现工具、选择工具、组合工具、处理依赖和协调复杂流程的能力。

它的重要性在于 MCP 正在成为实际 agent 工具接入接口，围绕这个接口建立 benchmark 能直接服务工程系统评估。ETOM 把工具 orchestration 从泛泛能力拆成可测层级。

按本库标准，它值得收录在 agent 系统方向，因为它提供的是可复用 evaluation interface 和 tool-use 复杂度分层，而不是一个普通 prompt benchmark。

局限是 MCP 生态仍在快速变化，benchmark 的长期价值取决于是否持续覆盖真实工具、权限、安全和错误恢复场景。

原始摘要与中文对照

中文对照翻译

我们引入ETOM，这是一个五级基准，用于评估分层模型-上下文协议（MCP）生态系统中LLM智能体的多跳、端到端工具编排能力。现有基准通常孤立地评估工具，忽视了功能重叠和跨服务器编排等挑战，这可能导致过于乐观的评估。ETOM通过构建“等效功能集”来建立真实情况，从而弥补了这些空白，实现了F1分数等客观指标，并减少了对LLM作为评判者的评估的依赖。其五级课程系统地测试了智能体的能力，从单工具编排到复杂的跨服务器规划，以及对范围外请求的鲁棒性。实验表明，僵化层级结构在没有协同设计策略的情况下会阻碍性能，即使是最先进的智能体在鲁棒性方面也表现出系统性弱点。ETOM提供了一个诊断框架，以揭示这些局限性并指导开发更强大、更高效的工具使用智能体。

原始摘要

We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through “equal function sets”, enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents.

链接

论文链接