Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Agent Systems And Execution 突破级暂无讲解视频

发表时间: 2026-06-01
arXiv: 2606.02060

核心要点

问题/背景: 这篇论文针对 deep-research agent 的关键评估盲区：只看最终答案无法知道长轨迹中哪一段搜索、证据检查或 synthesis 导致了不可靠结论。
方法/机制: 作者把真实 agent logs 转成语义 spans，构建 TELBench，并提出 DRIFT 这种 claim-centric auditing 框架，沿轨迹追踪 claims、证据支撑和冲突位置。
结果/证据: 它值得收录，因为它把 agent 可靠性评估从 outcome-level 推进到 process-level localization，提供了可复用的错误定位 benchmark 和审计范式，对研究型 agent、工具使用 agent 和生产观测都有直接价值。
收录价值: 按当前更高 agent 门槛，它仍然合格，因为贡献不是普通 agent 技巧，而是可复用的诊断/审计接口；局限是标注流程和任务分布仍需跨更多 agent 框架验证。

完整收录解读

这篇论文针对 deep-research agent 的关键评估盲区：只看最终答案无法知道长轨迹中哪一段搜索、证据检查或 synthesis 导致了不可靠结论。

作者把真实 agent logs 转成语义 spans，构建 TELBench，并提出 DRIFT 这种 claim-centric auditing 框架，沿轨迹追踪 claims、证据支撑和冲突位置。

它值得收录，因为它把 agent 可靠性评估从 outcome-level 推进到 process-level localization，提供了可复用的错误定位 benchmark 和审计范式，对研究型 agent、工具使用 agent 和生产观测都有直接价值。

按当前更高 agent 门槛，它仍然合格，因为贡献不是普通 agent 技巧，而是可复用的诊断/审计接口；局限是标注流程和任务分布仍需跨更多 agent 框架验证。

原始摘要与中文对照

中文对照翻译

深度研究智能体通过搜索、工具使用、证据检查和答案综合的漫长轨迹来解决任务。基于最终答案的评估显示了智能体是否成功，但未能指出轨迹的哪些部分导致答案不可靠。我们研究深度研究智能体的跨度级错误定位。我们从两个智能体框架、三个骨干模型和三个基准中收集了2,790条真实轨迹，将原始日志转换为语义跨度，并通过LLM辅助的专家评审标注了有害错误跨度。基于这些标注，我们构建了TELB ENCHa，这是一个包含1,000个实例的基准，用于识别正常探索、失败搜索、初步假设和无害噪声中的错误跨度。我们进一步提出了DRIFT b，这是一个以声明为中心的审计框架，它跟踪智能体声明，检查它们在轨迹证据中的支持度，并标记出不受支持或相互冲突的声明影响答案路径的跨度。跨模型家族和审计框架的实验表明，DRIFT将跨度级错误定位和首次错误准确率提高了多达30个百分点。我们的工作提供了深度研究智能体可靠性的过程级视图。a https://huggingface.co/datasets/NJU-LINK/TELBench

原始摘要

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELB ENCHa , a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT b , a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents. a https://huggingface.co/datasets/NJU-LINK/TELBench

链接

论文链接论文链接代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接