ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng; Lingxiao Du; Zijian Wu; Guanzheng Chen; Xiangyan Liu; Jiaqi Liao; Chonghe Jiang; Zhenglin Wan; Jiawei Gu; Pengfei Zhou; Rui Huang; Ziqi Zhao; Shengyuan Ding; Ailing Yu; Bo Peng; Bowei Xia; Hao Sun; Haotian Liang; Ji Xie; Jiajun Chen; Jiajun Song; Liu Yang; Ming Xu; Qionglin Qiu; Runhao Fu; Shengfang Zhai; Shijian Wang; Tengfei Ma; Tianyi Wu; Weiyang Jin; Yan Wang; Yang Dai; Yao Lai; Youwei Shu; Yue Liu; Yunzhuo Hao; Yuwei Niu; Jinkai Huang; Jiayuan Zhuo; Zhennan Shen; Linyu Wu; Hannah Yao; Charles Chen; Cihang Xie; Yuyin Zhou; Jiaheng Zhang; Zeyu Zheng; Mengkang Hu; Michael Qizhe Shieh

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-04-26
arXiv: 2604.23781

收录解读

ClawMark 真正补的是 persistent coworker agent 这条评测缺口。它不再假设 agent 在一个静态 session 里做完事，而是把任务拉长到多天、多轮、带外部环境变化的状态空间里去测，这比普通 web / tool benchmark 更接近真实办公协作。

它值得收的另一个原因是评测设计扎实：五个 stateful services、100 个任务、13 个专业场景、1,537 个 deterministic Python checkers，而且 scoring 不依赖 LLM-as-judge。这个 rule-based verification 很重要，因为多天、多模态环境下如果还靠主观 judge，噪声会很大。

最有信息量的结果不是某个模型分数高，而是 strict Task Success 仍然很低，且性能在第一次 exogenous update 后显著下滑。这把 persistent state tracking 和 changing-world adaptation 直接钉成了 agent 研究里的核心开放问题。

它没有更高，是因为当前 benchmark 仍聚焦 coworker-style office workflows；虽然方向很强，但还没覆盖更广的 real-world agent operating environments。

原始摘要与中文对照

中文对照翻译

语言模型智能体正越来越多地被用作持久的同事，在多个工作日内协助用户。在此类工作流程中，周围环境可能独立于智能体而变化：新邮件到达、日历条目变动、知识库记录更新，以及证据以图像、扫描PDF、音频、视频和电子表格的形式出现。现有基准未能充分评估这种设置，因为它们通常在单个静态情景中运行，并且主要以文本为中心。我们引入了ClawMark，一个用于同事智能体的基准，它围绕多轮、多日任务、一个状态在轮次之间演变的有状态沙盒服务环境以及基于规则的验证而构建。当前版本包含13个专业场景中的100项任务，针对五个有状态沙盒服务（文件系统、电子邮件、日历、知识库、电子表格）执行，并通过1,537个确定性Python检查器根据执行后的服务状态进行评分；评分期间不调用LLM作为评判者。我们对七个前沿智能体系统进行了基准测试。最强的模型达到75.8的加权分数，但最佳严格任务成功率仅为20.0%，这表明部分进展很常见，而完整的端到端工作流程完成仍然罕见。轮次级分析表明，在第一次外部环境更新后性能下降，强调适应不断变化的状态是一个关键的开放挑战。我们发布了该基准、评估线束和构建管道，以支持可重现的同事智能体评估。

原始摘要

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce ClawMark, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1,537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.

链接

论文链接