InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-04-30
arXiv: 2604.27419

收录解读

现在很多 web coding benchmark 默认用户需求是清晰的、信息充分的，agent 只要执行就行。但真实低代码场景里，真正的难点往往是用户本身表达含混、矛盾甚至带噪，这会让 agent 陷入一种更根本的失败模式：blind execution。

InteractWeb-Bench 的价值就在于把这个失败模式正式命名并 benchmark 化。它不是简单加点噪声，而是围绕非专家用户条件，引入 persona-driven instruction perturbations 和 Clarify / Implement / Verify / Submit 的统一交互动作空间，使 benchmark 真正覆盖 intent refinement 这一层。

这篇工作值得正式收录，因为它为 multimodal web / coding agent 提供了一个更耐用的 evaluation interface。真正可复用的不是网页题目本身，而是把 agent 从盲执行推向澄清-实现-验证闭环的评测框架。

它没有升到更高等级，因为目前任务域仍集中在 interactive website generation。尽管 failure mode 很通用，但它还不是所有 computer-use agent 的统一上位 benchmark。

链接

论文链接