SWE-chat: Coding Agent Interactions From Real Users in the Wild

软件工程与编程智能体突破级暂无讲解视频

发表时间: 2026-04-24
arXiv: 2604.20779

收录解读

这篇论文补的是 coding agent 研究里一个非常实际的缺口：我们有很多 benchmark，但几乎没有真实世界里人到底怎么用 coding agent、agent 产出的代码到底有多少被真正采用、失败模式在自然环境里长什么样的系统证据。

SWE-chat 的价值在于它不是合成任务集，而是来自开源开发者真实会话的 living dataset。它记录了完整 interaction trace、tool calls，以及更重要的 human vs. agent code authorship attribution。这样就能把‘agent 看起来会了’和‘agent 产出真正进入 commit’区分开。论文给出的几个结果都很硬：只有部分 agent 代码能存活进最终提交，安全漏洞也比人工代码更常见。

它值得正式收录，因为这是 coding agents 从 benchmark-centric 走向 evidence-based evaluation 的关键数据基础设施。之后无论是 agent reliability、human-in-the-loop workflow、security 还是 tool-use efficiency，都可以围绕这类真实世界数据来重建评测。

它没有更高，是因为目前仍是早期 dataset-and-analysis 形态；它是否会成为社区默认的 real-world coding-agent evaluation substrate，还要看开放持续更新和外部采用。

链接

论文链接