From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

安全、治理与可靠性突破级暂无讲解视频

发表时间: 2026-05-29
arXiv: 2605.31042

核心要点

问题/背景: 这篇把 prompt injection 的威胁模型从单步攻击推进到 persistent control：攻击者可以先把控制文本写入本地 workspace，再让 agent 在后续 session 中执行。
方法/机制: ClawTrojan 用本地 agentic harness 模拟多步 trojan，暴露出单步防御看不到的 backdoor planting；DASGuard 则通过 origin tracing 和 sanitized commits 清理不可信控制内容。
结果/证据: 收录价值在于它给 secure computer-use / local-first agents 提供了新的安全边界模型：workspace memory/harness 本身就是可被污染的控制面。
收录价值: 风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

完整收录解读

这篇把 prompt injection 的威胁模型从单步攻击推进到 persistent control：攻击者可以先把控制文本写入本地 workspace，再让 agent 在后续 session 中执行。

ClawTrojan 用本地 agentic harness 模拟多步 trojan，暴露出单步防御看不到的 backdoor planting；DASGuard 则通过 origin tracing 和 sanitized commits 清理不可信控制内容。

收录价值在于它给 secure computer-use / local-first agents 提供了新的安全边界模型：workspace memory/harness 本身就是可被污染的控制面。

风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

论文摘要

本文介绍了一种名为ClawTrojan的多步木马攻击基准，该基准在本地代理环境中进行，其中提示注入被写入文件或工具输出，并随后成为持久控制内容。它还提出了DASGuard，该方法扫描敏感本地文件以查找类似控制的文本，追踪其来源，并删除不可信的控制内容，结合了运行时阻止和经过消毒的 workspaces 提交。

英文原文

The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks in local agentic harnesses, where prompt injections are written into files or tool outputs and later become persistent control content. It proposes DASGuard, which scans sensitive local files for control-like text, traces origin, and removes untrusted control content, combining runtime blocking with sanitized workspace commits.

链接

论文链接论文链接代码相关链接

核心要点

论文摘要

相关论文

链接