安全、治理与可靠性 突破级 暂无讲解视频
发表时间
2026-06-02
arXiv
2606.01166

核心要点

问题/背景
这篇论文针对 computer-use agent 的核心安全变化:风险经常出现在多步文件、终端、浏览器和工具轨迹中,而不是单个 prompt 或最终回答。
方法/机制
BraveGuard 从开放研究来源挖掘新威胁和攻击模式,把它们转成可执行 computer-use tasks,收集 agent rollouts,并把整条轨迹转成 guard model 监督信号。
结果/证据
它的自演化点在于用 validation failures 和新威胁持续扩展 taxonomy、合成任务和训练分布,而不是依赖固定 benchmark 或静态 prompt-level safety 数据。
收录价值
它值得收录,因为它提供了 trajectory-grounded agent safety training pipeline,对 computer-use agents、工具执行审计、开放世界红队和 guard model 训练都有可复用价值;局限是当前依赖 OpenClaw 轨迹和合成任务,跨 agent 框架泛化仍需验证。
完整收录解读

这篇论文针对 computer-use agent 的核心安全变化:风险经常出现在多步文件、终端、浏览器和工具轨迹中,而不是单个 prompt 或最终回答。

BraveGuard 从开放研究来源挖掘新威胁和攻击模式,把它们转成可执行 computer-use tasks,收集 agent rollouts,并把整条轨迹转成 guard model 监督信号。

它的自演化点在于用 validation failures 和新威胁持续扩展 taxonomy、合成任务和训练分布,而不是依赖固定 benchmark 或静态 prompt-level safety 数据。

它值得收录,因为它提供了 trajectory-grounded agent safety training pipeline,对 computer-use agents、工具执行审计、开放世界红队和 guard model 训练都有可复用价值;局限是当前依赖 OpenClaw 轨迹和合成任务,跨 agent 框架泛化仍需验证。

原始摘要与中文对照

中文对照翻译

BraveGuard:从开放世界威胁到更安全的计算机使用智能体计算机使用智能体将语言模型(LLM)从文本生成扩展到与文件、终端、浏览器和外部工具的持续交互。这种转变带来了安全风险,这些风险难以从孤立的提示或最终响应中检测出来,因为危害通常只通过多步执行轨迹显现,而这些轨迹中的单个动作在局部看来是无害的。我们引入BraveGuard,这是一个自进化的防御框架,用于基于开放世界威胁信号和真实的智能体轨迹训练防护模型。BraveGuard挖掘最新的研究来源,以识别新兴风险和攻击模式,将它们实例化为可执行的计算机使用任务,收集智能体运行轨迹,并为防护模型训练推导出轨迹级别的监督信号。随着新威胁和验证失败的出现,该流程可以重复,从而形成一个自适应的防御循环,而不是一个静态的、基准驱动的训练过程。我们通过训练多个防护骨干模型(包括Qwen3-Guard和Llama-Guard变体)来实例化BraveGuard,并在轨迹级别的智能体安全基准上评估所得的防护模型。BraveGuard持续改进了在各种计算机使用轨迹中的安全检测。在AgentHazard上,相较于现成的防护模型,它显著提高了检测准确率,在平均防护模型设置下,准确率从38.79%提高到82.38%。这些结果表明,基于开放世界威胁发现和真实智能体执行的防护监督,可以改进安全监控,超越固定的分类法和合成的提示级别数据。BraveGuard为计算机使用智能体应对不断演变的真实世界风险,提供了一条通向自适应防御的可扩展路径。通讯作者:[email protected], [email protected] 网站:https://github.com/Yunhao-Feng/BraveGuard

原始摘要

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks. Correspondence: [email protected], [email protected] Website: https://github.com/Yunhao-Feng/BraveGuard

相关论文

链接