MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills

Wenbo Guo; Wei Zeng; Chengwei Liu; Xiaojun Jia; Yijia Xu; Lei Tang; Yong Fang; Yang Liu

安全、治理与可靠性突破级暂无讲解视频

发表时间: 2026-06-08
arXiv: 2606.07131

核心要点

问题/背景: 这篇论文抓住了 agent skill 生态的关键安全盲点：skill 既是代码包，又是写给 agent 的自然语言指令，还可能携带工具权限，因此不能用传统 supply-chain scanner 或单纯 prompt-injection detector 充分覆盖。
方法/机制: 作者提出 MalSkillBench，一个 runtime-verified malicious agent skills benchmark。数据集中包含 3944 个恶意 skill、4000 个匹配良性 skill，并用 Docker 沙箱、系统调用监控和 LLM judge 的闭环生成-验证流程保证恶意行为真的触发。论文还给出三维 taxonomy，并测量现有检测器在代码注入、prompt 注入和 agent control-p...
结果/证据: 正式收录价值在于它为 skill marketplace / agent capability extension 这条工程主线建立了可复用安全评测接口。仓库已经在跟踪 agent skills 和 capability acquisition，MalSkillBench 补上的是安全边界：如何定义、生成、验证和检测恶意技能。
收录价值: 它不是更高一级，因为它主要是 benchmark 和测量框架，还不是完整的防御体系；数据生成和 LLM judge 也需要后续独立复现。但作为 agent skill 供应链安全的首批系统性基准，它已经具备长期引用价值。

完整收录解读

这篇论文抓住了 agent skill 生态的关键安全盲点：skill 既是代码包，又是写给 agent 的自然语言指令，还可能携带工具权限，因此不能用传统 supply-chain scanner 或单纯 prompt-injection detector 充分覆盖。

作者提出 MalSkillBench，一个 runtime-verified malicious agent skills benchmark。数据集中包含 3944 个恶意 skill、4000 个匹配良性 skill，并用 Docker 沙箱、系统调用监控和 LLM judge 的闭环生成-验证流程保证恶意行为真的触发。论文还给出三维 taxonomy，并测量现有检测器在代码注入、prompt 注入和 agent control-plane 攻击上的失效模式。

正式收录价值在于它为 skill marketplace / agent capability extension 这条工程主线建立了可复用安全评测接口。仓库已经在跟踪 agent skills 和 capability acquisition，MalSkillBench 补上的是安全边界：如何定义、生成、验证和检测恶意技能。

它不是更高一级，因为它主要是 benchmark 和测量框架，还不是完整的防御体系；数据生成和 LLM judge 也需要后续独立复现。但作为 agent skill 供应链安全的首批系统性基准，它已经具备长期引用价值。

原始摘要与中文对照

中文对照翻译

MalSkillBench：恶意代理技能的运行时验证基准。诸如Claude Code和Gemini CLI之类的AI编码代理越来越多地通过第三方技能扩展自身，这些技能是捆绑了自然语言指令、可执行脚本和工具权限的Markdown包。由于一项技能同时是可执行代码和面向代理的指令，它引入了一种软件供应链依赖，其风险既非纯粹的代码风险也非纯粹的提示风险。检测工具从未针对涵盖这种混合空间的经过验证的真实情况进行衡量，导致其有效性未知，并且仅限于野外环境的评估存在系统性偏差。我们提出了MalSkillBench，这是第一个运行时验证的恶意代理技能基准。它包含3,944个恶意技能，这些技能沿着一个三维分类法进行标记，涵盖108个（攻击向量、行为、插入策略）单元。其中，3,214个来自一个闭环的生成-验证-反馈（Generate-Verify-Feedback）管道，该管道只接受那些在系统调用监控和LLM判断器下，其恶意行为在Docker沙箱内被触发的样本，并且验证反馈会影响后续生成。我们用...对其进行补充

原始摘要

AI coding agents such as Claude Code and Gemini CLI increasingly extend themselves with third-party skills, markdown packages that bundle natural-language instructions, executable scripts, and tool permissions. Because a skill is at once executable code and agent-facing instruction, it introduces a software supply chain dependency whose risk is neither pure code nor pure prompt. Detection tools have never been measured against verified ground truth that spans this hybrid space, leaving their effectiveness unknown and wild-only evaluations systematically biased. We present MalSkillBench, the first runtime-verified benchmark of malicious agent skills. It contains 3,944 malicious skills labeled along a three-dimensional taxonomy spanning 108 (attack vector, behavior, insertion strategy) cells. Of these, 3,214 come from a closed-loop Generate-Verify-Feedback pipeline that admits only samples whose malicious behavior fires inside a Docker sandbox under system-call monitoring and an LLM judge, with verification feedback shaping subsequent generation. We complement it with

链接

论文链接

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接