MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Agent Systems And Execution 突破级暂无讲解视频

发表时间: 2026-06-01
arXiv: 2606.01993

核心要点

问题/背景: 这篇论文把网页上的人类操作指南转化为 agent-executable skills，定义为 guide-to-skill learning 问题，目标是让 agent 从真实世界多模态知识中持续扩展能力。
方法/机制: MMG2Skill 先把 in-the-wild guides 编译成可编辑 skill，再让固定 VLM agent 调用这些 skill 执行任务，并根据轨迹级 root-cause feedback 修订 skill，而不是直接依赖 benchmark 分数。
结果/证据: 它值得收录，因为它有明确的新问题定义、benchmark 和闭环 skill 更新框架，符合仓库对 agent memory/capability extension 的主线，但比普通 skill 生成论文更强调真实指南与执行反馈。
收录价值: 按提高后的 agent 门槛，它仍然勉强合格；局限是技能质量、噪声网页指南适配和跨域安全边界仍需更严格检验。

完整收录解读

这篇论文把网页上的人类操作指南转化为 agent-executable skills，定义为 guide-to-skill learning 问题，目标是让 agent 从真实世界多模态知识中持续扩展能力。

MMG2Skill 先把 in-the-wild guides 编译成可编辑 skill，再让固定 VLM agent 调用这些 skill 执行任务，并根据轨迹级 root-cause feedback 修订 skill，而不是直接依赖 benchmark 分数。

它值得收录，因为它有明确的新问题定义、benchmark 和闭环 skill 更新框架，符合仓库对 agent memory/capability extension 的主线，但比普通 skill 生成论文更强调真实指南与执行反馈。

按提高后的 agent 门槛，它仍然勉强合格；局限是技能质量、噪声网页指南适配和跨域安全边界仍需更严格检验。

原始摘要与中文对照

中文对照翻译

网络上丰富的程序性知识在帮助智能体解决长周期任务方面具有巨大潜力。然而，这些知识通常是多模态、异构、嘈杂的，并且隐含地假设人类执行者，这使得它们难以直接用作智能体所需的技能。为了弥合以人为中心的指南与智能体可执行技能之间的鸿沟，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是为该问题设计的首个基准。我们进一步提出了MMG2Skilla，这是一个闭环框架，它将指南编译成可编辑的技能，在执行期间根据这些技能调整固定的视觉语言模型（VLM）智能体，并从轨迹级别的根本原因反馈中修订技能，而不使用基准分数。在GUI控制、开放式游戏和策略卡牌游戏中，MMG2Skill与六个VLM骨干模型一起，在每个模型-领域设置中始终优于普通的基线智能体，在所有骨干模型上实现了+12.8到+25.3个百分点的宏观平均增益。消融研究表明，直接使用原始指南提示智能体可能会降低性能，而结构化技能构建和轨迹驱动的修订对于观察到的改进都是必要的。在可推断成功的任务上，基于分析器的早期停止进一步防止了后期性能下降，并在成功信号校准得当时节省了25%–53%的尝试次数。a https://github.com/NJU-LINK/MMG2Skill。

原始摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve longhorizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skilla , a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model–domain setting, achieving macroaverage gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%–53% of attempts when the success signal is properly calibrated. a https://github.com/NJU-LINK/MMG2Skill.

链接

论文链接论文链接代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接