WebXSkill: Skill Learning for Autonomous Web Agents

Zhaoyang Wang; Qianhui Wu; Xuchao Zhang; Chaoyun Zhang; Wenlin Yao; Fazle Elahi Faisal; Baolin Peng; Si Qin; Suman Nath; Qingwei Lin; Chetan Bansal; Dongmei Zhang; Saravan Rajmohan; Jianfeng Gao; Huaxiu Yao

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-04-14
arXiv: 2604.13318

收录解读

WebXSkill 解决 Web agent 的长期痛点：已经完成过的流程不能稳定复用。它指出现有 skill 表示存在 grounding gap，纯文本技能无法执行，纯代码技能又不利于 agent 理解、恢复和泛化。

方法把一个 skill 表示成参数化 action program 加逐步自然语言说明，兼顾可执行性和可解释性。系统包含 skill extraction、URL graph 组织、retrieval，以及 grounded / guided 两种部署模式：强模型可以直接把技能当工具调用，弱模型可以按说明执行并保留局部自主性。

收录价值在于它给 agent capability extension 提供了具体、工程可落地的技能接口，而不是抽象记忆或普通 prompt reuse。跨 WebArena/WebVoyager 的迁移结果说明它有可能成为 Web agent 长程操作的可复用能力层。

主要限制是技能来自 synthetic trajectories，真实网页变化、权限边界、支付/账号等高风险操作还没有充分覆盖；此外 skill graph 的维护、冲突解决和安全审计仍需要更成熟机制。

原始摘要与中文对照

中文对照翻译

由大型语言模型（LLMs）驱动的自主网络代理在完成复杂的浏览器任务方面展现出潜力，但它们在长周期工作流方面仍面临挑战。一个关键瓶颈在于现有技能表述中的基础差距：基于文本的工作流技能提供自然语言指导但无法直接执行，而基于代码的技能可执行但对代理不透明，无法提供用于错误恢复或适应的步骤级理解。我们引入了W EB XS KILL，一个通过可执行技能弥合这一差距的框架，每个技能都将参数化动作程序与步骤级自然语言指导配对，从而实现直接执行和代理驱动的适应。W EB XS KILL分三个阶段运行：技能提取从现有的合成代理轨迹中挖掘可重用的动作子序列，并将其抽象为参数化技能；技能组织将技能索引到基于URL的图中以进行上下文感知检索；技能部署则提供两种互补模式：基础模式用于全自动多步执行，以及指导模式，其中技能作为代理通过其原生规划遵循的逐步指令。在WebArena和WebVoyager上，W EB XS KILL的任务成功率分别比基线提高了9.8和12.9个百分点，证明了可执行技能对网络代理的有效性。代码已在https://github.com/aiming-lab/WebXSkill公开。

原始摘要

Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce W EB XS KILL, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. W EB XS KILL operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for contextaware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, W EB XS KILL improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.

链接

论文链接