Flash-WAM: Modality-Aware Distillation for World Action Models

Arman Akbari; Ci Zhang; Arash Akbari; Lin Zhao; Yixiao Chen; Weiwei Chen; Xuan Zhang; Geng Yuan; Yanzhi Wang

机器人与具身智能突破级暂无讲解视频

发表时间: 2026-06-03
arXiv: 2606.05254

核心要点

问题/背景: Flash-WAM 解决的是 World Action Models 走向实时机器人控制时的核心瓶颈：现有 WAM 通过迭代 diffusion 同时生成未来视频和动作，效果强但需要几十步去噪，推理延迟无法支撑实时闭环控制。
方法/机制: 论文的关键不是普通蒸馏，而是指出视频流和动作流处在不同 SNR/noise regime，单一 consistency distillation 会失效。作者为动作流使用低噪声线性梯度缩放参数化，为视频流使用高噪声方差保持参数化，从而实现 modality-aware step distillation。
结果/证据: 正式收录价值在于它把 WAM 从重型 imagine-then-act 系统推进到单步实时化接口：在 LingBot-VA 上把每个 chunk 从 8.1 秒压到 348 ms，并在 RoboTwin 2.0、LIBERO 和 Unitree G1 上保留明显性能。这是 WAM/VLA 走向真实机器人部署的系统效率突破。
收录价值: 它不是更高一级，因为目前仍是针对特定 WAM backbone 和机器人基准的预印本，长期影响取决于是否能成为更多 WAM 架构的通用蒸馏配方。

完整收录解读

Flash-WAM 解决的是 World Action Models 走向实时机器人控制时的核心瓶颈：现有 WAM 通过迭代 diffusion 同时生成未来视频和动作，效果强但需要几十步去噪，推理延迟无法支撑实时闭环控制。

论文的关键不是普通蒸馏，而是指出视频流和动作流处在不同 SNR/noise regime，单一 consistency distillation 会失效。作者为动作流使用低噪声线性梯度缩放参数化，为视频流使用高噪声方差保持参数化，从而实现 modality-aware step distillation。

正式收录价值在于它把 WAM 从重型 imagine-then-act 系统推进到单步实时化接口：在 LingBot-VA 上把每个 chunk 从 8.1 秒压到 348 ms，并在 RoboTwin 2.0、LIBERO 和 Unitree G1 上保留明显性能。这是 WAM/VLA 走向真实机器人部署的系统效率突破。

它不是更高一级，因为目前仍是针对特定 WAM backbone 和机器人基准的预印本，长期影响取决于是否能成为更多 WAM 架构的通用蒸馏配方。

原始摘要与中文对照

中文对照翻译

Flash-WAM：面向世界动作模型的多模态感知蒸馏。世界动作模型（WAM）通过迭代扩散共同生成未来视频和机器人动作，在操作基准测试中取得了强大的性能，但需要数十个去噪步骤，这一成本排除了实时控制。步长蒸馏已成为自然的补救措施，但现成的方法在联合视频-动作设置中失效，因为视频和动作流使用不同的信噪比（SNR）偏移噪声调度，并且在训练时达到截然不同的边际噪声分布，这种不对称性是单模态蒸馏方法无法适应的。我们引入了Flash-WAM，这是一种受一致性蒸馏启发的多模态感知步长蒸馏框架，它为每种模态选择一致性函数以匹配其噪声机制：针对动作流的低噪声机制采用线性梯度缩放参数化，并针对视频流的高噪声机制采用方差保持参数化，其基础是对一致性函数家族的结构分析，该分析表征了一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化后，Flash-WAM将每种模态的推理压缩到单个步骤。在RoboTwin 2.0上，这使得NVIDIA L40S上的每块延迟从8.1秒减少到348毫秒，实现了23倍的加速，从而实现了实时推理。Flash-WAM在仿真基准测试中保持了任务成功率（RoboTwin 2.0为85.5%，LIBERO为95.7%），并显著恢复了真实世界性能（在Unitree G1人形机器人上平均达到60%），而朴素一致性蒸馏在相同的步长预算下下降到24%。

原始摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce Flash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradientscaling parametrization for the action stream’s low-noise regime, paired with a variance-preserving parametrization for the video stream’s high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23× speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks (85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance (60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.

链接

论文链接论文链接项目代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接