机器人与具身智能 突破级 暂无讲解视频
发表时间
2026-04-30
arXiv
2604.28192

核心要点

问题/背景
这篇论文面向 VLA 机器人操作中的关键缺口:已有 latent reasoning / reasoning-before-acting VLA 多数仍停留在静态 imitation learning,而在线 RL 往往只优化 action space,没有把物理 latent reasoning 本身纳入优化。
方法/机制
LaST-R1 的核心是 Latent-to-Action Policy Optimization (LAPO):把 latent Chain-of-Thought 物理推理直接放进 RL 后训练回路,联合优化 reasoning latent 和 action generation,使物理世界建模和执行控制相互强化。
结果/证据
作者还提出 adaptive latent CoT 机制,让策略根据环境状态动态调整推理 horizon;实验报告在 LIBERO 上用 one-shot supervised warm-up 达到接近满分的平均成功率,并在真实单臂/双臂任务中相比 SFT 基线显著提升。
收录价值
它值得收录,因为它把 VLA、latent world/physical reasoning、RL post-training 和机器人真实部署连接成一个可复用框架,是 embodied AI 从 imitation-only VLA 转向 reasoning-and-control 联合优化的重要路线。
完整收录解读

这篇论文面向 VLA 机器人操作中的关键缺口:已有 latent reasoning / reasoning-before-acting VLA 多数仍停留在静态 imitation learning,而在线 RL 往往只优化 action space,没有把物理 latent reasoning 本身纳入优化。

LaST-R1 的核心是 Latent-to-Action Policy Optimization (LAPO):把 latent Chain-of-Thought 物理推理直接放进 RL 后训练回路,联合优化 reasoning latent 和 action generation,使物理世界建模和执行控制相互强化。

作者还提出 adaptive latent CoT 机制,让策略根据环境状态动态调整推理 horizon;实验报告在 LIBERO 上用 one-shot supervised warm-up 达到接近满分的平均成功率,并在真实单臂/双臂任务中相比 SFT 基线显著提升。

它值得收录,因为它把 VLA、latent world/physical reasoning、RL post-training 和机器人真实部署连接成一个可复用框架,是 embodied AI 从 imitation-only VLA 转向 reasoning-and-control 联合优化的重要路线。

原始摘要与中文对照

中文对照翻译

LaST-R1:通过自适应物理潜在推理强化机器人操作。机器人基础模型需要对复杂的视觉场景进行推理,以便在动态环境中执行自适应动作。尽管最近关于潜在推理视觉-语言-动作(VLA)模型的研究已证明其能够捕捉细粒度的物理动态,但它们主要局限于静态模仿学习,严重限制了其适应性和泛化能力。在本文中,我们提出了LaST-R1,这是一种新颖的强化学习(RL)后训练框架,旨在有效利用“先潜在推理后行动”策略。具体来说,我们提出了潜在到行动策略优化(LAPO),这是一种核心RL算法,它联合优化潜在推理过程和行动生成。通过将潜在思维链(CoT)推理显式地直接嵌入到RL优化循环中,LAPO激发了深刻的物理世界建模,进而推动了在交互式环境中的鲁棒执行。此外,我们引入了一种自适应潜在CoT机制,使策略能够根据不同的环境状态动态调整其推理范围。实验表明,LaST-R1在LIBERO基准测试中仅通过一次监督预热就达到了近乎完美的99.9%平均成功率,显著提高了收敛速度和性能,超越了先前的最先进(SOTA)方法。在实际部署中,LaST-R1在四项复杂任务(包括单臂和双臂设置)中,相对于SOTA监督微调方法,平均性能提升高达22.5%。最后,LaST-R1在模拟和真实世界环境中展现出强大的泛化能力。

原始摘要

Robotic foundation models require reasoning over complex visual scenes to execute adaptive actions in dynamic environments. While recent studies on latent-reasoning Vision-Language-Action (VLA) models have demonstrated the capability to capture fine-grained physical dynamics, they remain predominantly confined to static imitation learning, severely limiting their adaptability and generalization. In this paper, we present LaST-R1, a novel reinforcement learning (RL) post-training framework designed to effectively harness “latent reasoning-before-acting” policies. Specifically, we propose Latent-to-Action Policy Optimization (LAPO), a core RL algorithm that jointly optimizes the latent reasoning process and the action generation. By explicitly embedding latent Chain-of-Thought (CoT) reasoning directly within the RL optimization loop, LAPO stimulates profound physical world modeling, which in turn drives robust execution in interactive environments. Furthermore, an adaptive latent CoT mechanism is introduced, allowing the policy to dynamically modulate its reasoning horizon based on diverse environment states. Experiments show that LaST-R1 achieves a near-perfect 99.9% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-theart (SOTA) methods. In real-world deployments, LaST-R1 yields up to a 22.5% average improvement over SOTA supervised fine-tuning approach across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.

相关论文

链接