Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang; Guillaume Le Moing; Skanda Koppula; Ignacio Rocco; Liliane Momeni; Junyu Xie; Shuyang Sun; Rahul Sukthankar; Joëlle K. Barral; Raia Hadsell; Zoubin Ghahramani; Andrew Zisserman; Junlin Zhang; Mehdi S. M. Sajjadi

多模态基础模型颠覆级有讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2025-12-09
arXiv: 2512.08924

收录解读

这篇论文提出 D4RT，用统一 transformer 从单目视频中同时推断深度、时空对应关系和完整相机参数，目标是把动态场景重建从重型优化/多解码器流程推进到一次前馈模型。

核心机制是新的 querying interface：模型可以独立查询任意时空点的 3D 位置，而不是对每帧做密集解码，也不需要为 depth、tracking、camera pose 等任务维护多套解码器。

这让 D4RT 在训练和推理上都更轻量，并覆盖动态 4D 重建、跟踪、相机估计等任务；项目页和 DeepMind 介绍强调其统一、快速、可扩展的 4D scene reconstruction and tracking 能力。

它值得收录，因为它把视频理解、3D/4D 表征、时空对应和相机几何统一成可查询的前馈表示接口，对机器人世界建模、视频世界模型和空间智能都有高外溢价值。

原始摘要与中文对照

中文对照翻译

从视频中理解和重建动态场景的复杂几何形状和运动在计算机视觉领域仍然是一个艰巨的挑战。本文介绍了D4RT，一个简单而强大的前馈模型，旨在高效地解决这项任务。D4RT利用统一的Transformer架构，从单个视频中联合推断深度、时空对应关系和完整的相机参数。它的核心创新是一种新颖的查询机制，该机制避免了密集、逐帧解码的繁重计算以及管理多个任务特定解码器的复杂性。我们的解码接口允许模型独立且灵活地探测空间和时间中任意点的三维位置。结果是一种轻量级且高度可扩展的方法，能够实现非常高效的训练和推理。我们证明了我们的方法树立了新的技术水平，在广泛的4D重建任务中超越了以前的方法。动画结果请参见项目网页。3

原始摘要

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatiotemporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, perframe decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results.3

解读视频

视频观看页 B 站 YouTube

链接

论文链接