Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Heejeong Nam; Quentin Le Lidec; Lucas Maes; Yann LeCun; Randall Balestriero

JEPA 与预测式世界模型颠覆级有讲解视频

发表时间: 2026-02-11
arXiv: 2602.11389

收录解读

这篇论文是 JEPA 路线里少数真正往 object-centric world model 推进的工作。它不是继续做 patch-level 的 latent prediction，而是把预测单位提升到对象级别，并通过 object-level masking 让模型必须利用其他对象的状态去推断被遮蔽对象的未来表示。

方法上，作者提出 `Causal-JEPA`。核心做法是把视频场景表示成对象槽位，再在训练时对部分对象进行干预式遮蔽。模型需要根据剩余对象和时序上下文预测目标对象的 latent trajectory。这样的训练目标会显式鼓励模型学习对象之间的相互作用，而不是只记住局部纹理或短期运动模式。作者把这种效果解释为一种因果导向的归纳偏置。

这篇工作的价值在于，它把 JEPA 从通用表征学习进一步推到“可用于预测、反事实推理和控制”的世界模型方向。摘要里给出的结果也够硬：在 counterfactual reasoning 上相对基线有明显增益，并且在控制场景里只用极少量 latent features 就能达到接近 patch-based world model 的表现。

如果从 JEPA 近两个月进展里只挑一篇最值得跟的，我会选这篇。它还不是一个完整的新主流范式，但已经明显超过“把 JEPA 换个数据集再跑一次”的级别，属于接近颠覆性候选的工作。

原始摘要与中文对照

中文对照翻译

Causal-JEPA：通过对象级潜在干预学习世界模型。世界模型需要强大的关系理解能力来支持预测、推理和控制。尽管以对象为中心的表示提供了一种有用的抽象，但它们不足以捕捉依赖于交互的动态。因此，我们提出了C-JEPA，一个简单灵活的以对象为中心的世界模型，它将掩码联合嵌入预测从图像块扩展到以对象为中心的表示。通过应用对象级掩码，该掩码要求从其他对象推断一个对象的状态，C-JEPA诱导了具有类反事实效应的潜在干预，并防止了捷径解决方案，从而使交互推理变得至关重要。经验上，C-JEPA在视觉问答方面取得了持续的提升，与没有对象级掩码的相同架构相比，在反事实推理方面绝对提高了约20%。在智能体控制任务中，C-JEPA通过仅使用基于图像块的世界模型所需的总潜在输入特征的1%，实现了显著更高效的规划，同时达到了可比的性能。最后，我们提供了一项形式化分析，证明对象级掩码通过潜在干预诱导了因果归纳偏置。我们的代码可在github.com/galilai-group/cjepa获取。

原始摘要

World models require robust = ϕ relational under+ e) standing to support prediction,slot reasoning, and conidentity encoding trol. While object-centric representations provide : Observable slots to capa useful abstraction, they are not sufficient : Observable auxiliaries ture interaction-dependent dynamics. We therefore propose C-JEPA, a simple: Masked and flexible objecthistory / future centric world model that extends masked joint em: Identity-anchor bedding prediction from image patches to object: Intervened variable centric By applying object-level Temporal positionalrepresentations. encoding 𝐞𝛕 masking that requires an object’s state to be inferred from other objects, C-JEPA induces latent Slots interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA Auxiliariesleads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning compared to the same arSlots + Auxiliaries chitecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at github.com/galilai-group/cjepa.

解读视频

视频观看页 B 站 YouTube

链接

论文链接