Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Mahtab Bigverdi; Linjie Li; Weikai Huang; Yiming Liu; Jaemin Cho; Jieyu Zhang; Tuhin Kundu; Chris Dangjoo Kim; Zelun Luo; Linda Shapiro; Ranjay Krishna

多模态基础模型突破级暂无讲解视频

发表时间: 2026-06-02
arXiv: 2606.03988

核心要点

问题/背景: 这篇论文针对 VLM 的空间推理短板提出 Imaginative Perception Tokens，让模型用中间感知 token 表示未直接观察到的视角或路径结构。
方法/机制: IPT 不是让模型在推理时生成图像，而是在训练中提供可解释的 perceptual intermediate representation，覆盖 perspective taking、path tracing 和 multiview counting。
结果/证据: 实验显示 IPT supervision 往往优于 textual chain-of-thought，说明空间计算强行塞进语言链条存在 modality mismatch。
收录价值: 它值得收录，因为它为多模态模型的空间想象和不可见结构推理提供了新的监督接口，对 embodied AI、3D/4D reasoning 和视觉世界模型都有外溢价值。

完整收录解读

这篇论文针对 VLM 的空间推理短板提出 Imaginative Perception Tokens，让模型用中间感知 token 表示未直接观察到的视角或路径结构。

IPT 不是让模型在推理时生成图像，而是在训练中提供可解释的 perceptual intermediate representation，覆盖 perspective taking、path tracing 和 multiview counting。

实验显示 IPT supervision 往往优于 textual chain-of-thought，说明空间计算强行塞进语言链条存在 modality mismatch。

它值得收录，因为它为多模态模型的空间想象和不可见结构推理提供了新的监督接口，对 embodied AI、3D/4D reasoning 和视觉世界模型都有外溢价值。

原始摘要与中文对照

中文对照翻译

标题：想象感知令牌增强多模态语言模型的空间推理能力。视觉语言模型 (VLM) 在许多任务中表现出色，但在空间推理方面仍然面临挑战——这类问题中的关键信息无法在输入中直接观察到。许多空间问题需要想象感知：模拟一个未见的视角，追踪穿过被遮挡空间的轨迹，或将部分视图整合到连贯的空间地图中。人类通过想象自然地支持这种推理。先前的工作引入了中间视觉表示（例如，视觉思维、深度或边界框令牌），但这些中间表示通常是细化已可见的结构，而不是预测由证据暗示的缺失空间结构。我们引入了想象感知令牌 (IPT)，这是一种中间感知表示，它将 VLM 在替代空间配置下会感知到的内容外化，同时与观察到的输入保持一致。为了研究这种能力，我们提出了三个需要想象感知的任务：视角采纳 (PET)、路径追踪 (PT) 和多视角计数 (MVC)。对于每个任务，我们构建了包含约2万个示例的数据集，涵盖模拟和真实世界场景，并配有真实中间想象、最终答案和精心策划的评估基准。使用统一的 VLM BAGEL [12] 作为我们的骨干模型，IPT 监督在多种设置下改善了空间推理能力，并且通常优于文本思维链训练，即使在推理时没有生成图像。例如，在 MVC 上，IPT 将准确率提高了 3.4%，并在路径追踪任务上取得了与强大的闭源模型相当的性能。我们还发现，将 IPT 与仅标签数据混合训练可以进一步提高性能。相比之下，文本思维链在这些任务上可能有害，在某些情况下会大幅降低性能，这突显了当通过语言强制进行空间计算时存在的模态不匹配问题。总的来说，IPT 提供了一个原则性的监督信号，用于

原始摘要

Vision-language models (VLMs) excel at many tasks, yet continue to struggle with spatial reasoning—problems where the key information is not directly observable in the input. Many spatial questions require imaginative perception: simulating an unseen viewpoint, tracing a trajectory through an occluded space, or integrating partial views into a coherent spatial map. Humans naturally support this kind of reasoning through imagination. Prior work has introduced intermediate visual representations (e.g., visual thoughts, depth, or box tokens), but these intermediates often refine structure already visible rather than predicting the missing spatial structure implied by the evidence. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under an alternative spatial configuration while remaining consistent with the observed input. To study this capability, we formulate three tasks that require imaginative perception: Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC). For each task, we construct datasets of ∼20K examples spanning simulated and real-world settings, paired with ground-truth intermediate imaginations, final answers, and curated evaluation benchmarks. Using the unified VLM BAGEL [12] as our backbone, IPT supervision improves spatial reasoning across several settings and often outperforms textual chainof-thought training, even when no image is generated at inference time. For example, on MVC, IPT improves accuracy by 3.4% and achieves performance competitive with strong closed-source models on Path Tracing. We also find that mixed training with IPT and label-only data can further improve performance. In contrast, textual chain-of-thought can be detrimental on these tasks, substantially degrading performance in some cases, highlighting a modality mismatch when forcing spatial computation through language. Overall, IPT provides a principled supervision signal for

链接

论文链接论文链接代码代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接