Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Siyuan Liu; Jinyang Wu

Multimodal And Generative Systems 突破级暂无讲解视频

发表时间: 2026-06-08
arXiv: 2606.09131

核心要点

问题/背景: 这篇论文挑战了 MLLM 默认架构假设：视觉 token 和文本 token 不一定都需要穿过完整深层语言模型栈。作者观察到视觉 token 在中层趋于饱和，而文本 token 仍需要深层语义处理。
方法/机制: 方法上，Dual-Path Vision Token Routing 在视觉饱和点把 vision tokens 路由到轻量侧分支，让文本在深层主干中继续前进，最后再 late-layer fusion。核心实例 DPVR-LF 只用约 3% trainable parameters，减少深层视觉计算同时保持标准多模态性能。
结果/证据: 正式收录价值在于它提供了 modality-asymmetric routing 这个可复用方法 primitive，并用层级分析解释了为什么 late fusion 可行。对多模态模型效率、架构解耦和视觉 token 计算预算都有直接外溢价值。
收录价值: 它不是更高一级，因为证据主要基于 LLaVA-style 模型，是否适用于更多原生多模态架构、视频模型和大规模训练还需验证；但方法和诊断结论具有突破性。

完整收录解读

这篇论文挑战了 MLLM 默认架构假设：视觉 token 和文本 token 不一定都需要穿过完整深层语言模型栈。作者观察到视觉 token 在中层趋于饱和，而文本 token 仍需要深层语义处理。

方法上，Dual-Path Vision Token Routing 在视觉饱和点把 vision tokens 路由到轻量侧分支，让文本在深层主干中继续前进，最后再 late-layer fusion。核心实例 DPVR-LF 只用约 3% trainable parameters，减少深层视觉计算同时保持标准多模态性能。

正式收录价值在于它提供了 modality-asymmetric routing 这个可复用方法 primitive，并用层级分析解释了为什么 late fusion 可行。对多模态模型效率、架构解耦和视觉 token 计算预算都有直接外溢价值。

它不是更高一级，因为证据主要基于 LLaVA-style 模型，是否适用于更多原生多模态架构、视频模型和大规模训练还需验证；但方法和诊断结论具有突破性。

原始摘要与中文对照

中文对照翻译

晚层融合足矣：视觉饱和下多模态大语言模型的双路径视觉令牌路由。多模态大语言模型（MLLMs）通常继承为单模态文本建模设计的深度对称Transformer骨干模型，并对图像和语言令牌统一应用相同的计算。这种设计忽略了一个关键的模态不对称性：图像和文本令牌在信息密度、冗余度和所需的推理深度方面存在显著差异。通过对LLaVA-1.5的逐层分析，我们观察到视觉令牌倾向于在中间层饱和。具体而言，文本到图像注意力从第0层的0.68下降到第4层的0.07，并在第18层之后稳定在0.04左右，而文本令牌则继续受益于深度语义处理。这些发现表明架构对称性与深度异步的模态演化之间存在不匹配，导致冗余的视觉计算以及在深度任务特定适应过程中感知表示可能出现的漂移。受此启发，我们提出了双路径视觉令牌路由（DPVR），这是一种用于高效MLLMs的模态不对称路由框架。其核心实例化DPVR-LF（晚层融合）在饱和点将视觉令牌路由到一个单层可训练的旁路分支中，运行一个十三层的纯文本前向传播（跳过深层堆栈中的图像位置），并仅在最终层重新融合视觉和文本流。DPVR-LF以大约3%的可训练参数，在标准基准上保持了有竞争力的多模态性能，同时减少了深度Transformer堆栈中的视觉计算。结果挑战了视觉令牌必须遍历所有深度语言模型层的传统假设，并表明单个晚层融合层足以在LLaVA风格的MLLMs中保持强大的感知能力。

原始摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteenlayer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

链接

论文链接论文链接

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接