Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

多模态基础模型突破级暂无讲解视频

发表时间: 2026-03-09
arXiv: 2604.00007

收录解读

统一多模态模型通常卡在两个方向之间：要么走 autoregressive serialization，把不同模态都压进同一 token stream；要么走组合式系统，让主模型依赖外部 modality-specific decoders 与 orchestration。真正困难的是在一个共享架构里同时支持 text、image、speech 的理解与生成，并保持 video understanding 等能力，而不是只拼接多个专用模块。

Dynin-Omni 提出 masked-diffusion-based omnimodal foundation model，把 text、image、speech 以及 video understanding 放进同一离散 token 空间中建模。它不走传统自回归统一建模，而是把多模态统一表述为 shared discrete token space 上的 masked diffusion，并通过 model-merging-based modality expansion 与 omnimodal alignment 完成多阶段训练。这样的设计使模型能在双向上下文下做 iterative refinement，而不是被单向 token 序列限制。

这篇工作值得收录，因为它把 unified multimodal pretraining 从“共享 backbone + 外挂专用头”推进到更彻底的任何到任何统一扩散建模。对于跨模态生成、检索、实时交互系统以及 embodied multimodal agents，它提供的是一种更耐久的统一接口视角，而不是某一单模态指标上的局部增益。

它没有升到更高一级，是因为当前证据仍主要来自一篇新近 arXiv 论文和 open-source 统一模型对比，离真正证明 masked diffusion 会成为 omnimodal foundation model 的主流范式还差一步。它是很强的统一建模工作，但影响范围尚未稳定固化。

链接

论文链接