Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

多模态基础模型突破级暂无讲解视频

发表时间: 2026-03-30
arXiv: 2603.29620

收录解读

统一多模态模型已经能做出高质量图像，但一遇到 long-tail、知识密集、文化事实性很强的生成任务，就容易被冻结参数中的陈旧或缺失知识卡住。普通 world knowledge prompting 往往不够，因为问题不只是模型记不记得，而是生成流程缺少显式的外部 grounding 与证据整合。Unify-Agent 针对的正是这一缺口。

论文把 world-grounded image synthesis 重写成一个 agentic pipeline：先做 prompt understanding，再做 multimodal evidence searching，然后 grounded recaptioning，最后再进入 synthesis。为了训练这一流程，作者构建了专门的数据管线和 143K 高质量 agent trajectories，用来监督完整的 agentic generation 过程；同时提出 FactIP benchmark，覆盖 12 类文化和长尾事实概念，显式要求外部知识 grounding。

这篇工作值得收录，因为它不只是把 image generation 接个搜索器，而是把 reasoning、searching 和 generation 紧耦合成了统一 agentic modeling 流程。对 multimodal agents、open-world generation 和 grounded image synthesis，这是一条具有耐久方法价值的路线，而不仅仅是提分技巧。

它没有升到更高一级，是因为当前仍主要是该方向的早期探索，尽管 pipeline 和 benchmark 都很完整，但是否会成为 world-grounded multimodal generation 的主流蓝图，还需要更多后续验证和外部 adoption。

链接

论文链接