LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

多模态基础模型突破级暂无讲解视频

收录解读

许多 VLM grounding/detection 方法把 2D box 序列化成多个坐标 token，既破坏 box 几何耦合，也带来严格串行解码瓶颈。

LocateAnything 提出 Parallel Box Decoding，把 bounding boxes 和 points 作为 atomic geometric units 一步解码，同时提升几何一致性和推理并行度。

论文还构建 LocateAnything-Data，包含超过 1.38 亿训练样本，用于提高高精度定位的数据多样性。

它值得收录，因为 precise grounding 是多模态提取、GUI/robot perception 和视觉 agent 的基础接口；PBD 是一个简单但可复用的解码原语。