Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

多模态基础模型突破级暂无讲解视频

发表时间: 2026-05-08
arXiv: 2605.05927

收录解读

这篇论文重新定位 Speech LLM 的 modality gap：问题不只在输出端把语音生成变得更像文本生成，剩余瓶颈主要来自输入端给 LLM 的 speech representation 不够 TLM-compatible。

作者提出 TextPro-SLM，把 Speech LLM 改造成 prosody-aware text LLM。核心组件 WhisperPro 同时输出同步 text tokens 和 prosody embeddings，显式分离 what is said 与 how it is said。

LLM backbone 通过知识蒸馏保留原始 text LLM 的语义能力，同时学习 emotion、speaking style、speaker timbre 等 paralinguistic understanding；实验显示 3B/7B 规模下 modality gap 更低，并且只需约 1,000 小时 LLM 训练音频。

它值得正式收录，因为它给 speech-language multimodal foundation models 提供了清晰的输入侧接口设计：让语音输入尽量贴近文本 LLM 的工作形态，同时保留韵律和副语言信息。

链接

论文链接