A foundation model of vision, audition, and language for in-silico neuroscience

神经科学与认知科学突破级有讲解视频

发表时间: 2026-03-26

收录解读

如果神经科学想真正和 foundation model 接轨，关键不只是做一个更高分的 encoding model，而是把跨视觉、听觉和语言刺激的大规模脑响应预测统一到同一个可泛化模型接口上。TRIBE v2 的定位正是这样：它试图把人类大脑对几乎任意 sight or sound 的反应，建模为一个可 zero-shot 外推的新型 trimodal brain encoder。

这项工作的核心新意，是把 vision、audition 和 language 联合到同一 foundation-model 训练对象里，并基于 500 多小时、700 多名被试的 fMRI 数据学习可迁移的神经响应表示。论文强调它不仅在新被试、新语言和新任务上做 zero-shot 预测，还通过可解释 latent feature 抽取，去揭示 multisensory integration 的细粒度拓扑结构，从而把预测性能与神经机制解释连到一起。

它值得正式收录，因为这不是普通脑编码 benchmark 提升，而是在 NeuroAI 方向上把 foundation model 方法论直接引入 in-silico neuroscience。它对脑响应数字孪生、跨模态神经表征建模、以及用 AI 统一不同感觉通道的认知神经科学工作流，都有明确外溢价值，也符合仓库对 neuroscience 条目的高门槛要求。

它目前仍然是 breakthrough，而不是更高一级，因为主证据仍主要建立在 Meta 官方技术报告和其组织的数据资产之上。虽然方向很强、规模也足够大，但它是否会成为更广 NeuroAI 社区的 durable reference，还要看外部复现、下游采用和对 brain-inspired AI 的实质性反哺。

原始摘要与中文对照

中文对照翻译

认知神经科学被碎片化为专门模型，每个模型都针对特定的实验范式量身定制，因此阻碍了人脑认知统一模型的建立。在此，我们介绍TRIBE v2，这是一个三模态（视频、音频和语言）基础模型，能够预测各种自然和实验条件下的人脑活动。利用一个包含720名受试者超过1,000小时fMRI数据的统一数据集，我们证明了我们的模型能够准确预测新颖刺激、任务和受试者的高分辨率大脑反应，超越了传统线性编码模型，实现了数倍的准确性提升。关键的是，TRIBE v2支持计算机模拟实验：在开创性的视觉和神经语言范式上进行测试，它恢复了数十年实证研究建立的各种结果。最后，通过提取可解释的潜在特征，TRIBE v2揭示了多感官整合的精细拓扑结构。这些结果确立了人工智能作为探索人脑功能组织的统一框架。日期：2026年3月25日通讯：sdascoli@meta.com 和 jeanremi@meta.com 代码：https://github.com/facebookresearch/tribev2 权重：https://huggingface.co/facebook/tribev2 演示：https://aidemos.atmeta.com/tribev2

原始摘要

Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain. Date: March 25, 2026 Correspondence: sdascoli@meta.com and jeanremi@meta.com Code: https://github.com/facebookresearch/tribev2 Weights: https://huggingface.co/facebook/tribev2 Demo: https://aidemos.atmeta.com/tribev2

解读视频

视频观看页 B 站 YouTube

链接

论文链接