CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Junlin Yang; Dylan Zhang; Xiangchen Song; Qirun Dai; Xiao Liu; Yuen Chen; Aniket Vashishtha; Jing Shi; Chenhao Tan; Hao Peng

科学发现旗舰工作突破级暂无讲解视频

发表时间: 2026-05-28
arXiv: 2605.26029

收录解读

AI scientist 评测不能只看最终答案是否对，还要看模型是否通过真实可解释机制获得答案。

CausaLab 把 agent 放进 synthetic laboratory：给定观测记录，允许对 manipulator crystal 做干预，然后预测 reactor crystal，同时需要恢复隐藏 structural causal model 的图结构和方程。

实验显示强模型在 prediction accuracy 上可以很高，但机制恢复仍明显不足；混合观测-干预策略提高结构忠实度，premature stopping 是主要失败模式之一。

它值得收录，因为它把科学发现 agent 的评测从答题推进到 interactive causal experimentation 和 mechanism recovery，这正是自主科学系统需要的核心能力。

原始摘要与中文对照

中文对照翻译

我们引入了CausaLab，这是一个可扩展的环境，用于通过LLM代理评估交互式因果发现。与以往的评估不同，CausaLab评估了代理是否能够利用因果证据解决问题，以及其答案是否基于对忠实恢复的因果机制。每个剧本都将代理置于一个合成实验室：它接收先前测量记录，对操作晶体进行干预，并预测由相同机制控制的持有一个反应堆晶体的共振频率。隐藏的数据生成过程是随机采样的结构性因果模型（SCM），因此成功需要恢复因果图和结构方程，而不是回忆先前知识。实验表明，预测与机制恢复之间存在持续差距：在纯观察性的6节点设置中，GPT-5.2-high达到92%的任务准确率，但仅为0.471的全部边缘F1值。混合观察-干预策略可以提高结构忠实度，而纯干预，即使对于强大的代理来说，仍然很困难。我们识别出过早停止是主要弱点，并表明一致性验证可以缓解它。因此，CausaLab将预测成功与因果理解分开，并揭示了当前LLM代理作为实验性因果推理者的局限性。代码：https://github.com/DylanZSZ/CausaLab * 颜俊林和张 Dylan 以平等的贡献参与了该项目，他们两人担任项目负责人。颜俊林的成果在

原始摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden datagenerating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F1 . Mixed observation–intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents’ limits as experimental causal reasoners. Code: https://github.com/DylanZSZ/CausaLab * Junlin Yang and Dylan Zhang contributed equally and both serve as project leads. Junlin Yang’s work was done at the

链接

论文链接