Molecular deep learning at the edge of chemical space

Derek van Tilborg; Luke Rossen; Francesca Grisoni

doi:10.1038/s42256-026-01216-w

化学、生物与自动化实验室突破级暂无讲解视频

发表时间: 2026-04-22
DOI: 10.1038/s42256-026-01216-w

收录解读

这篇论文抓住了分子机器学习里一个长期但经常被低估的问题：模型在训练分布边缘之外往往迅速失真，但很多工作只报告平均测试集性能，几乎不显式刻画“离训练化学空间有多远时还可信”。作者的重点不是再做一个 predictor，而是为 chemical-space generalization 引入更可用的估计量。

方法上，他们把性质预测和分子重建联合建模，提出 reconstruction-based unfamiliarity 指标，用来估计样本相对训练分布的陌生程度以及模型在该点的可靠性。它不只检测 OOD，还在 30 多个 bioactivity 数据集上表现为 classifier performance 的稳定预测信号。

它值得正式收录，因为 unfamiliarity 是很有复用潜力的方法原语。对于 virtual screening、active learning、分子库优先级排序，以及 wet-lab 前的风险控制都很有意义。更重要的是，作者还做了两条 kinase 的实验验证，说明这个指标不是纸上泛化，而能真的帮助发现结构上更远的新活性分子。

它没有升到更高等级，是因为当前贡献仍集中在 molecular ML 的 generalization diagnostics 与 screening workflow，虽然很强，但还没到重构整个药物发现基础设施的级别。

原始摘要与中文对照

中文对照翻译

分子机器学习模型通常难以泛化到其训练数据所涵盖的化学空间之外，这限制了它们对结构新颖的生物活性分子进行可靠预测的能力。为此，为了提升机器学习超越其训练化学空间“边缘”的能力，我们引入了一种联合建模方法，该方法结合了分子性质预测与分子重建。这种方法引入了陌生度（unfamiliarity），这是一种基于重建的度量，能够估计模型的泛化能力。通过对30多个生物活性数据集的系统分析，我们证明陌生度不仅能有效识别分布外分子，还能作为分类器性能的可靠预测指标。即使面对大规模分子库中存在的强烈分布偏移，陌生度也能产生传统方法未能察觉的稳健且有意义的分子见解。最后，我们通过湿实验室实验验证了基于陌生度的分子筛选方法，针对两种临床相关的激酶，发现了七种具有低微摩尔效价且与训练分子相似度有限的化合物。这表明陌生度能够将机器学习的范围扩展到已绘制化学空间的边缘之外，从而推动多样化和结构新颖分子的发现。

原始摘要

Molecular machine learning models often fail to generalize beyond the chemical space of their training data, limiting their ability to reliably perform predictions on structurally novel bioactive molecules. Here, to advance the ability of machine learning to go beyond the ‘edge’ of their training chemical space, we introduce a joint modelling approach that combines molecular property prediction with molecular reconstruction. This approach allows the introduction of unfamiliarity, a reconstruction-based metric that enables the estimation of model generalizability. Via a systematic analysis spanning more than 30 bioactivity datasets, we demonstrate that unfamiliarity not only effectively identifies out-of-distribution molecules but also serves as a reliable predictor of classifier performance. Even when faced with the presence of strong distribution shifts on large-scale molecular libraries, unfamiliarity yields robust and meaningful molecular insights that go unnoticed by traditional methods. Finally, we experimentally validate unfamiliarity-based molecule screening in the wet lab for two clinically relevant kinases, discovering seven compounds with low micromolar potency and limited similarity to training molecules. This demonstrates that unfamiliarity can extend the reach of machine learning beyond the edge of the charted chemical space, advancing the discovery of diverse and structurally novel molecules.

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接