Evaluating large language model agents for automation of atomic force microscopy

Indrajeet Mandal; Jitendra Soni; Mohd Zaki; Morten M. Smedskjaer; Katrin Wondraczek; Lothar Wondraczek; Nitya Nand Gosvami; N. M. Anoop Krishnan

doi:10.1038/s41467-025-64105-7

工业过程与制造突破级暂无讲解视频

发表时间: 2025-10-14
DOI: 10.1038/s41467-025-64105-7

收录解读

这篇论文抓住了 self-driving laboratories 真正难的一层：很多实验自动化系统依赖刚性 protocol 和手工流程设计，很难体现专家在动态实验环境中的判断与适应能力。作者把 atomic force microscopy（AFM）作为一个高精度实验工作流，专门测试 LLM agents 是否真的能够承担从实验设计到结果分析的完整科学流程，而不只是回答材料科学问题。

论文提出 Artificially Intelligent Lab Assistant（AILA）框架，并同时发布 AFMBench 这一套完整评测，从实验设计、校准、特征检测到结果分析全面考察 LLM agent 的实验能力。摘要里最关键的结果不是“某个模型做到了自动 AFM”，而是：当前最强模型在基础任务和协调场景上仍然会明显失败；材料科学问答能力并不等于实验能力；此外还出现作者称为 sleepwalking 的 instruction deviation，说明 agentic lab automation 有真实的安全和对齐问题。multi-agent 设置优于 single-agent，但仍然对提示格式敏感。

这项工作值得正式收录，因为它把 scientific instrumentation automation 从 demo 推进到 benchmark + safety + capability boundary 的更成熟阶段。对仓库来说，它既属于 AI for science，也属于工业/实验过程自动化的关键参考：真正重要的不是 AFM 这个单一设备，而是它证明了 agentic lab systems 在进入真实实验闭环前，必须有更严格的系统评测与安全分析。

它暂时还不到更高一级，原因是当前工作仍聚焦 AFM 与相关材料实验流程，离更广泛的 autonomous instrumentation default benchmark 还有距离。它非常值得收，但还没有把整个 self-driving lab 赛道完全重排。

原始摘要与中文对照

中文对照翻译

LLMs正在通过实现自动驾驶实验室（SDLs）来改变实验室自动化，这有望加速材料研究。然而，当前的SDL实现依赖于僵化的协议，未能捕捉到专家科学家在动态实验环境中的适应性和直觉。在本文中，我们展示了LLM智能体可以通过我们的人工智能实验室助手（AILA）框架自动化原子力显微镜（AFM）。此外，我们开发了AFMBench——一个全面的评估套件，用于在从实验设计到结果分析的完整科学工作流程中挑战LLM智能体。我们发现最先进的LLMs在基本任务和协调场景中表现不佳。值得注意的是，在材料科学问答方面表现出色的模型在实验室环境中表现不佳，这表明领域知识并不能转化为实验能力。此外，我们观察到LLM智能体可能会偏离指令，这种现象被称为“梦游”，这为SDL应用带来了安全对齐方面的担忧。我们的消融实验表明，多智能体框架显著优于单智能体方法，尽管两者都对指令格式或提示中的微小变化敏感。最后，我们评估了AILA在日益高级的实验中的有效性——AFM校准、特征检测、机械性能测量、石墨烯层计数和压痕器检测。这些发现确立了在将LLM智能体作为自主实验室助手部署到各个科学领域之前，进行基准测试和建立稳健安全协议的必要性。

原始摘要

Large language models (LLMs) are transforming laboratory automation by enabling self-driving laboratories (SDLs) that could accelerate materials research. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. Here, we show that LLM agents can automate atomic force microscopy (AFM) through our Artificially Intelligent Lab Assistant (AILA) framework. Further, we develop AFMBench—a comprehensive evaluation suite challenging LLM agents across the complete scientific workflow from experimental design to results analysis. We find that state-of-the-art LLMs struggle with basic tasks and coordination scenarios. Notably, models excelling at materials science question-answering perform poorly in laboratory settings, showing that domain knowledge does not translate to experimental capabilities. Additionally, we observe that LLM agents can deviate from instructions, a phenomenon referred to as sleepwalking, raising safety alignment concerns for SDL applications. Our ablations reveal that multi-agent frameworks significantly outperform single-agent approaches, though both remain sensitive to minor changes in instruction formatting or prompting. Finally, we evaluate AILA’s effectiveness in increasingly advanced experiments—AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. These findings establish the necessity for benchmarking and robust safety protocols before deploying LLM agents as autonomous laboratory assistants across scientific disciplines.

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接