LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang; Huiqin Yang; Bendong Jiang; Xiaolei Zhang; Yiran Zhao; Ruyi Chen; Lu Zhou; Xiaogang Xu; Jiafei Wu; Liming Fang; Zhe Liu

智能体与自主科学突破级有讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-05-11
arXiv: 2605.10779

收录解读

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments 关注的是一个可复用的 AI 系统或评测问题，而不是单点 demo。

OS-level benchmark for behavior jailbreaks in autonomous agents.

It evaluates physical/action-layer harm with rollback and dual semantic-physical verification, a strong reusable safety benchmark.

它没有更高，是因为这些新 arXiv 工作仍需要更多独立复现、真实系统部署和长期社区采用来确认影响。

原始摘要与中文对照

中文对照翻译

LLM驱动的自主代理在真实操作系统环境中的迅速普及，引入了一种超越传统内容安全的新型安全风险：行为越狱，即攻击者诱导代理执行具有不可逆物理后果的危险OS级操作。现有基准要么仅在语义输出层评估安全性，忽略了物理层危害，要么未能隔离测试用例，导致早期运行污染后期运行。我们提出了LITMUS（LLM代理OS内测试以衡量不安全颠覆），一个通过语义-物理双重验证机制和OS级状态回滚设计来解决这两个缺陷的基准。LITMUS包含一个由819个高风险测试用例组成的数据集，这些用例被组织成一个有害种子子集和六个攻击扩展子集，涵盖三种对抗范式（越狱话语、技能注入和实体封装），以及一个全自动多代理评估框架，该框架独立地在对话和OS级物理层判断代理行为。对多个前沿代理的评估揭示了三个一致的发现：（1）当前代理在真实OS环境中缺乏对危险指令的有效安全意识，即使是强大的模型（例如Claude Sonnet 4.6）仍执行40.64%的高风险操作；（2）代理表现出普遍的执行幻觉（EH），即口头拒绝请求，但危险操作已在系统层面完成，这种现象在所有先前的仅语义评估框架中是不可见的；（3）我们设计的技能注入和实体封装攻击取得了高成功率，暴露了代理对恶意技能干扰和指令混淆的显著漏洞。LITMUS为LLM代理在真实OS环境中可复现、物理基础的行为安全评估提供了第一个标准化平台。

原始摘要

The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a qualitatively new category of safety risk beyond traditional content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible physical consequences. Existing benchmarks either evaluate safety at the semantic output layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS Testing for Measuring Unsafe Subversion), a benchmark that addresses both gaps through a semantic–physical dual verification mechanism and an OS-level state rollback design. LITMUS comprises a dataset of 819 high-risk test cases organized into one harmful seed subset and six attack-extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping) as well as a fully automated multi-agent evaluation framework that independently judges agent behavior at both the conversational and OS-level physical layers. Evaluation across multiple frontier agents reveals three consistent findings: (1) current agents lack effective safety awareness against dangerous instructions in real OS environments, with the strong model (e.g. Claude Sonnet 4.6) still executing 40.64% of high-risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, a phenomenon invisible to every prior semantic-only evaluation framework; and (3) skill injection and entity wrapping attacks we designed achieve high success rates, exposing pronounced agent vulnerabilities to malicious skill interference and instruction obfuscation. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.

解读视频

视频观看页 B 站 YouTube

链接

论文链接