AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen; Hao Li; Chaowei Xiao; Ning Zhang

智能体与自主科学颠覆级有讲解视频

发表时间: 2026-02-07
arXiv: 2602.07398

收录解读

间接 prompt injection 的核心问题，不只是模型会不会识别恶意内容，而是传统 agent 会把工具输出、网页内容和中间痕迹一股脑塞进同一上下文，导致恶意指令在整个工作流里持续驻留并反复影响决策。现有防御大多默认这种 bloated memory 是既定条件，再在其上做过滤、检测或鲁棒 prompting。

AgentSys 直接改写了这个前提。它把 agent 组织成带层级隔离的结构：主 agent 为工具调用生成 worker agent，每个 worker 在独立上下文中运行，外部数据和子任务痕迹不进入主 agent 记忆，只有经过 schema 校验和确定性 JSON 解析的返回值可以跨边界流动。论文还加入 validator/sanitizer，并把防御开销做成与操作次数而不是上下文长度相关。

这篇工作值得收录，而且我给到 disruptive，因为它把 agent prompt injection 防御从“在污染上下文里尽量变稳”转向“通过显式记忆隔离阻止污染进入主工作记忆”。这不是一个局部 patch，而是一种更耐久的 agent runtime 安全组织方式，对浏览器 agent、API agent 和企业自动化流程都有直接复用价值。

它没有升到 paradigm，是因为当前证据还主要集中在 AgentDojo、ASB 和作者实现生态内，尚未成为行业默认的 agent sandbox / runtime blueprint。但作为一条系统级安全路线，它已经明显高于普通 benchmark defense。

原始摘要与中文对照

中文对照翻译

AGENT S YS：通过显式分层内存管理实现安全动态的LLM代理。摘要间接提示注入通过在外部内容中嵌入恶意指令来威胁LLM代理，从而导致未经授权的操作和数据窃取。LLM代理通过其上下文窗口维护工作记忆，该窗口存储交互历史以供决策。传统代理不加区分地将所有工具输出和推理痕迹累积到此记忆中，从而产生两个关键漏洞：(1) 注入的指令在整个工作流程中持续存在，使攻击者有多次机会操纵行为，以及 (2) 冗长、非必要的内容会降低决策能力。现有防御措施将臃肿的记忆视为既定事实，并侧重于保持弹性，而非减少不必要的累积以阻止攻击。我们提出了AGENT S YS，一个通过显式内存管理防御间接提示注入的框架。受操作系统中进程内存隔离的启发，AGENT S YS分层组织代理：主代理为工具调用生成工作代理，这些工作代理在隔离的上下文中执行，并可以递归地生成嵌套工作代理来处理子任务。外部数据和子任务推理痕迹绝不会直接进入主代理的记忆，只有经过模式验证的返回值才能通过确定性JSON解析跨越隔离边界。仅这种架构分离就提供了实质性的安全性：消融研究表明，上下文隔离在没有额外机制的情况下实现了2.19%的攻击成功率，这表明原则性的内存管理从根本上减少了攻击面。验证器和清理器进一步加强了防御，事件触发的检查确保开销随操作而非上下文长度扩展。在AgentDojo和ASB上的评估表明，AGENT S YS实现了0.78%和4.25%的攻击成功率，同时相对于未防御的基线略微提高了良性效用。AGENT S YS在面对自适应攻击者和跨多个基础模型时保持了稳健的性能，这表明显式内存管理能够实现安全、动态的LLM代理架构。我们的代码可在https://github.com/ruoyaow/agentsys-memory获取。

原始摘要

A BSTRACT Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision-making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non-essential content degrades decision-making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AGENT S YS, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AGENT S YS organizes agents hierarchically: the main agent spawns worker agents for tool invocations, which execute in isolated contexts and can recursively spawn nested workers for subtasks. External data and subtask reasoning traces never directly enter the main agent’s memory, where only schema-validated return values may cross isolation boundaries through deterministic JSON parsing. This architectural separation alone provides substantial security: ablation studies show context isolation achieves 2.19% attack success rate without additional mechanisms, demonstrating that principled memory management fundamentally reduces attack surface. A validator and sanitizer further strengthen defense, with event-triggered checks ensuring overhead scales with operations rather than context length. Evaluation on AgentDojo and ASB shows AGENT S YS achieves 0.78% and 4.25% attack success rates while slightly improving benign utility over undefended baselines. AGENT S YS maintains robust performance against adaptive attackers and across multiple foundation models, demonstrating that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at https://github.com/ruoyaow/agentsys-memory.

解读视频

视频观看页 B 站 YouTube

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

解读视频

相关论文

链接