Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

安全、治理与可靠性突破级暂无讲解视频

发表时间: 2026-05-28
arXiv: 2605.30189

核心要点

问题/背景: 这篇安全论文研究 LoRA adapter 作为主流微调分发格式时的 backdoor 风险：攻击者可以用少量 poisoned examples 训练出不影响 clean accuracy、但特定触发下可靠激活的 adapter。
方法/机制: 关键发现是 backdoor 的泛化不是按结构模式，而是按 token feature neighborhood：例如在 RFC reference 上训练的触发会迁移到任意 RFC reference，但不迁移到结构相似的 ISO/OWASP/CWE/NIST citation。
结果/证据: 论文系统考察 base model scale/family、LoRA rank、trigger string，并提出两类检测：probe battery 统计的 behavioral detector，以及不运行模型的 weight-level Frobenius norm statistic；causal patching 将后门定位到中后层 MLP，尤其 down_proj。
收录价值: 收录价值在于它把 PEFT/LoRA 供应链安全从泛泛 backdoor demo 推进到 token-level generalization、adapter cohort detection 和权重级检测接口，对 agent/tool model adapter 分发风险有直接复用价值。

收录解读

这篇安全论文研究 LoRA adapter 作为主流微调分发格式时的 backdoor 风险：攻击者可以用少量 poisoned examples 训练出不影响 clean accuracy、但特定触发下可靠激活的 adapter。

关键发现是 backdoor 的泛化不是按结构模式，而是按 token feature neighborhood：例如在 RFC reference 上训练的触发会迁移到任意 RFC reference，但不迁移到结构相似的 ISO/OWASP/CWE/NIST citation。

论文系统考察 base model scale/family、LoRA rank、trigger string，并提出两类检测：probe battery 统计的 behavioral detector，以及不运行模型的 weight-level Frobenius norm statistic；causal patching 将后门定位到中后层 MLP，尤其 down_proj。

收录价值在于它把 PEFT/LoRA 供应链安全从泛泛 backdoor demo 推进到 token-level generalization、adapter cohort detection 和权重级检测接口，对 agent/tool model adapter 分发风险有直接复用价值。

论文摘要

The paper studies poisoned LoRA adapters for LLMs, showing that clean-accuracy-preserving backdoors can saturate with a small fraction of poisoned data and generalize at token-feature rather than structural-pattern level. It characterizes the attack across model scale, family, rank and triggers, and proposes behavioral and weight-level detectors, with causal patching localizing the backdoor to mid-to-late MLP blocks.

链接

论文链接论文链接代码

核心要点

收录解读

论文摘要

相关论文

链接