Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Haonan Dong; Qiguan Feng; Kehan Jiang; Haoran Ye; Xin Zhang; Guojie Song

安全、治理与可靠性突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-05-11
arXiv: 2605.10365

收录解读

Agent-ValueBench 指出 agent 的价值表现不能直接等同于底层 LLM 的价值表现，因为 harness、工具、环境和动作轨迹会改变行为。

基准提供 394 个可执行环境、16 个领域、4335 个价值冲突任务和 28 个价值系统，并为任务提供 pole-aligned golden trajectories 与轨迹级 judge。

它值得正式收录，因为 agent safety 正在从文本偏好转向执行轨迹评估；这篇论文把 value evaluation 迁移到 agentic modality。

它没有更高，是因为价值体系覆盖、心理学标注一致性和 judge 可靠性仍然会影响结论。

原始摘要与中文对照

中文对照翻译

摘要自动智能体作为任务执行者已迅速成熟，并通过OpenClaw等工具得到了广泛部署。安全问题理所当然地引起了越来越多的研究关注，而这些问题之下是默默引导智能体行为的价值观。然而，现有的价值观基准仍局限于LLMs，使得智能体价值观在很大程度上未被探索。从直观、经验和理论的角度，我们表明智能体的价值观与其底层LLM的价值观存在差异，并且智能体模态进一步引入了文本协议中不存在的数据集、评估和系统层面的挑战。我们通过Agent-ValueBench弥补了这一空白，这是第一个专门针对智能体价值观的基准。它包含16个领域中的394个可执行环境，提供了4,335个价值观冲突任务，涵盖28个价值观系统（332个维度）。每个实例都通过我们专门构建的端到端管道共同合成，并由人类心理学家逐实例进行策划。每个任务都附带两条极点对齐的黄金轨迹，其检查点作为基于轨迹级别评分标准的评判依据。通过对4个主流工具中的14个前沿专有模型和开源模型进行基准测试，我们发现了三个协同发现。智能体价值观首先表现为跨模型同质性的“价值观潮汐”，其下存在可解释的逆流。这种潮汐在工具拉动下非加性地弯曲，但在通过嵌入技能进行有意识引导下则更具决定性。这些结果共同表明，智能体对齐的杠杆正在从经典的模型对齐和提示引导转向工具对齐和技能引导。

原始摘要

Abstract Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent’s values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems (332 dimensions). Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by human psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull , and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

链接

论文链接