Evaluating large language models for accuracy incentivizes hallucinations

Adam Tauman Kalai; Ofir Nachum; Santosh S. Vempala; Edwin Zhang

doi:10.1038/s41586-026-10549-w

理论、鲁棒性与核心机器学习颠覆级有讲解视频

发表时间: 2026-04-22
DOI: 10.1038/s41586-026-10549-w

收录解读

这篇论文的重要性不在于又提出一个 hallucination 检测器，而在于它把问题上提了一层：当前主流训练和评测流程本身就在奖励不该有的猜测。作者指出，next-word 预训练会天然把模型推向“尽量补全”而不是“诚实承认不知道”，而准确率导向的 headline metrics 又在后训练和榜单上继续放大这种倾向。

论文的核心贡献是一个新的问题框架。它不是把幻觉单纯视为知识缺失、检索不足或校准不良，而是把幻觉重新定义为 incentive mismatch：当错误几乎不受罚而 abstention 不被鼓励时，最优策略就是猜。基于这个角度，作者提出 open-rubric evaluations，把错误代价显式写进评测规则，要求模型根据 stakes 调整是否作答。

它值得正式收录，而且我把它抬到 disruptive，是因为这类工作会直接影响后续 benchmark 设计、leaderboard 指标、post-training 目标，甚至影响我们如何定义“更可靠”的模型。相比又一个局部缓解技巧，它更像在修正整个评测和训练闭环的目标函数。

它没有升到 paradigm，是因为目前提出的 open-rubric 仍是原则性方案和评测建议，是否会成为社区普遍采纳的新默认，还要看后续 benchmark、产品评测和安全评估体系是否真正跟进。

原始摘要

Large language models sometimes produce confident, plausible falsehoods (“hallucinations”), limiting their reliability . Prior work has offered numerous explanations and effective mitigations such as retrieval and tool use , consistency-based self-verification , and reinforcement learning from human feedback . Nonetheless, the problem persists even in state-of-the-art language models . Here we show how next-word prediction and accuracy-based evaluations inadvertently reward unwarranted guessing. Initially, next-word pretraining creates statistical pressure toward hallucination even with idealized error-free data: using learning theory , we show that facts lacking repeated support in training data (such as one-off details) yield unavoidable errors, while recurring regularities (such as grammar) do not. Subsequent training stages aim to correct such errors. However, dominant headline metrics like accuracy systematically reward guessing over admitting uncertainty. To align incentives, we suggest two additions to the classic approach of adding error penalties to evaluations to control abstention . First, we propose “open-rubric” evaluations that explicitly state how errors are penalized (if at all), which test whether a model modulates its abstentions to stated stakes while optimizing accuracy. Second, since hallucination-specific benchmarks rarely make leaderboards , we suggest using open-rubric variants of existing evaluations, to reverse their guessing incentives. Reframing hallucination as an incentive problem opens a practical path toward more reliable language models.

解读视频

视频观看页 B 站 YouTube

链接

论文链接

收录解读

原始摘要

解读视频

相关论文

链接