收录解读
这篇论文把视频生成评测从 prompt-following 扩展到专业影视质量评估。
它将电影制作流程知识、专家标注和 VLM 评估器结合,为未来 video RL、reward model 和 evaluator agent 提供基础设施。
它值得收录,因为可复用的高质量评测接口会影响视频生成模型后训练和工作流。
局限在于当前证据主要来自预印本实验与作者自建评测,后续需要独立复现和更大范围部署验证。
原始摘要与中文对照
中文对照翻译
生成式视频基础模型的快速发展已将该领域推向专业级电影合成。为实现如此严苛的质量要求,业界正转向强化学习(RL)和智能体工作流。然而,可靠的评估已成为一个关键瓶颈。现有基准测试主要评估“是否正确”(基本提示遵循),却根本忽视了“是否良好”(电影级质量、表演和美学)。此外,当前的自动化指标缺乏提供可信信号所需的领域特定严谨性,在人类美学感知和机器评分之间造成了严重的信任鸿沟。为弥合这一鸿沟,我们引入了EvalVerse,一个全面的、流程感知且经专家校准的评估框架。我们将视频生成评估不仅仅视为一项工程任务,而是一个核心科学问题:主观电影专业知识的系统数字化。首先,我们将领域知识组织成一个与专业电影制作工作流(前期制作、制作和后期制作)对齐的评估分类体系。其次,我们将人类专家判断提炼成一个包含大规模人工标注的精选数据集。第三,我们通过专家校准的微调策略将这些知识注入视觉-语言模型(VLMs)中,使VLM能够执行显式的思维链推理。与以往工作相比,EvalVerse不仅保留了与基础“正确性”指标的兼容性,而且显著扩展了“良好性”标准,并将任务覆盖范围拓宽至复杂的多镜头序列和视听整合。因此,通过提供细粒度的诊断信号,EvalVerse超越了静态排行榜,为未来的工作(如奖励模型和评估器智能体)建立了基础架构。
原始摘要
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate “whether it is right” (basic promptfollowing) while fundamentally neglecting “whether it is good” (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domainspecific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (preproduction, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expertcalibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-ofThought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational “rightness” metrics, but also significantly expands the criteria to “goodness” and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.