Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Jinyang Wu; Guocheng Zhai; Ruihan Jin; Yuhao Shen; Zhengxi Lu; Fan Zhang; Haoran Luo; Zheng Lian; Zhengqi Wen; Jianhua Tao

智能体与自主科学突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-05-21
arXiv: 2605.22177

收录解读

Maestro 关注 autonomous agents 的组合问题：模型和技能越来越多，但多数系统仍依赖固定逻辑或单一大模型，不能动态利用不同专家模型与工具技能的互补性。

论文把异构多模态任务重写为对 hierarchical model-skill registry 的序列决策过程，由轻量 policy 选择是否调用外部专家、选择哪一组 model-skill pair，以及何时终止。

训练采用 outcome-based RL，不需要 step-level supervision；论文报告 4B orchestrator 在多类多模态 benchmark 上获得强结果，并能在加入未见过的模型和技能后继续泛化。

它值得正式收录，因为它把 skill marketplace/model registry 变成可学习的 orchestration policy，是 agent 能力扩展和模块化执行系统的重要工程/研究接口。

原始摘要与中文对照

中文对照翻译

大语言模型（LLM）和模块化技能的激增赋予了自主智能体越来越强大的能力。现有框架通常依赖单一LLM和固定逻辑来与这些技能交互。这产生了一个关键瓶颈：不同的LLM在不同领域提供独特的优势，但当前框架未能利用模型和技能的互补优势，从而限制了它们在下游任务上的性能。在本文中，我们提出了M AESTRO（多模态智能体，用于专家技能目标强化编排），这是一个强化学习（RL）驱动的编排框架，它将异构多模态任务重新定义为在分层模型-技能注册表上的顺序决策过程。M AESTRO没有将所有知识整合到一个模型中，而是训练了一个轻量级策略，以动态组合冻结的专家模型集合和两层技能库，在每一步决定是否调用外部专家、选择哪个模型-技能对以及何时终止。该策略通过基于结果的RL进行优化，无需步骤级监督。我们评估了M AESTRO在涵盖数学推理、图表理解、高分辨率感知和领域特定分析的十个代表性多模态基准上的表现。仅使用一个40亿参数的编排器，M AESTRO实现了70.1%的平均准确率，超越了GPT-5（69.3%）和Gemini-2.5-Pro（68.7%）。至关重要的是，学习到的协调策略无需重新训练即可泛化到未见过的模型和技能：通过领域外专家增强注册表，在四个具有挑战性的基准上取得了59.5%的平均成绩，优于所有闭源基线。M AESTRO进一步保持了高计算效率和低延迟，为部署协作智能体生态系统提供了可扩展且稳健的途径。源代码可在https://github.com/jinyangwu/Maestro获取。

原始摘要

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present M AESTRO (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, M AESTRO trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate M AESTRO across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, M AESTRO achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. M AE STRO further maintains high computational efficiency with low latency, offering a scalable and robust pathway for deploying collaborative agentic ecosystems. The source code is available at https://github.com/jinyangwu/Maestro.

链接

论文链接