收录解读
这篇论文针对 MoE router 给出明确的矩阵表征设计原则。
它把 router row 解释为 expert matrix 的代理,并用 Manifold Power Iteration 让 router 与 expert 主奇异方向对齐。
它值得收录,因为 MoE routing 是大模型架构核心瓶颈,理论化 router 设计有较强溢出价值。
局限在于当前证据主要来自预印本实验与作者自建评测,后续需要独立复现和更大范围部署验证。
原始摘要与中文对照
中文对照翻译
MoE的核心是路由器,它通常被参数化为一个线性矩阵。对于每个输入token,路由器计算其与矩阵行的相似度分数,并将token分派给与得分最高的行相对应的专家。尽管这种设计简单明了,并长期以来被视为理所当然,但我们挑战了这一传统观念。理想情况下,MoE路由器矩阵中的每个独立行都应忠实地反映专家的内在特征。路由器矩阵因此可以更好地确立每个专家的身份,从而使token-路由器亲和性成为token-专家分配的精确代理。然而,MoE路由器中缺乏一种约束来强制将专家特征编码到表达能力有限的路由器行中。这种缺失可能导致次优的路由器设计,损害MoE模型的训练收敛性和能力。我们提出将每个路由器行与其相应专家权重矩阵的主奇异方向对齐。这一选择基于线性代数直觉:主奇异方向在矩阵中保留了最高的信息密度(Golub和Van Loan,1996;Halko等人,2010),使其成为表征该矩阵的最佳压缩表示。由于每个专家模块都参数化为权重矩阵,将其编码为单个路由器向量正是捕获其最具信息量方向的任务。为了避免精确奇异值分解(SVD)的过高成本,我们利用幂迭代(Halko等人,2010)作为一种轻量级替代方案,在线获取此主方向。具体来说,幂迭代方案仅使用标准矩阵-向量乘积来求解主奇异向量,避免了对昂贵的完整矩阵分解的需求。实际上,我们在每个训练步骤中仅对路由器权重执行单次幂迭代。之后,引入一个回缩步骤来规范化路由器权重的L2范数。路由器是MoE模型的基石组件。作为专家代理,路由器矩阵的行计算它们与MoE输入的相似度,以确定激活哪个专家子集。理想情况下,每个路由器行都旨在将专家矩阵编码到此代表性向量中,从而使其与token的点积能更好地反映token-专家亲和性。然而,目前没有设计原则来强制这种凝练。在本文中,我们提出将每个路由器行与相关专家主奇异方向对齐,因为该方向提供了矩阵最具表达力的数学描述。基于此原则,我们提出了一种使用流形幂迭代(MPI)的路由器重新设计。具体来说,它引入了一种“先幂迭代后回缩”范式,其中对路由器权重执行幂迭代步骤,然后进行回缩以施加范数约束,从而确保效率和稳定性。理论上,我们表明MPI驱动路由器行收敛于相关专家主奇异方向。经验上,我们跨越10亿到110亿参数的规模预训练MoE模型,以证实这种对齐有助于实现更有效的MoE模型。
原始摘要
At the heart of MoE lies the router, which is typically parameterized as a linear matrix. For each input token, the router computes similarity scores against the matrix rows and dispatches the token to the experts corresponding to the top-scoring rows. While this design is straightforward and has long been accepted as a matter of course, we challenge this conventional wisdom. Ideally, each individual row in MoE router matrix should faithfully reflect the expert’s intrinsic features. The router matrix can thus better ground the identity of each expert, allowing token–router affinity to serve as a precise proxy for token–expert assignment. However, there lacks a constraint in MoE router to enforce the encoding of expert features into router rows of limited expressivity. This absence may lead to suboptimal router design, compromising both training convergence and competence of MoE models. We propose to align each router row with the principal singular direction of its corresponding expert’s weight matrix. This choice is grounded in a linear algebraic intuition: the principal singular direction preserves the highest density of information within a matrix (Golub and Van Loan, 1996; Halko et al., 2010), making it the optimal compressed representation to characterize that matrix. Since each expert module is parameterized as weight matrices, encoding it into a single router vector is exactly the task of capturing its most informative direction. To avoid the prohibitive cost of exact singular value decomposition (SVD), we leverage power iteration (Halko et al., 2010) as a lightweight alternative to obtain this principal direction online. Specifically, the power iteration scheme uses only standard matrix-vector products to solve for the principal singular vector, obviating the need for expensive full matrix factorization. In practice, we perform only one single power iteration on the router weights during each training step. After that, a retraction step is introduced to regularize the L2 norm of the router weights, main- Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a “Powerthen-Retract” paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.