A Survey on Mixture of Experts 混合专家模型综述(第二部分：混合专家系统设计)

A Survey on Mixture of Experts 混合专家模型综述

(第一部分：混合专家算法设计)

A Survey on Mixture of Experts

arxiv

github：A-Survey-on-Mixture-of-Experts-in-LLMs

5 System Design of Mixture of Experts

While Mixture of Experts (MoE) has been increasingly leveraged to enhance the capabilities of large language models, its adoption introduces new challenges to existing training and inference systems, due to the inherently sparse and dynamic nature of its computational workload. GShard [28] introduces expert parallelism that implements parallel gating and expert computation by dispatching partitioned local tokens with load balancing limit of expert capacity. Since then, expert parallelism has emerged as a fundamental strategy to facilitate efficient scaling of MoE models. This approach can be viewed as an augmentation of data parallelism [197], [198], [199], where each expert in an MoE layer is assigned to a distinct device, while all non-expert layers are duplicated across devices. As depicted in Figure 8(a), the process flow of expert parallelism consists of the following sequential operations: gate routing, input encode, All-to-All dispatch, expert computation, All-to-All combine, and output decode. In general, the input size for general matrix multiply (GEMM) needs to be large enough to achieve optimal utilization and throughput that computing device necessitates. Therefore, input encode is employed to aggregate the input tokens of a same expert into a contiguous memory space, as determined by the token-expert mapping from gate routing. Subsequently, the All-to-All dispatch is employed to send the input tokens to their corresponding experts across the distributed devices. Following the localized computation by the experts, the inverse process—All-to-All combine and output decode—reinstates the original data layout according to the gating indices.

Furthermore, the synergy of expert parallelism [36], [132], [135], [200], [201] with other existing parallel strategies (tensor [202], [203], [204], pipeline [205], [206], [207], sequence parallelism [208], [209], [210]) has been investigated to enhance the scalability and efficiency of MoE models in large-scale distributed environments. As shown in Figure 8, we illustrate several examples of hybrid parallelism, encompassing (b) data + expert + tensor parallelism [36], [66], [132], [135], [138], (c) data + expert + pipeline parallelism [132], [134], [138], and (d) expert + tensor parallelism [67]. It is imperative to recognize that the choice of distributed parallelism strategies influences a complex interplay between computation efficiency, communication overhead, memory occupation, potentially affected by various hardware configurations. Consequently, the deployment strategies for practical applications necessitate nuanced trade-offs and bespoke designs tailored to specific use-case scenarios.

In the subsequent discussion, we delineate the challenges introduced by MoE models from computation, communication, and storage aspects, concurrently reviewing existing research addressing these issues. Table 4 shows an overview of the open-source MoE frameworks.

尽管混合专家模型（Mixture of Experts, MoE）越来越多地被用于增强大语言模型的能力，但其固有的稀疏性和动态计算特性给现有的训练和推理系统带来了新的挑战。GShard [28]引入了专家并行性（expert parallelism），通过分发分区的本地 token 并限制专家容量来实现并行门控（parallel gating）和专家计算。自此，专家并行性成为促进 MoE 模型高效扩展的基本策略。这种方法可以被视为数据并行性（data parallelism） [197], [198], [199]的增强，其中 MoE 层中的每个专家被分配到不同的设备，而非专家层则在所有设备上复制。如图 8(a) 所示，专家并行性的流程包括以下顺序操作：门控路由（gate routing）、输入编码（input encode）、全局分发（All-to-All dispatch）、专家计算（expert computation）、全局聚合（All-to-All combine）和输出解码（output decode）。通常，通用矩阵乘法（General Matrix Multiply, GEMM）的输入需要足够大才能实现计算设备所需的最佳利用率和吞吐量。因此，输入编码用于将同一专家的输入 token 聚合到连续的内存空间中，这是由门控路由的 token-专家映射决定的。随后，全局分发用于将输入 token 发送到分布式设备上对应的专家。在专家完成本地计算后，逆过程——全局聚合和输出解码——根据门控索引恢复原始数据布局。

此外，专家并行性 [36], [132], [135], [200], [201]与其他现有并行策略（张量并行性（tensor parallelism） [202], [203], [204]、流水线并行性（pipeline parallelism） [205], [206], [207]、序列并行性（sequence parallelism） [208], [209], [210]）的协同作用已被研究，以增强 MoE 模型在大规模分布式环境中的可扩展性和效率。如图 8 所示，我们展示了混合并行性（hybrid parallelism）的几个示例，包括 (b) 数据 + 专家 + 张量并行性 [36], [66], [132], [135], [138]，(c) 数据 + 专家 + 流水线并行性 [132], [134], [138]，以及 (d) 专家 + 张量并行性 [67]。必须认识到，分布式并行策略的选择会影响计算效率、通信开销和内存占用之间的复杂相互作用，这些因素可能受到各种硬件配置的影响。因此，实际应用的部署策略需要根据特定用例场景进行细致的权衡和定制设计。

在接下来的讨论中，我们从计算、通信和存储方面梳理了 MoE 模型引入的挑战，并回顾了解决这些问题的现有研究。表 4展示了开源 MoE 框架的概述。

5.1 Computation

Despite MoE is designed to scale model parameters efficiently without increasing computational demand, it encounters challenges pertaining to computational efficiency. One concern is the imbalance of computational load across distributed devices employing expert parallelism, which incurs significant synchronization overhead as the system awaits the processing completion of the most heavily loaded expert. Such issues are typically addressed through algorithmic strategies, such as optimized gating mechanisms and expert capacity adjustments, as discussed in Section 4.1. Besides, solutions like SE-MoE [133], Tutel [132], FlexMoE [137] and SmartMoE [138] have introduced dynamic expert placement strategies to distribute the workload as equally as possible among devices. Additionally, FasterMoE [134] has implemented a novel dynamic shadowed expert strategy, replicating experts on multiple devices to mitigate severe load imbalance. These model placement related strategies impact both computation and communication efficiency.

Another concern is that MoE introduces additional computational overhead through operations including gate routing, input encode, and output decode. Unlike expert computations, which mirror operations in dense models and benefit from extensive optimization on prevalent hardware such as GPUs, these MoE operations are characterized by redundant computation and memory movement, resulting in low efficiency on computing devices. Therefore, recent studies like DeepSpeed-MoE [66], FastMoE [131], HetuMoE [136] and Tutel [132] have focused on the development of tailored GPU kernels to enhance the efficiency of MoE operations.

In contexts where multiple experts are deployed on a single GPU device, MegaBlocks [139] reformulates MoE computation in terms of block-sparse operations, developing specialized block-sparse GPU kernels that efficiently handle the dynamic workloads without dropping tokens. Zheng et al. [141] propose PIT, a deep-learning compiler tailored for dynamic sparsity of MoE, which can find feasible PIT rules for all the operators within a model and generate optimized GPU kernels for them. PIT employs a novel tiling mechanism, utilizing the Permutation Invariant Transformation (PIT)—–a mathematically proven property—to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thereby achieving both high GPU utilization and low coverage waste. Despite these advancements, Tan et al. [140] highlight remaining optimization potential within current MoE frameworks such as MegaBlocks and PIT, which commence with an initial scatter-to-group data copy that increases memory footprint and requires a translation of the MoE problem into the sparse matrix format. Although this translation contributes minimally to computation overhead, it imposes limitations on the transparency and adaptability of extending MegaBlocks to modules beyond the FFN. To address these issues, Tan et al. [140] propose ScatterMoE, a MoE implementation designed to effectively minimize the memory footprint. ScatterMoE leverages ParallelLinear, a linear module capable of executing grouped matrix operations on scattered groups. This approach yields intermediate representations (e.g., the hidden states of an SMoE MLP) that are directly accessible as standard PyTorch tensors, allowing for easy extensions of MoE methods to other types of expert modules.

5.1 计算

尽管 MoE 旨在高效扩展模型参数而不增加计算需求，但它仍面临与计算效率相关的挑战。一个问题是使用专家并行性的分布式设备之间计算负载的不均衡，这会导致显著的同步开销，因为系统需要等待负载最重的专家完成处理。这些问题通常通过算法策略来解决，例如优化门控机制和专家容量调整，如第 4.1 节所述。此外，SE-MoE [133]、Tutel [132]、FlexMoE [137]和SmartMoE [138]等解决方案引入了动态专家放置策略，以尽可能均衡地分配设备间的工作负载。此外，FasterMoE [134]实现了一种新颖的动态影子专家策略（dynamic shadowed expert strategy），通过在多个设备上复制专家来缓解严重的负载不均衡。这些与模型放置相关的策略同时影响计算和通信效率。

另一个问题是，MoE 通过门控路由、输入编码和输出解码等操作引入了额外的计算开销。与专家计算（其操作与密集模型类似，并受益于 GPU 等硬件的广泛优化）不同，这些 MoE 操作的特点是冗余计算和内存移动，导致计算设备的效率低下。因此，DeepSpeed-MoE [66]、FastMoE [131]、HetuMoE [136]和Tutel [132]等最新研究专注于开发定制的 GPU 内核，以提高 MoE 操作的效率。

在多个专家部署在单个 GPU 设备上的场景中，MegaBlocks [139]将 MoE 计算重构为块稀疏操作（block-sparse operations），开发了专门的块稀疏 GPU 内核，能够高效处理动态工作负载而不会丢弃 token。Zheng 等人 [141]提出了PIT，这是一个针对 MoE 动态稀疏性定制的深度学习编译器，它可以为模型中的所有算子找到可行的PIT 规则并为其生成优化的 GPU 内核。PIT 采用了一种新颖的分块机制，利用置换不变变换（Permutation Invariant Transformation, PIT）——一种数学上证明的特性——将多个稀疏分布的微块转换为 GPU 高效的密集块，而不改变计算结果，从而实现高 GPU 利用率和低覆盖浪费。尽管取得了这些进展，Tan 等人 [140]指出，当前 MoE 框架（如MegaBlocks 和 PIT）仍存在优化潜力，这些框架从初始的分散到分组数据复制开始，增加了内存占用，并需要将 MoE 问题转换为稀疏矩阵格式。尽管这种转换对计算开销的贡献很小，但它限制了将MegaBlocks扩展到 FFN 之外模块的透明性和适应性。为了解决这些问题，Tan 等人 [140]提出了ScatterMoE，这是一种旨在有效最小化内存占用的 MoE 实现。ScatterMoE 利用了ParallelLinear，这是一个能够对分散的组执行分组矩阵运算的线性模块。这种方法生成的中间表示（例如 SMoE MLP 的隐藏状态）可以直接作为标准PyTorch 张量访问，从而可以轻松将 MoE 方法扩展到其他类型的专家模块。

5.2 Communication

In expert parallelism, the quadruple invocation of All-to-All communication during both the forward and backward propagation phases within each MoE layer causes a significant overhead, even emerging as the primary constraint on efficiency. The All-to-All communication paradigm encompasses both intra-node (via PCIe, pre-4th-generation NVLink) and inter-node (Ethernet, Infiniband, 4th-generation NVLink) communication channels. The efficiency of such communication is contingent upon a multitude of factors, including the heterogeneity of channel bandwidths, network topology, and the collective communication algorithms. Moreover, load imbalances intrinsic to MoE may exacerbate these inefficiencies by inducing synchronization delays.

To optimize the use of high intra-node bandwidth and low inter-node bandwidth, DeepSpeed-MoE [66], HetuMoE [136] and ScheMoE [147] have introduced hierarchical All-to-All communication strategies that enhance intra-node process and reduce inter-node data exchanges. Besides, FasterMoE [134], TA-MoE [143] and SE-MoE [133] have introduced topology-aware routing strategies aimed at mitigating cross-node expert selection, thereby reducing inter-node communication burdens. Additionally, ExFlow [142] exploits expert affinity, anticipating expert allocation across layers to maximize the retention of token processing within local GPU confines. The strategic allocation of experts to minimize network traffic and leverage high-bandwidth connections is a prevalent approach in distributed MoE system [66], [67], [135]. Moreover, this is often integrated with the placement design of non-expert modules to optimize overall system performance.

Given the concurrent feature of communication and computation, pipelining [205], [206], [207] is commonly employed to overlap their execution, thereby reducing the total time cost. This technique, which is integrated in systems such as Tutel [132], FasterMoE [134], PipeMoE [146] and MPipeMoE [144], orchestrates overlapping between All-to-All communication and expert computation. Notably, Lancet [145] underscores the inherent constraints of these pipelining methods, particularly the bounded duration for which expert computation and communication can overlap. To address this limitation, Lancet partitions non-MoE computations and integrates them into the pipeline during forward pass, and strategically schedules gradient weight computations to augment overlap in the backward pass. Punniyamurthy et al. [148] also emphasize the challenge posed by collective communications, which are often on the critical path, noting the difficulty of hiding their latency by overlapping kernel-granular communication and computation due to the absence of independent computation. Their solution involves fusing computation with dependent collective communication by leveraging GPU’s massive parallelism and GPU-initiated communication.

Aiming to break the inherent dependencies and thereby extend the overlap duration, ScMoE [110] restructures the MoE architecture to simultaneously process representations from preceding layers while engaging with current-layer representations. This decoupling of communication dependencies facilitates substantial, and in certain cases, complete overlapping between communication and computation. Snowflake Arctic [32] employs a similar design, utilizing a Dense-MoE hybrid transformer architecture to overlap communication with computation.

5.2 通信

在专家并行性中，每个 MoE 层的前向和反向传播阶段中All-to-All 通信的四重调用导致了显著的开销，甚至成为效率的主要限制因素。All-to-All 通信范式涵盖了节点内（通过 PCIe、第四代之前的 NVLink）和节点间（以太网、Infiniband、第四代 NVLink）的通信通道。这种通信的效率取决于多种因素，包括通道带宽的异构性、网络拓扑以及集体通信算法。此外，MoE 固有的负载不均衡可能通过同步延迟加剧这些低效问题。

为了优化高节点内带宽和低节点间带宽的使用，DeepSpeed-MoE [66]、HetuMoE [136]和ScheMoE [147]引入了分层 All-to-All 通信策略，以增强节点内处理并减少节点间数据交换。此外，FasterMoE [134]、TA-MoE [143]和SE-MoE [133]引入了拓扑感知路由策略，旨在减少跨节点专家选择，从而减轻节点间通信负担。另外，ExFlow [142]利用专家亲和性，预测跨层的专家分配，以最大化在本地 GPU 范围内保留 token 处理。将专家分配到最小化网络流量并利用高带宽连接是分布式 MoE 系统中的常见方法 [66], [67], [135]。此外，这通常与非专家模块的放置设计集成，以优化整体系统性能。

鉴于通信和计算的并发特性，流水线化 [205], [206], [207]通常被用于重叠它们的执行，从而减少总时间成本。这种技术在Tutel [132]、FasterMoE [134]、PipeMoE [146]和MPipeMoE [144]等系统中集成，协调了All-to-All 通信与专家计算之间的重叠。值得注意的是，Lancet [145]强调了这些流水线方法的固有约束，特别是专家计算和通信可以重叠的有限时间。为了解决这一限制，Lancet将非 MoE 计算分区并将其集成到前向传播的流水线中，并策略性地调度梯度权重计算以增加反向传播中的重叠。Punniyamurthy 等人 [148]也强调了集体通信带来的挑战，这些通信通常位于关键路径上，并指出由于缺乏独立计算，很难通过重叠内核粒度的通信和计算来隐藏其延迟。他们的解决方案是通过利用GPU 的大规模并行性和GPU 发起的通信，将计算与依赖的集体通信融合。

为了打破固有依赖关系并延长重叠时间，ScMoE [110]重构了 MoE 架构，以在处理当前层表示的同时处理来自前一层的表示。这种通信依赖关系的解耦促进了通信和计算之间的大幅重叠，在某些情况下甚至完全重叠。Snowflake Arctic [32]采用了类似的设计，利用Dense-MoE 混合 Transformer 架构来重叠通信与计算。

5.3 Storage

The ever-increasing parameters in MoE models exacerbate the constraints posed by memory capacity in compute devices, a challenge already pronounced in dense models. While expert parallelism offers a mitigation strategy through the distribution of experts across multiple devices, individual devices may still struggle to accommodate numerous experts, particularly in inference contexts where device capacity—–such as that of edge devices (PCs, smartphones, IoTs)–—is inherently more restricted.

Considering the hierarchical storage pyramid, solutions like SE-MoE [133], Pre-gated MoE [149], and EdgeMoE [150] selectively retain only essential non-expert parameters and the active expert parameters within the GPU’s HighBandwidth Memory (HBM), offloading inactive expert parameters to CPU memory or SSDs. These patterns incur additional overhead from data transfer across the storage hierarchy, thus they integrate expert selection forecasting and expert parameter prefetching techniques to overlap parameter access with computation.

In addition, MPipeMoE [144] introduces a strategy to reduce the memory overhead associated with activations and temporary buffers. This is achieved by sharing buffer for various partitions of tensors, while leveraging recomputation/communication and CPU offloading to recover the requisite activations in the backward pass.

5.3 存储

MoE 模型中不断增加的参数加剧了计算设备中内存容量的限制，这一挑战在密集模型中已经十分突出。虽然专家并行性通过将专家分布到多个设备提供了一种缓解策略，但单个设备可能仍然难以容纳多个专家，特别是在推理场景中，设备的容量（如边缘设备（PC、智能手机、IoT））本身更加有限。

考虑到分层存储金字塔，SE-MoE [133]、Pre-gated MoE [149]和EdgeMoE [150]等解决方案选择性地仅在 GPU 的高带宽内存（HBM）中保留必要的非专家参数和活跃的专家参数，将非活跃的专家参数卸载到 CPU 内存或 SSD 中。这些模式会因跨存储层次的数据传输而产生额外开销，因此它们集成了专家选择预测和专家参数预取技术，以重叠参数访问与计算。

此外，MPipeMoE [144]引入了一种策略，以减少与激活和临时缓冲区相关的内存开销。这是通过共享各种张量分区的缓冲区实现的，同时利用重计算/通信和CPU 卸载来恢复反向传播中所需的激活。