技术文档MoE(Mixture of Experts)

MoE(Mixture of Experts)

AIMoELLM
内容

What is MoE?

Transformer(Dense) vs MoE(Sparse)

Left: Transformer model(Dense Model) Right: MoE(Sparse Model)

  • 主要区别在FFN layer(fully connected networks)
  • MoE有多个FFN副本

(Pro) Why MoE popular?

  1. Same FLOP, more parameters dose better in loss!
  2. Faster to train MoEs
  3. more activated parameters, better performance
  4. Parallelizable 并行:把不同的experts放在不同设备里

(Con) Why haven't MoE been more popular?

  • Infrastructure is complex: MoE的优势在multi-node training,在模型特别大的时候才需要分割到不同devices里训练
  • Routing decisions are not differentiable

Routing function

TC vs EC

Left: token choose expert (TC) token选择最适合的expert Middle: expert choose token (EC) expert选择token,优点是每个expert都能分到均匀数量的token,资源利用率更高

  • 如下图消融实验,TC表现的更好
  • 业内TopK的k通常取2

Router

residual stream x -> router(linear inner product + softmax) -> choose a expert -> output weighted average / sum 用RL学习router决策(但是computing cost非常高,现在已经没人用了)

Top-K routing detail

e是route的参数 u是input sequence 门控只用了简单的softmax,没用MLP 原因:SFT间接影响内积(?没有很理解)

MoE structure

(a) dense model -> 复制experts con:参数翻倍 (b) fine-grained expert: cut expert to smaller pieces by decreasing the size of projection layer projection layer = hidden_layer * 4 -> projection layer = hidden_layer * 2 (c) shared expert : have some shared MLP(expert)处理一些必须share的东西(?)

消融实验(ablations)

X out of Y: X activated experts out of Y total routed experts

  1. For DeepSeek v1, 总共有6+2=8个experts,fine-grained ratio=1/4, 但总参数量还是变成了两倍。
  2. 某些模型是down-project的

Train MoE

Challenge: sparse gating decisions are not differentiable. solutions:

  • Reinforcment learning to optimize gating policies
    • complex
    • 效果不比哈希/linear等方法好
  • Stochastic perturbations(随机扰动)
    • From Shazeer et al 2017 – routing decisions are stochastic with gaussian perturbations.
    • This naturally leads to experts that are a bit more robust.
    • The softmax means that the model learns how to rank K experts
    • 因为是噪声,所以每个专家会随机得到一些tokens,所以这个方法得到的是less specialized but more robust experts ,会有些loss损失
  • Heuristic 'balancing' losses

loss balancing

from Switch Transformer(2022):

loss: loop all experts, calculate inner product

  • vector $f_i$: fractions(比例) that tokens were allocated to expert i
  • vector $P_i$: fractions that probability that was allocated to expert i $$loss=\alpha \cdot N \cdot \sum_{i=1}^{N}f_i \cdot P_i$$ 对loss的$P_i$求导,(argmax是啥-> $f_i$里面的东西),怎么看出来biggest experts 有最强的downweighting? softmax pre top K? deepseek v3