技术文档MoE（Mixture of Experts）

MoE（Mixture of Experts）

AIMoELLM

内容

What is MoE?

Transformer(Dense) vs MoE(Sparse)

Left: Transformer model(Dense Model) Right: MoE(Sparse Model)

主要区别在FFN layer（fully connected networks）
MoE有多个FFN副本

(Pro) Why MoE popular?

Same FLOP, more parameters dose better in loss!
Faster to train MoEs
more activated parameters, better performance
Parallelizable 并行：把不同的experts放在不同设备里

(Con) Why haven't MoE been more popular?

Infrastructure is complex: MoE的优势在multi-node training，在模型特别大的时候才需要分割到不同devices里训练
Routing decisions are not differentiable

Routing function

TC vs EC

Left: token choose expert (TC) token选择最适合的expert Middle: expert choose token (EC) expert选择token，优点是每个expert都能分到均匀数量的token，资源利用率更高

如下图消融实验，TC表现的更好
业内TopK的k通常取2

Router

residual stream x -> router(linear inner product + softmax) -> choose a expert -> output weighted average / sum 用RL学习router决策（但是computing cost非常高，现在已经没人用了）

Top-K routing detail

e是route的参数 u是input sequence 门控只用了简单的softmax，没用MLP 原因：SFT间接影响内积（？没有很理解）

MoE structure

(a) dense model -> 复制experts con:参数翻倍 (b) fine-grained expert: cut expert to smaller pieces by decreasing the size of projection layer projection layer = hidden_layer * 4 -> projection layer = hidden_layer * 2 (c) shared expert : have some shared MLP(expert)处理一些必须share的东西（？）

消融实验（ablations）

X out of Y: X activated experts out of Y total routed experts

For DeepSeek v1, 总共有6+2=8个experts，fine-grained ratio=1/4, 但总参数量还是变成了两倍。
某些模型是down-project的

Train MoE

Challenge: sparse gating decisions are not differentiable. solutions:

Reinforcment learning to optimize gating policies
- complex
- 效果不比哈希/linear等方法好
Stochastic perturbations（随机扰动）
- From Shazeer et al 2017 – routing decisions are stochastic with gaussian perturbations.
- This naturally leads to experts that are a bit more robust.
- The softmax means that the model learns how to rank K experts
- 因为是噪声，所以每个专家会随机得到一些tokens，所以这个方法得到的是less specialized but more robust experts ，会有些loss损失
Heuristic 'balancing' losses

loss balancing

from Switch Transformer(2022):

loss: loop all experts, calculate inner product

vector $f_i$: fractions(比例) that tokens were allocated to expert i
vector $P_i$: fractions that probability that was allocated to expert i $$loss=\alpha \cdot N \cdot \sum_{i=1}^{N}f_i \cdot P_i$$ 对loss的$P_i$求导，（argmax是啥-> $f_i$里面的东西），怎么看出来biggest experts 有最强的downweighting？ softmax pre top K? deepseek v3