MoE(Mixture of Experts)
What is MoE?
Transformer(Dense) vs MoE(Sparse)
Left: Transformer model(Dense Model) Right: MoE(Sparse Model)
- 主要区别在FFN layer(fully connected networks)
- MoE有多个FFN副本
(Pro) Why MoE popular?
- Same FLOP, more parameters dose better in loss!
- Faster to train MoEs
- more activated parameters, better performance
- Parallelizable 并行:把不同的experts放在不同设备里
(Con) Why haven't MoE been more popular?
- Infrastructure is complex: MoE的优势在multi-node training,在模型特别大的时候才需要分割到不同devices里训练
- Routing decisions are not differentiable
Routing function
TC vs EC
Left: token choose expert (TC) token选择最适合的expert Middle: expert choose token (EC) expert选择token,优点是每个expert都能分到均匀数量的token,资源利用率更高
- 如下图消融实验,TC表现的更好
- 业内TopK的k通常取2
Router
residual stream x -> router(linear inner product + softmax) -> choose a expert -> output weighted average / sum 用RL学习router决策(但是computing cost非常高,现在已经没人用了)
Top-K routing detail
e是route的参数 u是input sequence 门控只用了简单的softmax,没用MLP 原因:SFT间接影响内积(?没有很理解)
MoE structure
(a) dense model -> 复制experts con:参数翻倍
(b) fine-grained expert: cut expert to smaller pieces by decreasing the size of projection layer
projection layer = hidden_layer * 4 -> projection layer = hidden_layer * 2
(c) shared expert : have some shared MLP(expert)处理一些必须share的东西(?)
消融实验(ablations)
X out of Y: X activated experts out of Y total routed experts
- For DeepSeek v1, 总共有6+2=8个experts,fine-grained ratio=1/4, 但总参数量还是变成了两倍。
- 某些模型是down-project的
Train MoE
Challenge: sparse gating decisions are not differentiable. solutions:
- Reinforcment learning to optimize gating policies
- complex
- 效果不比哈希/linear等方法好
- Stochastic perturbations(随机扰动)
- From Shazeer et al 2017 – routing decisions are stochastic with gaussian perturbations.
- This naturally leads to experts that are a bit more robust.
- The softmax means that the model learns how to rank K experts
- 因为是噪声,所以每个专家会随机得到一些tokens,所以这个方法得到的是less specialized but more robust experts ,会有些loss损失
- Heuristic 'balancing' losses
loss balancing
from Switch Transformer(2022):
loss: loop all experts, calculate inner product
- vector $f_i$: fractions(比例) that tokens were allocated to expert i
- vector $P_i$: fractions that probability that was allocated to expert i $$loss=\alpha \cdot N \cdot \sum_{i=1}^{N}f_i \cdot P_i$$ 对loss的$P_i$求导,(argmax是啥-> $f_i$里面的东西),怎么看出来biggest experts 有最强的downweighting? softmax pre top K? deepseek v3