LLM Transformer 架构
Transformer
- Position embedding:sines and cosines $$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})$$$$PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$$
- FFN: ReLU$$FFN(x)=max(0,xW_1+b_1)W_2+b_2$$
- Norm type:post-norm, LayerNorm
Transformer in Assignment1
Differences:
- LayerNorm is in front of the block
- Rotary position embeddings(RoPE)
- FF layers use SwiGLU, not ReLU
- Linear layers (and layernorm) have no bias (constant) terms
Transformer变体
- Norm: LayerNorm, RMSNorm
- Parallel layer: Serial, Parallel
- Pre-norm/Post-norm
- Position embedding: sine, absolute, relative, rope, AliBi
- Activations: ReLU, GeLU, GeGLU, SwiGLU
Common architecture variations
- residual connections
- layer norms
- gating
Norm
- Basically everyone does pre-norm
- Most people do RMSnorm
去掉bias可以提高稳定性
Original transformer: $FFN(x)=max(0,xW_1+b_1)W_2+b_2$ Most implementations: $FFN(x)=\sigma(xW_1)W_2$ 区别:去掉的bias 结果:增强了模型的稳定性(类似RMSNorm)
RMSNorm vs LayerNorm
Post-norm or Pre-norm
- 灰色箭头:残差连接 residual stream
- Post-Norm(左)->Pre-Norm(右) 把Layer Norm移到非残差的FFN和MHA前面,就变成了pre-Norm
- 业内共识大多采用preNorm或者doubleNorm
- bert还是postNorm
- preNorm是更稳定的架构,不容易梯度爆炸
- Post-Norm: 比较不稳定,通常设置warmup learning rate
DoubleNorm
在pre-norm基础上前后都加layernorm(右图)
Activations, FFN
Relu, Gelu, Swish, ELU, GLU, GeGLU, ReGLU, SeLU, SwiGLU, LiGLU
Gated activations (xxGLU)
GLUs modify the first part of a FF layer:FF(x)=max(0,xW1)$W_2$
- 改变:linear+ReLU -> augment with an entrywise linear term
- 加了个额外的parameter V (parameter是可学习的) $$\max(0, xW_1) ;\rightarrow; \max(0, xW_1) \otimes (xV)$$ FF layer 变成了: $$FF_{ReGLU}(x)=(\max(0,xW_1)\otimes xV)W_2$$
如何设置新增参数V的维度?让W_1维度变得小一点(2/3),保证参数量一致。(不是很懂)
Serial vs Parallel layers
Serial: outputs come in from the bottom
- 问题:串形结构有并行计算的限制 Parallel: 并行计算MLP和attention的部分,最后把他们加入残差
Variants in Position embedding
Sine embeddings
Add sines and cosines that enable localization
- original transformer $$Embed(x,i)=v_x+PE_{pos}$$ $$PE_{pos,2i}=sin(\frac{pos}{10000^{2i/d_{model}}})$$ $$PE_{pos,2i+1}=cos(\frac{pos}{10000^{2i+1/d_{model}}})$$ Q: 为什么底是10000? A: 10000是一个经验值,代表能表示的最长的seqlen在最高维度的频率。 正余弦编码在d_model低维的时候频率很高,在d_model高维的时候频率很低。 如果底太小,可能在高维度表示的位置信息有重复。 如果底太大,高维度的变化不明显,可能无法表示高维度信息。
Absolute embeddings
Add a position vector to the embedding
- GPT 1/2/3, OPT $$Embed(x,i)=v_x+u_i$$
Relative embeddings
Add a vector to the attention computation
- T5, Gopher, Chinchilla $$e_{i,j}=\frac{x_iW^Q(x_jW^k+a_{ij}^k)^T}{\sqrt{d_z}}$$
Rope embeddings
- GPTJ, PaLM, LLaMA, Most 2024+models
Hyperparameters
Feedforward layer
Feedforward - model dimension ratio $$FFN(x)=max(0,xW_1+b_1)W_2+b_2$$ There are two dimensions that are relevant: ff层增大容量 → 提升表达能力
- feedforward dim $d_{ff}$
- model dim $d_{model}$
- default: $$d_{ff}=4d_{model}$$
- for GLU:$$d_{ff}=2.66d_{model}$$ Exception:
- GLU variants: GLU variants scale down by 2/3
- $d_{ff}=\frac{8}{3}d_{model}≈2.67d_{model}$
- scale down 不是缩小 d_model,而是缩小 FFN 的扩展维度 d_ff
- 原因:GLU 内部需要双 projection,如果还用
4 * d_model,计算量会爆炸。
- T5 (Raffel et al 2020, google):
- for 11B model, they set: $d_{ff}=65536 ; d_{model}=1024$ -> 64 倍
Head
For multi-head attention: $$d_{head}*num_{heads}=d_{model}$$ 但也可以: head-dimensions>model-dim/num-heads
Aspect ratios
$$d_{model}/n_{layer}$$ 大部分采用128,即每层的model dim是128 (100-200)
Dropout and other regularization
- optimizer 和 weight decay有动态博弈
- Dropout:训练时随机丢掉一部分神经元,增加鲁棒性。
- Weight decay:训练时给权重施加"收缩力",避免权重过大 → 防止过拟合,提升泛化。
所以 newer models(比如 LLaMA、OPT、PaLM)不再依赖 Dropout,而主要靠 Weight decay 来做正则化,因为:
- 大模型规模足够大,本身对噪声有鲁棒性。
- Dropout 会破坏并行效率,增加训练不稳定性。
- Weight decay 更稳定、更适合大规模预训练。 Weight decay = 在训练过程中不断压缩权重的大小(相当于 L2 正则化),防止过拟合,让模型更稳定泛化,见于loss和optimizer中。
Stability tricks
Beware of Softmax!
softmax can be ill-behaved due to exponentials / division by zero $$softmax(z_i)=\frac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}$$ 问题:
- 指数运算:数值问题
- 分母可能为0
softmax在transformer里面的位置:
- 输出前
- self-attention里面
auxiliary loss / z-loss
使用z loss可以增强模型稳定性(但还不是很理解)