技术文档LLM Transformer 架构

LLM Transformer 架构

AITransformerLLM

内容

Transformer

Position embedding:sines and cosines $$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})$$$$PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$$
FFN: ReLU$$FFN(x)=max(0,xW_1+b_1)W_2+b_2$$
Norm type:post-norm, LayerNorm

Transformer in Assignment1

Differences:

LayerNorm is in front of the block
Rotary position embeddings(RoPE)
FF layers use SwiGLU, not ReLU
Linear layers (and layernorm) have no bias (constant) terms

Transformer变体

Norm: LayerNorm, RMSNorm
Parallel layer: Serial, Parallel
Pre-norm/Post-norm
Position embedding: sine, absolute, relative, rope, AliBi
Activations: ReLU, GeLU, GeGLU, SwiGLU

Common architecture variations

residual connections
layer norms
gating

Norm

Basically everyone does pre-norm
Most people do RMSnorm

去掉bias可以提高稳定性

Original transformer: $FFN(x)=max(0,xW_1+b_1)W_2+b_2$ Most implementations: $FFN(x)=\sigma(xW_1)W_2$ 区别：去掉的bias 结果：增强了模型的稳定性（类似RMSNorm）

RMSNorm vs LayerNorm

Post-norm or Pre-norm

灰色箭头：残差连接 residual stream
Post-Norm（左）->Pre-Norm（右）把Layer Norm移到非残差的FFN和MHA前面，就变成了pre-Norm
业内共识大多采用preNorm或者doubleNorm
- bert还是postNorm
- preNorm是更稳定的架构，不容易梯度爆炸
Post-Norm: 比较不稳定，通常设置warmup learning rate

DoubleNorm

在pre-norm基础上前后都加layernorm(右图)

Activations, FFN

Relu, Gelu, Swish, ELU, GLU, GeGLU, ReGLU, SeLU, SwiGLU, LiGLU

Gated activations (xxGLU)

GLUs modify the first part of a FF layer：FF(x)=max(0,xW1)$W_2$

改变：linear+ReLU -> augment with an entrywise linear term
- 加了个额外的parameter V (parameter是可学习的) $$\max(0, xW_1) ;\rightarrow; \max(0, xW_1) \otimes (xV)$$ FF layer 变成了： $$FF_{ReGLU}(x)=(\max(0,xW_1)\otimes xV)W_2$$

如何设置新增参数V的维度？让W_1维度变得小一点（2/3），保证参数量一致。（不是很懂）

Serial vs Parallel layers

Serial: outputs come in from the bottom

问题：串形结构有并行计算的限制 Parallel: 并行计算MLP和attention的部分，最后把他们加入残差

Variants in Position embedding

Sine embeddings

Add sines and cosines that enable localization

original transformer $$Embed(x,i)=v_x+PE_{pos}$$ $$PE_{pos,2i}=sin(\frac{pos}{10000^{2i/d_{model}}})$$ $$PE_{pos,2i+1}=cos(\frac{pos}{10000^{2i+1/d_{model}}})$$ Q: 为什么底是10000？ A: 10000是一个经验值，代表能表示的最长的seqlen在最高维度的频率。正余弦编码在d_model低维的时候频率很高，在d_model高维的时候频率很低。如果底太小，可能在高维度表示的位置信息有重复。如果底太大，高维度的变化不明显，可能无法表示高维度信息。

Absolute embeddings

Add a position vector to the embedding

GPT 1/2/3, OPT $$Embed(x,i)=v_x+u_i$$

Relative embeddings

Add a vector to the attention computation

T5, Gopher, Chinchilla $$e_{i,j}=\frac{x_iW^Q(x_jW^k+a_{ij}^k)^T}{\sqrt{d_z}}$$

Rope embeddings

GPTJ, PaLM, LLaMA, Most 2024+models

Hyperparameters

Feedforward layer

Feedforward - model dimension ratio $$FFN(x)=max(0,xW_1+b_1)W_2+b_2$$ There are two dimensions that are relevant: ff层增大容量 → 提升表达能力

feedforward dim $d_{ff}$
model dim $d_{model}$
- default: $$d_{ff}=4d_{model}$$
- for GLU:$$d_{ff}=2.66d_{model}$$ Exception:

GLU variants: GLU variants scale down by 2/3
- $d_{ff}=\frac{8}{3}d_{model}≈2.67d_{model}$
- scale down 不是缩小 d_model，而是缩小 FFN 的扩展维度 d_ff
- 原因：GLU 内部需要双 projection，如果还用 4 * d_model，计算量会爆炸。
T5 (Raffel et al 2020, google):
- for 11B model, they set: $d_{ff}=65536 ; d_{model}=1024$ -> 64 倍

Head

For multi-head attention: $$d_{head}*num_{heads}=d_{model}$$ 但也可以: head-dimensions>model-dim/num-heads

Aspect ratios

$$d_{model}/n_{layer}$$ 大部分采用128，即每层的model dim是128 （100-200）

Dropout and other regularization

optimizer 和 weight decay有动态博弈
Dropout：训练时随机丢掉一部分神经元，增加鲁棒性。
Weight decay：训练时给权重施加"收缩力"，避免权重过大 → 防止过拟合，提升泛化。

所以 newer models（比如 LLaMA、OPT、PaLM）不再依赖 Dropout，而主要靠 Weight decay 来做正则化，因为：

大模型规模足够大，本身对噪声有鲁棒性。
Dropout 会破坏并行效率，增加训练不稳定性。
Weight decay 更稳定、更适合大规模预训练。 Weight decay = 在训练过程中不断压缩权重的大小（相当于 L2 正则化），防止过拟合，让模型更稳定泛化，见于loss和optimizer中。

Stability tricks

Beware of Softmax!

softmax can be ill-behaved due to exponentials / division by zero $$softmax(z_i)=\frac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}$$ 问题：

指数运算：数值问题
分母可能为0

softmax在transformer里面的位置：

输出前
self-attention里面

auxiliary loss / z-loss

使用z loss可以增强模型稳定性（但还不是很理解）