LLM 大模型基础原理：Transformer 与注意力机制

理解 Transformer 架构是掌握 GPT、LLaMA、Claude 等大语言模型的基础。

Transformer 整体架构

输入: "什么是机器学习"
     ↓
Token 嵌入 + 位置编码
     ↓
┌─────────────────────────────────────┐
│         Encoder (BERT 类)             │
│  ├ Multi-Head Self-Attention         │
│  └ Feed Forward Network              │
└─────────────────────────────────────┘
     ↓
┌─────────────────────────────────────┐
│         Decoder (GPT 类)             │
│  ├ Masked Self-Attention             │
│  ├ Cross-Attention (可选)            │
│  └ Feed Forward Network              │
└─────────────────────────────────────┘
     ↓ (×N 层)
     ↓
输出分布 → "机器学习是..."

# GPT 是纯 Decoder-only
# LLaMA 是 Decoder-only + RMSNorm + RoPE

自注意力机制（Self-Attention）

核心公式

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Q (Query): 我要查询什么
K (Key):  我有什么特征
V (Value):  对应的值
√d_k: 缩放因子，防止点积过大

代码实现

python
import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # W_q, W_k, W_v 三个投影矩阵
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        # x: [batch, seq_len, d_model]
        B, L, D = x.shape
        
        # 投影到 Q, K, V
        Q = self.W_q(x).view(B, L, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, L, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, L, self.n_heads, self.d_k).transpose(1, 2)
        
        # 注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        attn_weights = torch.softmax(scores, dim=-1)
        
        # 注意力加权
        attn_output = torch.matmul(attn_weights, V)
        
        # 合并多头
        attn_output = attn_output.transpose(1, 2).contiguous().view(B, L, D)
        return self.W_o(attn_output)

多头注意力（Multi-Head Attention）

python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.heads = nn.ModuleList([
            SelfAttention(d_model, 1) for _ in range(n_heads)
        ])
        self.proj = nn.Linear(d_model * n_heads, d_model)
    
    def forward(self, x):
        # 每个头独立计算注意力
        head_outputs = [head(x) for head in self.heads]
        # 拼接后投影
        return self.proj(torch.cat(head_outputs, dim=-1))

位置编码（Positional Encoding）

Transformer 原版：正弦位置编码

python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

RoPE（旋转位置编码）- LLaMA 使用

python
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(end)
    freqs = torch.outer(t, freqs)
    freqs = torch.polar(torch.ones_like(freqs), freqs)
    return freqs

def apply_rotary_emb(x, freqs_cis):
    # x: [batch, seq_len, n_heads, head_dim]
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rotated = x_complex * freqs_cis
    return torch.view_as_real(x_rotated).flatten(-2).type_as(x)

Transformer 完整块

python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, ff_dim):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, ff_dim),
            nn.GELU(),
            nn.Linear(ff_dim, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # 残差连接
        x = x + self.attention(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

GPT 系列架构对比

模型	参数量	架构	位置编码	激活函数
GPT-1	117M	Decoder-only	-learned	GELU
GPT-2	1.5B	Decoder-only	Learned	GELU
GPT-3	175B	Decoder-only	Learned	GELU
LLaMA	7B-65B	Decoder-only	RoPE	SwiGLU
LLaMA2	7B-70B	Decoder-only	RoPE	SwiGLU

常见面试问题

Q1: Transformer 相比 RNN 的优势？

并行计算：RNN 必须顺序处理，Transformer 可并行
长距离依赖：RNN 梯度衰减严重，Attention 直接建模任意距离
可解释性：注意力权重直观显示词与词的关系
训练稳定性：Transformer 更稳定

Q2: 为什么注意力需要缩放因子 √d_k？

当 d_k 很大时，QK^T 的点积值会很大
导致 softmax 进入饱和区，梯度变小
除以 √d_k 可以保持方差稳定

Q3: Decoder 为什么用 Masked Attention？

python
# GPT 的 Masked Self-Attention
# 每个位置只能看到之前的词（因果mask）
mask = torch.tril(torch.ones(L, L))
scores = scores.masked_fill(mask == 0, -inf)

Q4: LayerNorm vs BatchNorm？

维度	LayerNorm	BatchNorm
归一化	单样本特征	batch 维度
适用	NLP（变长序列）	CV（固定维度）
训练/推理	行为一致	不一致

Q5: GELU 相比 ReLU 的优势？

python
# GELU: 高斯误差线性单元
gelu(x) = x * Φ(x)  # x * 标准正态分布的CDF
# 比 ReLU 更平滑，梯度更丰富

程序员面试宝典

LLM 大模型基础原理：Transformer 与注意力机制

LLM 大模型基础原理：Transformer 与注意力机制

Transformer 整体架构

自注意力机制（Self-Attention）

核心公式

代码实现

多头注意力（Multi-Head Attention）

位置编码（Positional Encoding）

Transformer 原版：正弦位置编码

RoPE（旋转位置编码）- LLaMA 使用

Transformer 完整块

GPT 系列架构对比

常见面试问题

Q1: Transformer 相比 RNN 的优势？

Q2: 为什么注意力需要缩放因子 √d_k？

Q3: Decoder 为什么用 Masked Attention？

Q4: LayerNorm vs BatchNorm？

Q5: GELU 相比 ReLU 的优势？

相关标签