CS336 Lecture 10 - Inference

2026-04-08 来源课程

CS336 Lecture 10：Inference

来源：Stanford CS336 “Language Modeling from Scratch” Spring 2025，Lecture 10，Percy Liang 主讲
原始文件：https://github.com/stanford-cs336/spring2025-lectures/blob/main/lecture_10.py
摘要时间：2026-04-08

核心内容

本讲系统分析了 LLM inference 的效率问题，覆盖从底层硬件 arithmetic intensity 到上层调度系统的完整链路。

主要章节

Inference 的 arithmetic intensity 分析：推导 decode 阶段 memory-bound 的数学根据，H100 分界线 295 FLOP/byte
KV Cache 减小（有损）：GQA、MLA、CLA、Local Attention 四种方案
量化与剪枝：LLM.int8()、AWQ，以及 NVIDIA 的 pruning + distillation 流程
Speculative Decoding（无损）：draft model + target model 并行验证，数学保证精确采样
动态请求调度：Continuous Batching（Orca）、PagedAttention（vLLM）

关键数据

H100 arithmetic intensity 分界线：295 FLOP/byte
Decode 阶段 MLP arithmetic intensity ≈ B（batch size）
Decode 阶段 Attention arithmetic intensity < 1（无法通过 batching 改善）
MLA 压缩比：~56x（DeepSeek v2，32768维→576维）
Speculative Decoding 典型配置：70B 大模型 + 8B 草稿模型，~2-3x 加速

涉及概念

内存带宽瓶颈
KV Cache
GQA
MLA
投机解码
Batching
量化
prefill-decode