vLLM 代码结构¶

约 265 个字预计阅读时间 1 分钟

模块¶

Entrypoint（LLM，API Server）：入口
Engine：引擎
Scheduler （目录：`vllm
/core/scheduler.py`)
KV cache manager
- [Paged Attention]
Evictor
- Prefix caching <-- (What if prefix doesn't match? [CacheBlend])
- What if prefix cache on another machine? KV cache. sharing across nodes.
- KV cache optimization：[DeepSeek（MLA）]

感觉 prefix caching 这种可以和 longcat 的缓存复用相结合，特别是在 prefix 不完全匹配的时候，可以借鉴 CacheBlend 这篇论文。

Worker：和具体硬件交互
Model executer（Model runner）：
- llama.py 质量非常高：了解 Transformer 和 LLaMA 模型的基本架构
- forward 函数（265 行）：每个小块是 Attention + 线性层
Modelling
- 怎么把模型写成 vllm 能跑的代码
Attention backend
- flash_attn.py : Flash Attention，不去 Attention 里面 softmax 显示的算出来，而是用隐式的方法类似于递归的方式一点点的实现出来。

Continues Batching：request 在 batch 里面进进出出，而不是一个 request 从头到尾执行到位，再去执行下一个 request，而是把所有 request 打包成一个大包，然后一起向下跑。 - 能打包的 request 非常少，因为很多显存被浪费了。 - 显存浪费的原因在于 request 的序列慢慢变长 - pagedatention 把东西切成了块。

Created: April 29, 2026
Last update: April 29, 2026

vLLM 代码结构¶

模块¶

Discussion