编程万字深度解析 LMCache：当 LLM 推理遇见「KV 缓存革命」——从 Transformer 注意力机制到多层存储分级、从 vLLM/SGLang 集成到生产级 PD 拆分的完整技术指南（2026）

2026-07-02 08:42:52 +0800 CST views 9

万字深度解析 LMCache：当 LLM 推理遇见「KV 缓存革命」——从 Transformer 注意力机制到多层存储分级、从 vLLM/SGLang 集成到生产级 PD 拆分的完整技术指南（2026）

作者按：LLM 推理的成本瓶颈不在模型参数，而在 KV 缓存的重复计算与 GPU 显存浪费。LMCache 把 KV 缓存从「临时状态」变成「可复用 AI 原生知识」，实现了跨请求、跨进程、跨节点的 KV 缓存持久化与共享。本文完整拆解 LMCache 的架构设计、存储后端分级、多进程模式、与非前缀复用的 CacheBlend 机制，并提供 15+ 可直接运行的代码示例，覆盖从本地开发到生产集群的全场景。

背景：LLM 推理的 KV 缓存之痛
LMCache 是什么？核心设计哲学
架构全景：从 GPU Connector 到 Storage Manager
多级存储后端深度解析
Multiprocess 模式：进程隔离与跨 Pod 共享
vLLM / SGLang / NVIDIA Dynamo 集成原理
非前缀 KV 复用与 CacheBlend 机制
PD 拆分（Disaggregated Prefill）与 KV Transfer
可观测性：KV 缓存的全链路监控
代码实战：15+ 可运行示例
性能基准与真实业务数据
生产部署最佳实践
总结与展望

1. 背景：LLM 推理的 KV 缓存之痛

1.1 Transformer 注意力机制回顾

Transformer 的核心是自注意力（Self-Attention），其计算公式为：

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

在自回归生成（Autoregressive Generation）场景中，每生成一个新 token，模型需要「看到」之前所有 token 的 Key 和 Value 向量。如果每次都重新计算，复杂度为 $O(N^2 \cdot d)$，其中 $N$ 是序列长度，$d$ 是隐藏层维度。

1.2 KV 缓存的本质

KV 缓存是一种以空间换时间的优化策略：

# 没有 KV 缓存：每次都重新计算（灾难）
for i in range(seq_len):
    Q, K, V = compute_qkv(hidden_states[:, :i+1, :])
    output = attention(Q, K, V)

# 有 KV 缓存：缓存已计算的 K, V
kv_cache = {}
for i in range(seq_len):
    if i in kv_cache:
        K, V = kv_cache[i]           # 直接读缓存，O(1)
    else:
        K, V = compute_kv(hidden_states[:, i:i+1, :])
        kv_cache[i] = (K, V)        # 写入缓存
    output = attention(Q, K, V)

1.3 现实痛点：为什么需要 LMCache？

传统 KV 缓存方案存在三大痛点：

痛点	描述	后果
显存爆炸	一个 70B 模型，batch=32，seq_len=4096，KV 缓存占用超 64GB GPU 显存	无法增大 batch，吞吐量受限
无法跨请求复用	相同 system prompt（如 "You are a helpful assistant..."）每次都要重新计算	TTFT（首 Token 延迟）居高不下
引擎崩溃即丢失	vLLM 进程崩溃后，内存中的 KV 缓存全部丢失	所有正在处理的请求被迫重试

LMCache 的解法：把 KV 缓存从 GPU 显存卸载到 CPU 内存、本地磁盘、甚至远程存储（Redis/S3），并实现跨请求、跨进程、跨节点的 KV 缓存复用。

2. LMCache 是什么？核心设计哲学

2.1 官方定义

LMCache is a KV cache management layer for LLM inference. It turns KV cache from a temporary state into reusable AI-native knowledge that can be stored persistently, reused across multiple serving engines, monitored with an observability stack, and transformed for better generation quality.

2.2 核心设计哲学

LMCache 的设计哲学可以归纳为「STORAGE」七原则：

Scalable（可扩展）：支持从单机到分布式集群
Tierd Storage（分级存储）：GPU 显存 → CPU 内存 → 本地磁盘 → 远程存储
Observable（可观测）：完整的 KV 缓存生命周期监控
Reusable（可复用）：跨请求、跨会话、跨引擎实例
Agnostic（引擎无关）：同时支持 vLLM、SGLang、NVIDIA Dynamo
General（通用）：支持所有主流开源 LLM（Llama、Qwen、DeepSeek 等）
Efficient（高效）：KV 检索速度比原生方案快 7 倍

2.3 与 vLLM Prefix Caching 的本质区别

特性	vLLM Prefix Caching	LMCache
缓存范围	仅进程内，仅前缀匹配	跨进程，支持非前缀匹配
存储层级	GPU 显存 + CPU 内存	GPU/CPU/Disk/Remote（可插拔）
持久化	进程退出即丢失	支持持久化到磁盘/对象存储
跨引擎复用	❌	✅（vLLM ↔ SGLang 共享）
非前缀复用	❌	✅（通过 CacheBlend）

3. 架构全景：从 GPU Connector 到 Storage Manager

3.1 核心组件拓扑

┌─────────────────────────────────────────────────────┐
│                   Serving Engine                     │
│          (vLLM / SGLang / Dynamo)                  │
└──────────────────┬──────────────────────────────────┘
                   │ KVConnectorBase_V1
                   ▼
┌─────────────────────────────────────────────────────┐
│                LMCache Engine                       │
│                                                     │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────┐ │
│  │ GPUConnector │  │StorageManager│  │Hasher   │ │
│  │ (GPU↔CPU)   │  │(分层调度器)   │  │(Chunk   │ │
│  └─────────────┘  └──────┬───────┘  │Hash)    │ │
│                           │          └─────────┘ │
│                           ▼                       │
│              ┌────────────────────┐              │
│              │  Storage Backends  │              │
│              ├────────────────────┤              │
│              │ L1: CPU RAM       │              │
│              │ L2: Local Disk    │              │
│              │ L3: Remote Backend│              │
│              └────────────────────┘              │
└─────────────────────────────────────────────────────┘

3.2 CacheEngine 核心实现

核心引擎位于 lmcache/v1/cache_engine.py：

class LMCacheEngine:
    """
    LMCache 的核心引擎，管理 KV 缓存的完整生命周期：
    - GPU ↔ CPU 数据传输（通过 GPUConnector）
    - 多层存储的写入与检索（通过 StorageManager）
    - Chunk Hash 计算与 CacheEngineKey 生成
    """
    
    def __init__(
        self,
        config: LMCacheEngineConfig,
        metadata: LMCacheEngineMetadata,
    ):
        # 1. 配置与元数据
        self.config = config
        self.metadata = metadata
        
        # 2. GPU Connector：处理 GPU 内存 ↔ 主机内存的拷贝
        self.gpu_connector = self._create_gpu_connector()
        
        # 3. Storage Manager：协调所有存储后端
        self.storage_manager = StorageManager(config, metadata, self.loop)
        
        # 4. 缓存索引：token hash → CacheEngineKey
        self.cache_index = {}  # CacheEngineKey → KV 数据引用

3.3 CacheEngineKey 设计：跨模型 KV 复用

LMCache 通过内容寻址（Content-Addressed Storage）实现 KV 缓存的精确匹配：

@dataclass
class CacheEngineKey:
    chunk_hash: int          # 文本块的哈希值（基于 token IDs）
    fmt: str                 # 格式版本（如 "v1"）
    model_name: str          # 模型名称（KV 格式与模型架构相关）
    num_tokens: int          # 该 chunk 的 token 数量
    
    def to_string(self) -> str:
        return f"{self.fmt}@{self.model_name}@{self.chunk_hash}@{self.num_tokens}"

关键设计：chunk_hash 是基于 token IDs 计算的（而非原始文本），确保相同 tokenization 结果可以精确匹配，即使原始文本有细微差异（如多余空格）。

4. 多级存储后端深度解析

4.1 存储层级设计哲学

LMCache 实现了经典的多级缓存（Multi-Level Cache）架构，但针对 LLM 推理的特点做了深度优化：

热度 ↓      延迟 ↓      容量 ↑      成本 ↓
L1: CPU RAM  ────→  最快      最小      最贵
L2: NVMe SSD ────→  中等      中等      中等
L3: Remote    ────→  最慢      最大      最便宜

4.2 StorageManager 调度策略

StorageManager 是分级存储的核心调度器，实现了 Write-All + Waterfall Retrieval 策略：

class StorageManager:
    """
    分级存储管理器：
    - 写入：同时写入所有层级（Write-All），确保高可用
    - 读取：从最热的层级开始查找（Waterfall），命中即返回
    """
    
    def put(self, key: CacheEngineKey, value: KVCache) -> None:
        """写入：并行写入所有后端（L1 + L2 + L3）"""
        futures = []
        for backend in self.backends:
            futures.append(backend.put(key, value))
        await asyncio.gather(*futures)
    
    async def get(self, key: CacheEngineKey) -> Optional[KVCache]:
        """
        检索：Waterfall 策略
        1. 先查 L1（CPU RAM），命中即返回
        2. L1 未命中 → 查 L2（本地磁盘）
        3. L2 未命中 → 查 L3（远程后端）
        4. 从远程拉取后，自动回写 L1/L2（Read-Through Cache）
        """
        for backend in self.backends_priority_ordered:
            result = await backend.get(key)
            if result is not None:
                # 命中：触发后台回写（Write-Back）到更热的层级
                self._promote_to_hotter_levels(key, result, backend.level)
                return result
        return None

4.3 支持的存储后端完整列表

后端	类名	适用场景	性能特征
CPU RAM	`MemoryBackend`	最热 KV 缓存，低延迟要求	< 1ms 延迟，容量受限
本地磁盘	`FileSystemBackend`	中等热度 KV 缓存	毫秒级延迟，TB 级容量
Redis/Valkey	`RedisBackend`	分布式共享 KV 缓存	微秒~毫秒级，支持集群
Mooncake	`MooncakeBackend`	超大规模推理集群	RDMA 网络，极低延迟
InfiniStore	`InfiniStoreBackend`	云原生推理场景	高吞吐，弹性扩展
S3 (兼容)	`S3Backend`	归档型 KV 缓存	高延迟，无限容量
NIXL	`NIXLBackend`	NVIDIA GPU 直连传输	GPU ↔ GPU 零拷贝
GDS	`GDSBackend`	NVIDIA GPU Direct Storage	GPU 显存直达磁盘

4.4 配置示例：多后端并联

# lmcache-config.yaml
# 同时配置 CPU RAM + 本地磁盘 + Redis
storage_backends:
  - type: memory
    capacity: 10GiB
    
  - type: filesystem
    path: /mnt/nvme/lmcache
    capacity: 500GiB
    
  - type: redis
    host: redis-cluster.internal
    port: 6379
    password: "${REDIS_PASSWORD}"
    db: 0
    capacity: 1TiB    # Redis 集群容量

# Waterfall 检索顺序
retrieval_order:
  - memory          # 先查 CPU RAM
  - filesystem      # 再查本地 NVMe
  - redis           # 最后查 Redis 集群

5. Multiprocess 模式：进程隔离与跨 Pod 共享

5.1 为什么需要 Multiprocess 模式？

在 Kubernetes 生产环境中，典型部署是每个 Pod 一个 vLLM 实例。传统（单进程）模式下：

每个 vLLM Pod 有自己独立的 LMCache 实例
Pod 内的 KV 缓存无法与其他 Pod 共享
vLLM Pod 重启后，KV 缓存丢失

Multiprocess (MP) 模式的解法：将 LMCache 运行为独立的守护进程（DaemonSet），每个节点一个 LMCache 服务，该节点上所有 vLLM Pod 都连接到同一个 LMCache 服务。

5.2 MP 模式架构

┌──────────────────────────────────────────────────┐
│              Kubernetes Node                      │
│                                                  │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐        │
│  │ vLLM    │  │ vLLM    │  │ vLLM    │        │
│  │ Pod 1   │  │ Pod 2   │  │ Pod 3   │        │
│  └────┬────┘  └────┬────┘  └────┬────┘        │
│       │ ZMQ/TCP     │             │              │
│       └─────────────┴─────────────┘              │
│                     │                            │
│                     ▼                            │
│       ┌─────────────────────────┐                │
│       │   LMCache MP Server     │                │
│       │  (DaemonSet, 1 per node)│                │
│       │                         │                │
│       │  ┌─────────────────┐   │                │
│       │  │  MessageQueue    │   │                │
│       │  │  Server (ZMQ)   │   │                │
│       │  └────────┬────────┘   │                │
│       │           │             │                │
│       │           ▼             │                │
│       │  ┌─────────────────┐   │                │
│       │  │  MPCacheEngine  │   │                │
│       │  │  + Storage Mgr  │   │                │
│       │  └─────────────────┘   │                │
│       └─────────────────────────┘                │
└──────────────────────────────────────────────────┘

5.3 ZMQ 通信协议

vLLM 与 LMCache MP Server 之间通过 ZMQ (ZeroMQ) 进行高速通信：

# vLLM 侧：LMCacheMPConnector（继承 KVConnectorBase_V1）
class LMCacheMPConnector(KVConnectorBase_V1):
    def __init__(self, vllm_config, kv_cache_config):
        # 连接到 LMCache MP Server（通过 TCP）
        self.zmq_context = zmq.Context()
        self.zmq_socket = self.zmq_context.socket(zmq.DEALER)
        self.zmq_socket.connect("tcp://localhost:5555")
    
    async def start_load_kv(self, request_id: str, token_ids: List[int]):
        """向 LMCache Server 发送 KV 加载请求"""
        message = {
            "type": "LOAD_KV",
            "request_id": request_id,
            "token_ids": token_ids,
            "model_name": self.model_name,
        }
        await self._send_zmq_message(message)

# LMCache Server 侧：MessageQueueServer
class MessageQueueServer:
    """
    ZMQ DEALER-ROUTER 模式：
    - 接收来自多个 vLLM Pod 的请求
    - 根据 RequestType 分发到对应的处理器
    """
    
    def run(self):
        router = self.zmq_context.socket(zmq.ROUTER)
        router.bind("tcp://*:5555")
        
        while True:
            identity, message = router.recv_multipart()
            request = json.loads(message)
            
            if request["type"] == "LOAD_KV":
                response = self._handle_load_kv(request)
            elif request["type"] == "STORE_KV":
                response = self._handle_store_kv(request)
            
            router.send_multipart([identity, json.dumps(response).encode()])

5.4 MP 模式性能数据

根据 LMCache 官方博客（2026/04），MP 模式在 MoE（混合专家）模型推理中带来了最高 10 倍性能提升：

共享 KV 缓存：同一节点上多个 vLLM Pod 共享同一份 KV 缓存，命中率提升 3-5 倍
进程隔离：vLLM Pod 崩溃后，LMCache Server 上的 KV 缓存不受影响
独立扩缩容：LMCache 服务的资源（CPU/内存）与 vLLM 推理完全解耦

6. vLLM / SGLang / NVIDIA Dynamo 集成原理

6.1 vLLM 集成：KVConnectorBase_V1

vLLM V1 引擎引入了 KVConnector 抽象接口，LMCache 通过 LMCacheConnectorV1 实现该接口：

# vLLM 配置中启用 LMCache
from vllm import EngineArgs

args = EngineArgs(
    model="Qwen/Qwen3-32B-FP8",
    tensor_parallel_size=8,
    # 禁用 vLLM 内置的 Prefix Caching（交给 LMCache 处理）
    enable_prefix_caching=False,
    # 配置 KV Transfer 使用 LMCache
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_both",   # 既发送也接收 KV
    },
)

# 通过环境变量注入 LMCache 配置
import os
os.environ["LMCACHE_CONFIG_FILE"] = "/path/to/lmcache-config.yaml"

Scheduler 侧方法（控制层）：

class LMCacheConnectorV1(KVConnectorBase_V1):
    # 1. 获取能从 LMCache 匹配的新 token 数量
    def get_num_new_matched_tokens(self, request, num_computed_tokens):
        matched_count = self.lmcache_engine.match_prefix(
            token_ids=request.token_ids,
            start_idx=num_computed_tokens,
        )
        return matched_count, self.use_async_loading
    
    # 2. 请求结束时：决定是否异步释放 KV blocks
    def request_finished(self, request, blocks):
        if self._should_persist(request):
            # 异步 offload 到 LMCache（不阻塞主线程）
            asyncio.create_task(self._offload_kv_async(request, blocks))

Worker 侧方法（执行层）：

    # 3. 开始从 LMCache 加载 KV（可能异步）
    async def start_load_kv(self, request_id, token_ids, blocks):
        # 从 LMCache 检索 KV
        kv_cache = await self.lmcache_engine.retrieve(
            token_ids=token_ids,
            num_tokens=len(token_ids),
        )
        
        if kv_cache is not None:
            # 将检索到的 KV 拷贝到 vLLM 的 GPU blocks
            self._copy_kv_to_gpu_blocks(kv_cache, blocks)
        
        return LoadKvResult(status="success", num_loaded=len(token_ids))

6.2 SGLang 集成：基数树（Radix Tree）复用

SGLang 使用基数树（Radix Tree）管理 KV 缓存，LMCache 与 SGLang 的集成通过缓存感知调度（Cache-Aware Scheduling）实现：

# SGLang 启动时使用 LMCache
import subprocess
import os

env = os.environ.copy()
env["LMCACHE_CONFIG_FILE"] = "lmcache-config.yaml"

proc = subprocess.Popen([
    "python", "-m", "sglang.launch_server",
    "--model-path", "meta-llama/Llama-3.1-8B-Instruct",
    "--port", "30000",
    "--enable-cache-server",    # 启用 SGLang 的缓存服务器
    "--cache-server-backend", "lmcache",  # 使用 LMCache 作为后端
], env=env)

基数树 KV 复用原理：

Radix Tree 结构示例：
根节点 ""
├── "You are a helpful assistant. " (System Prompt, 共享)
│   ├── "Please write a Python function to " (共享前缀)
│   │   ├── "reverse a linked list."  (请求 A)
│   │   └── "implement quicksort."    (请求 B)
│   └── "Summarize the following text: " (另一分支)
└── "Translate to French: " (另一系统提示)

SGLang 通过基数树实现最长前缀匹配，自动复用相同前缀的 KV 缓存。LMCache 在此基础上进一步支持跨节点的基数树共享。

6.3 NVIDIA Dynamo 集成

NVIDIA Dynamo 是 NVIDIA 的 LLM 推理平台，通过 NIXL (NVIDIA Interconnect for XL) 实现 GPU 之间的高速 KV 传输：

# dynamo-config.yaml
kv_cache:
  backend: lmcache
  transfer:
    protocol: nixl
    devices:
      - "cuda:0"
      - "cuda:1"
  lmcache_config: /path/to/lmcache-config.yaml

7. 非前缀 KV 复用与 CacheBlend 机制

7.1 传统前缀缓存的局限

传统 KV 缓存方案（包括 vLLM Prefix Caching）只能复用连续前缀：

❌ 无法复用的情况：
Prompt A: "Translate to French: Hello" → KV 缓存
Prompt B: "Summarize then translate: Hello" → 前缀不匹配，无法复用

❌ 无法复用的情况（多轮对话）：
Turn 1: "What is Python?" → KV 缓存
Turn 2: "What are its main features? (referring to Python)" → 前缀不同

7.2 CacheBlend：语义级 KV 复用

LMCache 通过 CacheBlend 实现非前缀 KV 复用：

核心思想：将 prompt 分成多个 chunk，每个 chunk 独立计算 hash，然后与 LMCache 中存储的 KV 进行任意位置匹配。

def cacheblend_match(prompt_chunks: List[str], lmcache_engine) -> List[MatchResult]:
    """
    CacheBlend 匹配算法：
    1. 将 prompt 分成固定大小的 chunk（如 64 tokens/chunk）
    2. 对每个 chunk 计算 hash
    3. 在 LMCache 中查找匹配的 chunk
    4. 对未匹配的 chunk 进行选择性重计算（Recalculation）
    """
    results = []
    for chunk in prompt_chunks:
        chunk_hash = hash(tokenizer.encode(chunk))
        cached_kv = lmcache_engine.lookup_by_chunk_hash(chunk_hash)
        
        if cached_kv is not None:
            # 命中：直接使用缓存的 KV
            results.append(MatchResult(
                status="hit",
                kv_cache=cached_kv,
                num_tokens=len(chunk),
            ))
        else:
            # 未命中：标记为需要重计算
            results.append(MatchResult(
                status="miss",
                chunk=chunk,
                num_tokens=len(chunk),
            ))
    
    return results

7.3 选择性重计算的质量恢复

直接使用缓存的 KV 可能因为位置编码（RoPE）的错位而导致质量下降。CacheBlend 通过选择性重计算来恢复质量：

async def cacheblend_retrieve(
    self,
    token_ids: List[int],
    retrieve_ratio: float = 0.8,  # 80% 使用缓存，20% 重计算
):
    """
    CacheBlend 检索 + 质量恢复：
    - 对匹配到的 KV 直接使用
    - 对未匹配的部分，选择性重计算一部分 token（而非全部）
    - 通过插值（Interpolation）融合缓存 KV 与重计算 KV
    """
    matched_positions = self._find_matched_positions(token_ids)
    
    # 决定哪些位置需要重计算
    recalc_positions = self._selective_recalc(
        matched_positions,
        retrieve_ratio,
    )
    
    # 并行：从 LMCache 加载匹配的 KV + 重计算未匹配部分
    cached_kv_task = self._load_from_lmcache(matched_positions)
    recomputed_kv_task = self._recompute(recalc_positions)
    
    cached_kv, recomputed_kv = await asyncio.gather(
        cached_kv_task, recomputed_kv_task,
    )
    
    # 融合
    return self._blend_kv(cached_kv, recomputed_kv, matched_positions)

8. PD 拆分（Disaggregated Prefill）与 KV Transfer

8.1 为什么需要 PD 拆分？

LLM 推理分为两个阶段：

Prefill 阶段：处理完整 prompt，计算 KV 缓存（计算密集，延迟敏感）
Decode 阶段：逐 token 生成（内存带宽瓶颈，吞吐量敏感）

问题：长 prompt（如 100K tokens）的 Prefill 会阻塞 Decode，导致正在生成的请求 TTFT 暴增。

PD 拆分解法：将 Prefill 和 Decode 分配到不同的 GPU：

Prefill 节点：高算力 GPU（如 H100），专门处理长 prompt
Decode 节点：高显存 GPU，专门做逐 token 生成
KV Transfer：Prefill 计算完的 KV 缓存，通过网络传输到 Decode 节点

8.2 LMCache 的 PD 拆分实现

┌─────────────────┐     KV Transfer      ┌─────────────────┐
│  Prefill Node   │ ──────────────────▶  │  Decode Node    │
│  (GPU 0)       │     (NIXL/NVLink)    │  (GPU 1)       │
│                 │                      │                 │
│  vLLM +         │                      │  vLLM +         │
│  LMCache        │                      │  LMCache        │
│  (offload KV)   │                      │  (retrieve KV)  │
└─────────────────┘                      └─────────────────┘

配置示例：

# lmcache-config.yaml（Prefill 节点）
role: prefill
kv_transfer:
  protocol: nixl          # 使用 NVIDIA NIXL 传输
  prefill_device: cuda:0
  decode_address: "tcp://decode-node:5555"

# lmcache-config.yaml（Decode 节点）
role: decode
kv_transfer:
  protocol: nixl
  listen_address: "tcp://0.0.0.0:5555"
  decode_device: cuda:0

运行示例（官方 examples/offline_inference/disaggregated_prefill.py）：

# Prefill 节点进程
import lmcache
from vllm import LLM, SamplingParams

# 启动 vLLM，角色为 prefill
llm = LLM(
    model="Qwen/Qwen3-32B-FP8",
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_producer",  # 只生产 KV，不负责 decode
    },
)

# 处理 prompt，KV 自动 offload 到 LMCache → 传输到 Decode 节点
outputs = llm.generate(prompts, sampling_params)

# Decode 节点进程（另一台机器/GPU）
llm = LLM(
    model="Qwen/Qwen3-32B-FP8",
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_consumer",  # 只消费 KV，做 decode
    },
)

# 从 LMCache 接收 KV，直接开始 decode（跳过 prefill）
# 注意：此处不需要传入完整 prompt，因为 KV 已经传输过来了

9. 可观测性：KV 缓存的全链路监控

9.1 为什么 KV 缓存需要可观测性？

在生产环境中，KV 缓存的命中率直接影响成本和用户体验：

命中率高 → TTFT 低 → 用户满意
命中率低 → 重复计算多 → GPU 成本高

LMCache 提供了多维度可观测性指标，覆盖从 Kubernetes 基础设施到 KV 缓存业务层。

9.2 指标分类

类别	指标示例	用途
Kubernetes 基础设施	CPU/内存使用率、Pod 重启次数	运维监控
KV 缓存性能	命中率、TTFT、检索延迟（P50/P99）	性能优化
请求级指标	每个请求的命中 token 数、未命中 token 数	业务分析
管理指标	每个用户的 KV 缓存使用量	计费/配额

9.3 Prometheus 指标暴露

LMCache 通过 /metrics 端点暴露 Prometheus 格式的指标：

# LMCache 自动暴露以下指标（通过 Prometheus client）
from prometheus_client import Counter, Histogram, Gauge

# 1. KV 缓存命中/未命中计数
kv_cache_hit_tokens = Counter(
    "lmcache_kv_cache_hit_tokens_total",
    "Total number of tokens served from KV cache",
    ["model_name", "user_id"],
)
kv_cache_miss_tokens = Counter(
    "lmcache_kv_cache_miss_tokens_total",
    "Total number of tokens recomputed",
    ["model_name", "user_id"],
)

# 2. KV 检索延迟（直方图）
kv_retrieval_latency = Histogram(
    "lmcache_kv_retrieval_latency_seconds",
    "Time spent retrieving KV cache from storage backends",
    ["backend_type", "model_name"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0],
)

# 3. 存储后端使用量
storage_usage_bytes = Gauge(
    "lmcache_storage_usage_bytes",
    "Current storage usage per backend",
    ["backend_type", "node_id"],
)

9.4 Grafana 仪表盘示例

# grafana-dashboard.json（片段）
panels:
  - title: "KV Cache Hit Rate (%)"
    targets:
      - expr: |
          sum(rate(lmcache_kv_cache_hit_tokens_total[5m])) /
          (sum(rate(lmcache_kv_cache_hit_tokens_total[5m])) +
           sum(rate(lmcache_kv_cache_miss_tokens_total[5m]))) * 100
    type: stat
    
  - title: "KV Retrieval Latency (P99)"
    targets:
      - expr: |
          histogram_quantile(0.99,
            rate(lmcache_kv_retrieval_latency_seconds_bucket[5m]))
    type: graph

10. 代码实战：15+ 可运行示例

10.1 环境准备

# 安装 LMCache
pip install lmcache

# 安装 vLLM（可选，用于集成示例）
pip install vllm

# 启动 Redis（用于远程后端示例）
docker run -d -p 6379:6379 redis:7-alpine

# 设置环境变量
export LMCACHE_CONFIG_FILE=./lmcache-config.yaml
export REDIS_PASSWORD=your_redis_password  # 如使用了密码

10.2 示例 1：基础用法 —— 存储与检索 KV 缓存

import lmcache
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 初始化 LMCache Engine
config = lmcache.LMCacheEngineConfig.from_yaml("lmcache-config.yaml")
metadata = lmcache.LMCacheEngineMetadata(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    num_layers=32,
    num_kv_heads=8,
    head_dim=128,
)
engine = lmcache.LMCacheEngine(config, metadata)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# 模拟：计算并存储 KV 缓存
prompt = "You are a helpful assistant. Please explain what is machine learning."
token_ids = tokenizer.encode(prompt)
kv_cache = {}  # 模拟计算的 KV 缓存（实际应从模型提取）

# 存储到 LMCache
cache_key = engine.make_key(chunk_hash=hash(tuple(token_ids)), fmt="v1")
engine.put(cache_key, kv_cache)
print(f"✅ Stored KV cache with key: {cache_key.to_string()}")

# 检索
retrieved = engine.get(cache_key)
if retrieved is not None:
    print(f"✅ Retrieved KV cache successfully!")
else:
    print("❌ Cache miss")

10.3 示例 2：vLLM 集成 —— 启动带 LMCache 的推理服务

# 编写 LMCache 配置文件
cat > lmcache-config.yaml << 'EOF'
storage_backends:
  - type: memory
    capacity: 5GiB
  - type: filesystem
    path: /tmp/lmcache
    capacity: 50GiB

retrieval_order:
  - memory
  - filesystem

# 启用非前缀复用（CacheBlend）
cacheblend:
  enabled: true
  retrieve_ratio: 0.8
EOF

# 启动 vLLM 服务（使用 LMCache）
export LMCACHE_CONFIG_FILE=./lmcache-config.yaml

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --gpu-memory-utilization 0.85 \
  --no-enable-prefix-caching \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

10.4 示例 3：CPU Offload —— 将 KV 缓存卸载到 CPU 内存

# examples/offline_inference/cpu_offload_lmcache.py（简化版）

import lmcache
from vllm import LLM, SamplingParams

# 配置：启用 CPU Offload
config_yaml = """
storage_backends:
  - type: memory      # L1: CPU RAM
    capacity: 20GiB
  - type: filesystem  # L2: 本地磁盘（备用）
    path: /mnt/nvme/lmcache
    capacity: 200GiB

# 当 GPU 显存不足时，自动 offload 到 CPU
offload_policy:
  trigger: gpu_memory_usage > 0.9
  destination: memory
"""

with open("/tmp/lmcache-cpu-offload.yaml", "w") as f:
    f.write(config_yaml)

import os
os.environ["LMCACHE_CONFIG_FILE"] = "/tmp/lmcache-cpu-offload.yaml"

# 启动 vLLM（自动 offload KV 到 CPU）
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_both",
    },
)

# 长 prompt 测试（触发 CPU offload）
long_prompt = "Summarize the following document: " + "ML is awesome. " * 5000
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate([long_prompt], sampling_params)
print(outputs[0].outputs[0].text)

10.5 示例 4：Redis 后端 —— 分布式 KV 缓存共享

# lmcache-redis.yaml
storage_backends:
  - type: redis
    host: redis-cluster.internal
    port: 6379
    password: "${REDIS_PASSWORD}"
    db: 0
    # Redis 集群模式
    cluster_mode: true
    nodes:
      - "redis-1.internal:6379"
      - "redis-2.internal:6379"
      - "redis-3.internal:6379"
    capacity: 1TiB

# Waterfall 顺序：先查本地内存，未命中再查 Redis
retrieval_order:
  - memory
  - redis

# 写入策略：同时写入本地和 Redis（Write-All）
write_policy: write_all

import os
import lmcache

# 设置 Redis 密码（从环境变量读取，避免硬编码）
assert "REDIS_PASSWORD" in os.environ, "Please set REDIS_PASSWORD"

os.environ["LMCACHE_CONFIG_FILE"] = "lmcache-redis.yaml"

# 启动 vLLM（自动使用 Redis 后端）
# 多个 vLLM 实例可以共享同一个 Redis 集群中的 KV 缓存
from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3-14B",
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_both",
    },
)

# 第一个请求：计算 KV 并存储到 Redis
# 第二个请求（甚至在不同机器上）：从 Redis 检索 KV，跳过 prefill

10.6 示例 5：PD 拆分 —— Disaggregated Prefill

# 节点 1（Prefill 节点，高算力 GPU）
cat > lmcache-prefill.yaml << 'EOF'
role: prefill
kv_transfer:
  protocol: nixl
  prefill_device: cuda:0
  decode_address: "tcp://decode-node-ip:5555"

storage_backends:
  - type: memory
    capacity: 10GiB
EOF

export LMCACHE_CONFIG_FILE=./lmcache-prefill.yaml
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer"}' \
  --no-enable-prefix-caching &

# 节点 2（Decode 节点，高显存 GPU）
cat > lmcache-decode.yaml << 'EOF'
role: decode
kv_transfer:
  protocol: nixl
  listen_address: "tcp://0.0.0.0:5555"
  decode_device: cuda:0

storage_backends:
  - type: memory
    capacity: 20GiB
EOF

export LMCACHE_CONFIG_FILE=./lmcache-decode.yaml
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer"}' \
  --no-enable-prefix-caching &

10.7 示例 6：P2P 模式 —— 跨节点 CPU 内存共享

# lmcache-p2p.yaml（实验性 → 生产特性，2026/01）
p2p:
  enabled: true
  role: both           # 既发送也接收
  listen_port: 5556
  peers:
    - "tcp://node-2:5556"
    - "tcp://node-3:5556"
  # P2P 传输使用 CPU 内存（不经过磁盘/网络存储）
  buffer_type: cpu_memory
  buffer_capacity: 32GiB

# 节点 1：存储 KV 到 P2P 网络
engine = lmcache.LMCacheEngine(config, metadata)
engine.put(key, kv_cache)  # 自动广播到 peers

# 节点 2：从 P2P 网络检索 KV（零磁盘 IO）
retrieved = engine.get(key)  # 直接从节点 1 的 CPU 内存读取

10.8 示例 7：CacheBlend 非前缀复用

from lmcache.blend import CacheBlendRetriever

# 初始化 CacheBlend
retriever = CacheBlendRetriever(
    lmcache_engine=engine,
    retrieve_ratio=0.8,       # 80% 使用缓存
    recalc_ratio=0.2,         # 20% 重计算（质量恢复）
)

prompt_chunks = [
    "You are a Python expert.",
    "Please review the following code:",
    "def add(a, b): return a + b",
    "Suggest improvements.",
]

# CacheBlend 检索
results = retriever.retrieve(prompt_chunks)

for i, result in enumerate(results):
    if result.status == "hit":
        print(f"✅ Chunk {i}: Cache HIT")
    else:
        print(f"🔄 Chunk {i}: Cache MISS, recalculating...")
        # 只重计算未命中部分（而非整个 prompt）
        kv = model.forward(result.chunk_tokens)
        engine.put(result.cache_key, kv)

10.9 示例 8：可观测性 —— 暴露 Prometheus 指标

from lmcache.observability import PrometheusExporter
import threading

# 启动 Prometheus 指标暴露（默认端口 9100）
exporter = PrometheusExporter(
    port=9100,
    path="/metrics",
)
thread = threading.Thread(target=exporter.start, daemon=True)
thread.start()

print("📊 Prometheus metrics available at http://localhost:9100/metrics")

# 正常使用 LMCache...
# 指标会自动记录：
# - lmcache_kv_cache_hit_tokens_total
# - lmcache_kv_cache_miss_tokens_total
# - lmcache_kv_retrieval_latency_seconds

10.10 示例 9：多模型支持 —— 同时加速不同架构的 LLM

# LMCache 支持同时管理多个模型的 KV 缓存
import lmcache

# 模型 A：Llama-3（GQA 注意力）
config_a = lmcache.LMCacheEngineConfig(...)
metadata_a = lmcache.LMCacheEngineMetadata(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    num_layers=32,
    num_kv_heads=8,       # GQA：8 个 KV heads
    head_dim=128,
)
engine_a = lmcache.LMCacheEngine(config_a, metadata_a)

# 模型 B：Qwen3（MLA 注意力，DeepSeek 架构）
metadata_b = lmcache.LMCacheEngineMetadata(
    model_name="Qwen/Qwen3-32B",
    num_layers=64,
    # MLA 注意力的 KV 格式不同，LMCache 自动适配
    num_kv_heads=1,        # MLA：压缩为一个 KV head
    head_dim=512,          # 更大的 head_dim
)
engine_b = lmcache.LMCacheEngine(config_b, metadata_b)

# 两个引擎独立工作，共享同一个存储后端（如 Redis）
# CacheEngineKey 中的 model_name 字段确保 KV 格式不会混淆

10.11 示例 10：KV 缓存压缩（Pluggable Transformation）

from lmcache.transformation import KVCompressor

# 自定义 KV 压缩器（通过 SERDE 接口）
class Int8KVCompressor(KVCompressor):
    """
    将 FP16 KV 缓存量化为 INT8，节省 50% 存储空间
    （可能轻微影响质量，需根据场景权衡）
    """
    
    def compress(self, kv_cache: KVCache) -> CompressedKVCache:
        # K: [num_layers, num_kv_heads, seq_len, head_dim]
        k_int8 = torch.quantize_per_tensor(
            kv_cache.K, scale=0.01, zero_point=0, dtype=torch.qint8,
        )
        v_int8 = torch.quantize_per_tensor(
            kv_cache.V, scale=0.01, zero_point=0, dtype=torch.qint8,
        )
        return CompressedKVCache(k_int8, v_int8)
    
    def decompress(self, compressed: CompressedKVCache) -> KVCache:
        k_fp16 = compressed.k_int8.dequantize()
        v_fp16 = compressed.v_int8.dequantize()
        return KVCache(K=k_fp16, V=v_fp16)

# 注册压缩器
engine.register_transformation(compressor=Int8KVCompressor())

10.12 示例 11：Agentic 工作负载优化（多轮对话）

# Agentic 场景：多轮工具调用，相同 system prompt 反复出现
system_prompt = """You are a helpful AI assistant with access to the following tools:
- search_web(query): Search the web
- read_file(path): Read a file
- execute_code(code): Execute Python code
"""

# 第一轮
conversation_1 = system_prompt + "\nUser: What's the weather in Paris?"
# KV 缓存：system_prompt 部分被 LMCache 缓存

# 第二轮（新会话，但 system_prompt 相同）
conversation_2 = system_prompt + "\nUser: What's the capital of France?"
# ✅ LMCache 命中：system_prompt 的 KV 直接从缓存加载
# 只计算 "\nUser: What's the capital of France?" 部分的 KV

# 配置：针对 Agentic 工作负载的优化
yaml_config = """
# 针对多轮对话的优化
cache_policy:
  # 优先缓存 system prompt（通常最长且最稳定）
  priority_prefix:
    - "You are a helpful assistant"
    - "You are an AI assistant with tools"
  
  # 自动检测对话边界（\nUser: / \nAssistant:）
  conversation_boundary_detection: true

# Agentic 基准测试显示：TTFT 降低 60-80%
"""

10.13 示例 12：Batch 推理优化

# 批量推理场景：多个请求共享相同前缀
prompts = [
    "Classify the sentiment: I love this product!",
    "Classify the sentiment: This is the worst experience ever.",
    "Classify the sentiment: It's okay, not great.",
]

# 传统方式：每个 prompt 独立计算（重复计算 "Classify the sentiment: "）
# LMCache 方式：自动复用共享前缀的 KV

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_both",
    },
)

# 第一次请求后，"Classify the sentiment: " 的 KV 被缓存
# 后续请求的 TTFT 显著降低
sampling_params = SamplingParams(temperature=0.0, max_tokens=10)
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt[:50]}...")
    print(f"Output: {output.outputs[0].text}\n")

10.14 示例 13：LMCache CLI 使用

# 安装 LMCache 后，自带 CLI 工具

# 1. 检查 LMCache 状态
lmcache-cli status
# 输出：
# LMCache Engine Status:
#   - Config: /path/to/lmcache-config.yaml
#   - Storage Backends: memory (5GiB), filesystem (/tmp/lmcache, 50GiB)
#   - Cache Entries: 142
#   - Hit Rate: 68.3%

# 2. 清空缓存
lmcache-cli cache clear

# 3. 预热缓存（预先加载常见 prompt 的 KV）
lmcache-cli cache warmup \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --prompts-file common-prompts.txt \
  --output-dir /tmp/lmcache-warmup

# 4. 基准测试
lmcache-cli benchmark \
  --model Qwen/Qwen3-14B \
  --backend memory,redis \
  --num-requests 1000 \
  --output report.json

10.15 示例 14：自定义存储后端

from lmcache.storage_backend import StorageBackendInterface

# 实现自定义存储后端（如对接自研对象存储）
class MyCustomBackend(StorageBackendInterface):
    def __init__(self, config):
        self.client = MyObjectStoreClient(config.endpoint, config.api_key)
    
    async def put(self, key: CacheEngineKey, value: KVCache) -> None:
        """将 KV 缓存上传到自研对象存储"""
        serialized = self._serialize(value)
        await self.client.put(
            key=key.to_string(),
            data=serialized,
            metadata={"model": key.model_name, "num_tokens": key.num_tokens},
        )
    
    async def get(self, key: CacheEngineKey) -> Optional[KVCache]:
        """从自研对象存储下载 KV 缓存"""
        data = await self.client.get(key.to_string())
        if data is None:
            return None
        return self._deserialize(data)
    
    async def exists(self, key: CacheEngineKey) -> bool:
        return await self.client.exists(key.to_string())

# 注册自定义后端
from lmcache.storage_backend import register_backend
register_backend("my_custom", MyCustomBackend)

# 在配置文件中使用
yaml_config = """
storage_backends:
  - type: my_custom
    endpoint: https://my-object-store.internal
    api_key: "${MY_API_KEY}"
"""

11. 性能基准与真实业务数据

11.1 官方基准测试（2026/05，AMD MI300X）

测试场景：Agentic 多轮对话工作负载

指标	无 LMCache	有 LMCache	提升
TTFT (P50)	850 ms	120 ms	7.1x
TTFT (P99)	3200 ms	380 ms	8.4x
吞吐量（requests/s）	18	52	2.9x
GPU 显存占用	64 GB	38 GB	1.7x 节省

测试细节：Agent 多轮对话，每轮平均 3-5 次工具调用，system prompt 约 800 tokens。LMCache 命中率：78%。

11.2 Cohere x CoreWeave 案例（2025/11）

Cohere 在 CoreWeave（GPU 云）上部署 LMCache，结果：

KV 缓存命中率：从 0%（无缓存）提升到 72%
推理成本：降低 58%（主要来自于减少的 GPU 计算时间）
用户体验：P99 TTFT 从 2.1s 降至 0.4s

11.3 长上下文场景（100K+ tokens）

上下文长度	无 LMCache (TTFT)	有 LMCache (TTFT)	提升
10K tokens	320 ms	80 ms	4.0x
50K tokens	2100 ms	350 ms	6.0x
100K tokens	6800 ms	920 ms	7.4x

LMCache 对长上下文的加速效果更明显，因为长 prompt 的 prefill 计算成本更高。

12. 生产部署最佳实践

12.1 Kubernetes 部署架构

# lmcache-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: lmcache-server
spec:
  selector:
    matchLabels:
      app: lmcache-server
  template:
    metadata:
      labels:
        app: lmcache-server
    spec:
      containers:
      - name: lmcache
        image: lmcache/lmcache:latest
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
          limits:
            cpu: "16"
            memory: "64Gi"
        ports:
        - containerPort: 5555   # ZMQ 通信端口
          name: zmq
        - containerPort: 9100   # Prometheus 指标端口
          name: metrics
        volumeMounts:
        - name: lmcache-storage
          mountPath: /mnt/nvme
        env:
        - name: LMCACHE_CONFIG_FILE
          value: /etc/lmcache/config.yaml
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: password
      volumes:
      - name: lmcache-storage
        hostPath:
          path: /mnt/nvme   # 使用节点本地 NVMe SSD

12.2 配置调优建议

# 生产环境推荐配置
storage_backends:
  - type: memory
    capacity: 20GiB        # 足够缓存最热的 2-5% KV
    
  - type: filesystem
    path: /mnt/nvme/lmcache
    capacity: 500GiB       # 缓存大部分热点 KV
    
  - type: redis
    host: redis-prod.internal
    port: 6379
    cluster_mode: true
    capacity: 2TiB         # 归档型 KV（冷数据）

# 调优参数
performance:
  # 并行检索（提高吞吐量）
  num_retrieval_threads: 8
  
  # 后台预取（预测下一个可能需要的 KV）
  prefetch_enabled: true
  prefetch_window: 3       # 预取接下来 3 个请求可能的 KV
  
  # 缓存淘汰策略（类似 LRU）
  eviction_policy: lru
  lru_time_window: 3600    # 1 小时内未访问的 KV 优先淘汰

12.3 监控告警规则

# prometheus-alerts.yaml
groups:
  - name: lmcache-alerts
    rules:
      - alert: LowCacheHitRate
        expr: |
          sum(rate(lmcache_kv_cache_hit_tokens_total[10m])) /
          (sum(rate(lmcache_kv_cache_hit_tokens_total[10m])) +
           sum(rate(lmcache_kv_cache_miss_tokens_total[10m]))) < 0.5
        for: 15m
        annotations:
          summary: "LMCache 命中率过低（< 50%）"
          description: "请检查 prefill 模式是否变化，或考虑增大缓存容量"
      
      - alert: HighRetrievalLatency
        expr: |
          histogram_quantile(0.99,
            rate(lmcache_kv_retrieval_latency_seconds_bucket[10m])) > 0.5
        for: 10m
        annotations:
          summary: "KV 检索 P99 延迟过高（> 500ms）"
          description: "请检查存储后端（Redis/磁盘）是否过载"

13. 总结与展望

13.1 核心要点回顾

LMCache 的本质：把 KV 缓存从「GPU 显存中的临时状态」变成「可持久化、可共享、可观测的 AI 原生知识」
核心价值：TTFT 降低 6-8 倍，GPU 显存占用节省 40-60%，推理成本降低 50%+
架构亮点：引擎无关设计（vLLM/SGLang/Dynamo 全支持）、Multiprocess 模式（进程隔离 + 跨 Pod 共享）、多级存储（GPU → CPU → 磁盘 → 远程）
差异化特性：非前缀 KV 复用（CacheBlend）、PD 拆分（Disaggregated Prefill）、Pluggable Transformation（压缩/序列化）

13.2 LMCache 在 LLM 推理技术栈中的位置

┌─────────────────────────────────────────────────┐
│            Application / Agent Framework         │
│            (LangChain, AutoGen, OpenAI Agents)  │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│            Serving Engine                       │
│  (vLLM, SGLang, NVIDIA Dynamo, TGI)           │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│       LMCache（本文主角）                        │
│  KV Cache 管理层                                │
│  GPU ↔ CPU ↔ Disk ↔ Remote                     │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│       Model / GPU Execution                     │
│  (CUDA Kernels, FlashAttention, PagedAttention) │
└─────────────────────────────────────────────────┘

13.3 未来展望

根据 LMCache 官方 Roadmap（GitHub Issue #2923），值得关注的未来方向：

Tensor 并行 KV 缓存共享：在 tensor-parallel 场景下，跨 GPU 共享 KV 缓存（减少冗余存储）
KV 缓存的去重与压缩：自动检测并合并重复的 KV（如多个用户使用相同的 system prompt）
与 LLM 编译优化结合：LMCache + torch.compile / vLLM 的 Speculative Decoding
多模态 KV 缓存：支持图像/音频编码器输出的 KV 缓存复用（目前主要支持文本）

参考资源

官方文档: https://docs.lmcache.ai/
GitHub 仓库: https://github.com/LMCache/LMCache（8600+ Stars）
论文: LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference (arXiv:2510.09665)
Slack 社区: https://join.slack.com/t/lmcacheworkspace
Roadmap: https://github.com/LMCache/LMCache/issues/2923

本文撰写于 2026 年 7 月，基于 LMCache dev 分支（June 2026）的最新代码。示例代码仅供参考，生产部署请以官方文档为准。

字数统计：约 12,500 字

编程 万字深度解析 LMCache：当 LLM 推理遇见「KV 缓存革命」——从 Transformer 注意力机制到多层存储分级、从 vLLM/SGLang 集成到生产级 PD 拆分的完整技术指南（2026）