VecInfer: 基于向量量化的 2-bit KV Cache 高效 LLM 推理

Posted on 十月 7, 2025

VecInfer: 基于向量量化的 2-bit KV Cache 高效 LLM 推理

ArXiv ID: 2510.06175
作者: Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang
发布日期: 2025-10-07
分类: inference, kv-cache-optimization, quantization

摘要

VecInfer 针对 LLM 推理中的 KV Cache 内存瓶颈问题，提出了一种基于向量量化的激进压缩方案。通过 smooth 和 Hadamard 变换抑制 key cache 中的 outlier，实现了对数据分布的全面覆盖。仅使用 2-bit 量化即可达到与全精度相当的性能，并设计了优化的 CUDA kernel 最小化内存访问开销。在 Llama-3.1-8B 模型上，大 batch 场景下 self-attention 计算获得 2.7 倍加速，单 batch 端到端延迟在 196k 序列长度下降低 8.3 倍。

核心贡献

Outlier 抑制的向量量化: 通过 smooth 和 Hadamard 变换抑制 key cache outliers，实现更有效的 2-bit 向量量化
2-bit 极限压缩: 在仅 2-bit 量化的情况下实现与全精度相当的性能，8 倍内存压缩比
优化 CUDA kernel: 定制化 CUDA kernel 最小化内存访问，充分发挥压缩带来的性能优势
长上下文推理优化: 在 196k 超长序列下实现 8.3 倍端到端延迟降低，解决长上下文推理瓶颈

问题背景

KV Cache 内存瓶颈

LLM 推理中的 KV Cache 增长:

序列长度 | KV Cache 大小 (FP16, Llama-7B)
---------|-------------------------------
1K       | 128 MB
16K      | 2 GB
196K     | 24.5 GB  ← 超过单卡显存

随着序列长度增长，KV Cache 呈线性增长，成为长上下文推理的主要瓶颈。

量化挑战

KV Cache 量化难点：

1. Outlier 问题
   - Key cache 中存在极端大值 (outliers)
   -  outliers 占据大部分动态范围
   - 导致正常值量化精度严重下降

2. 数据分布不均
   - KV cache 呈现长尾分布
   - 传统 per-tensor 量化效果差
   - per-token 量化开销大

3. 计算 - 存储权衡
   - 激进量化节省存储但增加反量化开销
   - 需要硬件感知设计

方法详解

VecInfer 整体架构

┌─────────────────────────────────────────────────────────┐
│                    VecInfer Architecture                 │
│                                                         │
│  KV Cache 输入 (FP16)                                   │
│     │                                                   │
│     ▼                                                   │
│  ┌─────────────────┐                                    │
│  │  Smooth         │  ← 减少动态范围                   │
│  │  Transformation │                                    │
│  └─────────────────┘                                    │
│     │                                                   │
│     ▼                                                   │
│  ┌─────────────────┐                                    │
│  │  Hadamard       │  ← 均匀化数据分布                 │
│  │  Rotation       │                                    │
│  └─────────────────┘                                    │
│     │                                                   │
│     ▼                                                   │
│  ┌─────────────────┐                                    │
│  │  Vector         │  ← 2-bit 向量量化                  │
│  │  Quantization   │    (learned codebook)              │
│  └─────────────────┘                                    │
│     │                                                   │
│     ▼                                                   │
│  量化 KV Cache 存储 (2-bit)                              │
│                                                         │
│  推理时：                                                │
│  ┌─────────────────┐                                    │
│  │  Custom CUDA    │  ← 直接在量化空间计算              │
│  │  Kernel         │    按需反量化                      │
│  └─────────────────┘                                    │
│     │                                                   │
│     ▼                                                   │
│  Attention 输出                                          │
└─────────────────────────────────────────────────────────┘

Outlier 抑制技术

1. Smooth 变换

import torch
import torch.nn as nn

class SmoothTransform:
    """
    Smooth 变换：减少 KV Cache 的动态范围

    核心思想：使用 per-channel 缩放因子，将 outliers"平滑"到正常范围
    """

    def __init__(self, hidden_dim):
        self.hidden_dim = hidden_dim
        # 学习每个 channel 的缩放因子
        self.smooth_scale = nn.Parameter(
            torch.ones(hidden_dim)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        应用 Smooth 变换

        Args:
            x: 输入 KV cache, shape [batch, seq_len, hidden_dim]

        Returns:
            平滑后的 KV cache
        """
        # per-channel 缩放
        x_smoothed = x / self.smooth_scale.view(1, 1, -1)

        return x_smoothed

    def inverse(self, x_smoothed: torch.Tensor) -> torch.Tensor:
        """反变换：恢复原始范围"""
        return x_smoothed * self.smooth_scale.view(1, 1, -1)

2. Hadamard 旋转

class HadamardRotation:
    """
    Hadamard 旋转：均匀化数据分布

    核心思想：应用 Hadamard 矩阵进行正交变换
    - 保持信息完整性（正交变换）
    - 将能量分散到所有维度
    - 使分布更接近均匀分布
    """

    def __init__(self, dim):
        # 构建 Hadamard 矩阵 (维度需为 2 的幂)
        self.H = self._build_hadamard(dim)

    def _build_hadamard(self, n):
        """递归构建 Hadamard 矩阵"""
        if n == 1:
            return torch.tensor([[1.0]])

        # Sylvester 构造法
        H_prev = self._build_hadamard(n // 2)
        H = torch.cat([
            torch.cat([H_prev, H_prev], dim=1),
            torch.cat([H_prev, -H_prev], dim=1)
        ], dim=0)

        # 归一化
        return H / torch.sqrt(torch.tensor(n))

    def rotate(self, x: torch.Tensor) -> torch.Tensor:
        """
        应用 Hadamard 旋转

        高效实现：使用 FWHT (Fast Walsh-Hadamard Transform)
        """
        # 重塑为 [batch * seq_len, dim]
        original_shape = x.shape
        x = x.view(-1, x.shape[-1])

        # FWHT (O(n log n) 复杂度)
        rotated = self._fwht(x)

        return rotated.view(original_shape)

    def _fwht(self, x):
        """快速 Walsh-Hadamard 变换"""
        n = x.shape[-1]
        if n == 1:
            return x

        # 分治
        left = x[..., :n//2]
        right = x[..., n//2:]

        # 蝶形运算
        new_left = left + right
        new_right = left - right

        return torch.cat([new_left, new_right], dim=-1)

向量量化

class VectorQuantizer:
    """
    2-bit 向量量化器

    将向量分组，每组共享一个 codebook
    """

    def __init__(self, codebook_size=4,  # 2-bit = 4 个码字
                 vector_dim=8,            # 每组向量维度
                 num_groups=512):
        self.codebook_size = codebook_size
        self.vector_dim = vector_dim
        self.num_groups = num_groups

        # 可学习的 codebook
        self.codebook = nn.Parameter(
            torch.randn(codebook_size, vector_dim)
        )

    def quantize(self, x: torch.Tensor) -> torch.Tensor:
        """
        向量量化

        流程：
        1. 将输入向量分成若干组
        2. 每组找到最近的 codebook 向量
        3. 存储索引 (2-bit)
        """
        original_shape = x.shape

        # 重塑为 [num_vectors, vector_dim]
        x = x.view(-1, self.vector_dim)

        # 计算与 codebook 的距离
        # shape: [num_vectors, codebook_size]
        distances = torch.cdist(x.unsqueeze(0),
                                 self.codebook.unsqueeze(0)).squeeze(0)

        # 找到最近的码字索引
        indices = torch.argmin(distances, dim=-1)  # shape: [num_vectors]

        # 存储为 2-bit 索引 (打包为 uint8)
        packed_indices = self._pack_indices(indices)

        return packed_indices

    def dequantize(self, packed_indices: torch.Tensor) -> torch.Tensor:
        """反量化：从 codebook 查找码字"""
        # 解包索引
        indices = self._unpack_indices(packed_indices)

        # 查找码字
        quantized = self.codebook[indices]

        return quantized

    def _pack_indices(self, indices: torch.Tensor) -> torch.Tensor:
        """
        将 2-bit 索引打包为 uint8

        每个 uint8 存储 4 个 2-bit 索引
        """
        # 实现略：位运算打包
        pass

定制 CUDA Kernel

// VecInfer CUDA Kernel 伪代码

__global__ void vecinfer_attention_kernel(
    const uint8_t* quantized_kv,  // 量化的 KV cache
    const float* codebook,         // 码字表
    const float* query,            // query
    float* output,                 // 输出
    int batch_size,
    int seq_len,
    int num_heads,
    int head_dim
) {
    // 每个 thread block 处理一个 query 位置

    // 步骤 1: 按需反量化 key
    __shared__ float key_buffer[BLOCK_SIZE][HEAD_DIM];

    for (int i = threadIdx.x; i < seq_len; i += blockDim.x) {
        // 从 2-bit 索引反量化
        uint8_t packed = quantized_kv[blockIdx.x * seq_len + i];

        // 解包并查找 codebook
        #pragma unroll
        for (int j = 0; j < 4; j++) {
            int idx = (packed >> (j * 2)) & 0x03;  // 提取 2-bit
            key_buffer[i][j] = codebook[idx];
        }
    }

    __syncthreads();

    // 步骤 2: 计算 attention 分数
    float score = 0.0f;
    #pragma unroll
    for (int i = 0; i < head_dim; i++) {
        score += query[i] * key_buffer[threadIdx.x][i];
    }

    // 步骤 3: Softmax 和加权求和
    // ... (标准 attention 实现)

    // 步骤 4: 写入输出
    output[blockIdx.x * head_dim + threadIdx.x] = result;
}

实验结果详解

实验设置

硬件:

NVIDIA A100 GPU (80GB)
CUDA 11.8

模型:

Llama-3.1-8B-Instruct
评估任务：长上下文问答、语言建模

对比方法:

FP16 (全精度基线)
8-bit KV Cache
4-bit KV Cache
H2O (Heavy-Hitter Oracle)
SnapKV

主实验结果

精度对比

方法	WikiText2 (PPL)	GSM8K	LongBench
FP16	15.82	68.5%	72.3%
8-bit	15.89	68.2%	71.8%
4-bit	16.25	67.1%	70.5%
H2O	17.53	64.2%	66.8%
VecInfer (2-bit)	15.95	68.1%	71.9%

关键发现：VecInfer 在仅 2-bit 量化下，性能接近 FP16 和 8-bit 方法，显著优于其他激进量化方案。

推理加速

大 Batch Self-Attention 加速比:

Batch Size | VecInfer | FP16 | 加速比
-----------|----------|------|------
16         | 2.78ms   | 7.65ms | 2.75x
32         | 5.12ms   | 14.23ms | 2.78x
64         | 9.85ms   | 27.12ms | 2.75x

端到端延迟

单 Batch 长序列推理延迟 (Llama-3.1-8B):

序列长度 | FP16 | VecInfer | 加速比
---------|------|----------|--------
4K       | 45ms | 28ms | 1.61x
16K      | 120ms | 52ms | 2.31x
64K      | 380ms | 98ms | 3.88x
196K     | 1250ms | 150ms | 8.33x

关键洞察：序列越长，VecInfer 的优势越明显。在 196k 超长序列下实现 8.3 倍加速。

消融实验

Outlier 抑制组件贡献

配置	WikiText2 (PPL)	196K 延迟
完整 VecInfer	15.95	150ms
- Smooth 变换	18.23	148ms
- Hadamard 旋转	17.85	152ms
- 两者都移除	25.67	145ms

结论：Outlier 抑制对精度至关重要，对延迟影响较小。

Codebook 大小影响

Codebook Size	Bits	PPL	内存压缩比
2	1-bit	22.5	16x
4	2-bit	15.95	8x
8	3-bit	15.85	5.3x
16	4-bit	15.83	4x

决策：2-bit 提供最佳性价比，继续使用 2-bit 配置。

实践指南

集成 VecInfer

from vecinfer import VecInferModel

# 1. 加载模型
model = VecInferModel.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config={
        "kv_cache_bits": 2,
        "vector_dim": 8,
        "codebook_size": 4
    }
)

# 2. 校准（一次性离线过程）
calibration_data = load_calibration_data()
model.calibrate(calibration_data)

# 3. 长上下文推理
context = load_long_context(200000)  # 200k tokens
response = model.generate(
    context + "请总结以上内容：",
    max_new_tokens=500
)

最佳实践

场景	推荐配置	预期收益
长上下文问答 (>64K)	2-bit, vector_dim=8	4-8x 加速
短对话 (<4K)	FP16 或 8-bit	1.5x 加速
大 batch 推理	2-bit, batch_size>=16	2.7x 吞吐提升
低延迟应用	2-bit, 优化 kernel	2-3x 延迟降低

硬件要求

最低: NVIDIA V100 (16GB)
推荐: NVIDIA A100/H100 (40GB+)
CUDA 版本: 11.7+

个人评价

VecInfer 是 KV Cache 量化领域的重要进展。其核心创新在于：

优势:

极限压缩: 2-bit 量化达到接近 FP16 的精度，8 倍内存压缩比
Outlier 抑制: Smooth + Hadamard 组合拳有效解决长尾分布问题
端到端优化: 从算法到 CUDA kernel 的全栈优化
长上下文专长: 在 196k 超长序列下表现尤为突出

局限:

校准依赖: 需要高质量校准数据训练 codebook
架构特定: 主要针对 Transformer 架构优化
短序列收益有限: 短序列场景加速比不明显

适用场景:

长文档理解和分析
超长上下文对话
大 batch 离线推理
显存受限的长序列任务

评分: 4.3/5.0

技术亮点: vector quantization, 2-bit KV cache, Hadamard rotation, outlier suppression, CUDA optimization

代码仓库: GitHub

相关资源: