Transformers

压缩张量

compressed-tensors 库提供了一种多功能且高效的方式来存储和管理压缩的模型检查点。该库支持各种量化和稀疏方案，使其成为处理不同模型优化（如 GPTQ、AWQ、SmoothQuant、INT8、FP8、SparseGPT 等）的统一格式。

一些支持的格式包括：

dense
int-quantized (sample): INT8 量化模型
float-quantized (sample): FP8 量化模型；目前支持 E4M3
pack-quantized (sample): INT4 或 INT8 权重量化模型，打包成 INT32。对于 INT4，权重具有 INT4 范围，但存储为 INT8，然后打包成 INT32。

压缩模型可以轻松使用llm-compressor创建。或者，模型可以独立创建并使用压缩张量配置进行序列化。

要在 Hugging Face 模型中心查找现有模型，请搜索 compressed-tensors 标签。

功能：

权重和激活精度：FP8、INT4、INT8（对于Q/DQ，INT允许任意精度）
量化比例和零点策略：tensor, channel, group, block, token
动态的每令牌激活量化（或任何静态策略）
权重中的稀疏性（非结构化或半结构化，如2:4）可以与量化结合以实现极端压缩
支持任意模块的量化，而不仅仅是线性模块
按名称或类别对模块进行定向支持或忽略

安装

建议从PyPI安装稳定版本的compressed-tensors：

pip install compressed-tensors

想要尝试最新功能的开发者也可以从源代码安装该包：

git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .

快速开始模型加载

量化模型可以轻松加载以进行推理，如下所示。目前只能加载已经量化的模型。要将模型量化为压缩张量格式，请参见llm-compressor。

from transformers import AutoModelForCausalLM

# Load the model in compressed-tensors format
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")

# Measure memory usage
mem_params = sum([param.nelement()*param.element_size() for param in ct_model.parameters()])
print(f"{mem/2**30:.4f} GB")
# 8.4575 GB

我们可以在上面看到，Llama 3.1 8B的压缩张量FP8检查点能够使用未量化参考检查点的一半内存进行推理加载。

示例用例 - 加载并运行FP8模型

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is"
]

model_name = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"

quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer(prompt, return_tensors="pt")
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
outputs = tokenizer.batch_decode(generated_ids)

print(outputs)

"""
['<|begin_of_text|>Hello, my name is [Name]. I am a [Your Profession/Student] and I am here to learn about the [Course/Program] at [University/Institution]. I am excited to be here and I am looking forward to', '<|begin_of_text|>The capital of France is Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. The city is home to', "<|begin_of_text|>The future of AI is here, and it's already changing the way we live and work. From virtual assistants to self-driving cars, AI is transforming industries and revolutionizing the way we interact with technology. But what does the future of AI hold"]
"""

以上展示了一个使用compressed-tensors模型进行生成的快速示例。目前，一旦加载，模型无法保存。

深入探讨压缩张量模型检查点

在这个例子中，我们将研究压缩张量模型 nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf 是如何通过其配置条目定义的，并看看这如何转化为加载的模型表示。

首先，让我们看一下模型的quantization_config。乍一看，条目数量让人感到不知所措，但这是因为compressed-tensors是一种格式，允许在模型压缩期间和之后进行灵活的表达。

在实际应用中，为了加载检查点和进行推理，配置可以简化为不包含所有默认或空条目，因此我们将在此处这样做，以专注于实际表示的压缩内容。

"quantization_config": {
  "config_groups": {
    "group_0": {
      "input_activations": {
        "num_bits": 8,
        "strategy": "tensor",
        "type": "float"
      },
      "targets": ["Linear"],
      "weights": {
        "num_bits": 8,
        "strategy": "tensor",
        "type": "float"
      }
    }
  },
  "format": "naive-quantized",
  "ignore": ["lm_head"],
  "quant_method": "compressed-tensors",
  "quantization_status": "frozen"
},

我们可以从上述配置中看到，它指定了一个配置组，其中包括使用静态每张量策略将权重和激活量化为FP8。还值得注意的是，在ignore列表中有一个条目用于跳过lm_head模块的量化，因此在检查点中该模块应保持不变。

要查看配置的实际效果，我们可以简单地使用模型卡上的safetensors viewer来查看第一层模型中所有线性模块的量化权重、input_scale和weight_scale（其他层也是如此）。

张量	形状	精度
model.layers.0.input_layernorm.weight	[4 096]	BF16
model.layers.0.mlp.down_proj.input_scale	[1]	BF16
model.layers.0.mlp.down_proj.weight	[4 096, 14 336]	F8_E4M3
model.layers.0.mlp.down_proj.weight_scale	[1]	BF16
model.layers.0.mlp.gate_proj.input_scale	[1]	BF16
model.layers.0.mlp.gate_proj.weight	[14 336, 4 096]	F8_E4M3
model.layers.0.mlp.gate_proj.weight_scale	[1]	BF16
model.layers.0.mlp.up_proj.input_scale	[1]	BF16
model.layers.0.mlp.up_proj.weight	[14 336, 4 096]	F8_E4M3
model.layers.0.mlp.up_proj.weight_scale	[1]	BF16
model.layers.0.post_attention_layernorm.weight	[4 096]	BF16
model.layers.0.self_attn.k_proj.input_scale	[1]	BF16
model.layers.0.self_attn.k_proj.weight	[1 024, 4 096]	F8_E4M3
model.layers.0.self_attn.k_proj.weight_scale	[1]	BF16
model.layers.0.self_attn.o_proj.input_scale	[1]	BF16
model.layers.0.self_attn.o_proj.weight	[4 096, 4 096]	F8_E4M3
model.layers.0.self_attn.o_proj.weight_scale	[1]	BF16
model.layers.0.self_attn.q_proj.input_scale	[1]	BF16
model.layers.0.self_attn.q_proj.weight	[4 096, 4 096]	F8_E4M3
model.layers.0.self_attn.q_proj.weight_scale	[1]	BF16
model.layers.0.self_attn.v_proj.input_scale	[1]	BF16
model.layers.0.self_attn.v_proj.weight	[1 024, 4 096]	F8_E4M3
model.layers.0.self_attn.v_proj.weight_scale	[1]	BF16

当我们使用压缩张量的HFQuantizer集成加载模型时，我们可以看到量化配置中指定的所有Linear模块已被替换为CompressedLinear模块，这些模块管理压缩权重并执行推理的前向传递。请注意，之前在忽略列表中提到的lm_head仍然保持为未量化的Linear模块。

from transformers import AutoModelForCausalLM

ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
print(ct_model)
"""
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): CompressedLinear(
            in_features=4096, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (k_proj): CompressedLinear(
            in_features=4096, out_features=1024, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (v_proj): CompressedLinear(
            in_features=4096, out_features=1024, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (o_proj): CompressedLinear(
            in_features=4096, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): CompressedLinear(
            in_features=4096, out_features=14336, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (up_proj): CompressedLinear(
            in_features=4096, out_features=14336, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (down_proj): CompressedLinear(
            in_features=14336, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
"""

< > Update on GitHub

←BitNet Contribute new quantization method→