TorchAO
TorchAO 是一个用于 PyTorch 的架构优化库,它提供了高性能的数据类型、优化技术和内核,用于推理和训练,并且与原生 PyTorch 功能如 torch.compile
、FSDP 等具有可组合性。一些基准测试数据可以在 这里 找到。
在开始之前,请确保以下库已安装并更新到最新版本:
pip install --upgrade torch torchao
默认情况下,权重以全精度(torch.float32)加载,无论权重实际存储的数据类型是什么,例如torch.float16。设置torch_dtype="auto"
以加载模型config.json
文件中定义的数据类型,以自动加载内存最优的数据类型。
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B"
# We support int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight
# More examples and documentations for arguments can be found in https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# compile the quantized model to get speedup
import torchao
torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# benchmark the performance
import torch.utils.benchmark as benchmark
def benchmark_fn(f, *args, **kwargs):
# Manual warmup
for _ in range(5):
f(*args, **kwargs)
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)",
globals={"args": args, "kwargs": kwargs, "f": f},
num_threads=torch.get_num_threads(),
)
return f"{(t0.blocked_autorange().mean):.3f}"
MAX_NEW_TOKENS = 1000
print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype=torch.bfloat16)
bf16_model = torch.compile(bf16_model, mode="max-autotune")
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))
序列化和反序列化
torchao量化是通过tensor子类实现的,它仅适用于huggingface的非safetensor序列化和反序列化。它依赖于torch.load(..., weights_only=True)
来避免在加载时执行任意用户代码,并使用add_safe_globals来允许一些已知的用户函数。
不支持安全张量序列化的原因是,包装器张量子类允许最大灵活性,因此我们希望确保支持新格式量化张量的工作量较低,而安全张量则优化了最大安全性(不执行用户代码),这也意味着我们必须确保手动支持新的量化格式。
# save quantized model locally
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained(output_dir, safe_serialization=False)
# push to huggingface hub
# save_to = "{user_id}/llama3-8b-int4wo-128"
# quantized_model.push_to_hub(save_to, safe_serialization=False)
# load quantized model
ckpt_id = "llama3-8b-int4wo-128" # or huggingface hub model id
loaded_quantized_model = AutoModelForCausalLM.from_pretrained(ckpt_id, device_map="cuda")
# confirm the speedup
loaded_quantized_model = torch.compile(loaded_quantized_model, mode="max-autotune")
print("loaded int4wo-128 model:", benchmark_fn(loaded_quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))