Transformers 文档

量化

Transformers

量化

量化技术通过使用较低精度的数据类型（如8位整数int8）来表示权重和激活值，从而减少内存和计算成本。这使得加载通常无法放入内存的较大模型成为可能，并加速推理。Transformers支持AWQ和GPTQ量化算法，并且支持使用bitsandbytes进行8位和4位量化。

Transformers 中不支持的量化技术可以通过 HfQuantizer 类来添加。

学习如何在量化指南中量化模型。

QuantoConfig

类 transformers.QuantoConfig

( weights = 'int8' activations = None modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

参数

weights (str, optional, defaults to "int8") — 量化后权重的目标数据类型。支持的值为（“float8”，“int8”，“int4”，“int2”）
activations (str, optional) — 量化后激活的目标数据类型。支持的值为 (None,“int8”,“float8”)
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，对于需要明确保留某些模块在其原始精度的模型（例如 Whisper 编码器、Llava 编码器、Mixtral 门层）非常有用。

这是一个包装类，包含了你可以使用通过quanto加载的模型进行的所有可能属性和功能。

post_init

( )

安全检查器，确保参数正确

AqlmConfig

类 transformers.AqlmConfig

( in_group_size: int = 8 out_group_size: int = 1 num_codebooks: int = 1 nbits_per_codebook: int = 16 linear_weights_not_to_quantize: typing.Optional[typing.List[str]] = None **kwargs )

参数

in_group_size (int, optional, 默认为 8) — 输入维度上的组大小。
out_group_size (int, optional, 默认为 1) — 沿输出维度的组大小。建议始终使用 1.
num_codebooks (int, 可选, 默认为 1) — 用于加性量化过程的码本数量。
nbits_per_codebook (int, 可选, 默认为 16) — 编码单个码书向量的位数。码书的大小为 2**nbits_per_codebook.
linear_weights_not_to_quantize (Optional[List[str]], optional) — 不应量化的nn.Linear权重参数的完整路径列表。
kwargs (Dict[str, Any], 可选) — 用于初始化配置对象的额外参数。

这是一个关于aqlm参数的包装类。

post_init

( )

安全检查器，确保参数正确 - 同时将一些NoneType参数替换为它们的默认值。

AwqConfig

类 transformers.AwqConfig

( bits: int = 4 group_size: int = 128 zero_point: bool = True version: AWQLinearVersion = backend: AwqBackendPackingMethod = do_fuse: typing.Optional[bool] = None fuse_max_seq_len: typing.Optional[int] = None modules_to_fuse: typing.Optional[dict] = None modules_to_not_convert: typing.Optional[typing.List] = None exllama_config: typing.Optional[typing.Dict[str, int]] = None **kwargs )

参数

bits (int, optional, defaults to 4) — 要量化的位数。
group_size (int, optional, 默认为 128) — 用于量化的组大小。推荐值为 128，-1 表示使用每列量化。
zero_point (bool, optional, defaults to True) — 是否使用零点量化。
版本 (AWQLinearVersion, 可选, 默认为 AWQLinearVersion.GEMM) — 要使用的量化算法版本。GEMM 更适合大批量大小（例如 >= 8），否则 GEMV 更合适（例如 < 8）。GEMM 模型与 Exllama 内核兼容。
backend (AwqBackendPackingMethod, 可选, 默认为 AwqBackendPackingMethod.AUTOAWQ) — 量化后端。某些模型可能使用 llm-awq 后端进行量化。这对于使用 llm-awq 库量化自己模型的用户非常有用。
do_fuse (bool, optional, defaults to False) — 是否将注意力层和mlp层融合在一起以加快推理速度
fuse_max_seq_len (int, optional) — 使用融合时生成的最大序列长度。
modules_to_fuse (dict, optional, default to None) — 用用户指定的方案覆盖原生支持的融合方案。
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，对于需要明确保留某些模块为原始精度的模型（例如Whisper编码器、Llava编码器、Mixtral门层）非常有用。请注意，您不能直接使用transformers进行量化，请参考AutoAWQ文档以了解如何量化HF模型。
exllama_config (Dict[str, Any], 可选) — 你可以通过version键指定exllama内核的版本，通过max_input_len键指定最大序列长度，通过max_batch_size键指定最大批量大小。如果未设置，默认为{"version": 2, "max_input_len": 2048, "max_batch_size": 8}.

这是一个包装类，包含了你可以使用通过auto-awq库加载的模型进行的所有可能的属性和功能，该库依赖于auto_awq后端进行量化。

post_init

( )

安全检查器，确保参数正确

EetqConfig

类 transformers.EetqConfig

( weights: str = 'int8' modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

参数

weights (str, optional, defaults to "int8") — 权重的目标数据类型。支持的值仅为“int8”
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，对于需要明确保留某些模块在其原始精度的模型量化非常有用。

这是一个包装类，包含了你可以使用通过eetq加载的模型进行的所有可能属性和功能。

post_init

( )

安全检查器，确保参数正确

GPTQConfig

类 transformers.GPTQConfig

( bits: int tokenizer: typing.Any = None dataset: typing.Union[typing.List[str], str, NoneType] = None group_size: int = 128 damp_percent: float = 0.1 desc_act: bool = False sym: bool = True true_sequential: bool = True use_cuda_fp16: bool = False model_seqlen: typing.Optional[int] = None block_name_to_quantize: typing.Optional[str] = None module_name_preceding_first_block: typing.Optional[typing.List[str]] = None batch_size: int = 1 pad_token_id: typing.Optional[int] = None use_exllama: typing.Optional[bool] = None max_input_length: typing.Optional[int] = None exllama_config: typing.Optional[typing.Dict[str, typing.Any]] = None cache_block_outputs: bool = True modules_in_block_to_quantize: typing.Optional[typing.List[typing.List[str]]] = None **kwargs )

参数

bits (int) — 要量化的位数，支持的数值为 (2, 3, 4, 8)。
tokenizer (str 或 PreTrainedTokenizerBase, 可选) — 用于处理数据集的tokenizer。你可以传递以下内容之一：
- 一个自定义的tokenizer对象。
- 一个字符串，表示托管在huggingface.co上的模型仓库中预定义的tokenizer的模型id。
- 一个包含tokenizer所需词汇表文件的目录路径，例如使用save_pretrained()方法保存的路径，例如./my_model_directory/。
dataset (Union[List[str]], optional) — 用于量化的数据集。您可以提供一个字符串列表中的自定义数据集，或者直接使用GPTQ论文中使用的原始数据集 ['wikitext2', 'c4', 'c4-new']
group_size (int, 可选, 默认为 128) — 用于量化的组大小。推荐值为128，-1表示使用每列量化。
damp_percent (float, optional, defaults to 0.1) — 用于阻尼的平均Hessian对角线的百分比。推荐值为0.1.
desc_act (bool, optional, defaults to False) — 是否按激活大小递减的顺序量化列。将其设置为False可以显著加快推理速度，但困惑度可能会稍微变差。也称为act-order.
sym (bool, optional, 默认为 True) — 是否使用对称量化。
true_sequential (bool, 可选, 默认为 True) — 是否在单个Transformer块内执行顺序量化。我们不是一次性量化整个块，而是逐层进行量化。因此，每一层都会使用通过先前量化层的输入进行量化。
use_cuda_fp16 (bool, 可选, 默认为 False) — 是否使用优化的cuda内核来处理fp16模型。需要模型为fp16格式。
model_seqlen (int, optional) — 模型可以接受的最大序列长度。
block_name_to_quantize (str, optional) — 要量化的transformers块名称。如果为None，我们将使用常见模式（例如model.layers）推断块名称
module_name_preceding_first_block (List[str], optional) — 位于第一个Transformer块之前的层。
batch_size (int, optional, defaults to 1) — 处理数据集时使用的批量大小
pad_token_id (int, 可选) — 填充标记的ID。当batch_size > 1时，需要准备数据集。
use_exllama (bool, 可选) — 是否使用exllama后端。如果未设置，默认为True。仅在bits = 4时有效。
max_input_length (int, optional) — 最大输入长度。这是初始化一个依赖于最大预期输入长度的缓冲区所必需的。它特定于具有act-order的exllama后端。
exllama_config (Dict[str, Any], 可选) — exllama 配置。你可以通过 version 键指定 exllama 内核的版本。如果未设置，默认为 {"version": 1}.
cache_block_outputs (bool, optional, defaults to True) — 是否缓存块输出以作为后续块的输入重用。
modules_in_block_to_quantize (List[List[str]], optional) — List of list of module names to quantize in the specified block. This argument is useful to exclude certain linear modules from being quantized. The block to quantize can be specified by setting block_name_to_quantize. We will quantize each list sequentially. If not set, we will quantize all linear layers. Example: modules_in_block_to_quantize =[["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"], ["self_attn.o_proj"]]. In this example, we will first quantize the q,k,v layers simultaneously since they are independent. Then, we will quantize self_attn.o_proj layer with the q,k,v layers quantized. This way, we will get better results since it reflects the real input self_attn.o_proj will get when the model is quantized.

这是一个包装类，包含了你可以使用已加载的模型进行的所有可能属性和功能，该模型是使用optimum API进行gptq量化的，依赖于auto_gptq后端。

from_dict_optimum

( config_dict )

获取具有最佳gptq配置字典的兼容类

post_init

( )

安全检查器，确保参数正确

to_dict_optimum

( )

获取兼容的字典以优化GPTQ配置

BitsAndBytesConfig

类 transformers.BitsAndBytesConfig

( load_in_8bit = False load_in_4bit = False llm_int8_threshold = 6.0 llm_int8_skip_modules = None llm_int8_enable_fp32_cpu_offload = False llm_int8_has_fp16_weight = False bnb_4bit_compute_dtype = None bnb_4bit_quant_type = 'fp4' bnb_4bit_use_double_quant = False bnb_4bit_quant_storage = None **kwargs )

参数

load_in_8bit (bool, optional, defaults to False) — 此标志用于启用使用LLM.int8()的8位量化。
load_in_4bit (bool, optional, defaults to False) — 此标志用于通过将线性层替换为来自 bitsandbytes 的 FP4/NF4 层来启用 4 位量化。
llm_int8_threshold (float, optional, defaults to 6.0) — This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
llm_int8_skip_modules (List[str], 可选) — 一个明确的模块列表，我们不希望将其转换为8位。这对于像Jukebox这样的模型非常有用，这些模型在不同位置有多个头部，并且不一定在最后位置。例如，对于CausalLM模型，最后的lm_head保持其原始的dtype。
llm_int8_enable_fp32_cpu_offload (bool, optional, 默认为 False) — 此标志用于高级用例和了解此功能的用户。如果您想将模型分成不同的部分，并在 GPU 上以 int8 运行某些部分，在 CPU 上以 fp32 运行其他部分，您可以使用此标志。这对于卸载大型模型（如 google/flan-t5-xxl）非常有用。请注意，int8 操作不会在 CPU 上运行。
llm_int8_has_fp16_weight (bool, 可选, 默认为 False) — 此标志使用16位主权重运行LLM.int8()。这对于微调非常有用，因为权重不需要在反向传播过程中来回转换。
bnb_4bit_compute_dtype (torch.dtype 或 str, 可选, 默认为 torch.float32) — 这设置了可能与输入类型不同的计算类型。例如，输入可能是 fp32，但为了加速，计算可以设置为 bf16。
bnb_4bit_quant_type (str, 可选, 默认为 "fp4") — 这设置了bnb.nn.Linear4Bit层中的量化数据类型。选项是FP4和NF4数据类型，分别由fp4或nf4指定。
bnb_4bit_use_double_quant (bool, 可选, 默认为 False) — 此标志用于嵌套量化，其中第一次量化的量化常数会再次被量化。
bnb_4bit_quant_storage (torch.dtype 或 str, 可选, 默认为 torch.uint8) — 这设置了存储类型以打包量化的4位参数。
kwargs (Dict[str, Any], 可选) — 用于初始化配置对象的额外参数。

这是一个包装类，包含了你可以使用已加载的模型进行的所有可能属性和功能，该模型是通过bitsandbytes加载的。

这取代了load_in_8bit或load_in_4bit，因此这两个选项是互斥的。

目前仅支持LLM.int8()、FP4和NF4量化。如果bitsandbytes中添加了更多方法，那么将向此类添加更多参数。

is_quantizable

( )

如果模型可量化，则返回 True，否则返回 False。

post_init

( )

安全检查器，确保参数正确 - 同时将一些NoneType参数替换为它们的默认值。

量化方法

( )

此方法返回用于模型的量化方法。如果模型不可量化，则返回 None。

to_diff_dict

( ) → Dict[str, Any]

返回

Dict[str, Any]

构成此配置实例的所有属性的字典，

从配置中移除所有与默认配置属性对应的属性，以提高可读性，并将其序列化为Python字典。

HfQuantizer

类 transformers.quantizers.HfQuantizer

( quantization_config: QuantizationConfigMixin **kwargs )

HuggingFace量化器的抽象类。目前支持对HF transformers模型进行推理和/或量化。该类仅用于transformers.PreTrainedModel.from_pretrained，目前还不能在该方法范围之外轻松使用。

属性 quantization_config (transformers.utils.quantization_config.QuantizationConfigMixin): 定义您想要量化的模型的量化参数的量化配置。 modules_to_not_convert (List[str], 可选): 在量化模型时不转换的模块名称列表。 required_packages (List[str], 可选): 在使用量化器之前需要安装的所需pip包列表。 requires_calibration (bool): 量化方法在使用前是否需要校准模型。 requires_parameters_quantization (bool): 量化方法是否需要创建一个新的参数。例如，对于bitsandbytes，需要创建一个新的xxxParameter以正确量化模型。

adjust_max_memory

( max_memory: typing.Dict[str, typing.Union[int, str]] )

如果需要额外的内存进行量化，请调整 infer_auto_device_map() 的 max_memory 参数

adjust_target_dtype

( torch_dtype: torch.dtype )

参数

torch_dtype (torch.dtype, optional) — 用于计算device_map的torch_dtype。

如果你想调整在from_pretrained中使用的target_dtype变量以计算device_map（在device_map是str的情况下），请重写此方法。例如，对于bitsandbytes，我们强制将target_dtype设置为torch.int8，而对于4-bit，我们传递一个自定义枚举accelerate.CustomDtype.int4。

check_quantized_param

( model: PreTrainedModel param_value: torch.Tensor param_name: str state_dict: typing.Dict[str, typing.Any] **kwargs )

检查加载的state_dict组件是否是量化参数的一部分并进行一些验证；仅在requires_parameters_quantization == True时定义，适用于需要为量化创建新参数的量化方法。

create_quantized_param

( *args **kwargs )

从state_dict中获取所需的组件并创建量化参数；仅当requires_parameters_quantization == True时适用

反量化

( model )

可能会对模型进行反量化以恢复原始模型，但会损失一些准确性/性能。请注意，并非所有量化方案都支持此操作。

get_special_dtypes_update

( model torch_dtype: torch.dtype )

参数

model (~transformers.PreTrainedModel) — 要量化的模型
torch_dtype (torch.dtype) — 传递给 from_pretrained 方法的 dtype.

返回未量化的模块的dtypes - 用于在传递字符串作为device_map时计算device_map。该方法将使用在_process_model_before_weight_loading中修改的modules_to_not_convert。

postprocess_model

( model: PreTrainedModel **kwargs )

参数

model (~transformers.PreTrainedModel) — 要量化的模型
kwargs (dict, 可选) — 传递给 _process_model_after_weight_loading 的关键字参数。

在模型权重加载后进行后处理。确保重写抽象方法 _process_model_after_weight_loading。

预处理模型

( model: PreTrainedModel **kwargs )

参数

model (~transformers.PreTrainedModel) — 要量化的模型
kwargs (dict, optional) — 传递给 _process_model_before_weight_loading 的关键字参数。

在加载权重之前设置模型属性和/或转换模型。此时，模型应该在元设备上初始化，因此您可以自由操作模型的骨架以替换模块。确保重写抽象方法 _process_model_before_weight_loading。

update_device_map

( device_map: typing.Optional[typing.Dict[str, typing.Any]] )

参数

device_map (Union[dict, str], 可选) — 通过 from_pretrained 方法传递的 device_map。

如果你想用一个新的设备映射覆盖现有的设备映射，请重写此方法。例如，对于bitsandbytes，由于accelerate是一个硬性要求，如果没有传递device_map，device_map将被设置为`“auto”`

update_expected_keys

( model expected_keys: typing.List[str] loaded_keys: typing.List[str] )

参数

expected_keys (List[str], optional) — 初始化模型中预期键的列表。
loaded_keys (List[str], optional) — 检查点中加载的键的列表。

如果你想调整update_expected_keys，请重写此方法。

update_missing_keys

( model missing_keys: typing.List[str] prefix: str )

参数

missing_keys (List[str], optional) — 检查点中与模型状态字典相比缺少的键列表

如果你想调整missing_keys，请重写此方法。

update_torch_dtype

( torch_dtype: torch.dtype )

参数

torch_dtype (torch.dtype) — 传递给 from_pretrained 的输入数据类型

一些量化方法需要显式地将模型的dtype设置为目标dtype。如果您想确保该行为被保留，您需要重写此方法。

validate_environment

( *args **kwargs )

此方法用于检查传入from_pretrained的参数是否存在潜在冲突。您需要为所有与transformers集成的未来量化器定义它。如果不需要显式检查，只需返回空值。

HqqConfig

类 transformers.HqqConfig

( nbits: int = 4 group_size: int = 64 view_as_float: bool = False axis: typing.Optional[int] = None dynamic_config: typing.Optional[dict] = None skip_modules: typing.List[str] = ['lm_head'] **kwargs )

参数

nbits (int, 可选, 默认为 4) — 位数。支持的值为 (8, 4, 3, 2, 1)。
group_size (int, 可选, 默认为 64) — 分组大小值。支持的值是任何可以被 weight.shape[axis] 整除的值。
view_as_float (bool, 可选, 默认为 False) — 如果设置为 True，则将量化权重视为浮点数（用于分布式训练）。
axis (Optional[int], 可选) — 执行分组操作的轴。支持的值为0或1。
dynamic_config (dict, optional) — 动态配置的参数。键是层的名称标签，值是一个量化配置。如果设置，每个由其id指定的层将使用其专用的量化配置。
skip_modules (List[str], optional, defaults to ['lm_head']) — 要跳过的nn.Linear层列表.
kwargs (Dict[str, Any], 可选) — 用于初始化配置对象的额外参数。

这是围绕hqq的BaseQuantizeConfig的封装。

from_dict

( config: typing.Dict[str, typing.Any] )

重写 from_dict，用于 quantizers/auto.py 中的 AutoQuantizationConfig.from_dict

post_init

( )

安全检查器，确保参数正确 - 同时将一些NoneType参数替换为它们的默认值。

to_diff_dict

( ) → Dict[str, Any]

返回

Dict[str, Any]

构成此配置实例的所有属性的字典，

从配置中移除所有与默认配置属性对应的属性，以提高可读性，并将其序列化为Python字典。

FbgemmFp8Config

类 transformers.FbgemmFp8Config

( activation_scale_ub: float = 1200.0 modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

参数

activation_scale_ub (float, optional, defaults to 1200.0) — 激活比例的上限。这在量化输入激活时使用。
modules_to_not_convert (list, optional, default to None) — 不进行量化的模块列表，对于需要明确保留某些模块在其原始精度的模型量化非常有用。

这是一个包装类，包含了所有可能的属性和功能，您可以使用这些属性和功能来操作一个已经使用fbgemm fp8量化加载的模型。

CompressedTensorsConfig

类 transformers.CompressedTensorsConfig

( config_groups: typing.Dict[str, typing.Union[ForwardRef('QuantizationScheme'), typing.List[str]]] = None format: str = 'dense' quantization_status: QuantizationStatus = 'initialized' kv_cache_scheme: typing.Optional[ForwardRef('QuantizationArgs')] = None global_compression_ratio: typing.Optional[float] = None ignore: typing.Optional[typing.List[str]] = None sparsity_config: typing.Dict[str, typing.Any] = None quant_method: str = 'compressed-tensors' **kwargs )

参数

config_groups (typing.Dict[str, typing.Union[ForwardRef('QuantizationScheme'), typing.List[str]]], optional) — 将组名映射到量化方案定义的字典
format (str, optional, defaults to "dense") — 模型表示的格式
quantization_status (QuantizationStatus, optional, defaults to "initialized") — 模型在量化生命周期中的状态，例如‘initialized’（初始化）、‘calibration’（校准）、‘frozen’（冻结）
kv_cache_scheme (typing.Union[QuantizationArgs, NoneType], optional) — 指定kv缓存的量化。如果为None，则kv缓存不会被量化。
global_compression_ratio (typing.Union[float, NoneType], optional) — 0-1 浮点数表示模型压缩的百分比
ignore (typing.Union[typing.List[str], NoneType], optional) — 不进行量化的层名称或类型，支持以‘re:’为前缀的正则表达式
sparsity_config (typing.Dict[str, typing.Any], optional) — 稀疏性压缩的配置
quant_method (str, optional, defaults to "compressed-tensors") — 不要覆盖，应该是compressed-tensors

这是一个处理压缩张量量化配置选项的包装类。它是compressed_tensors.QuantizationConfig的包装器。

from_dict

( config_dict return_unused_kwargs = False **kwargs ) → QuantizationConfigMixin

参数

config_dict (Dict[str, Any]) — 用于实例化配置对象的字典。
return_unused_kwargs (bool,optional, 默认为 False) — 是否返回未使用的关键字参数列表。用于 from_pretrained 方法在 PreTrainedModel 中。
kwargs (Dict[str, Any]) — 用于初始化配置对象的附加参数。

返回

QuantizationConfigMixin

从这些参数实例化的配置对象。

从Python参数字典实例化一个CompressedTensorsConfig。可选地从嵌套的quantization_config中解包任何参数

to_dict

( )

要添加到config.json的量化配置

将此实例序列化为Python字典。返回： Dict[str, Any]: 构成此配置实例的所有属性的字典。

to_diff_dict

( ) → Dict[str, Any]

返回

Dict[str, Any]

构成此配置实例的所有属性的字典，

从配置中移除所有与默认配置属性对应的属性，以提高可读性，并将其序列化为Python字典。

TorchAoConfig

类 transformers.TorchAoConfig

( quant_type: str modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

参数

quant_type (str) — 我们想要使用的量化类型，目前支持：int4_weight_only, int8_weight_only 和 int8_dynamic_activation_int8_weight.
modules_to_not_convert (list, optional, default to None) — 不进行量化的模块列表，对于需要明确保留某些模块在其原始精度的模型量化非常有用。
kwargs (Dict[str, Any], 可选) — 所选量化类型的关键字参数，例如，int4_weight_only 量化目前支持两个关键字参数 group_size 和 inner_k_tiles。更多 API 示例和参数文档可以在 https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques

这是一个用于torchao量化/稀疏技术的配置类。

示例：

quantization_config = TorchAoConfig("int4_weight_only", group_size=32)
# int4_weight_only quant is only working with *torch.bfloat16* dtype right now
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", torch_dtype=torch.bfloat16, quantization_config=quantization_config)

post_init

( )

安全检查器，确保参数正确 - 同时将一些NoneType参数替换为它们的默认值。

BitNetConfig

类 transformers.BitNetConfig

( modules_to_not_convert: typing.Optional[typing.List] = None **kwargs )

post_init

( )

安全检查器，确保参数正确

< > Update on GitHub

←Processors Tokenizer→