hf_transformers-Quantization

bitsandbytes

BitSandbytes 是将模型量化为 8 位和 4 位的最简单选项。8 位量化将 FP16 中的异常值与 INT8 中的非异常值相乘，将非异常值转换回 FP16，然后将它们相加以返回 FP16 中的权重。这减少了离群值对模型性能的降级影响。4 位量化进一步压缩模型，通常与 QLoRA 一起使用以微调量化的 LLM。

要使用 bitsandbytes，请确保您已安装以下库

8bit:

pip install transformers accelerate bitsandbytes>0.37.0

4bit:

pip install bitsandbytes>=0.39.0
pip install --upgrade accelerate transformers

bitsandbytes 正在被重构以支持 CUDA 之外的多个后端。目前，ROCm （AMD GPU）和 Intel CPU 的实现已经成熟，Intel XPU 正在开发中，预计第 4 季度/第 1 季度将支持 Apple Silicon。有关安装说明和最新的后端更新，请访问此链接。

我们非常重视您的反馈，以帮助在完整发布之前识别错误！查看这些文档以获取更多详细信息和反馈链接

现在，您可以通过传递 to from_pretrained（）方法来量化模型。这适用于任何模态中的任何模型，只要它支持使用 Accelerate 加载并包含层。BitsAndBytesConfigtorch.nn.Linear

8bit:

以 8 位量化模型会将内存使用量减半，对于大型模型，设置为有效使用可用的 GPU：device_map="auto"

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7", 
    quantization_config=quantization_config
)

默认情况下，所有其他模块（例如 torch.nn.LayerNorm）会被转换为指定的数据类型。如果需要，你可以通过 torch_dtype 参数更改这些模块的数据类型，例如设置为 torch.float16。

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m", 
    quantization_config=quantization_config, 
    torch_dtype=torch.float32
)
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype

将模型量化为 8 位后，除非您使用的是最新版本的 Transformer 和 bitsandbytes，否则无法将量化权重推送到 Hub。如果您有最新版本，则可以使用 push_to_hub（）方法将 8 位模型推送到 Hub。首先推送量化 config.json 文件，然后推送量化的模型权重。

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m", 
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

model.push_to_hub("bloom-560m-8bit")

仅使用 8 位和 4 位权重进行训练，则仅支持训练额外参数。

print(model.get_memory_footprint())

量化模型可以从 from_pretrained（）方法加载，而无需指定 or 参数：load_in_8bitload_in_4bit

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")

4bit:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    quantization_config=quantization_config
)

剩下的和8bit的流程一样，区别是把8bit换成4bit

测试代码的时候,不要傻逼的把.py文件取名为bitsandbytes.py

8-bit (LLM.int8() algorithm)

卸载

8 位模型可以在 CPU 和 GPU 之间卸载权重，以支持将非常大的模型放入内存中。分派给 CPU 的权重实际上存储在 float32 中，不会转换为 8 位。例如，要为 bigscience/bloom-1b7 模型启用卸载，请首先创建一个 BitsAndBytesConfig：

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

设计一个自定义设备映射，将除 lm_head 以外的所有内容放在 GPU 上，而将其调度到 CPU：

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}

现在加载您的模型，使用自定义的 device_map 和 quantization_config：

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    device_map=device_map,
    quantization_config=quantization_config,
)

异常值阈值

“异常值”是指隐藏状态值超过某个阈值的情况，这些值是以 fp16 计算的。虽然这些值通常呈正态分布（[-3.5, 3.5]），但对于大型模型，这种分布可能会有很大不同（如 [-60, 6] 或 [6, 60]）。8 位量化对于约 5 的值效果良好，但超出该范围后会有显著的性能损失。一个好的默认阈值是 6，但对于不稳定的模型（如小模型或微调），可能需要更低的阈值。

为了找到适合您模型的最佳阈值，我们建议在 BitsAndBytesConfig 中进行参数实验：llm_int8_threshold。

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_threshold=10,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    quantization_config=quantization_config,
)

跳过模块转换

对于某些模型，如 Jukebox，您不需要将每个模块量化为 8 位，这实际上可能导致不稳定。在 Jukebox 中，有几个模块应使用 BitsAndBytesConfig 中的参数 lm_headllm_int8_skip_modules 跳过。

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "bigscience/bloom-1b7"

quantization_config = BitsAndBytesConfig(
    llm_int8_skip_modules=["lm_head"],
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config,
)

微调

使用 PEFT 库，您可以对大型模型（如 flan-t5-large 和 facebook/opt-6.7b）进行 8 位量化的微调。您不需要在训练时传递参数，因为它会自动将模型加载到 GPU 上。不过，如果需要，您仍然可以自定义设备映射，参数如下（仅应用于推理）：

device_map = "auto"

位量化（QLoRA 算法）

在这个笔记本中尝试 4 位量化，并在博客文章中了解更多细节。

这一部分探讨了 4 位模型的一些具体特性，例如更改计算数据类型、使用正常浮点 4（NF4）数据类型以及使用嵌套量化。

计算数据类型

为了加快计算速度，您可以将数据类型从 float32（默认值）更改为 bf16，使用 BitsAndBytesConfig 中的参数 bnb_4bit_compute_dtype。

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

正常浮点 4（NF4）

NF4 是来自 QLoRA 论文的 4 位数据类型，适用于从正态分布初始化的权重。您应该在训练 4 位基础模型时使用 NF4。这可以通过 BitsAndBytesConfig 中的参数 bnb_4bit_quant_type 配置。

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

对于推理，NF4 不会对性能产生巨大影响。不过，为了保持与模型权重的一致性，您应该使用对应的 bnb_4bit_quant_type 和 bnb_4bit_compute_dtype。

嵌套量化

嵌套量化是一种技术，可以在不增加性能成本的情况下节省额外内存。该特性对已经量化的权重进行第二次量化，以节省额外的 0.4 位/参数。例如，通过嵌套量化，您可以在 16GB 的 NVIDIA T4 GPU 上以序列长度 1024、批量大小 1 的情况下微调 Llama-13b 模型，并启用 4 步的梯度累积。

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)

解量化 BitsAndBytes 模型

一旦模型被量化，您可以将其解量化回原始精度，但这可能会导致模型轻微的质量损失。确保您有足够的 GPU 内存来容纳解量化后的模型。

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = "facebook/opt-125m"

# 加载量化模型
model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 解量化模型
model.dequantize()

# 准备输入文本
text = tokenizer("Hello my name is", return_tensors="pt").to(0)

# 生成输出
out = model.generate(**text)
print(tokenizer.decode(out[0]))

GPTQ量化

尝试在这个笔记本中使用GPTQ量化和PEFT，并在这篇博客中了解更多细节！

AutoGPTQ库 实现了GPTQ算法，这是一种后训练量化技术，其中权重矩阵的每一行独立进行量化，以找到一种最小化误差的权重版本。这些权重被量化为int4，但在推理时会即时恢复为fp16。这可以将内存使用量节省4倍，因为int4权重在一个融合内核中解量化，而不是在GPU的全局内存中。此外，使用更低的位宽可以加快推理速度，因为数据传输所需的时间更少。

开始之前

确保安装以下库：

pip install auto-gptq
pip install --upgrade accelerate optimum transformers

量化模型

要量化一个模型（目前仅支持文本模型），您需要创建一个GPTQConfig类，并设置量化位数、用于权重量化的校准数据集以及用于准备数据集的tokenizer。

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

您也可以将自己的数据集作为字符串列表传递，但强烈建议使用与GPTQ论文中相同的数据集。

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly APIs, based on GPTQ algorithm."]
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)

加载要量化的模型并将其传递给from_pretrained()方法。设置为自动将模型转移到CPU，以帮助适应内存，并允许模型模块在CPU和GPU之间移动进行量化。

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)

如果由于数据集过大而内存不足，磁盘卸载是不支持的。在这种情况下，请尝试传递参数以分配设备（GPU和CPU）上要使用的内存量：

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)

根据您的硬件，量化模型的时间可能会有所不同。在免费版的Google Colab GPU上量化facebook/opt-350m模型大约需要5分钟，而在NVIDIA A100上量化175B参数模型可能需要约4小时。在量化模型之前，最好检查Hub上是否已经存在GPTQ量化版本的模型。

下载 C4 数据集！ ·阿莱奈/Allennlp ·讨论 #5056

太大了这个看看就行就行了

推送量化模型

一旦模型量化完成，您可以将模型和tokenizer推送到Hub，以便轻松共享和访问。使用push_to_hub()方法保存GPTQConfig：

quantized_model.push_to_hub("opt-125m-gptq")
tokenizer.push_to_hub("opt-125m-gptq")

您还可以使用save_pretrained()方法将量化模型保存在本地。如果模型是通过该参数量化的，请确保在保存之前将整个模型移动到GPU或CPU。例如，若要在CPU上保存模型：

quantized_model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")

# 如果通过device_map设置量化
quantized_model.to("cpu")
quantized_model.save_pretrained("opt-125m-gptq")

重新加载量化模型

使用from_pretrained()方法重新加载量化模型，并设置为自动在所有可用GPU上分配模型，以更快地加载模型，而不会使用超过所需的内存：

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")

ExLlama

ExLlama 是一个针对Llama模型的Python/C++/CUDA实现，旨在使用4位GPTQ权重进行更快的推理（查看这些基准）。创建GPTQConfig对象时默认启用ExLlama内核。为了进一步提升推理速度，可以通过配置参数使用ExLlamaV2内核：

import torch
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)

仅支持4位模型，如果您使用PEFT微调量化模型，建议禁用ExLlama内核。

如果您在CPU上使用AutoGPTQ（版本>0.4.2）进行推理，则需要禁用ExLlama内核。这将覆盖config.json文件中与ExLlama内核相关的属性。

import torch
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(bits=4, use_exllama=False)
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)

AWQ 量化

尝试在这个笔记本中使用 AWQ 量化！

激活感知权重量化 (AWQ) 不会对模型中的所有权重进行量化，而是保留一小部分对 LLM 性能重要的权重。这显著减少了量化损失，使您能够在 4 位精度下运行模型，而不会体验到性能下降。

有几个库可以用于 AWQ 算法的模型量化，例如 llm-awq、autoawq 或 optimum-intel。Transformers 支持加载使用 llm-awq 和 autoawq 库量化的模型。此指南将展示如何加载使用 autoawq 量化的模型，但 llm-awq 量化模型的过程类似。

安装 autoawq

pip install autoawq

识别 AWQ 量化模型

AWQ 量化模型可以通过检查模型的 config.json 文件中的 quantization_config 属性来识别：

{
  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
  "architectures": [
    "MistralForCausalLM"
  ],
  ...
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
}

加载量化模型

量化模型使用 from_pretrained() 方法加载。如果您在 CPU 上加载模型，请确保先将其移动到 GPU 设备。使用 device_map 参数指定模型的位置：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/zephyr-7B-alpha-AWQ"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")

加载 AWQ 量化模型时，其他权重默认设置为 fp16 以提高性能。如果您希望以不同的格式加载这些权重，可以使用 torch_dtype 参数：

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

结合 FlashAttention-2 加速推理

AWQ 量化也可以与 FlashAttention-2 结合使用，以进一步加速推理：

model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")

融合模块

融合模块提供了更好的准确性和性能，并且对 Llama 和 Mistral 架构的 AWQ 模块支持开箱即用，但您也可以对不受支持的架构融合 AWQ 模块。

请注意，融合模块无法与其他优化技术（如 FlashAttention-2）结合使用。

要启用支持的架构的融合模块，创建 AwqConfig 并设置参数 fuse_max_seq_len 和 do_fuse=True。fuse_max_seq_len 参数是总序列长度，应包括上下文长度和预期生成长度。可以将其设置为更大的值以确保安全。

例如，融合 TheBloke/Mistral-7B-OpenOrca-AWQ 模型的 AWQ 模块：

import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    do_fuse=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)

AQLM

在 Google Colab 上尝试 AQLM！

加法量化语言模型 (AQLM) 是一种大型语言模型压缩方法。它将多个权重一起量化，利用它们之间的相互依赖性。AQLM 将 8-16 个权重的组表示为多个向量编码的和。

安装 AQLM 库

要运行模型，请确保安装该库（请注意，AQLM 仅支持 Python >= 3.10）：

pip install aqlm[gpu,cpu]

该库为 GPU 和 CPU 推理及训练提供了高效的内核。

运行 AQLM 模型

有关如何自己量化模型的说明以及所有相关代码可以在对应的 GitHub 仓库中找到。要运行已量化的 AQLM 模型，只需加载一个已通过 AQLM 量化的模型：

from transformers import AutoTokenizer, AutoModelForCausalLM 
quantized_model = AutoModelForCausalLM.from_pretrained( "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf", torch_dtype="auto", device_map="auto" ) 
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

PEFT 支持

从 AQLM 版本 1.0.2 开始，AQLM 支持以 LoRA 形式集成的参数高效微调 (PEFT) 库。

AQLM 配置

AQLM 的量化设置主要取决于使用的代码本数量以及代码本的位数。最流行的设置以及它们支持的推理内核如下所示：

Quanto

尝试在此笔记本中使用 Quanto + transformers！

🤗 Quanto 库是一个多功能的 PyTorch 量化工具包。使用的量化方法是线性量化。Quanto 提供了几个独特的功能，包括：

权重量化（如 float8、int8、int4、int2）
激活量化（如 float8、int8）
与模态无关（例如计算机视觉、语言模型）
与设备无关（例如 CUDA、MPS、CPU）
与 torch.compile 兼容
易于为特定设备添加自定义内核
支持量化感知训练

在开始之前，请确保安装以下库：

pip install quanto accelerate transformers

现在，你可以通过将 QuantoConfig 对象传递给 from_pretrained() 方法来量化模型。这适用于任何包含 torch.nn.Linear 层的模型和模态。

与 transformers 的集成仅支持权重量化。对于更复杂的用例，例如激活量化、校准和量化感知训练，你应该使用 Quanto 库。

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

请注意，目前 transformers 尚不支持序列化，但这一功能即将到来！如果你想保存模型，可以改用 Quanto 库。

Quanto 库使用线性量化算法进行量化。尽管这是一种基本的量化技术，但我们获得了非常好的结果！

EETQ

EETQ 库支持针对 NVIDIA GPU 的 int8 通道权重量化。高性能的 GEMM 和 GEMV 内核来自 FasterTransformer 和 TensorRT-LLM。该库不需要校准数据集，也不需要预先量化模型。此外，由于使用了通道量化，准确性下降可以忽略不计。

请确保从发布页面安装 EETQ：

pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl

或者通过源代码安装：

git clone https://github.com/NetEase-FuXi/EETQ.git
cd EETQ/
git submodule update --init --recursive
pip install .

可以通过 from_pretrained 方法对未量化模型进行量化：

from transformers import AutoModelForCausalLM, EetqConfig

path = "/path/to/model"
quantization_config = EetqConfig("int8")
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)

量化后的模型可以通过 save_pretrained 方法保存，并通过 from_pretrained 方法再次使用：

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

HQQ

半二次量化 (HQQ) 通过快速稳健优化实现实时量化。它不需要校准数据，且可以用于量化任何模型。有关更多细节，请参阅官方包。

安装

我们建议您使用以下方法安装，以获取最新版本并构建相应的 CUDA 内核

pip install hqq

量化模型

要量化一个模型，您需要创建一个 HqqConfig 对象。有两种方法可以实现：

from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

# 方法 1：所有线性层将使用相同的量化配置
quant_config = HqqConfig(nbits=8, group_size=64)

# 方法 2：每个具有相同标签的线性层将使用专用的量化配置
q4_config = {'nbits': 4, 'group_size': 64}
q3_config = {'nbits': 3, 'group_size': 32}
quant_config = HqqConfig(dynamic_config={
    'self_attn.q_proj': q4_config,
    'self_attn.k_proj': q4_config,
    'self_attn.v_proj': q4_config,
    'self_attn.o_proj': q4_config,
    
    'mlp.gate_proj': q3_config,
    'mlp.up_proj': q3_config,
    'mlp.down_proj': q3_config,
})

第二种方法对于量化混合专家 (Mixture-of-Experts, MoEs) 特别有趣，因为专家不太会受到较低量化设置的影响。

接下来，您可以按以下方式量化模型：

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=quant_config
)

优化运行时

HQQ 支持多种后端，包括纯 PyTorch 和自定义反量化 CUDA 内核。这些后端适用于较旧的 GPU 和 PEFT/QLoRA 训练。为了更快的推理，HQQ 支持 4 位融合内核 (TorchAO 和 Marlin)，在单个 4090 上可达到每秒 200 个 token。有关如何使用这些后端的更多细节，请参考 HQQ 官方文档。

FBGEMM FP8

使用 FBGEMM FP8 量化方法，您可以将模型量化为 FP8（W8A8）格式：

权重将在每个通道中量化为 8 位（FP8）。
激活将在每个 token 中量化为 8 位（FP8）。

此方法依赖于 FBGEMM 库，该库提供了高效的低精度通用矩阵乘法，适用于小批量大小，并支持减少精度损失的技术，如按行量化和考虑异常值的量化。

您需要一个计算能力 ≥ 9 的 GPU（例如 H100）

开始之前

确保安装以下库的最新版本：

pip install --upgrade accelerate fbgemm-gpu torch

如果您在使用 fbgemm-gpu 和 torch 库时遇到问题，您可能需要安装夜间版本。可以按照这里的说明进行操作。here

from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = FbgemmFp8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "我们晚餐吃什么？"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

您可以通过 save_pretrained 方法保存量化模型，并通过 from_pretrained 方法再次使用。

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

Optimum

Optimum 库支持针对 Intel、Furiosa、ONNX Runtime、GPTQ 以及更低级的 PyTorch 量化函数的量化。如果您使用特定的优化硬件，如 Intel CPU、Furiosa NPU 或像 ONNX Runtime 这样的模型加速器，建议使用 Optimum 进行量化。🤗 Optimum

TorchAO

TorchAO 是一个针对 PyTorch 的架构优化库，提供高性能的数据类型、优化技术和用于推理和训练的内核，能够与原生 PyTorch 特性（如 torch.compile、FSDP 等）兼容。您可以在这里找到一些基准数据。

在开始之前，请确保安装以下库的最新版本

pip install --upgrade torch torchao

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Meta-Llama-3-8B"
# 我们支持 int4_weight_only、int8_weight_only 和 int8_dynamic_activation_int8_weight
# 更多参数的示例和文档可以在 https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques 找到
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "我们今晚吃什么？"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# 编译量化模型以获得加速
import torchao
torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 性能基准测试
import torch.utils.benchmark as benchmark

def benchmark_fn(f, *args, **kwargs):
    # 手动预热
    for _ in range(5):
        f(*args, **kwargs)
        
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

MAX_NEW_TOKENS = 1000
print("int4wo-128 模型:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))

bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype=torch.bfloat16)
bf16_model = torch.compile(bf16_model, mode="max-autotune")
print("bf16 模型:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))

序列化与反序列化

torchao 量化是通过张量子类实现的，仅支持 Hugging Face 的非安全张量序列化和反序列化。它依赖于 torch.load(..., weights_only=True) 来避免在加载时执行任意用户代码，并使用 add_safe_globals 来允许一些已知用户函数。

不支持安全张量序列化的原因是，包装的张量子类提供了最大的灵活性，因此我们希望支持新格式量化张量的工作量较低。而安全张量则优化了最大安全性（不执行用户代码），这也意味着我们必须手动支持新的量化格式。

# 将量化模型保存到本地
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained(output_dir, safe_serialization=False)

# 推送到 Hugging Face Hub
# save_to = "{user_id}/llama3-8b-int4wo-128"
# quantized_model.push_to_hub(save_to, safe_serialization=False)

# 加载量化模型
ckpt_id = "llama3-8b-int4wo-128"  # 或 Hugging Face Hub 模型 ID
loaded_quantized_model = AutoModelForCausalLM.from_pretrained(ckpt_id, device_map="cuda")

# 确认加速效果
loaded_quantized_model = torch.compile(loaded_quantized_model, mode="max-autotune")
print("加载的 int4wo-128 模型:", benchmark_fn(loaded_quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))

BitNet

BitNet 用专门的层 BitLinear 替代了多头注意力和前馈网络中的传统线性层，这些层具有三值（或在旧版本中为二值）精度。这里引入的 BitLinear 层使用三值精度对权重进行量化（值为 -1、0 和 1），并将激活量化为 8 位精度。

在训练过程中，我们首先将权重量化为三值，使用对称的每个张量量化。首先，我们计算权重矩阵绝对值的平均值，并将其用作缩放因子。然后，我们将权重除以缩放因子，进行四舍五入，将值限制在 -1 到 1 之间，最后重新缩放以继续使用全精度。

然后，激活量化为指定的位宽（例如，8位），使用绝对最大值量化（对称的每个通道量化）。这涉及将激活量缩放到范围 [−128,127[[−128,127[。量化公式为：

要了解更多关于我们如何训练和微调 BitNet 模型的信息，请查看这里的博客文章。Fine-tuning LLMs to 1.58bit: extreme quantization made easy

从 Hub 加载 BitNet 模型

BitNet 模型无法即时量化——它们需要在预训练或微调时应用量化（这是一种量化感知训练技术）。一旦训练完成，这些模型已经量化，并作为打包版本在 Hub 上提供。

量化模型可以通过以下方式加载

from transformers import AutoModelForCausalLM
path = "/path/to/model"
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto")

预训练/微调 BitNet 模型

如果您希望使用 Nanotron 预训练或微调自己的 1.58 位模型，请查看此 PRFEAT: Adding 1.58bit LLMs training architecture in nanotron by MekkCyber · Pull Request #180 · huggingface/nanotron · GitHub，所有您需要开始的步骤都在这里！

对于微调，您需要将模型从 Hugging Face 格式转换为 Nanotron 格式（两者之间有一些差异）。您可以在此 PR 中找到转换步骤。

内核

在我们的初始版本中，我们选择使用 @torch.compile 来解压权重并执行前向传播。这种实现方式非常简单，并且提供了显著的速度提升。我们计划在未来的版本中集成更多优化的内核。

Compressed Tensors

提供了一种灵活高效的方式来存储和管理压缩模型检查点。该库支持多种量化和稀疏性方案，使其成为处理不同模型优化（如 GPTQ、AWQ、SmoothQuant、INT8、FP8、SparseGPT 等）的统一格式。

支持的格式包括：

dense：密集格式
int-quantized (sample)：INT8 量化模型
float-quantized (sample)：FP8 量化模型，目前支持 E4M3
pack-quantized (sample)：INT4 或 INT8 权重量化模型，打包到 INT32 中。对于 INT4，权重具有 INT4 范围，但存储为 INT8，然后打包到 INT32 中。

可以使用 llm-compressor 轻松创建压缩模型。也可以独立创建模型并通过压缩张量配置进行序列化。

要在 Hugging Face Model Hub 上查找现有模型，请搜索 compressed-tensors 标签。

特性：

权重和激活精度：FP8、INT4、INT8（对于 Q/DQ，INT 允许任意精度）
量化缩放和零点策略：张量、通道、组、块、令牌
动态逐令牌激活量化（或任何静态策略）
权重稀疏性（无结构或半结构，例如 2:4）可与量化组合以实现极端压缩
支持任意模块的量化，不仅限于线性模块
可以根据名称或类有针对性地支持或忽略模块

安装

建议从 PyPI 安装 compressed-tensors 的稳定版本：

pip install compressed-tensors

希望尝试最新功能的开发者也可以从源代码安装该软件包：

git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .

快速入门模型加载

量化模型可以轻松加载进行推断，示例如下。目前只能加载已经量化的模型。要将模型量化为压缩张量格式，请参见 llm-compressor。

from transformers import AutoModelForCausalLM

# 加载压缩张量格式的模型
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")

# 测量内存使用
mem_params = sum([param.nelement() * param.element_size() for param in ct_model.parameters()])
print(f"{mem/2**30:.4f} GB")
# 8.4575 GB

如上所示，Llama 3.1 8B 的压缩张量 FP8 检查点能够以一半的内存加载进行推断，相比未量化的参考检查点。

示例用例 - 加载和运行 FP8 模型

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is"
]

model_name = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"

quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer(prompt, return_tensors="pt")
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
outputs = tokenizer.batch_decode(generated_ids)

print(outputs)

运行结果：

['<|begin_of_text|>Hello, my name is [Name]. I am a [Your Profession/Student] and I am here to learn about the [Course/Program] at [University/Institution]. I am excited to be here and I am looking forward to', '<|begin_of_text|>The capital of France is Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. The city is home to', "<|begin_of_text|>The future of AI is here, and it's already changing the way we live and work. From virtual assistants to self-driving cars, AI is transforming industries and revolutionizing the way we interact with technology. But what does the future of AI hold"]

上述示例展示了如何使用压缩张量模型进行生成。目前，加载后模型无法保存。

深入了解压缩张量模型检查点

在这个例子中，我们将检查压缩张量模型 nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf 的配置条目，并查看其如何转化为加载的模型表示。

首先，让我们查看模型的 quantization_config。乍一看，条目数量庞大，但这是因为压缩张量格式在模型压缩过程中和之后都允许灵活表达。

实际上，对于检查点加载和推断，配置可以简化，不包括所有默认或空条目，因此我们将在此集中关注实际表示的压缩内容。

"quantization_config": {
  "config_groups": {
    "group_0": {
      "input_activations": {
        "num_bits": 8,
        "strategy": "tensor",
        "type": "float"
      },
      "targets": ["Linear"],
      "weights": {
        "num_bits": 8,
        "strategy": "tensor",
        "type": "float"
      }
    }
  },
  "format": "naive-quantized",
  "ignore": ["lm_head"],
  "quant_method": "compressed-tensors",
  "quantization_status": "frozen"
}

从上面的配置中，我们可以看到它指定了一个配置组，包括将权重和激活量化为 FP8 的静态每个张量策略。值得注意的是，忽略列表中有一个条目，以跳过对 lm_head 模块的量化，因此该模块在检查点中保持不变。

要查看配置在实际中的结果，我们可以使用模型卡上的 safetensors 查看器，查看所有线性模块的量化权重、输入缩放和权重缩放（以及其余层的内容）。

当我们使用 compressed-tensors 的 HFQuantizer 集成加载模型时，可以看到在量化配置中指定的所有 Linear 模块都被替换为 CompressedLinear 模块，这些模块管理压缩权重和推断的前向传播。请注意，之前在忽略列表中提到的 lm_head 仍然保留为未量化的 Linear 模块。

from transformers import AutoModelForCausalLM

ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
print(ct_model)
"""
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): CompressedLinear(
            in_features=4096, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (k_proj): CompressedLinear(
            in_features=4096, out_features=1024, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (v_proj): CompressedLinear(
            in_features=4096, out_features=1024, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (o_proj): CompressedLinear(
            in_features=4096, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): CompressedLinear(
            in_features=4096, out_features=14336, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (up_proj): CompressedLinear(
            in_features=4096, out_features=14336, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (down_proj): CompressedLinear(
            in_features=14336, out_features=4096, bias=False
            (input_observer): MovingAverageMinMaxObserver()
            (weight_observer): MovingAverageMinMaxObserver()
          )
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
"""

贡献新量化方法

Transformers 支持并集成了多种量化方法，如 QLoRA、GPTQ、LLM.int8 和 AWQ。然而，还有其他尚未集成的量化方法。为了更方便地添加和使用这些量化方法，你应该使用 HfQuantizer 类。HfQuantizer 旨在作为一个内部助手类，用于添加量化方法，而不是直接应用于每个 PyTorch 模块。

本指南将向你展示如何使用 HfQuantizer 类集成新的量化方法。

要求

在将新的量化方法集成到 Transformers 之前，请确保你要添加的方法满足以下先决条件。当前仅支持能够与 PyTorch 模块一起运行的量化方法。

构建新的 HfQuantizer 类

量化方法应通过一个可供任何人 pip 安装的 Python 包提供（如果只能从源代码安装包也是可以的）。理想情况下，pip 包中应包含预编译的内核。
该方法可以在常用硬件（CPU、GPU 等）上运行。

该方法应封装在 nn.Module 中（例如，Linear8bitLt、Linear4bit），且量化线性层应具有以下定义：

class Linear4bit(nn.Module):
    def __init__(self, ...):
        ...
    
    def forward(self, x):
        return my_4bit_kernel(x, self.weight, self.bias)

这样，Transformers 模型可以通过替换一些 nn.Linear 实例为目标类来轻松量化。
量化方法应可序列化。你可以将量化权重保存到本地或推送到 Hub。
在 src/transformers/utils/quantization_config.py 中创建一个新的量化配置类，并确保通过将其添加到 src/transformers/__init__.py 中的 _import_structure 对象来公开新的量化配置。
在 src/transformers/quantizers/ 中创建一个名为 quantizer_your_method.py 的新文件，并使其继承自 src/transformers/quantizers/base.py::HfQuantizer。确保在 src/transformers/quantizers/auto.py 中的量化自动映射中添加新的量化器和量化配置。
为你的量化方法定义以下类属性/属性方法：
- requires_calibration：量化方法是否需要数据校准过程。如果设置为 True，则只能支持推理（使用量化权重），而不支持推理和量化。
- required_packages：使用量化权重所需的包字符串列表。你可能需要在 transformers/src/utils/import_utils.py 中定义一些新的实用方法，例如 is_auto_awq_available。
- requires_parameters_quantization：仅在你的量化方法需要额外关注底层 nn.Parameter 对象时才需要。例如，bitsandbytes 使用 Params4bit 和 Int8Param，这在量化模型时需要额外关注。最近的量化方法通常将 int2/int4 权重打包在 torch.uint8 权重中，因此这个标志不需要设置（默认为 False）。
- is_serializable：确定该方法是否可序列化的属性方法。
- is_trainable：确定你是否可以在量化方法上微调模型的属性方法（使用或不使用 PEFT 方法）。
编写 validate_environment 和 update_torch_dtype 方法。这些方法在创建量化模型之前被调用，以确保用户使用正确的配置。你可以查看其他量化器中是如何实现的。
编写 _process_model_before_weight_loading 方法。在 Transformers 中，量化模型在加载权重之前首先在“元”设备上初始化。这意味着 _process_model_before_weight_loading 方法负责操纵模型骨架，以替换某些模块（例如，nn.Linear）为目标模块（量化模块）。你可以通过在 transformers/src/integrations/ 中创建新文件来定义模块替换逻辑或其他实用方法，并在该文件夹的 __init__.py 文件中公开相关方法。最好的起点是查看其他量化方法，例如 quantizer_awq.py。
编写 _process_model_after_weight_loading 方法。此方法允许实现加载权重后需要操纵模型的附加功能。
文档编写！ 确保通过在 docs/source/en/quantization 下添加新文件并在 docs/source/en/quantization/overview.md 中添加新行来记录你的量化方法。
添加测试！ 首先在我们的夜间 Dockerfile 中添加该包（位于 docker/transformers-quantization-latest-gpu），然后在 tests/quantization/xxx 中添加新测试文件。随时查看其他量化方法的实现方式。
- 确保包含量化内核/原语的包是稳定的（没有频繁的破坏性更改）。
- 对于某些量化方法，它们可能需要通过数据校准进行“预量化”（例如 AWQ）。在这种情况下，我们倾向于仅支持 Transformers 中的推理，让由 ML 社区维护的第三方库处理模型的量化。