HuggingFace团队亲授大模型量化基础: Quantization Fundamentals with Hugging Face

news2025/1/18 10:57:09

Quantization Fundamentals with Hugging Face

本文是学习https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/ 这门课的学习笔记。

在这里插入图片描述

What you’ll learn in this course

Generative AI models, like large language models, often exceed the capabilities of consumer-grade hardware and are expensive to run. Compressing models through methods such as quantization makes them more efficient, faster, and accessible. This allows them to run on a wide variety of devices, including smartphones, personal computers, and edge devices, and minimizes performance degradation.

Join this course to:

  • Quantize any open source model with linear quantization using the Quanto library.
  • Get an overview of how linear quantization is implemented. This form of quantization can be applied to compress any model, including LLMs, vision models, etc.
  • Apply “downcasting,” another form of quantization, with the Transformers library, which enables you to load models in about half their normal size in the BFloat16 data type.

By the end of this course, you will have a foundation in quantization techniques and be able to apply them to compress and optimize your own generative AI models, making them more accessible and efficient.

文章目录

  • Quantization Fundamentals with Hugging Face
    • What you’ll learn in this course
  • Handling Big Models
  • Lesson 2: Data Types and Sizes
      • Integers
      • Floating Points
      • Downcasting
  • Lesson 3: Loading ML Models with Different Data Types
    • Model Casting: `float16`
    • Model Casting: `bfloat16`
        • Note about deepcopy
    • Using Popular Generative Models in Different Data Types
        • To get the sample code that Younes showed:
      • Model Performance: `float32` vs `bfloat16`
      • Default Data Type
      • Note
  • Lesson 4: Quantization Theory
        • Libraries to install
    • T5-FLAN
      • Without Quantization
    • Quantize the model (8-bit precision)
      • Freeze the model
      • Try running inference on the quantized model
        • Comparing "linear quantization" to "downcasting"
  • Quantization of LLMs
  • 后记

Handling Big Models

在这里插入图片描述
Pruning
在这里插入图片描述

Knowledge Distillation

在这里插入图片描述

Quantization

在这里插入图片描述

Data Types

在这里插入图片描述

What covers in this course

在这里插入图片描述

Lesson 2: Data Types and Sizes

In this lab, you will learn about the common data types used to store the parameters of machine learning models.

The libraries are already installed in the classroom. If you’re running this notebook on your own machine, you can install the following:

!pip install torch==2.1.1
import torch

Integers

在这里插入图片描述

Interger in PyTorch
在这里插入图片描述

# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)

Output

iinfo(min=0, max=255, dtype=uint8)
# Information of `8-bit (signed) integer`
torch.iinfo(torch.int8)

Output

iinfo(min=-128, max=127, dtype=int8)

Floating Points

Floating point

在这里插入图片描述

FP32

在这里插入图片描述

FP16

在这里插入图片描述

Comparison

在这里插入图片描述

Floating point in pytorch

在这里插入图片描述

# by default, python stores float data in fp64
value = 1/3
format(value, '.60f')

Output

'0.333333333333333314829616256247390992939472198486328125000000'
# 64-bit floating point
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")

Output

fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

Output

fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000
# Information of `16-bit brain floating point`
torch.finfo(torch.bfloat16)

Output

finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)
# Information of `32-bit floating point`
torch.finfo(torch.float32)

Output

finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

Downcasting

# random pytorch tensor: float32, size=1000
tensor_fp32 = torch.rand(1000, dtype = torch.float32)
# first 5 elements of the random tensor
tensor_fp32[:5]

Output

tensor([0.4897, 0.0494, 0.8093, 0.6704, 0.0713])
# downcast the tensor to bfloat16 using the "to" method
tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)
tensor_fp32_to_bf16[:5]

Output

tensor([0.4902, 0.0493, 0.8086, 0.6719, 0.0713], dtype=torch.bfloat16)
# tensor_fp32 x tensor_fp32
m_float32 = torch.dot(tensor_fp32, tensor_fp32)

Output

tensor(324.9693)
# tensor_fp32_to_bf16 x tensor_fp32_to_bf16
m_bfloat16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)

Output

tensor(326., dtype=torch.bfloat16)

在这里插入图片描述

Lesson 3: Loading ML Models with Different Data Types

在这里插入图片描述

In this lab, you will load ML models in different datatypes.

helper.py

import torch
import torch.nn as nn
import requests
from PIL import Image

import warnings
# Ignore specific UserWarnings related to max_length in transformers
warnings.filterwarnings("ignore", 
    message=".*Using the model-agnostic default `max_length`.*")

class DummyModel(nn.Module):
  """
  A dummy model that consists of an embedding layer
  with two blocks of a linear layer followed by a layer
  norm layer.
  """
  def __init__(self):
    super().__init__()

    torch.manual_seed(123)

    self.token_embedding = nn.Embedding(2, 2)

    # Block 1
    self.linear_1 = nn.Linear(2, 2)
    self.layernorm_1 = nn.LayerNorm(2)

    # Block 2
    self.linear_2 = nn.Linear(2, 2)
    self.layernorm_2 = nn.LayerNorm(2)

    self.head = nn.Linear(2, 2)

  def forward(self, x):
    hidden_states = self.token_embedding(x)

    # Block 1
    hidden_states = self.linear_1(hidden_states)
    hidden_states = self.layernorm_1(hidden_states)

    # Block 2
    hidden_states = self.linear_2(hidden_states)
    hidden_states = self.layernorm_2(hidden_states)

    logits = self.head(hidden_states)
    return logits


def get_generation(model, processor, image, dtype):
  inputs = processor(image, return_tensors="pt").to(dtype)
  out = model.generate(**inputs)
  return processor.decode(out[0], skip_special_tokens=True)


def load_image(img_url):
    image = Image.open(requests.get(
        img_url, stream=True).raw).convert('RGB')

    return image


from helper import DummyModel
model = DummyModel()
model

Output

DummyModel(
  (token_embedding): Embedding(2, 2)
  (linear_1): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_1): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (linear_2): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_2): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (head): Linear(in_features=2, out_features=2, bias=True)
)
  • Create a function to inspect the data types of the parameters in a model.
def print_param_dtype(model):
    for name, param in model.named_parameters():
        print(f"{name} is loaded in {param.dtype}")
        
print_param_dtype(model)

Output

token_embedding.weight is loaded in torch.float32
linear_1.weight is loaded in torch.float32
linear_1.bias is loaded in torch.float32
layernorm_1.weight is loaded in torch.float32
layernorm_1.bias is loaded in torch.float32
linear_2.weight is loaded in torch.float32
linear_2.bias is loaded in torch.float32
layernorm_2.weight is loaded in torch.float32
layernorm_2.bias is loaded in torch.float32
head.weight is loaded in torch.float32
head.bias is loaded in torch.float32

Model Casting: float16

  • Cast the model into a different precision.
# float 16
model_fp16 = DummyModel().half()
print_param_dtype(model_fp16)

Output

token_embedding.weight is loaded in torch.float16
linear_1.weight is loaded in torch.float16
linear_1.bias is loaded in torch.float16
layernorm_1.weight is loaded in torch.float16
layernorm_1.bias is loaded in torch.float16
linear_2.weight is loaded in torch.float16
linear_2.bias is loaded in torch.float16
layernorm_2.weight is loaded in torch.float16
layernorm_2.bias is loaded in torch.float16
head.weight is loaded in torch.float16
head.bias is loaded in torch.float16
  • Run simple inference using model.
import torch
dummy_input = torch.LongTensor([[1, 0], [0, 1]])
# inference using float32 model
logits_fp32 = model(dummy_input)
logits_fp32

Output

tensor([[[-0.6872,  0.7132],
         [-0.6872,  0.7132]],

        [[-0.6872,  0.7132],
         [-0.6872,  0.7132]]], grad_fn=<ViewBackward0>)
# inference using float16 model
try:
    logits_fp16 = model_fp16(dummy_input)
except Exception as error:
    print("\033[91m", type(error).__name__, ": ", error, "\033[0m")

Model Casting: bfloat16

Note about deepcopy
  • copy.deepcopy makes a copy of the model that is independent of the original. Modifications you make to the copy will not affect the original, because you’re making a “deep copy”. For more details, see the Python docs on the [copy][https://docs.python.org/3/library/copy.html] library.
from copy import deepcopy
model_bf16 = deepcopy(model)
model_bf16 = model_bf16.to(torch.bfloat16)
print_param_dtype(model_bf16)

Output

token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16
logits_bf16 = model_bf16(dummy_input)
  • Now, compare the difference between logits_fp32 and logits_bf16.
mean_diff = torch.abs(logits_bf16 - logits_fp32).mean().item()
max_diff = torch.abs(logits_bf16 - logits_fp32).max().item()

print(f"Mean diff: {mean_diff} | Max diff: {max_diff}")

Output

Mean diff: 0.0009978711605072021 | Max diff: 0.0016907453536987305

Using Popular Generative Models in Different Data Types

  • Load Salesforce/blip-image-captioning-base to perform image captioning.
To get the sample code that Younes showed:
  • Click on the “Model Card” tab.
  • On the right, click on the button “<> Use in Transformers”, you’ll see a popup with sample code for loading this model.
# Load model directly
from transformers import AutoProcessor, AutoModelForSeq2SeqLM

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/blip-image-captioning-base")
  • To see the sample code with an example, click on “Read model documentation” at the bottom of the popup. It opens a new tab.
    https://huggingface.co/docs/transformers/main/en/model_doc/blip#transformers.BlipForConditionalGeneration
  • On this page, scroll down a bit, past the “parameters”, section, and you’ll see “Examples:”
from PIL import Image
import requests
from transformers import AutoProcessor, BlipForConditionalGeneration

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "A picture of"

inputs = processor(images=image, text=text, return_tensors="pt")

outputs = model(**inputs)
from transformers import BlipForConditionalGeneration
model_name = "Salesforce/blip-image-captioning-base"
model = BlipForConditionalGeneration.from_pretrained(model_name)
  • Check the memory footprint of the model.
fp32_mem_footprint = model.get_memory_footprint()
print("Footprint of the fp32 model in bytes: ",
      fp32_mem_footprint)
print("Footprint of the fp32 model in MBs: ", 
      fp32_mem_footprint/1e+6)

Output

Footprint of the fp32 model in bytes:  989660400
Footprint of the fp32 model in MBs:  989.6604
  • Load the same model in bfloat16.
model_bf16 = BlipForConditionalGeneration.from_pretrained(
                                               model_name,
                               torch_dtype=torch.bfloat16
)

bf16_mem_footprint = model_bf16.get_memory_footprint()

# Get the relative difference
relative_diff = bf16_mem_footprint / fp32_mem_footprint

print("Footprint of the bf16 model in MBs: ", 
      bf16_mem_footprint/1e+6)
print(f"Relative diff: {relative_diff}")

Output

Footprint of the bf16 model in MBs:  494.832248
Relative diff: 0.5000020693967345

Model Performance: float32 vs bfloat16

  • Now, compare the generation results of the two model.
from transformers import BlipProcessor
processor = BlipProcessor.from_pretrained(model_name)
  • Load the image.
from helper import load_image, get_generation
from IPython.display import display

img_url = 'https://storage.googleapis.com/\
sfr-vision-language-research/BLIP/demo.jpg'

image = load_image(img_url)
display(image.resize((500, 350)))

Output

在这里插入图片描述

results_fp32 = get_generation(model, 
                              processor, 
                              image, 
                              torch.float32)
                              
print("fp32 Model Results:\n", results_fp32)

Output

fp32 Model Results:
 a woman sitting on the beach with her dog
results_bf16 = get_generation(model_bf16, 
                              processor, 
                              image, 
                              torch.bfloat16)
print("bf16 Model Results:\n", results_bf16)                             

Output

bf16 Model Results:
 a woman sitting on the beach with her dog

Default Data Type

  • For Hugging Face Transformers library, the deafult data type to load the models in is float32
  • You can set the “default data type” as what you want.
desired_dtype = torch.bfloat16
torch.set_default_dtype(desired_dtype)
dummy_model_bf16 = DummyModel()
print_param_dtype(dummy_model_bf16)

Output

token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16
  • Similarly, you can reset the default data type to float32.
torch.set_default_dtype(torch.float32)
print_param_dtype(dummy_model_bf16)

Output

token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16

Note

  • You just used a simple form of quantization, in which the model’s parameters are saved in a more compact data type (bfloat16). During inference, the model performs its calculations in this data type, and its activations are in this data type.
  • In the next lesson, you will use another quantization method, “linear quantization”, which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference.

Lesson 4: Quantization Theory

线性量化(Linear Quantization)是一种量化方法,用于将连续的实数值数据映射到离散的整数值。在线性量化中,数据范围被均匀地划分为若干个量化级别,每个级别代表一个固定的实数范围。线性量化广泛应用于信号处理、图像处理和机器学习模型的压缩和加速。

线性量化的基本原理

线性量化的过程可以概括为以下几个步骤:

  1. 确定数据范围
    确定需要量化的数据的最小值和最大值,通常记作 [ x m i n , x m a x ] [x_{min}, x_{max}] [xmin,xmax]

  2. 确定量化级别数
    选择一个量化级别数 (N),通常是 ( 2 b 2^b 2b),其中 (b) 是量化位数。例如,对于 8 位量化,(N = 256)。

  3. 计算量化步长
    量化步长(step size)( Δ \Delta Δ) 计算公式为:

    Δ = x m a x − x m i n N − 1 \Delta = \frac{x_{max} - x_{min}}{N - 1} Δ=N1xmaxxmin

  4. 量化
    将每个连续实数值 (x) 映射到离散量化级别 (q)。量化公式为:

    q = round ( x − x m i n Δ ) q = \text{round}\left(\frac{x - x_{min}}{\Delta}\right) q=round(Δxxmin)

    其中,( round \text{round} round) 表示四舍五入到最近的整数。

  5. 反量化(重建)
    将量化后的整数值 (q) 映射回近似的连续实数值 ( x ^ \hat{x} x^)。反量化公式为:
    x ^ = x m i n + q ⋅ Δ \hat{x} = x_{min} + q \cdot \Delta x^=xmin+qΔ

线性量化的示例

假设有一组数据 ([0.0, 1.0, 2.0, 3.0]),我们希望使用 2 位量化(即 (N = 4))。

  1. 数据范围
    x m i n = 0.0 , x m a x = 3.0 x_{min} = 0.0,x_{max} = 3.0 xmin=0.0xmax=3.0

  2. 量化步长
    Δ = 3.0 − 0.0 4 − 1 = 1.0 \Delta = \frac{3.0 - 0.0}{4 - 1} = 1.0 Δ=413.00.0=1.0

  3. 量化
    对每个值进行量化:
    q = round ( x − 0.0 1.0 ) = round ( x ) q = \text{round}\left(\frac{x - 0.0}{1.0}\right) = \text{round}(x) q=round(1.0x0.0)=round(x)

    因此:
    0.0 → 0 , 1.0 → 1 , 2.0 → 2 , 3.0 → 3 0.0 \rightarrow 0, \quad 1.0 \rightarrow 1, \quad 2.0 \rightarrow 2, \quad 3.0 \rightarrow 3 0.00,1.01,2.02,3.03

  4. 反量化
    将量化值映射回实数:
    x ^ = 0.0 + q ⋅ 1.0 = q \hat{x} = 0.0 + q \cdot 1.0 = q x^=0.0+q1.0=q
    因此:
    0 → 0.0 , 1 → 1.0 , 2 → 2.0 , 3 → 3.0 0 \rightarrow 0.0, \quad 1 \rightarrow 1.0, \quad 2 \rightarrow 2.0, \quad 3 \rightarrow 3.0 00.0,11.0,22.0,33.0

线性量化在机器学习中的应用

在线性量化中,神经网络模型的权重和激活值可以被量化为低位整数(例如 8 位整数),以减少模型的内存占用和计算复杂度,从而提高推理速度。特别是在资源受限的设备(如移动设备和嵌入式系统)上,量化技术非常有用。

量化神经网络的主要挑战在于,如何在量化过程中尽量减少对模型精度的影响。因此,常见的方法包括对模型进行量化感知训练(Quantization-Aware Training)和后量化(Post-Training Quantization)。

总之,线性量化是一种简单而有效的数据压缩技术,在许多领域有广泛的应用。

Linear quantization

在这里插入图片描述

在这里插入图片描述

Scale and zero point

在这里插入图片描述

Quantization Aware Training

在这里插入图片描述

In this lab, you will perform Linear Quantization.

Libraries to install
  • If you are running this notebook on your local machine, you can install the following:
!pip install transformers==4.35.0
!pip install quanto==0.0.11
!pip install torch==2.1.1

T5-FLAN

  • Please note that due to hardware memory constraints, and in order to offer this course for free to everyone, the code you’ll run here is for the T5-FLAN model instead of the EleutherAI AI Pythia model.
  • Thank you for your understanding! 🤗

For the T5-FLAN model, here is one more library to install if you are running locally:

!pip install sentencepiece==0.2.0

Without Quantization

model_name = "google/flan-t5-small"
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Output

<pad> annie scott</s>

helper.py

import torch

# ################ monkey patch for quanto
def named_module_tensors(module, recurse=False):
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

Output

The model size is 0.307844608 GB

Quantize the model (8-bit precision)

from quanto import quantize, freeze
import torch
quantize(model, weights=torch.int8, activations=None)

Freeze the model

  • This step takes a bit of memory, and so for the Pythia model that is shown in the lecture video, it will not run in the classroom.
  • This will work fine with the smaller T5-Flan model.
freeze(model)
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

Output

The model size is 0.12682868 GB

Try running inference on the quantized model

input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Output

<pad> annie scott</s>
Comparing “linear quantization” to “downcasting”

To recap the difference between the “linear quantization” method in this lesson with the “downcasting” method in the previous lesson:

  • When downcasting a model, you convert the model’s parameters to a more compact data type (bfloat16). During inference, the model performs its calculations in this data type, and its activations are in this data type. Downcasting may work with the bfloat16 data type, but the model performance will likely degrade with any smaller data type, and won’t work if you convert to an integer data type (like the int8 in this lesson).

  • In this lesson, you used another quantization method, “linear quantization”, which enables the quantized model to maintain performance much closer to the original model by converting from the compressed data type back to the original FP32 data type during inference. So when the model makes a prediction, it is performing the matrix multiplications in FP32, and the activations are in FP32. This enables you to quantize the model in data types smaller than bfloat16, such as int8, in this example.

在这里插入图片描述

Quantization of LLMs

Recent SOTA quantization methods

在这里插入图片描述

For 2-bit quantization

在这里插入图片描述

在这里插入图片描述

Fine Tuning quantized models

在这里插入图片描述

Fine tune with QAT

在这里插入图片描述

后记

2024年6月8日14点58分完成huggingface的这门量化基础课程。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1802100.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

转让无区域商业管理公司挺批行业包变更

无区域的名称我们可以直接进行名称的申请核准。 从新规施行之后&#xff0c;国家局核名批准难度更高。新申请的无区域名称已经停批了&#xff0c;进行核名将更难&#xff0c;而需要满足一定条件并在成立一年后才能变更升级名称。而这个过程并非易事&#xff0c;难度非常高。可以…

45-1 waf绕过 - 文件上传绕过WAF方法

环境准备: 43-5 waf绕过 - 安全狗简介及安装-CSDN博客然后安装dvwa靶场:构建完善的安全渗透测试环境:推荐工具、资源和下载链接_渗透测试靶机下载-CSDN博客打开dvwa靶场,先将靶场的安全等级调低,然后切换到文件上传 一、符号变异 在PHP中,由于其弱类型特性,有时候仅有一…

JavaSE——抽象类和接口

目录 一 .抽象类 1.抽象类概念 2.抽象类语法 3.抽象类特性 4.抽象类的作用 二. 接口 1.接口的概念 2.语法规则 3.接口的使用 4.接口特性 5.实现多个接口 6.接口间的继承 三.抽象类和接口的区别 一 .抽象类 1.抽象类概念 在面向对象的概念中&#xff0c;所有的对…

SpringCloud整合OpenFeign实现微服务间的通信

1. 前言 1.1 为什么要使用OpenFeign&#xff1f; 虽说RestTemplate 对HTTP封装后, 已经⽐直接使⽤HTTPClient简单⽅便很多, 但是还存在⼀些问题. 需要拼接URL, 灵活性⾼, 但是封装臃肿, URL复杂时, 容易出错. 代码可读性差, ⻛格不统⼀。 1.2 介绍一下微服务之间的通信方式 微…

Zabbix实现邮件和钉钉实时告警(使用python脚本)

告警和通知 告警是监控的主要职能,是指将到达某一阈值事件的消息发送给用户,让用户在事件发生的时候即刻知道监控项处于不正常状态,从而让用户来决定是否采取相关措施。 zabbix中,告警是由一系列的流程组成的,⾸首先是触发器到达阈值,接下是Active对事件信息进行处理,其…

TCP/IP协议分析实验:通过一次下载任务抓包分析

TCP/IP协议分析 一、实验简介 本实验主要讲解TCP/IP协议的应用&#xff0c;通过一次下载任务&#xff0c;抓取TCP/IP数据报文&#xff0c;对TCP连接和断开的过程进行分析&#xff0c;查看TCP“三次握手”和“四次挥手”的数据报文&#xff0c;并对其进行简单的分析。 二、实…

ElasticSearch学习笔记之一:介绍及EFK部署

1. 系统概述 The Elastic Stack&#xff0c;包括Elasticsearch、Kibana、Beats和Logstash&#xff08;也成为ELK Stack&#xff09; Elasticsearch&#xff1a;简称ES&#xff0c;是一个开源的高扩展的分布式全文搜索引擎&#xff0c;是整个Elastic Stack技术栈的核心。它可以…

【MySQL】聊聊MySQL常见的SQL语句阻塞场景

在平时的业务中&#xff0c;可能一个简单的SQL语句也执行很慢&#xff0c;这种情况其实大多数都是要么没有使用索引&#xff0c;要么出现锁竞争造成执行阻塞。本篇主要来介绍具体的场景 CREATE TABLE t ( id int(11) NOT NULL, c int(11) DEFAULT NULL, PRIMARY KEY (id) ) ENG…

用HAL库改写江科大的stm32入门-6-4 PWM驱动舵机

接线图&#xff1a; 如何控制一个舵机 舵机的控制由一个脉冲宽度调制信号(PWM波&#xff09;来实现&#xff0c;该信号在这个实验里使用stm32来发出。 舵机通讯协议&#xff1a; 对应设置参数&#xff1a; ARR的值为19999 CCR的值为500~2500(生成占空比是2.5%~12.5%的波形)…

[经验] 昆山教育网(昆山教育网中小学报名) #媒体#职场发展#微信

昆山教育网&#xff08;昆山教育网中小学报名&#xff09; 昆山教育局网站 网站&#xff1a;昆山市教育局 昆山市教育局全面贯彻执行党和国家的教育方针、政策&#xff0c;落实有关教育工作的法律、法规&#xff1b;负责制定本市教育工作的实施意见和措施&#xff0c;并监督…

Java基础_异常

Java基础_异常 异常体系介绍编译时异常和运行时异常异常的作用异常的处理方式JVM默认的处理方式自己处理&#xff08;捕获异常&#xff09;try...catch灵魂四问Throwable的成员方法 抛出处理 综合练习自定义异常来源Gitee地址 异常体系介绍 异常是什么&#xff1f; 程序中可能出…

LabVIEW2017破解安装教程

LabVIEW2017破解安装教程&#xff1a; 1、新版LabVIEW2017分为32位和64位两个平台&#xff0c;多种语言版本(需要LabVIEW2017中文版的朋友请选择WinChn版本)&#xff0c;大家选择自行选择符合系统的版本下载并解压 2、本次安装以Win 7 64位系统为例&#xff0c;运行“2017LV-64…

【Vscode配置java环境并配置stringboot】

1.VSCODE配置JAVA环境 参考这篇文章配置JAVA环境&#xff1a;连接 java版本&#xff0c;我是win11系统,我下载的JAVA安装版本是下面&#xff0c;是最新版的&#xff1a; 配置环境&#xff1a;步骤很简单&#xff0c;就是向系统环境变量中添加路径&#xff0c;参考上面文章中的…

AddressSanitizer理论及实践:heap-use-after-free、free on not malloc()-ed address

AddressSanity&#xff1a;A Fast Address Sanity Checker 摘要 对于C和C 等编程语言&#xff0c;包括缓冲区溢出和堆内存的释放后重用等内存访问错误仍然是一个严重的问题。存在许多内存错误检测器&#xff0c;但大多数检测器要么运行缓慢&#xff0c;要么检测到的错误类型有…

【Java】解决Java报错:IllegalArgumentException

文章目录 引言1. 错误详解2. 常见的出错场景2.1 非法的参数值2.2 空值或 null 参数2.3 非法的数组索引 3. 解决方案3.1 参数验证3.2 使用自定义异常3.3 使用Java标准库中的 Objects 类 4. 预防措施4.1 编写防御性代码4.2 使用注解和检查工具4.3 单元测试 结语 引言 在Java编程…

C++笔记之一个函数多个返回值的方法、STL::pair、STL::tuple、STL::tie的用法

C++笔记之一个函数多个返回值的方法、STL::pair、STL::tuple、STL::tie的用法 —— 2024-06-08 杭州 code review! 文章目录 C++笔记之一个函数多个返回值的方法、STL::pair、STL::tuple、STL::tie的用法一.从一个函数中获取多个返回值的方法1. 使用结构体或类2. 使用`std::t…

递归【2】(组合回溯(生成括号)、子集回溯(背包问题))

括号对 &#xff08;组合型回溯&#xff09; 分解成子问题&#xff0c;每一次添加括号分两步&#xff1a; if左括号小于n&#xff0c;加左括号&#xff0c;然后k(index1), if左括号大于有括号&#xff0c;加右括号&#xff0c;k(index1),然后收尾括号单独考虑&#xff0c;到…

Spark的开发环境配置

1. 介绍 这里主要记录一下&#xff0c;我们常用的maven配置&#xff0c;方便后期开发配置环境&#xff0c;避免每次都从零开始配置工程。 2. 工程目录 3. pom的配置 <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven…

Vue11-键盘事件

一、键盘事件&#xff1a;keydown和keyup事件 keydown 和 keyup 是两种常用于处理键盘输入事件的JavaScript事件。当你在网页的输入框或其他可输入元素上按下或释放键盘上的某个键时&#xff0c;这些事件就会被触发。 1-1、keydown 事件 当用户按下键盘上的某个键时&#xff…

计算机组成原理——浮点加减运算的一道非计算例题

浮点加减运算的一道非计算例题 一、题目 文字描述 例6.31设机器数字长16位&#xff0c;阶码5位&#xff08;含1位阶符&#xff09;&#xff0c;基值为2&#xff0c;尾数11位&#xff08;含1位数符&#xff09;。 对千两个阶码相等的数按补码浮点加法完成后&#xff0c;由于规格…