GPTQ vs AWQ vs GGUF（GGML）速览和 GGUF 文件命名规范

简单介绍一下四者的区别。

参考链接：GPTQ - 2210.17323 | AWQ - 2306.00978 | GGML | GGUF - docs | What is GGUF and GGML?

文章目录

- GPTQ vs AWQ vs GGUF（GGML）速览
- - GGUF 文件命名
  - GGUF 文件结构
  - 文件名解析答案
- 附录
- - GGUF 文件命名
  - GGUF 文件结构

GPTQ vs AWQ vs GGUF（GGML）速览

GPTQ (Generalized Post-Training Quantization)
GPTQ 是一种基于近似二阶信息的后训练量化技术，能够将模型的权重位宽降低到 3-4 bits，在大幅减少模型大小和计算成本的同时还能保持模型性能。在极端情况下还能量化到 2 bits 甚至 3 进制，但会有一定的性能损失。
AWQ (Activation-aware Weight Quantization)

AWQ 不会量化模型的所有权重，保留了对模型性能重要的一小部分权重，大大减少了量化损失。如图所示，这里比较极端，是 INT3 量化：
- 图 a：RTN量化（Round-to-Nearest）
  将权重直接四舍五入到目标位宽，导致性能明显下降，PPL 达到 43.2。
- 图 b：保护 1% 的显著权重，使用混合精度形式
  这里展示了一种改进策略，即保留 1% 最重要的权重通道使用高精度（FP16），其余使用低精度（INT3）。PPL 降低到 13.0。虽然这种方法能保住性能，但由于需要不同精度切换，硬件效率不高。但这一策略证明了并非所有权重都对模型性能同等重要。
- 图 c：AWQ 提出的通道缩放量化方法
  
  AWQ 通过通道缩放保护显著权重，利用激活分布找到重要的权重并缩放它们的值来减少量化误差。相比混合精度形式，AWQ 提升了硬件效率，同时性能与图 b 一致，PPL 也为到 13.0。
GGML (GPT-Generated Model Language)
「显存不够内存来凑」，这是一种文件格式，支持在CPU和GPU上进行推理。
GGUF (GPT-Generated Unified Format)：
GGUF 是 GGML 的升级版，提升了扩展和兼容性。

GGUF 文件命名

参考链接：GGUF - docs

GGUF 格式将加载模型所需的所有信息封装在一个文件中，简化了模型的分发和部署。同时，GGUF 文件命名遵循 <BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf 的规则，方便人们快速识别模型的关键信息。具体说明如下：

BaseName：模型的基础名称或架构名称，例如 Llama。
SizeLabel：模型的参数规模标签，表示模型的参数数量及可能的专家数量，格式为 <expertCount>x<count><scale-prefix>。
- expertCount：表示专家模型中的专家数量。如果模型没有使用 Mixture of Experts (MoE) 架构，可以省略。
- Count：
  - Q: 表示百万亿（quadrillion）参数。
  - T: 表示万亿（trillion）参数。
  - B: 表示十亿（billion）参数。
  - M: 表示百万（million）参数。
  - K: 表示千（thousand）参数。
    当前主流大模型多为 B 级参数（十亿级），但未来 T（万亿级）模型可能会成为主流。
- 附加属性：在某些情况下，-<attributes><count><scale-prefix> 可以进一步细化模型的描述，添加额外的参数，例如 Q, K, T，这些表示量化方式或其他模型特性。例如：
  - Q4: 表示 4-bit 量化。
    示例：
  - 7B: 表示 70 亿参数的模型。
  - 4x3T: 表示有 4 个专家的 3 万亿参数模型。
  - 2x10B-Q4: 表示有 2 个专家且采用 Q4 量化的 100 亿参数模型。
FineTune：微调目标描述（如 Chat、Instruct）。
Version（可选）：模型的版本号，格式为 v<Major>.<Minor>，没提供则假设为 v1.0。
Encoding：权重编码方案（如 Q4_0 表示 4-bit 量化）。
Type：文件类型，如 LoRA（适配器）或 vocab（仅包含词汇表）。
Shard（可选）：模型分片信息，格式为 <ShardNum>-of-<ShardTotal>，适用于大型模型。例如 00003-of-00009 表示第 3 个分片，共 9 个分片，注意分片编号从 00001 开始，而非 00000。

验证命名是否符合规范的正则：

^(?<BaseName>[A-Za-z0-9\s]*(?:(?:-(?:(?:[A-Za-z\s][A-Za-z0-9\s]*)|(?:[0-9\s]*)))*))-(?:(?<SizeLabel>(?:\d+x)?(?:\d+\.)?\d+[A-Za-z](?:-[A-Za-z]+(\d+\.)?\d+[A-Za-z]+)?)(?:-(?<FineTune>[A-Za-z0-9\s-]+))?)?-(?:(?<Version>v\d+(?:\.\d+)*))(?:-(?<Encoding>(?!LoRA|vocab)[\w_]+))?(?:-(?<Type>LoRA|vocab))?(?:-(?<Shard>\d{5}-of-\d{5}))?\.gguf$

尝试理解下面三个来自官方文档的文件命名，看看你能否正确解析：

Mixtral-8x7B-v0.1-KQ2.gguf
Hermes-2-Pro-Llama-3-8B-F16.gguf
Grok-100B-v1.0-Q4_0-00003-of-00009.gguf

在文章的末尾会给出解析答案，现在请停下来思考。

GGUF 文件结构

*diagram by @mishig25(/Users/home/Downloads/agent/LLM-API-Guide-and-Demos/Guide/assets/313174776-c3623641-3a1d-408e-bfaf-1b7c4e16aa63-2.png)*

如果想进一步了解，查看附录部分的代码。

文件名解析答案

Mixtral-8x7B-v0.1-KQ2.gguf：
- BaseName：Mixtral
- SizeLabel：
  - Expert Count: 8
  - Parameter Count: 7B
- Version：v0.1
- Encoding：KQ2
Hermes-2-Pro-Llama-3-8B-F16.gguf：
- BaseName：Hermes 2 Pro Llama 3
- SizeLabel：
  - Expert Count: 0
  - Parameter Count: 8B
- Version：v1.0
- Encoding：F16
Grok-100B-v1.0-Q4_0-00003-of-00009.gguf：
- BaseName：Grok
- SizeLabel：
  - Expert Count: 0
  - Parameter Count: 100B
- Version：v1.0
- Encoding：Q4_0
- Shard：第 3 个分片，共 9 个分片

附录

GGUF 文件命名

Quantization Types

类型	来源	描述
F64	Wikipedia	64 位标准 IEEE 754 双精度浮点数。
I64	GH	64 位定宽整数。
F32	Wikipedia	32 位标准 IEEE 754 单精度浮点数。
I32	GH	32 位定宽整数。
F16	Wikipedia	16 位标准 IEEE 754 半精度浮点数。
BF16	Wikipedia	32 位 IEEE 754 单精度浮点数的 16 位简化版本。
I16	GH	16 位定宽整数。
Q8_0	GH	8 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale`。目前已不广泛使用的过时量化方法。
Q8_1	GH	8 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale + block_minimum`。目前已不广泛使用的过时量化方法。
Q8_K	GH	8 位量化（q）。每个块有 256 个权重。仅用于量化中间结果。此量化类型支持所有 2-6 位点积。权重公式：`w = q * block_scale`。
I8	GH	8 位定宽整数。
Q6_K	GH	6 位量化（q）。超块包含 16 个块，每个块有 16 个权重。权重公式：`w = q * block_scale(8-bit)`，每个权重占用 6.5625 位。
Q5_0	GH	5 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale`。目前已不广泛使用的过时量化方法。
Q5_1	GH	5 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale + block_minimum`。目前已不广泛使用的过时量化方法。
Q5_K	GH	5 位量化（q）。超块包含 8 个块，每个块有 32 个权重。权重公式：`w = q * block_scale(6-bit) + block_min(6-bit)`，每个权重占用 5.5 位。
Q4_0	GH	4 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale`。目前已不广泛使用的过时量化方法。
Q4_1	GH	4 位四舍五入量化（q）。每个块有 32 个权重。权重公式：`w = q * block_scale + block_minimum`。目前已不广泛使用的过时量化方法。
Q4_K	GH	4 位量化（q）。超块包含 8 个块，每个块有 32 个权重。权重公式：`w = q * block_scale(6-bit) + block_min(6-bit)`，每个权重占用 4.5 位。
Q3_K	GH	3 位量化（q）。超块包含 16 个块，每个块有 16 个权重。权重公式：`w = q * block_scale(6-bit)`，每个权重占用 3.4375 位。
Q2_K	GH	2 位量化（q）。超块包含 16 个块，每个块有 16 个权重。权重公式：`w = q * block_scale(4-bit) + block_min(4-bit)`，每个权重占用 2.5625 位。
IQ4_NL	GH	4 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到。
IQ4_XS	HF	4 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 4.25 位。
IQ3_S	HF	3 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 3.44 位。
IQ3_XXS	HF	3 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 3.06 位。
IQ2_XXS	HF	2 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 2.06 位。
IQ2_S	HF	2 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 2.5 位。
IQ2_XS	HF	2 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 2.31 位。
IQ1_S	HF	1 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 1.56 位。
IQ1_M	GH	1 位量化（q）。超块包含 256 个权重。权重 `w` 通过 `super_block_scale` 和 `importance matrix` 计算得到，每个权重占用 1.75 位。

GGUF 文件结构

GGUF - docs

enum ggml_type: uint32_t {
    GGML_TYPE_F32     = 0,
    GGML_TYPE_F16     = 1,
    GGML_TYPE_Q4_0    = 2,
    GGML_TYPE_Q4_1    = 3,
    // GGML_TYPE_Q4_2 = 4, support has been removed
    // GGML_TYPE_Q4_3 = 5, support has been removed
    GGML_TYPE_Q5_0    = 6,
    GGML_TYPE_Q5_1    = 7,
    GGML_TYPE_Q8_0    = 8,
    GGML_TYPE_Q8_1    = 9,
    GGML_TYPE_Q2_K    = 10,
    GGML_TYPE_Q3_K    = 11,
    GGML_TYPE_Q4_K    = 12,
    GGML_TYPE_Q5_K    = 13,
    GGML_TYPE_Q6_K    = 14,
    GGML_TYPE_Q8_K    = 15,
    GGML_TYPE_IQ2_XXS = 16,
    GGML_TYPE_IQ2_XS  = 17,
    GGML_TYPE_IQ3_XXS = 18,
    GGML_TYPE_IQ1_S   = 19,
    GGML_TYPE_IQ4_NL  = 20,
    GGML_TYPE_IQ3_S   = 21,
    GGML_TYPE_IQ2_S   = 22,
    GGML_TYPE_IQ4_XS  = 23,
    GGML_TYPE_I8      = 24,
    GGML_TYPE_I16     = 25,
    GGML_TYPE_I32     = 26,
    GGML_TYPE_I64     = 27,
    GGML_TYPE_F64     = 28,
    GGML_TYPE_IQ1_M   = 29,
    GGML_TYPE_COUNT,
};

enum gguf_metadata_value_type: uint32_t {
    // The value is a 8-bit unsigned integer.
    GGUF_METADATA_VALUE_TYPE_UINT8 = 0,
    // The value is a 8-bit signed integer.
    GGUF_METADATA_VALUE_TYPE_INT8 = 1,
    // The value is a 16-bit unsigned little-endian integer.
    GGUF_METADATA_VALUE_TYPE_UINT16 = 2,
    // The value is a 16-bit signed little-endian integer.
    GGUF_METADATA_VALUE_TYPE_INT16 = 3,
    // The value is a 32-bit unsigned little-endian integer.
    GGUF_METADATA_VALUE_TYPE_UINT32 = 4,
    // The value is a 32-bit signed little-endian integer.
    GGUF_METADATA_VALUE_TYPE_INT32 = 5,
    // The value is a 32-bit IEEE754 floating point number.
    GGUF_METADATA_VALUE_TYPE_FLOAT32 = 6,
    // The value is a boolean.
    // 1-byte value where 0 is false and 1 is true.
    // Anything else is invalid, and should be treated as either the model being invalid or the reader being buggy.
    GGUF_METADATA_VALUE_TYPE_BOOL = 7,
    // The value is a UTF-8 non-null-terminated string, with length prepended.
    GGUF_METADATA_VALUE_TYPE_STRING = 8,
    // The value is an array of other values, with the length and type prepended.
    ///
    // Arrays can be nested, and the length of the array is the number of elements in the array, not the number of bytes.
    GGUF_METADATA_VALUE_TYPE_ARRAY = 9,
    // The value is a 64-bit unsigned little-endian integer.
    GGUF_METADATA_VALUE_TYPE_UINT64 = 10,
    // The value is a 64-bit signed little-endian integer.
    GGUF_METADATA_VALUE_TYPE_INT64 = 11,
    // The value is a 64-bit IEEE754 floating point number.
    GGUF_METADATA_VALUE_TYPE_FLOAT64 = 12,
};

// A string in GGUF.
struct gguf_string_t {
    // The length of the string, in bytes.
    uint64_t len;
    // The string as a UTF-8 non-null-terminated string.
    char string[len];
};

union gguf_metadata_value_t {
    uint8_t uint8;
    int8_t int8;
    uint16_t uint16;
    int16_t int16;
    uint32_t uint32;
    int32_t int32;
    float float32;
    uint64_t uint64;
    int64_t int64;
    double float64;
    bool bool_;
    gguf_string_t string;
    struct {
        // Any value type is valid, including arrays.
        gguf_metadata_value_type type;
        // Number of elements, not bytes
        uint64_t len;
        // The array of values.
        gguf_metadata_value_t array[len];
    } array;
};

struct gguf_metadata_kv_t {
    // The key of the metadata. It is a standard GGUF string, with the following caveats:
    // - It must be a valid ASCII string.
    // - It must be a hierarchical key, where each segment is `lower_snake_case` and separated by a `.`.
    // - It must be at most 2^16-1/65535 bytes long.
    // Any keys that do not follow these rules are invalid.
    gguf_string_t key;

    // The type of the value.
    // Must be one of the `gguf_metadata_value_type` values.
    gguf_metadata_value_type value_type;
    // The value.
    gguf_metadata_value_t value;
};

struct gguf_header_t {
    // Magic number to announce that this is a GGUF file.
    // Must be `GGUF` at the byte level: `0x47` `0x47` `0x55` `0x46`.
    // Your executor might do little-endian byte order, so it might be
    // check for 0x46554747 and letting the endianness cancel out.
    // Consider being *very* explicit about the byte order here.
    uint32_t magic;
    // The version of the format implemented.
    // Must be `3` for version described in this spec, which introduces big-endian support.
    //
    // This version should only be increased for structural changes to the format.
    // Changes that do not affect the structure of the file should instead update the metadata
    // to signify the change.
    uint32_t version;
    // The number of tensors in the file.
    // This is explicit, instead of being included in the metadata, to ensure it is always present
    // for loading the tensors.
    uint64_t tensor_count;
    // The number of metadata key-value pairs.
    uint64_t metadata_kv_count;
    // The metadata key-value pairs.
    gguf_metadata_kv_t metadata_kv[metadata_kv_count];
};

uint64_t align_offset(uint64_t offset) {
    return offset + (ALIGNMENT - (offset % ALIGNMENT)) % ALIGNMENT;
}

struct gguf_tensor_info_t {
    // The name of the tensor. It is a standard GGUF string, with the caveat that
    // it must be at most 64 bytes long.
    gguf_string_t name;
    // The number of dimensions in the tensor.
    // Currently at most 4, but this may change in the future.
    uint32_t n_dimensions;
    // The dimensions of the tensor.
    uint64_t dimensions[n_dimensions];
    // The type of the tensor.
    ggml_type type;
    // The offset of the tensor's data in this file in bytes.
    //
    // This offset is relative to `tensor_data`, not to the start
    // of the file, to make it easier for writers to write the file.
    // Readers should consider exposing this offset relative to the
    // file to make it easier to read the data.
    //
    // Must be a multiple of `ALIGNMENT`. That is, `align_offset(offset) == offset`.
    uint64_t offset;
};

struct gguf_file_t {
    // The header of the file.
    gguf_header_t header;

    // Tensor infos, which can be used to locate the tensor data.
    gguf_tensor_info_t tensor_infos[header.tensor_count];

    // Padding to the nearest multiple of `ALIGNMENT`.
    //
    // That is, if `sizeof(header) + sizeof(tensor_infos)` is not a multiple of `ALIGNMENT`,
    // this padding is added to make it so.
    //
    // This can be calculated as `align_offset(position) - position`, where `position` is
    // the position of the end of `tensor_infos` (i.e. `sizeof(header) + sizeof(tensor_infos)`).
    uint8_t _padding[];

    // Tensor data.
    //
    // This is arbitrary binary data corresponding to the weights of the model. This data should be close
    // or identical to the data in the original model file, but may be different due to quantization or
    // other optimizations for inference. Any such deviations should be recorded in the metadata or as
    // part of the architecture definition.
    //
    // Each tensor's data must be stored within this array, and located through its `tensor_infos` entry.
    // The offset of each tensor's data must be a multiple of `ALIGNMENT`, and the space between tensors
    // should be padded to `ALIGNMENT` bytes.
    uint8_t tensor_data[];
};