ZeroQuant-V2 LLM权重与激活量化

news2025/1/8 13:54:55

ref

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

4bit量化为什么重要

The case for 4-bit precision: k-bit Inference Scaling Laws

这篇文章研究表明4bit通常是最优的量化方法:同样的量化后模型大小情况下,4bit模型的精度更高。比特数越低,能够使能更大参数的模型,但是低比特导致量化精度损失也更大,相当于实现了最佳折中。同样大压缩后的模型,2倍参数量模型4bit压缩与1倍参数量模型8bit压缩的精度往往前者更高。

激活量化为什么重要

GPTQ等方法成功把模型参数压缩到4bit并得到广泛应用,但是该方法主要针对权重量化。当前比较缺乏有效的激活量化方法,很大因素是权重量化加上激活量化后会导致精度更大的损失。

但是缺乏激活量化,矩阵乘、卷积计算仍然需要float/half浮点计算,而且权重需要进行int到浮点的反量化,这导致性能较差,而且内存占用大。

ZeroQuant-V2的贡献

给出了一些比较有价值的洞察,并且给出了一个提升激活+权重量化精度的方法。

we undertake an exhaustive examination of the impact of PTQ on weight-only, activation-only, and combined weight-and-activation quantization. This investigation incorporates a range of PTQ methods, including round-to-nearest (RTN), GPTQ [12], ZeroQuant [36], and their respective variants. To broaden the scope of our analysis, we focus on two distinct model families, OPT [40] and BLOOM [28], spanning model sizes from 125M to a massive 176B.

In summary, we make the following contributions:
(1) We provide a thorough sensitivity analysis to demonstrate that a) Activation quantization is generally more sensitive to weight quantization; Smaller models usually have better activation quantization performance than the relative larger model. b) Different model families show different INT8 activation quantization behaviors; Particularly for large models, BLOOM-176B has small accuracy drops (about 1 perplexity or PPL) but OPT-30B and -66B experience worse performance.

(2) We carry out a detailed evaluation and comparison of current PTQ methods, utilizing optimal configurations to maximize model size reduction while minimizing accuracy impact. We found that the current existing method can barely achieve less than 0.1 PPL points degradation for quantization with either INT4-weight or INT4-weight-and-INT8-activation (W4A8). To recover the 0.1 PPL, we strive to push the boundaries of employing fine-grained quantization (FGQ) techniques. We observe FGQ is able to recovered points degradation of <0.1 PPL for large models (>13B) for INT4 weight quantization, but there are still non-negligible model quality drops.

(3) Based on the above understanding, we further optimize existing methods and introduce a technique called Low Rank Compensation (LoRC), which employs low-rank matrix factorization on the quantization error matrix. Complementary to FGQ, LoRC plays a crucial role in enhancing the full model quality recovery, while there is little increase of the model size.

using LoRC on top of PTQ methods from [36, 12] and fine-grained quantization, we set a new quantization Pareto frontier for LLMs.
Meanwhile, we recommend the following setting for quantizing LLMs with LoRC (Note that activation quantization should be only applied if necessary):

(1) For larger models (>10B), fine-grained (block size 64–256) 4-bit weight quantization plus 8-bit activation quantization (block size 64–256) with PTQ can be used for real deployment;

(2) For middle-size models (<10B and >1B), per-row INT8 quantization plus fine-grained (block size 64–256) INT8 activation quantization can be used with PTQ from [12, 36];

(3) For smaller models (<1B), per-row W8A8 (INT8 weight and INT8 activation) RTN is enough based on [36].
 

We employ both symmetric and asymmetric quantization to gauge the quantization sensitivity and highlight the advantage of asymmetric quantization.

Particularly, we implement per-row quantization [12] for weight quantization and per-token quantization for activation [36].
 

Robustness of Weight-only Quantization for Large Models.

INT8 weight-only quantization, either symmetric or asymmetric, results in negligible accuracy loss (less than 0.05, i.e., Class-1).
For INT4 quantization, the asymmetric method outperforms the symmetric approach in accuracy, attributable to its superior utilization of the quantization range. Interestingly, larger models exhibit better tolerance to low-precision quantization (i.e., INT4) than smaller models, with a few exceptions such as OPT-66B.


Challenge Encountered in Activation Quantization for Large Models.

Activation quantization has consistently proven more difficult than weight quantization
When compared to weight-only quantization, activation-only quantization indicates that asymmetric quantization can significantly improved performance over symmetric quantization. Moreover, contrary to weight-only quantization, smaller models typically exhibit better tolerance to activation quantization, as their hidden dimension is smaller and the activation dynamic range is also narrower than larger models [36].

existing quantization methods optimally harnessing the potential to minimize LLMs sizes

Fine-grained Quantization and Its Evaluation
finer-grained quantization schemes [5], where every k elements possess their own scaling factor and/or zero point.
For models of considerable size, specifically those equal to or exceeding 1B, the application of such fine-grained activation quantization (Case-1) results in a substantial reduction in quantization error compared to per-row activation (Case-2). By implementing fine-grained activation quantization with weight quantization (Case-3), we are able to almost restore the performance to the level of their W4A16 counterparts.
A trend of superior accuracy is observed with smaller block sizes in contrast to larger ones. However, the enhancement in performance reaches a saturation point when the size smaller or equal to 256, which corresponds to the range of values INT8 can represent. Despite INT8’s capability to signify 256 distinct values, activation quantization errors persist due to the application of uniform quantization.


LoRC (Low Rank Compensation)

LoRC can be viewed as a supplementary feature to existing quantization methodologies such as RTN, GPTQ, and ZeroQuant-Local/Global, and can be seamlessly integrated with FGQ.
low-rank dimension m can be as small as 4 or 8
the two low-rank matrices, Uˆ and Vˆ , can be quantized to 8-bit without any performance discrepancy

the combination of fine-grained quantization with LoRC yields the most impressive results, underscoring the efficacy of LoRC when integrated with FGQ. Overall, the results emphasize the benefits of using LoRC for enhanced performance in weight quantization and its compatibility with FGQ. Notably, recovering the last 0.05-0.1 perplexity can be challenging, but with LoRC, we are able to nearly recover the original model quality for INT4 quantization.

这个U V计算得到后又怎么用?这个细节文章并没有讲!

只提到

但是如果直接这样加上去,就没有单独存放这两个矩阵的必要,与文章提到增加存储相矛盾。而且w^可能已经部分饱和了,相加不一定提升精度。

也有可能是激活分别与W^和Uˆ , Vˆ做矩阵乘,再把结果相加?这样就需要一个额外的矩阵乘步骤。

这个工作看上去不需要训练过程?ZQ-Global应该需要蒸馏。

文章对LoRC提升W4A8精度评估实验不太不充分!

普通激活min,max动态量化能否提升精度不明确,还是只能结合ZQ-Global才能实现几乎无损?

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/661682.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

快速选择合适的GNSS模拟器,看这几个重点

为了正确测试GNSS接收器或其他任何依赖GNSS信号的系统或设备&#xff0c;工程师需要模拟与真实卫星广播相同的RF信号&#xff0c;而使用GNSS模拟器是实现这一工作的最佳选择。 假设正在开发一种需要集成GNSS接收器的新产品&#xff0c;想要验证接收器设备在不同条件下是否运行…

企业为什么要关注库存?因为那都是“钱”!

库存是企业管理的核心&#xff0c;公司必须提供足够的商品来满足客户需求&#xff0c;但又不能冒着浪费金钱和资源的风险&#xff0c;一味的增加库存。 不同的企业在库存管理这个问题上就“八仙过海各显神通”了&#xff0c;常见的处理方法有&#xff1a; 1. 经验法&#xff1a…

求两数之和

必应壁纸供图 题目描述 两数之和 给定一个整数数组 nums 和一个整数目标值 target&#xff0c;请你在该数组中找出 和为目标值 target 的那 两个 整数&#xff0c;并返回它们的数组下标。 你可以假设每种输入只会对应一个答案。但是&#xff0c;数组中同一个元素在答案里不…

基于SpringBoot+Vue的学生毕业离校系统设计与实现

博主介绍&#xff1a; 大家好&#xff0c;我是一名在Java圈混迹十余年的程序员&#xff0c;精通Java编程语言&#xff0c;同时也熟练掌握微信小程序、Python和Android等技术&#xff0c;能够为大家提供全方位的技术支持和交流。 我擅长在JavaWeb、SSH、SSM、SpringBoot等框架下…

Sui x KuCoin Labs夏季黑客松首批入围项目公布

经过一段时间的共同努力&#xff0c;我们已经完成了对Sui x KuCoin Labs夏季黑客松第一批入围项目的审核&#xff0c;并宣布以下团队已入围最终demo day&#xff1a; 首批入围名单 Desig Protocol Desig协议提供了一种与区块链无关&#xff08;无智能合约&#xff09;的多重…

聚观早报|奔驰接入ChatGPT聊天机器人;极兔速递提交上市申请书

今日要闻&#xff1a;奔驰接入ChatGPT聊天机器人&#xff1b;极兔速递向港交所提交上市申请书&#xff1b;微软股价创历史新高&#xff1b;美电动卡车Nikola宣布裁员23%&#xff1b;马斯克&#xff1a;人工智能具有颠覆性 奔驰接入ChatGPT聊天机器人 梅赛德斯 - 奔驰表示&…

Doris 扩缩容和副本均衡

扩缩容和副本均衡 ​ 对于集群节点发生变化&#xff08;扩缩容&#xff09;时&#xff0c;集群内部的tablet是如何以一定的策略完成数据的重分布&#xff0c;从而达到每个be能够分布尽可能数量相同的tablet。同时&#xff0c;集群内部某些tablet由于某些原因发生损坏时&#x…

快手web验证日志分析

文章目录 log参数session_ididentityproject_id 流程1-4请求参数sign日志formdata828586 log参数 为了确定日志中是否有参数和IP绑定&#xff0c;检查了session_id参数。 session_id session_id i (wn[e[t 0]] wn[e[t 1]] wn[e[t 2]] wn[e[t 3]] "-" wn…

rk3568 长按电源键关机修改

rk3568 电源键强制关机触发时间修改 Rockchip系列主控芯片配套PMIC支持长按电源键强制关闭PMU&#xff0c;也就是标题中的强制关机。rk3568使用I2C通信与PMIC(rk809/rk817)&#xff0c;power(电源)键则是与PMIC相连&#xff0c;PMIC监测按键状态控制RK3568的上下电&#xff0c;…

新增每日个股涨停原因及概念题材统计!股票量化分析工具QTYX-V2.6.6

功能概述 目前A股市场主旋律仍然是以行业及题材板块的热点轮动为主&#xff0c;因此从某种意义上来说&#xff0c;板块的强弱决定了选出个股的强弱。 我们的股票量化分析工具QTYX目前已经支持了统计行业板块内涨停板数量功能。‍ 为了进一步分析出市场上启动的热点概念题材。我…

vscode远程连接linux服务器

Linux服务器&#xff08;或虚拟机&#xff09;条件&#xff1a; 开启ssh服务 客户端&#xff1a; vscode 1.服务端 服务端需要开启ssh服务vscode才能连接&#xff0c;首先检验linux是否开启了ssh服务 systemctl status sshd servie sshd status 运行命令&#xff0c;如果可…

视频基础知识——编码时,帧率、码率、分辨率之间的联系

本文主要补充基础概念&#xff0c;在设置编码器的时候&#xff0c;参数设置不对&#xff0c;录制的屏幕就会出现花屏。所以梳理了基础知识&#xff0c;就知道哪里设置不对了。 如下是ffmpeg自带的sample&#xff0c;encode_video的例子&#xff0c;选取H264的编码器&#xff0c…

基于SpringBoot和协同过滤算法的体育商城系统的实现

最近帮小伙伴做了个基于SpringBoot和协同过滤算法的体育商城系统。 文章目录 一&#xff0c;SpingBoot介绍二&#xff0c;协同过滤算法介绍三&#xff0c;系统功能需求四&#xff0c;系统功能演示五&#xff0c;系统核心代码演示 一&#xff0c;SpingBoot介绍 现如今后台开源框…

企业寻求最大化投资回报率的24个社交媒体技巧

社交媒体营销不是银弹。它不会解决您所有的业务问题&#xff0c;也不会让您一夜成名。但是&#xff0c;如果您愿意投入时间和精力&#xff0c;社交媒体可以成为建立关系、建立信誉和发展业务的强大工具。 在这篇文章中&#xff0c;我们将分享24个社交媒体技巧&#xff0c;以帮…

React 演示Hook中发送网络请求

我这里本地用java写了个接口 当我访问 http://localhost:8080/books 就会发出一个无惨的 get请求 返回内容如下 这里 我返回了一个 json对象 然后 我们在react项目的src目录下创建setupProxy.js 参考代码如下 const proxy require(http-proxy-middleware)module.exports f…

您的Android密钥安全吗?一招教你安全加固

您的Android密钥安全吗&#xff1f;一招教你安全加固 你是否担心你的Android应用中的敏感数据会被披露出去&#xff1f;如果是的话&#xff0c;别担心&#xff0c;你并不是唯一一个有这样担忧的人。在今天的世界里&#xff0c;保护你的应用的URL和密钥免受逆向工程的风险变得比…

前端Vue图片上传组件支持单个文件多个文件上传 自定义上传数量 预览删除图片 图片压缩

前端Vue图片上传组件支持单个文件多个文件上传 自定义上传数量 预览删除图片 图片压缩&#xff0c; 下载完整代码请访问uni-app插件市场址:https://ext.dcloud.net.cn/plugin?id13099 效果图如下&#xff1a; ## 1.0.0&#xff08;2023-06-18&#xff09; 组件初始化 #### …

三、多层感知机及模型优化

文章目录 前言一、多层感知机1.1 隐藏层1.1.1 什么叫隐藏层1.1.2 为什么需要隐藏层 1.2 激活函数1.2.1 ReLU函数1.2.2 Sigmoid函数1.2.3 tanh函数 1.3 多层感知机的代码实现 二、模型选择、欠拟合和过拟合2.1 训练误差和泛化误差2.2 模型选择2.2.1 模型复杂性2.2.2 验证集2.2.3…

机器学习强基计划9-1:图解匹配追踪(MP)与正交匹配追踪(OMP)算法

目录 0 写在前面1 字典学习2 稀疏表示与稀疏编码3 匹配追踪MP算法4 正交匹配追踪OMP算法 0 写在前面 机器学习强基计划聚焦深度和广度&#xff0c;加深对机器学习模型的理解与应用。“深”在详细推导算法模型背后的数学原理&#xff1b;“广”在分析多个机器学习模型&#xff…

CSS小技巧之悬停3D发光效果

今天要实现的效果如图所示&#xff0c;鼠标悬停是图片卡片会有3D翻转的效果&#xff0c;且动画过程中还带有发光的效果。类似这种3D悬停效果日常看到的比较多&#xff0c;但是今天实现的这个效果有以下几个亮点&#xff1a; 核心CSS代码不超过10行没有额外的元素&#xff08;只…