【InternLM】LMDeploy部署实践

1. LMDeploy基本介绍

LMDeploy提供一站式的大模型压缩、部署和服务，其主要特点包括：

高效的推理速度。通过引入持久批处理(即连续批处理)、阻塞 KV 缓存、动态拆分与融合、张量并行、高性能 CUDA 内核等关键特性，提供了比 vLLM 高1.8倍的请求吞吐量；
(注：vLLM是一款开源的被广泛使用的大模型推理库，其支持大量的Hugging Face模型， LMDeploy广泛将vLLM作为对比对象)
高效的量化策略。LMDeploy支持权重和KV量化，INT4量化的推理速度是FP16的2.4被。量化效果已经得到了OpenCompass评估的验证。
方便的分布式部署。利用请求分发服务，LMDeploy 实现了跨多机和多卡的多模型服务的方便部署。
交互式推理模式。通过在多轮对话过程中缓存注意力的k/v，从而记住对话历史，避免历史会话的重复处理。

LMDeploy在A100卡上与vLLM的全面对比图如下：
在这里插入图片描述
目前LMDeploy支持的模型见下图，基本涵盖了主流的开源大模型：

2. 作业实践

2.1 环境配置

安装lmdeploy：pip install lmdeploy[all]==0.3.0
在这里插入图片描述

2.2 模型下载

测试模型为InternLM2-Chat-1.8B，利用git-lfs从OpenXLab平台进行下载：
step 1：安装git-lfs组件

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt update
apt install git-lfs   
git lfs install  --system

step 2: 下载InternLM2-Chat-1.8B模型：

git clone https://code.openxlab.org.cn/OpenLMLab/internlm2-chat-1.8b.git

在这里插入图片描述

2.3 定量速度测试

简单对比下transformers和lmdeploy的推理速度。

采用原生transformers进行Internlm2-chat-1.8b的推理：

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response, history = model.chat(tokenizer, inp, history=[])

# test speed
inp = "请创作一首五言绝句。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response, history = model.chat(tokenizer, inp, history=history)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

整个的推理速度为Speed: 41.292 words/s（与token正向线性相关）
在这里插入图片描述

采用lmdeploy进行推理：

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请创作一首五言绝句。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

整个的推理速度为Speed: 41.292 words/s（与token正向线性相关）
在这里插入图片描述

可见，测试样例上lmdeploy的推理速度为原生transformers的6倍。

2.4 设置KV Cache和W4A16量化

常见的 LLM 模型由于 Decoder Only 架构的特性，实际推理时大多数的时间都消耗在了逐 Token 生成阶段（Decoding 阶段），是典型的访存密集型场景。可通过缓存机制、量化算法等方法来优化LLM推理过程。

2.4.1 KV Cache基本概念

KV Cache是一种常用的大模型推理性能优化技术，其可以在不影响任何精度的前提下，通过时间换空间的思想，提升推理性能。

LLM作为生成式模型，其推理阶段是逐个token输出的，next token的推理会用到前面输出token和输入token的拼接，如何不断反复直到遇到终止符，其基本代码如下：

i = 0
with torch.no_grad():
    while out_token != token_eos:   # 根据词表定义的token_eos
        logits, _ = model(in_tokens)   # 前向推理
        out_token = torch.argmax(logits[-1, :], dim=0, keepdim=True)   # 贪婪解码
        in_tokens = torch.cat((in_tokens, out_token), 0)    # 输入拼接
        text = tokenizer.decode(in_tokens)   # 用作打印
        print(f'step {i} input: {text}', flush=True)
        i += 1

其输出效果类似于：

step 0 input: Lionel Messi is a player
step 1 input: Lionel Messi is a player who
step 2 input: Lionel Messi is a player who has
step 3 input: Lionel Messi is a player who has been
step 4 input: Lionel Messi is a player who has been a
step 5 input: Lionel Messi is a player who has been a key
step 6 input: Lionel Messi is a player who has been a key part
step 7 input: Lionel Messi is a player who has been a key part of
step 8 input: Lionel Messi is a player who has been a key part of the
step 9 input: Lionel Messi is a player who has been a key part of the team
step 10 input: Lionel Messi is a player who has been a key part of the team's
step 11 input: Lionel Messi is a player who has been a key part of the team's success
step 12 input: Lionel Messi is a player who has been a key part of the team's success.

这也就是为什么我们看到的绝大部分LLM应用采用流式输出的原因。

不难发现，随着拼接token的长度变长，推理过程的FLOPs会随之增大。另一方面，后面每个token的推理均用到了前面所有输出token的k-v值，其存在着重复计算的浪费。如果能够将前面输出token的k-v值缓存起来，那可以大大提升推理效率。如在transformers库中设置past_key_values参数，即可实现KV缓存的效果。当然其内部实现逻辑也是concat，不过是原KV缓存和新tokenKV值的concat。

with torch.no_grad():
    while out_token != token_eos:
        logits, kvcache = model(in_tokens, past_key_values=kvcache) # 增加了一个 past_key_values 的参数
        out_token = torch.argmax(logits[-1, :], dim=0, keepdim=True)
        in_tokens = out_token # 输出 token 直接作为下一轮的输入，不再拼接
        text = tokenizer.decode(in_tokens)
        print(f'step {i} input: {text}', flush=True)

采用KV Cache的方法，模型推理中的现存消耗主要可以分为三个部分：

2.4.2 W4A16量化基本概念

模型量化是一种重要的减小模型大小、提升模型推理效率的手段，其将神经网络的浮点计算转换为定点计算，从而将连续的无限值映射到一组较小的离散有限值。模型参数本身占用的显存、KV Cache占用的显存，以及中间运算结果占用的显存。

神经网络模型常见的权重数据格式有FP32（32位浮点，单精度）、FP16（半精度浮点）、INT8（8位整型）和INT4（4位整型）。如采用INT8格式，其相较于FP32可减小75%的模型大小。

按照量化阶段的不同，模型量化又可分为：训练时量化（quantization aware training， QAT）和训练后量化（post-training quantization， PTQ），QAT需要在训练阶段就对量化误差进行建模，一般能够获得较低的精度损失。PTQ 直接对普通训练后的模型进行量化，过程简单。

对于大模型而言，常见量化有两种设置：
1）W8A8 量化，即权重和激活都量化为 INT8；
2）W4A16量化，即权重量化为INT4，激活量化为FLOAT16.

适用于大模型的一些常见量化技术有：
在这里插入图片描述
以Qwen-7B-Chat为例，给出了量化后的效果对比。可见合理的采用量化技术，可以明显减小缓存、有效提升推理速度，同时保证模型的精度没有明显的损失。

模型评估效果

在这里插入图片描述

推理速度
显存使用

2.4.3 LMDeploy的设置

KV Cache

LMDeploy的KV Cache管理器可以通过设置--cache-max-entry-count参数，控制KV缓存占用剩余显存的最大比例。默认的比例为0.8。

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.8

仍以internlm2-chat-1_8b模型为例进行试验，通过设置--cache-max-entry-count为不同值，可得到以下的表格：

cache-max-entry-count值	模型显存大小(MiB)
0.8（默认值）	20936
0.5	14792
0.1	6600
0.01	4712

随着参数值的减小，能感受到明显推理速度的减小。

W4A16量化

LMDeploy使用AWQ算法，实现模型4bit权重量化，同时推理引擎TurboMind提供了非常高效的4bit推理cuda kernel。

直接利用如下的命令下完成4bit的量化工作：

lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit

同样设置--cache-max-entry-count为不同值，观察模型占用现存的情况：

cache-max-entry-count值	bit4量化后模型显存大小(MiB)
0.8（默认值）	20516
0.5	13732
0.1	4708
0.01	2628

可见，通过量化手段可以有效减小模型的显存占用率，当然代价是会损失一定的推理精度。

2.4.4 KV Cache+W4A16量化模型推理

下图为设置KV Cache最大占用比例为0.4，开启W4A16量化后，以命令行方式与模型对话：

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.4

在这里插入图片描述

2.5 API Server方式启动 lmdeploy

将大模型封装为API接口服务，可以更方便地让客户端用户进行调用，其电箱的服务流程结构图包括三个模块：
1）模型推理/服务。主要提供模型本身的推理，一般来说可以和具体业务解耦，专注模型推理本身性能的优化。可以以模块、API等多种方式提供。
2）API Server。中间协议层，把后端推理/服务通过HTTP，gRPC或其他形式的接口，供前端调用。
3）Client。可以理解为前端，与用户交互的地方。通过通过网页端/命令行去调用API接口，获取模型推理/服务。

上述三个模块组成的架构图如下：
在这里插入图片描述

2.5.1 启动后端

启动 lmdeploy后端，开启 W4A16量化，调整KV Cache的占用比例为0.4：

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b-4bit \
    --model-format awq \
    --quant-policy 1 \
    --cache-max-entry-count 0.4 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

其背后使用FastAPI作为server，提供如下的接口功能：
在这里插入图片描述

2.5.2 命令行交互

命令行启动客户端：

lmdeploy serve api_client http://localhost:23333

运行后，可以通过命令行窗口直接与模型对话：
在这里插入图片描述

2.5.3 gradio客户端

使用gradio作为client，启动网页客户端并作简单测试：
在这里插入图片描述

2.6 使用Python代码集成的方式运行LMDeploy

lmdeploy效仿transformers库，提供了pipeline的端到端调用：

from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')
response = pipe(['你好，请给我作一首诗', '离离原上草'])
print(response)

返回结果：
在这里插入图片描述

小模型的语义理解能力确实还是略显弱小，并没有正确理解本人的意图做出合理的回答。

在python代码中，同样可以设置KV Cache等参数。下面的示例使用W4A16量化，调整KV Cache的占用比例为0.4：

from lmdeploy import pipeline, TurbomindEngineConfig

# 调低 k/v cache内存占比调整为总显存的 40%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.4)

# 使用W4A16量化
pipe = pipeline('/root/internlm2-chat-1_8b-4bit',
                backend_config=backend_config)
response = pipe(['你好，请给我作一首诗', '离离原上草'])
print(response)

返回结果：
在这里插入图片描述
模型存在明显的幻觉现象，《离离原上草》是草根歌手周杰伦的代表作了。。。。

2.7 使用 LMDeploy 运行视觉多模态大模型

最新版本的LMDeploy支持了llava多模态模型，这里以llava-v1.6-7b进行推理：

import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()