前一篇文章做了Qwen1的加速,其中关于Auto-GPTQ的安装问题在Qwen2中依然适用。但是Qwen2比Qwen1加载模型快了很多,笔者也不知道为什么。
下面是Hugging Face transformer版的千问2,token生成速度在15个每秒左右,但还不够快,在这篇文章里我们用vLLM将速度翻倍,达到38tokens/s。
想用vLLM加速处理,于是参考了【以Qwen2为例】vLLM流式推理部署,openai接口调用,requests调用_qwen2 openai-CSDN博客
但是按照他的流程走(版本都和他一样):
pip install vllm
pip install nvidia-nccl-cu12==2.20.5
启动vllm后端:
python -m vllm.entrypoints.openai.api_server
时报错:
(VllmWorkerProcess pid=1567610) INFO 07-08 09:29:33 utils.py:613] Found nccl from environment variable VLLM_NCCL_SO_PATH=/mnt/shareEEx/chenyixiang/nccl/usr/lib/x86_64-linux-gnu/
(VllmWorkerProcess pid=1567609) ERROR 07-08 09:29:33 pynccl_wrapper.py:196] Failed to load NCCL library from /mnt/shareEEx/chenyixiang/nccl/usr/lib/x86_64-linux-gnu/ .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.4.0-182-generic-x86_64-with-glibc2.31.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
意思是关于NCCL协同通讯库没有安装好,vllm后台识别不到nccl:
让我设置好VLLM_NCCL_SO_PATH
于是我就按照NVIDIA官网中找安装方式,发现好多都需要sudo安装deb包,但是笔者的机器没有sudo权限,于是直接考虑下载源码:https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.22.3/agnostic/x64/nccl_2.22.3-1+cuda12.2_x86_64.txz/
并解压到~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/后
设置VLLM_NCCL_SO_PATH
# 错误
export VLLM_NCCL_SO_PATH=~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/lib/
vllm后台仍然识别不到nccl,
几经周折才试出来,要把PATH直接指向.so文件(虽然以前的PATH好像都是直接指向lib/文件夹就行,但这次是指向文件):
# 正确
export VLLM_NCCL_SO_PATH=~/Qwen/nccl_2.22.3-1+cuda12.2_x86_64/lib/libnccl.so
再次运行:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct-GPTQ-Int4 --model $MODEL_PATH \
--tensor-parallel-size 4 --host 0.0.0.0 --port 8008
成功!
可以看到,短上下文的处理速度达到了恐怖的38.7tokens/s,与官方给出的A100速度基本持平
相比于开头的transformer版本,速度提升了两倍有余。
调用服务端源码:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:8008/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen2-72B-Instruct-GPTQ-Int4",
messages=[
{"role": "system", "content": "你是一个有用的人工智能助手"},
{"role": "user", "content": "为什么生鱼片其实是死鱼片?对此生成超过不少于1000字的解释。"},
]
)
print("Chat response:", chat_response)