开源模型应用落地-Qwen2.5-7B-Instruct与TGI实现推理加速

news2025/1/14 18:22:56

一、前言

目前，大语言模型已升级至Qwen2.5版本。无论是语言模型还是多模态模型，均在大规模多语言和多模态数据上进行预训练，并通过高质量数据进行后期微调以贴近人类偏好。在本篇学习中，将集成 Hugging Face的TGI框架实现模型推理加速，现在，我们赶紧跟上技术发展的脚步，去体验一下新版本模型的推理质量。

二、术语

2.1. TGI

Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI 支持多种流行的开源 LLMs，包括 Llama、Falcon、StarCoder、BLOOM、GPT-NeoX 和 T5，实现高性能的文本生成。

TGI 提供了多个优化和功能，包括：

简单的启动程序，能够服务大多数流行的 LLMs
具备分布式追踪功能（使用 Open Telemetry 和 Prometheus 监控）
张量并行，实现多 GPU 上更快的推理
使用服务器推送事件（SSE）进行流式传输
持续批处理来提高总吞吐量
针对流行架构优化的推理变换器代码，使用 Flash Attention 和 Paged Attention
量化支持，使用 bitsandbytes 和 GPT-Q
Safetensors 权重加载
针对大型语言模型的水印技术
Logits 变换器（温度缩放、top-p、top-k、重复惩罚）
停止序列和日志概率
微调支持：利用微调模型以提高特定任务的准确性和性能
指导功能：强制模型生成基于预定义输出模式的结构化输出

2.1. Qwen2.5

Qwen2.5系列模型都在最新的大规模数据集上进行了预训练，该数据集包含多达 18T tokens。相较于 Qwen2，Qwen2.5 获得了显著更多的知识（MMLU：85+），并在编程能力（HumanEval 85+）和数学能力（MATH 80+）方面有了大幅提升。

此外，新模型在指令执行、生成长文本（超过 8K 标记）、理解结构化数据（例如表格）以及生成结构化输出特别是 JSON 方面取得了显著改进。 Qwen2.5 模型总体上对各种system prompt更具适应性，增强了角色扮演实现和聊天机器人的条件设置功能。

与 Qwen2 类似，Qwen2.5 语言模型支持高达 128K tokens，并能生成最多 8K tokens的内容。它们同样保持了对包括中文、英文、法文、西班牙文、葡萄牙文、德文、意大利文、俄文、日文、韩文、越南文、泰文、阿拉伯文等 29 种以上语言的支持。我们在下表中提供了有关模型的基本信息。

专业领域的专家语言模型，即用于编程的 Qwen2.5-Coder 和用于数学的 Qwen2.5-Math，相比其前身 CodeQwen1.5 和 Qwen2-Math 有了实质性的改进。具体来说，Qwen2.5-Coder 在包含 5.5 T tokens 编程相关数据上进行了训练，使即使较小的编程专用模型也能在编程评估基准测试中表现出媲美大型语言模型的竞争力。同时，Qwen2.5-Math 支持中文和英文，并整合了多种推理方法，包括CoT（Chain of Thought）、PoT（Program of Thought）和 TIR（Tool-Integrated Reasoning）。

2.3. Qwen2.5-7B-Instruct

是通义千问团队推出的语言模型，拥有70亿参数，经过指令微调，能更好地理解和遵循指令。作为 Qwen2.5 系列的一部分，它在 18T tokens 数据上预训练，性能显著提升，具有多方面能力，包括语言理解、任务适应性、多语言支持等，同时也具备一定的长文本处理能力，适用于多种自然语言处理任务，为用户提供高质量的语言服务。

三、前提条件

3.1. 基础环境及前置条件

1）操作系统：centos7

2）Tesla V100-SXM2-32GB CUDA Version: 12.2

3）提前下载好Qwen2.5-7B-Instruct模型

通过以下两个地址进行下载，优先推荐魔搭

huggingface：

https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/tree/main

ModelScope：

git clone https://www.modelscope.cn/qwen/Qwen2.5-7B-Instruct.git

按需选择SDK或者Git方式下载

使用git方式下载示例：

3.2. Anaconda安装

参见“开源模型应用落地-qwen-7b-chat与vllm实现推理加速的正确姿势（一）”

四、技术实现

4.1. 源码安装方式

git clone https://github.com/huggingface/text-generation-inference.git && cd text-generation-inference
make install

# 运行
text-generation-launcher --model-id /model/Qwen2.5-7B-Instruct --port 8080

4.2. Docker方式

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:8080 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

4.3. 客户端调用

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

五、附带说明

5.1. 问题一：Could not find a version that satisfies the requirement regex==2024.9.11 (from versions: none)

异常描述：

解决：

pip仓库中的确存在2024.9.11版本regex包，出现上述问题，可以尝试再次安装，即可解决。

5.2. 问题二：python setup.py egg_info did not run successfully.

异常描述：

Downloading flash_attn-2.6.1.tar.gz (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 14.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-rtbuq9iv/flash-attn_27279a6227a742aab778aecb310b09a8/setup.py", line 19, in <module>
          import torch
        File "/usr/local/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/__init__.py", line 367, in <module>
          from torch._C import * # noqa: F403
      ImportError: /usr/local/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
      [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

解决：

v100不支持flash_attn，修改/text-generation-inference/server/Makefile文件

去掉install-flash-attention-v2-cuda
去掉include Makefile-flash-att-v2

5.3.问题三：libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

异常描述：

Traceback (most recent call last):
File "/data/service/text-generation-inference/server/vllm/setup.py", line 10, in <module>
import torch
File "/usr/local/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/__init__.py", line 367, in <module>
from torch._C import * # noqa: F403
ImportError: /usr/local/miniconda3/envs/tgi/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
make[1]: *** [Makefile-vllm:5: build-vllm-cuda] Error 1
make[1]: Leaving directory '/data/service/text-generation-inference/server'
make: *** [Makefile:2: install-server] Error 2

解决：

export LD_LIBRARY_PATH=/usr/local/miniconda3/envs/tgi/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH

参考：