服务器部署LLM、Embedding
- 1.提要
- 2.搭建python环境
- 2.1安装python
- 2.2 安装虚拟环境管理软件(Miniconda3)
- 3.部署LLM模型
- 3.1安装git:
- 3.2模型文件来源
- 3.3Transformers模型推理部署
- 3.4vllm启动模型推理
- 4.本地部署embedding模型
1.提要
服务器环境: 32vCPU + 128GiB + 2NVIDIA A100 2*80GB
服务器操作系统:Ubuntu 22.04 64位
驱动:CUDA 版本 12.4.1/Driver 版本 550.90.07/CUDNN 版本 9.2.0.82
Qwen2.5-14B环境要求
Python >= 3.10;
Transformers>= 4.37.0;
PyTorch >= 2.3.1;
2.搭建python环境
2.1安装python
Ubuntu 22.04 64位系统一般默认已经安装python 3.10;Linux系统一般都默认安装有python环境。
安装指定版本python
sudo apt install python3.10
查看安装版本命令:
python3 --version
2.2 安装虚拟环境管理软件(Miniconda3)
下载虚拟环境安装脚本
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
执行脚本
bash Miniconda3-latest-Linux-x86_64.sh
重启终端,生效
source ~/.bashrc
验证安装
conda --version
创建python环境
conda create --name inference_model python=3.10
激活环境:
conda activate inference_model
注意:
PyTorch 版本参照链接:https://pytorch.org/
;PyTorch是模型运行的基础环境框架,版本很重要,需要根据链接选择合适的版本;其他依赖可以根据需要安装.
3.部署LLM模型
3.1安装git:
安装git
sudo apt install git
安装大型文件的版本控制插件lfs
sudo apt-get install git-lfs
验证安装
git lfs --version
3.2模型文件来源
AI模型社区:
阿里魔塔: https://modelscope.cn/my/overview
huggingface: https://huggingface.co/
百度飞浆 https://www.paddlepaddle.org.cn/
下载模型权重方式通常有两种:
git:
git clone https://www.modelscope.cn/Qwen/Qwen2.5-72B-Instruct.git
魔塔、huggingface等可以通过使用模型名称的方式,自动加载。
3.3Transformers模型推理部署
简介: Transformers是huggingface的开源流行库,集成了深度学习领域中预训练模型的推理、训练方式;在这里结合使用fastapi、uvicorn,将模型发布为api服务。
启动命令为:
nohup python chatdemo.py &
执行命令:启动后,外部访问的接口为:http://localhost:8080/v1
测试:
curl -X POST "http://127.0.0.1:8080/v1"
-H 'Content-Type: application/json'
-d '{"prompt": "你好"}'
chatdemo.py的代码为:
from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import uvicorn
import json
import datetime
import torch
# 设置设备参数
DEVICE = "cuda" # 使用CUDA
DEVICE_ID = "0" # CUDA设备ID,如果未设置则为空
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE # 组合CUDA设备信息
# 清理GPU内存函数
def torch_gc():
if torch.cuda.is_available(): # 检查是否可用CUDA
with torch.cuda.device(CUDA_DEVICE): # 指定CUDA设备
torch.cuda.empty_cache() # 清空CUDA缓存
torch.cuda.ipc_collect() # 收集CUDA内存碎片
# 创建FastAPI应用
app = FastAPI()
# 处理POST请求的端点
@app.post("/v1")
async def create_item(request: Request):
global model, tokenizer # 声明全局变量以便在函数内部使用模型和分词器
json_post_raw = await request.json() # 获取POST请求的JSON数据
json_post = json.dumps(json_post_raw) # 将JSON数据转换为字符串
json_post_list = json.loads(json_post) # 将字符串转换为Python对象
prompt = json_post_list.get('prompt') # 获取请求中的提示
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
# 调用模型进行对话生成
input_ids = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
model_inputs = tokenizer([input_ids], return_tensors="pt").to('cuda')
generated_ids = model.generate(model_inputs.input_ids,max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
now = datetime.datetime.now() # 获取当前时间
time = now.strftime("%Y-%m-%d %H:%M:%S") # 格式化时间为字符串
# 构建响应JSON
answer = {
"response": response,
"status": 200,
"time": time
}
# 构建日志信息
log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
print(log) # 打印日志
torch_gc() # 执行GPU内存清理
return answer # 返回响应
# 主函数入口
if __name__ == '__main__':
# 加载预训练的分词器和模型
model_name_or_path = '/home/llm/model/Qwen/Qwen2.5-72B-Instruct/'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16)
# 启动FastAPI应用
# 用8080端口可以将autodl的端口映射到本地,从而在本地使用api
uvicorn.run(app, host='0.0.0.0', port=8080, workers=1) # 在指定端口和主机上启动应用
3.4vllm启动模型推理
简介: 使用vllm启动,可以查看相关的接口文档。
启动命令:
python -m vllm.entrypoints.openai.api_server --model Qwen2.5-72B-Instruct --trust-remote-code --tensor-parallel-size 2 --port 8080
测试:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen2.5-72B-Instruct", "messages": [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": "Tell me something about large language models."} ], "temperature": 0.7, "top_p": 0.8, "repetition_penalty": 1.05, "max_tokens": 512}'
4.本地部署embedding模型
简介: embedding模型可以采用transformers进行推理,结合fastapi和uvicorn发布为api服务。使用的是BAAI智源研究院模型:https://model.baai.ac.cn/models
启动命令:
nohup python inference.py &
inference:
import uvicorn
from fastapi import FastAPI, Request
from fastapi.params import Query
from transformers import AutoTokenizer, AutoModel
import torch
app = FastAPI()
@app.post('/v1')
async def predict(sentence: list[str] = Query(default='', description='关键字')):
global model, tokenizer
print("predict", dict)
# Sentences we want sentence embeddings for
sentences = ["好好学习-1", "天天向上-2"]
sentences.extend(sentence)
print("sentences", sentences)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
torch.set_printoptions(linewidth=120, precision=4, threshold=10000, edgeitems=10000)
print("Sentence embeddings:", sentence_embeddings)
return {"code": 0, "msg": "success", "data": sentence_embeddings.tolist(), "dict": dict}
if __name__ == '__main__':
# Load model from HuggingFace Hub
model_name_or_path = '/home/embedding/BAAI/bge-small-zh-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path)
model.eval()
# 启动FastAPI应用
# 用8080端口启动服务
uvicorn.run(app, host='127.0.0.1', port=8099, workers=1)