大模型下载、本地部署和调用攻略

零、下载大模型——以Qwen/Qwen2.5-7B为例

1、下载前的准备工作

1.1 仔细阅读模型的Model card介绍

里面会有支持的上下文长度、模型结构、参数量等基本信息

1.2 查看模型文件

Tokenizer相关的文件：

merges.txt：这是用于Byte Pair Encoding（BPE）的词汇表的一部分。BPE是一种常见的子词切分算法，它会将常见的词或词组分解为更小的可重复的子单元。
tokenizer_config.json：这是tokenizer的配置文件，包含tokenizer的设置和参数。
tokenizer.json：这可能是tokenizer的主要文件，包含用于词汇表的具体数据。
vocab.json：这也是词汇表的一部分，通常包含词汇表中的所有单词及其对应的索引。

模型相关的文件：

config.json：这是模型的配置文件，包含模型的设置和参数。
generation_config.json：这是生成任务的配置文件，包含生成任务的设置和参数。
model-xxx.safetensors：这是模型的权重文件，包含模型的预训练权重。
model.safetensors.index.json：这可能是模型的索引文件，用于在加载模型时帮助定位权重。

另外，原先pytorch模型存储格式.bin\.pth、Tensorflow模型存储格式.h5\.pb在hf生态上面将逐渐被淘汰，推荐认准.safetensors格式的存储格式下载。

tips:hf上面的模型仓库也是用git管理的，建议学习完git用法后再学习git lfs的用法。

参考博文：浅析下载的模型文件结构及作用

一文读懂pytorch和huggingface的大模型存储格式

2、下载大模型

因为hf的官网在国内上不去，本教程推荐使用国内镜像站的方式下载

安装huggingface_hub

 pip install -U huggingface_hub

设置环境变量，使得下载时默认从国内的镜像站https://hf-mirror.com下载

export HF_ENDPOINT=https://hf-mirror.com

下载hf上面的模型——以Qwen/Qwen2.5-7B为例

huggingface-cli download --resume-download Qwen/Qwen2.5-7B --local-dir  Qwen/Qwen2.5-7B --exclude "*.bin"

参考教程：如何快速下载huggingface模型——全方法总结

一、本地部署工具简介和分类

LLM（Large Language Models，大型语言模型）部署工具是专门设计用于在本地或私有环境中部署、管理和运行大型语言模型的软件解决方案。这些工具的目标是简化部署流程，提高模型运行效率，同时提供必要的优化和定制功能，以适应不同的应用场景和硬件环境。

按照应用场景定位可以进行如下分类：

1、综合部署和管理工具

类似于LLM操作系统，旨在提供一个可以在任何地方运行的大型语言模型（LLM）聊天机器人生态系统，这一类项目允许用户在本地CPU和几乎任何GPU上运行大型语言模型，主要解决的痛点是跨平台操作的兼容性，适配足够多的主流硬件平台。

代表工具有：Ollama、LM Studio、GPT4All等

2、专用推理优化工具

这类工具主要解决的痛点是用尽量少的资源跑起来足够大的模型，并且尽量不降低模型的性能，专注于提高模型的推理效率，研究通过硬件加速、量化、减枝等算法优化等方式减少推理时间和资源消耗，与通用的深度学习库相比较，只有前向计算过程。

代表工具有：TensorRT-LLM、SGLang、VLLM、XInference等

3、通用的深度学习库

这类库主要是提供模型训练后的验证测试，开发者可以用来部署预训练的大语言模型时，一般被视作该模型的最好效果，主要可作为自己二次开发的参考对比。

代表工具有：Pytorch、Transformers等

二、综合部署和管理工具——以Ollama为例

Ollama 是一个用于构建大型语言模型应用的工具，它提供了一个简洁易用的命令行界面和服务器，支持在Windows、Linux、MAC这三个主流操作系统部署轻松下载、运行和管理各种开源 LLM，支持目前最主流的模型文件格式GGUF 格式，并提供一套模型格式转换的工具，可以将训练完成pytorch、transformers模型文件格式无缝迁移成GGUF格式。

1、下载安装Ollama

下载ollama官方的脚本，可以在linux系统上一键安装

 wget https://ollama.com/install.sh

执行安装脚本

bash install.sh

一路Enter选择默认配置，完成安装后的信息：

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
################################################################################################################################## 100.0%################################################################################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

2、启动Ollama

启动ollama服务

ollama serve &

确认ollama服务是否启动

systemctl status ollama.service

ollama服务已经正常启动的信息：

ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-09-22 12:23:31 EDT; 1min 43s ago
   Main PID: 2631176 (ollama)
      Tasks: 26 (limit: 618829)
     Memory: 2.5G (peak: 2.5G)
        CPU: 20.460s
     CGroup: /system.slice/ollama.service
             └─2631176 /usr/local/bin/ollama serve

3、部署自己的模型

3.1 安装转换工具llm/llama.cpp

克隆官方的ollama/ollama 仓库：

git clone https://github.com:ollama/ollama.git
cd ollama

获取该仓库中的子模块llama.cpp：

git submodule init
git submodule update llm/llama.cpp

安装llama.cpp的依赖：

pip3 install -r llm/llama.cpp/requirements.txt -i https://mirrors.cloud.tencent.com/pypi/simple

3.2 模型格式转换——以Qwen/Qwen2.5-7B为例

执行llm/llama.cpp/convert_hf_to_gguf.py脚本转换GGUF模型格式

python3 llm/llama.cpp/convert_hf_to_gguf.py /mnt/models/Qwen/Qwen2.5-7B  --outfile Qwen2.5-7B.gguf --outtype f16

转换GGUF模型格式成功后输出的信息

INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:Qwen2.5-7B.gguf: n_tensors = 339, total_size = 15.2G
Writing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [00:44<00:00, 346Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to Qwen2.5-7B.gguf

3.3 新建ollama识别的模型

modelfile文件是与Ollama 创建和共享模型的顶层文件，ollama需根据modelfile文件和对应的.gguf模型文件新建自己的模型

具体参数的含义参见ollama的官方文档

https://github.com/ollama/ollama/blob/main/docs/modelfile.md

同时根据阿里Qwen官方的文档

https://qwen.readthedocs.io/en/latest/run_locally/ollama.html

新建一个 Qwen2.5-7B.modelfile文件

FROM /home/vslyu/Documents/ollama/Qwen2.5-7B.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20

TEMPLATE """{{ if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{ .System }}
{{- if .Tools }}

# Tools

You are provided with function signatures within <tools></tools> XML tags:
<tools>{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""

# set the system message
SYSTEM """You are Qwen, created by Alibaba Cloud. You are a helpful assistant."""

ollama create Qwen2.5-7B -f Qwen2.5-7B.modelfile

3.4 运行模型

ollama run Qwen2.5-7B

成功运行模型，对话输出：

└─$ ollama run Qwen2.5-7B
>>> 请问你是谁
我是Qwen，由阿里巴巴云开发的AI助手。有什么我可以帮助你的吗？

参考教程：Ollama：从入门到进阶

三、专用推理优化工具——以SGLang为例

SGLang开启QWen2.5-7B模型服务

python -m sglang.launch_server --model-path /media/huiwei/models/Qwen/Qwen2.5-7B --tp 2 --enable-p2p-check --disable-cuda-graph

模型交互

from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="What is the capital of China?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

四、通用的深度学习库——以Transformers为例

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B", trust_remote_code=True)

# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    device_map="auto",
    trust_remote_code=True
).eval()


prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

有用的资源推荐

GitHub - WangRongsheng/awesome-LLM-resourses: 🧑‍🚀 全世界最好的LLM资料总结 | Summary of the world's best LLM resources.