深度学习：大模型Decoding+MindSpore NLP分布式推理详解

大模型推理流程

1. 用户输入提示词（Prompt）

假设用户输入为：“从前，有一只小猫，它喜欢……”

我们的目标是让模型生成一段完整的故事。

2. 模型处理用户输入

2.1 分词：输入提示被分词为模型可以理解的子词（subword）或单词（token）。

例如：

"从前，有一只小猫，它喜欢……" 可能被分词为：

["从前", "，", "有", "一只", "小猫", "，", "它", "喜欢", "……"]

这些 token 会被映射为模型词汇表中的索引（ID）。也就是Tokenizer分词器返回的input_ids。

2.2 将IDs转为embeddings

每个 token 被转换为一个高维向量（embedding），这些向量包含了语义信息。模型通过嵌入层将 token 索引映射为向量。

用户输入的input_ids形状为：(1, 9)，表示batch中有一个样本，样本序列长度为9。

嵌入层（Embedding Layer）将每个 token 索引映射为一个高维向量。这个向量的维度是 hidden_size，即模型的隐藏层维度。hidden_size为模型的超参数，由设计者决定。

经过嵌入层后，输入的形状会从 (batch_size, seq_length) 变为 (batch_size, seq_length, hidden_size)。例如，(1, 9) 会变为 (1, 9, hidden_size)。

2.3 对张量加入位置编码

为了保留输入序列的顺序信息，模型会为每个 token 添加位置编码。这些编码与 token 嵌入相加，形成最终的输入表示。

位置编码（Positional Encoding） 的张量维度大小与 输入嵌入（Input Embedding） 的维度大小完全相同，并且它们会直接在最后一个维度上相加。

输入嵌入的形状：
输入嵌入的输出形状是 (batch_size, seq_length, hidden_size)，其中 hidden_size 是每个 token 的嵌入维度。
位置编码的形状：
位置编码的形状也是 (batch_size, seq_length, hidden_size)，与输入嵌入的形状完全一致。

位置编码可以保留语句的顺序信息，直接将位置信息注入语句中。

3. 前向传播

将处理过的用户输入张量输入模型进行前向计算。

4. 生成输出

在自回归生成任务中，模型会逐步生成 token，每次生成一个 token。因此，输出结果的形状会随着生成过程而变化。

输入形状：(1, 9)。
模型输出的概率分布形状：(1, 9, vocab_size)。
生成下一个 token 的形状：(1, 1)。

4.1 输出概率分布

最后一层 Transformer 的输出会通过一个线性层和 softmax 函数，生成每个可能 token 的概率分布。例如，模型可能会预测下一个 token 是“玩耍”的概率为 0.4，“睡觉”的概率为 0.3，等等。

4.2 解码策略（Decoding Strategy）

模型根据概率分布选择下一个 token。常见的解码策略包括：

贪婪搜索（Greedy Search）：
选择概率最高的 token。例如，选择“玩耍”作为下一个 token。
束搜索（Beam Search）：
保留多个候选序列，选择整体概率最高的序列。
采样（Sampling）：
根据概率分布随机采样下一个 token。

输出的概率分布 和 随机采样的概率分布 之间有直接的联系！随机采样是基于模型输出的概率分布进行的，因此两者密切相关。

随机采样的基础：
随机采样直接依赖于模型输出的概率分布。概率分布决定了每个 token 被采样的可能性。
概率分布的作用：
概率分布反映了模型对每个 token 的“信心”或“偏好”。高概率的 token 更有可能被采样，而低概率的 token 也有可能被采样到（尤其是在多样性较高的场景中）。
采样结果的不确定性：
由于采样是随机的，即使概率分布相同，每次采样的结果也可能不同。这与贪婪搜索（总是选择最高概率的 token）形成对比。

Top-K和Top-P策略可以与温度Temperature结合使用。

5. 迭代生成

5.1 递归生成

模型将生成的 token 重新作为输入，继续生成下一个 token。例如：

输入提示：“从前，有一只小猫，它喜欢……”
模型生成：“玩耍”
新输入：“从前，有一只小猫，它喜欢玩耍”
模型继续生成：“，每天……”

生成过程会持续，直到达到最大生成长度或生成特殊的终止 token（如 <EOS>）。

6. 最终输出

最终，模型生成的完整故事可能是：
“从前，有一只小猫，它喜欢玩耍，每天都会在花园里追逐蝴蝶。有一天，它遇到了一只小鸟……”

LLM模型不是直接使用贪心解码策略（选择概率最高的token作为输出），如果使用贪心解码册啰，对于相同输入序列LLM模型每次都会给出相同回复（推理模式下参数固定，不存在随机性）。所以，

不同的大模型解码策略

假设模型正在预测“The cat”的下一个token，模型预测结果如下：

• sat (0.5)

• jumped (0.3)

• is (0.1)

• slept (0.05)

• runs (0.05)

1. Top-k 采样

Top-k 采样将随机性引入解码过程，通过限制输出token的集合在Top-k个概率最高的token。下一个输出的token将在Top-k个token中随机采样生成。

在案例中，Top-k 采样会选出概率最高的sat（0.5）和jumped（0.3），随后从这两个token中随机采样出下一个预测的token作为模型的输出。

2. Top-p 采样

Top-p 采样首先通过设置一个限制值P，随后按照概率大小选取n个token，直至token累计的概率达到P。随后对n个token进行随机采样。

在案例中，Top-p 采样回选出sat（0.5），jumped（0.3）和is（0.1），随后对这三个token进行随机采样出下一个token。

3. 温度采样

温度Temperature，作为一个超参数，可以控制选择token的概率分布。预测的概率分布会被因子 1/T进行缩放，T则是温度。

当T = 1时，概率分布不发生变化。
当T > 1时，模型输出变得更为随机，小概率的token更容易出现。
当T < 1时，模型输出变得更有确定性，高概率的token更容易得到选择。

温度高时，模型会变得“更有创造性”；温度较低时，模型变得“更加精准”。

4. 束搜索

束搜索是更加精密的贪心搜索策略，它会保留top-k个序列同时进行扩展。

在每一步，模型生成 top-k 个最可能的词汇，并继续解码每一个 k 个序列。
参数 beam width（k）决定了每一步保留多少个候选序列。
在每一步之后，模型根据累积概率对 k 个序列进行排序，并保留概率最高的 k 个序列用于进一步扩展。

在案例中，假设beam的数量为2。那么我们将会选出概率最高的2个token用于后续生成。

“The cat sat”（累计概率：0.5）

“The cat jumped”（累计概率：0.3）

模型继续扩充两个序列，如：

“The cat sat on the mat”

“The cat jumped over the fence”

Beam-Search后续发展有Diverse Beam-Search

不同解码策略的使用场景

贪婪解码（Greedy Decoding）：
当需要快速生成文本且对生成质量要求不是特别高时，贪婪解码是一个简单且计算效率高的选择。它选择具有最大logit值的token作为下一个输出，适用于需要快速响应的场景，如聊天机器人的初步响应生成。
束搜索（Beam Search）：
适用于需要精确控制输出质量的场景，如机器翻译或问答系统。束搜索通过考虑多个候选序列来生成文本，可以提高翻译的准确性和流畅性。
抽样解码（Sampling Decoding）：
适用于需要多样性输出的场景，如创意写作或开放性问题的回答。抽样解码从词汇表中根据概率分布选择 token，可以通过调整参数如温度（Temperature）来控制随机性。
Top-K和Top-P：
适用于需要控制输出长度和提高生成质量的场景。Top-K和Top-P通过限制候选token的数量来提高生成的连贯性和减少重复，适用于需要高质量输出的任务。
温度采样（Temperature Sampling）：
适用于需要在生成过程中增加随机性的场景，如创意写作或探索性任务。温度参数可以调整输出的随机度，较低的温度值会使采样更接近确定性解码，而较高的温度值则增加随机性。

MindSpore进行解码推理

创建Notebook

mindspore==2.3.0, cann==8.0

更新mindspore

pip install --upgrade mindspore

克隆mindnlp

git clone https://github.com/mindspore-lab/mindnlp.git

更新mindnlp

cd mindnlp
bash scripts/build_and_reinstall.sh

卸载mindformers

pip uninstall mindformers

加载模型与转换输入

import mindspore
from mindnlp.transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "LLM-Research/Meta-Llama-3-8B-Instruct"
# 下载Llama 3的分词器
tokenizer = AutoTokenizer.from_pretrained(model_id, mirror="modelscope")

# 下载Llama 3模型
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    ms_dtype=mindspore.float16,
    mirror="modelscope"
)

# 输入信息
messages = [
    {"role": "system", "content": "You are a psychological counsellor, who is good at emotional comfort"},
    {"role": "user", "content": "I don't sleep well for a long time."}
]
# 将输入信息转为input_ids
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="ms"
)
# 声明预测的终止token
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
# 模型生产结果
outputs = model.generate(
    input_ids, # 输入token
    max_new_tokens=50, # 限制输出长度
    eos_token_id=terminators, # 声明终止符
    do_sample=True, # 是否对输出进行概率分布采样
    top_p=1.0 # 声明top-p值
)

贪心策略

# 贪心策略
# 模型生产结果
outputs = model.generate(
    input_ids, # 输入token
    max_new_tokens=1000, # 限制输出长度
    eos_token_id=terminators, # 声明终止符
    do_sample=False, # 是否对输出进行概率分布采样
)

response = outputs[0][input_ids.shape[-1]:]
tokenizer.decode(response, skip_special_tokens=True)

模型输出：

"I'm so sorry to hear that you're struggling with sleep. It can be really frustrating and affect many aspects of your daily life. Can you tell me a bit more about what's been going on? What's been keeping you awake at night? Is it stress, anxiety, or something else?\n\nAlso, have you noticed any patterns or triggers that might be contributing to your insomnia? For example, do you find yourself lying awake for hours, or do you wake up multiple times during the night?\n\nRemember, I'm here to listen and support you, and I want you to feel comfortable sharing as much or as little as you'd like."

重复多次模型输出结果未发生变化

Temperature参数

temperature控制文本生成的随机性和多样性，控制输出张量的概率分布。

import mindspore
from mindspore import Tensor
import numpy as np
import mindspore.ops as ops

logits = Tensor(np.array([[0.5, 1.2, -1.0, 0.1]]), mindspore.float32)

probs = ops.softmax(logits, axis=-1)
# low temp = 0.5
# 分布更为集中（陡峭）
probs_low = ops.softmax(logits / 0.5, axis=-1)
# high temp = 2
# 分布更为分散（平缓）
probs_high = ops.softmax(logits / 2, axis=-1)

probs, probs_low, probs_high

(Tensor(shape=[1, 4], dtype=Float32, value=
 [[ 2.55937576e-01,  5.15394986e-01,  5.71073927e-02,  1.71560094e-01]]),
 Tensor(shape=[1, 4], dtype=Float32, value=
 [[ 1.80040166e-01,  7.30098903e-01,  8.96367151e-03,  8.08972642e-02]]),
 Tensor(shape=[1, 4], dtype=Float32, value=
 [[ 2.69529819e-01,  3.82481009e-01,  1.27316862e-01,  2.20672339e-01]]))

可以看出温度越高，分布越平缓，温度越低，分布越集中

temerature=1

# 模型生产结果
outputs = model.generate(
    input_ids, # 输入token
    max_new_tokens=1000, # 限制输出长度
    eos_token_id=terminators, # 声明终止符
    do_sample=True, # 是否对输出进行概率分布采样
    temperature=1
)
# 标准温度输出
response = outputs[0][input_ids.shape[-1]:]
tokenizer.decode(response, skip_special_tokens=True)

输出1:

"I'm so sorry to hear that you're struggling with sleep. It can be really tough to deal with insomnia or disrupted sleep patterns. Can you tell me a bit more about what's been going on? What's been on your mind lately that might be keeping you awake? Has anything changed in your life that could be contributing to this difficulty?"

输出2:

"I'm so sorry to hear that you're struggling with sleep. It can be such a frustrating and debilitating experience. Can you tell me a bit more about what's been going on for you? What's been making it hard for you to fall asleep or stay asleep? Is it racing thoughts, stress, anxiety, or something else?\n\nAlso, how long have you been experiencing this sleep difficulty? Has it been a recent development or has it been going on for a while?"

temperature=0.1

输出1:

"I'm so sorry to hear that you're struggling with sleep. It can be really frustrating and affect many aspects of your daily life. Can you tell me a bit more about what's been going on? What are some of the things that make it hard for you to fall asleep or stay asleep? Is it stress, anxiety, physical discomfort, or something else?\n\nAlso, have you noticed any patterns or triggers that seem to make it worse? For example, do you tend to have trouble sleeping on certain nights of the week, or after certain events or activities?\n\nRemember, I'm here to listen and support you, and I want you to feel comfortable sharing as much or as little as you'd like."

输出2:

"I'm so sorry to hear that you're struggling with sleep. It can be really frustrating and affect many aspects of your daily life. Can you tell me a bit more about what's been going on? What's been on your mind lately, and how have you been feeling when you wake up in the morning?"

输出3:

"I'm so sorry to hear that you're struggling with sleep. It can be really frustrating and affect many aspects of your daily life. Can you tell me a bit more about what's been going on? What's been on your mind lately that might be making it hard for you to fall asleep or stay asleep?"

temperature=2

# 模型生产结果
outputs = model.generate(
    input_ids, # 输入token
    max_new_tokens=1000, # 限制输出长度
    eos_token_id=terminators, # 声明终止符
    do_sample=True, # 是否对输出进行概率分布采样
    temperature=2.0
)
# 高温度输出->概率分布更为分散
response = outputs[0][input_ids.shape[-1]:]
tokenizer.decode(response, skip_special_tokens=True)

输出1:

"I'm so sorry to hear that. Not getting proper sleep can be really wearing on your emotional and physical well-being. Can you tell me a little bit more about how this lack of sleep is affecting you? Are you feeling constantly exhausted, irritable, or struggling to concentrate? Have you noticed any changes in your relationships or daily routine because of it?\n\nMost importantly, I'm here for you, and I believe that by exploring this together, we can find ways to improve your sleep and improve your overall well-being.\n\nIt might be helpful for me to share that sometimes, lack of sleep can be a sign of underlying anxiety, stress, or even unprocessed emotions. If we can identify the root cause, I may have some suggestions on how to ease your path to better sleep.\n\nWould you like me to offer you some coping strategies to help you relax and unwind before bedtime? Sometimes, a simple change in routine or relaxation techniques can make a world of difference."

输出2:

"It can be really frustrating and worrying when sleep evade you, making it hard to wake up feeling refreshed and energized. I'm listening, and I want you to know that I'm here to support you. It's important to recognize that this is a tough and normal experience, even if it can be tough to bear right now.\n\nWould you like to talk more about what's going on when you have trouble sleeping? Is there anything in particular that bothers you or stress you out?"

输出3:

"It can be really distressing to deal with chronic sleep issues, not getting the rest you need and feeling tired and exhausted all the time. Can you tell me a little bit more about how you've been feeling? Have you noticed any patterns or triggers that might be contributing to the issue? And how has it been affecting other aspects of your life?\n\nAlso, I want you to know that as your listener, my main goal right now is just to support and provide comfort. Whatever you share, I'm here for you. No judgments, no critiques, just a gentle and compassionate space for you to express yourself.\n\nRemember, it takes a lot of courage to share vulnerable thoughts and feelings with someone like me, and I want to assure you that your feelings are completely normal and valid. Okay?"