昇思25天学习打卡营第12天|文本解码原理--以MindNLP为例

学AI还能赢奖品？每天30分钟，25天打通AI任督二脉 (qq.com)

文本解码原理--以MindNLP为例

回顾：自回归语言模型

根据前文预测下一个单词

一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积

𝑊_0:初始上下文单词序列
𝑇: 时间步
当生成EOS标签时，停止生成。

MindNLP/huggingface Transformers提供的文本生成方法

Greedy search

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤(1:𝑡−1))

按照贪心搜索输出序列("The","nice","woman") 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9 ![image.png](attachment:image.png =600x600)

环境准备

%%capture captured_output
# 实验环境已经预装了mindspore==2.2.14，如需更换mindspore版本，可更改下面mindspore的版本号
!pip uninstall mindspore -y
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14

!pip uninstall mindvision -y
!pip uninstall mindinsight -y

# 该案例在 mindnlp 0.3.1 版本完成适配，如果发现案例跑不通，可以指定mindnlp版本，执行`!pip install mindnlp==0.3.1`
!pip install mindnlp

#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

原理：在每一步选择概率最高的词作为输出。这种方法简单直接，但可能因忽视了全局最优解而错过更高质量的序列。

示例：以生成序列"The nice woman"为例，虽然每个步骤都选择了局部最优，但可能错过了整体概率更高的序列("The","dog","has")。

Beam search

Beam search通过在每个时间步保留最可能的 num_beams 个词，并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:

("The","dog","has") : 0.4 * 0.9 = 0.36

("The","nice","woman") : 0.5 * 0.4 = 0.20

优点：一定程度保留最优路径

缺点：1. 无法解决重复问题；2. 开放域生成效果差

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')

Beam search issues

缺点：1. 无法解决重复问题；2. 开放域生成效果差

Repeat problem

n-gram 惩罚:

将出现过的候选词的概率设置为 0

设置no_repeat_ngram_size=2 ，任意 2-gram 不会出现两次

Notice: 实际文本生成需要重复出现

改进：为了解决贪婪搜索的局限，引入了束搜索（Beam Search），它在每一步维护多个（束宽为num_beams）可能性较高的序列，从而提高找到全局最优解的概率。

特点：尽管提升了生成质量，但束搜索依然存在生成重复片段和在开放域生成上的局限性。

优化：通过设置no_repeat_ngram_size避免重复，以及通过num_return_sequences生成多个序列来增加多样性。

Sample

根据当前条件概率分布随机选择输出词𝑤_𝑡

("car") ～P(w∣"The") ("drives") ～P(w∣"The","car")

优点：文本生成多样性高

缺点：生成文本不连续

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

随机采样：直接从当前词的条件概率分布中随机选择下一个词，可以增加生成文本的多样性，但可能导致文本不连贯。

Temperature 降低softmax 的temperature使 P(w∣w1:t−1)分布更陡峭

增加高概率单词的似然并降低低概率单词的似然

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(1234)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Temperature Sampling：通过调节温度参数控制概率分布的平坦程度，较低的温度倾向于生成更确定、高频的词汇，而较高的温度增加随机性，提高多样性。

TopK sample

选出概率最大的 K 个词，重新归一化，最后在归一化后的 K 个词中采样

TopK sample problems

将采样池限制为固定大小 K ：

在分布比较尖锐的时候产生胡言乱语
在分布比较平坦的时候限制模型的创造力

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Top-P sample

在累积概率超过概率 p 的最小单词集中进行采样，重新归一化

采样池可以根据下一个词的概率分布动态增加和减少

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Top-K Sampling和Top-P Sampling：进一步优化采样过程，分别限制采样空间至最高概率的K个词或累计概率达到P的词集合，平衡生成的可控性和创新性。

top_k_top_p

import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=5,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

自回归语言模型（Autoregressive Language Model）是一种用于预测文本序列中下一个词的统计模型，基于这样的假设：一个词的出现仅依赖于它前面的所有词，而不依赖于它之后的任何词。这类模型通过学习历史词汇的条件概率来生成新的文本序列。简而言之，给定一个词序列 𝑤1,𝑤2,...,𝑤𝑡−1w1,w2,...,wt−1，自回归语言模型旨在预测下一个词 𝑤𝑡wt 的概率 𝑃(𝑤𝑡∣𝑤1,𝑤2,...,𝑤𝑡−1)P(wt∣w1,w2,...,wt−1)。

工作原理

条件概率分解：一个文本序列的概率可以被分解为每个词基于其前面所有词的条件概率的乘积，即 𝑃(𝑤1,𝑤2,...,𝑤𝑇)=𝑃(𝑤1)𝑃(𝑤2∣𝑤1)𝑃(𝑤3∣𝑤1,𝑤2)...𝑃(𝑤𝑇∣𝑤1,...,𝑤𝑇−1)P(w1,w2,...,wT)=P(w1)P(w2∣w1)P(w3∣w1,w2)...P(wT∣w1,...,wT−1)
模型训练：模型通常通过最大化这个概率来进行训练，即在大量已知文本数据上，学习到能够准确预测每个位置词的参数。这通常通过最大似然估计来实现。
生成过程：在生成新的文本时，模型会依次生成每个词，每次生成都是基于已经生成的词序列。例如，首先生成第一个词，然后基于第一个词生成第二个词，以此类推。

实现技术

RNN（循环神经网络）：早期的自回归模型常使用RNN，它能够处理序列数据，但由于梯度消失/爆炸问题，在处理长序列时表现不佳。
LSTM（长短时记忆网络）和GRU（门控循环单元）：作为RNN的变种，设计来缓解长期依赖问题。
Transformer：随着注意力机制的引入，Transformer模型在自回归语言建模中取得了巨大成功，如GPT系列模型。它摒弃了RNN的顺序处理，采用并行计算，极大地提高了训练速度和模型性能。

解码策略

贪心搜索：每一步选择概率最高的词，简单但可能错过全局最优解。
束搜索（Beam Search）：保留多个最有可能的候选序列，以探索更优路径，但可能会导致生成重复且对开放域任务效果有限。
采样方法：如随机采样、Temperature Sampling、Top-K Sampling、Top-P Sampling等，用于增加生成文本的多样性和自然度，但可能牺牲一致性。

通过使用MindNLP提供的GPT2LMHeadModel模型和GPT2Tokenizer分词器，可以实现文本生成的不同方法，包括Greedy Search、Beam Search、采样策略（例如Top-K Sample、Top-P Sample）以及结合Top-K和Top-P的方法。