小白学大模型：四种文本解码策略

在大型语言模型（LLM）的迷人世界中，模型架构、数据处理和优化常常成为关注的焦点。但解码策略在文本生成中扮演着至关重要的角色，却经常被忽视。

在这篇文章中，我们将通过深入探讨贪婪搜索和束搜索的机制，以及采用顶K采样和核采样的技术，来探索LLM是如何生成文本的。

https://mlabonne.github.io/blog/posts/2022-06-07-Decoding_strategies.html

https://colab.research.google.com/drive/19CJlOS5lI29g-B3dziNn93Enez1yiHk2?usp=sharing

基础知识

为了开始，我们先举一个例子。我们将文本“I have a dream”输入到GPT-2模型中，并让它生成接下来的五个词（单词或子词）。

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model.eval()

text = "I have a dream"
input_ids = tokenizer.encode(text, return_tensors='pt').to(device)

outputs = model.generate(input_ids, max_length=len(input_ids.squeeze())+5)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

句子“I have a dream of being a doctor”似乎是由GPT-2生成的。然而，GPT-2并没有完全生成这句话。

接下来我们将深入探讨各种解码策略，包括贪婪搜索、束搜索以及采用顶K采样和核采样的技术。通过这些策略，我们可以更好地理解GPT-2是如何生成文本的。

人们常常误解认为像GPT-2这样的大型语言模型（LLM）直接生成文本。实际上并非如此。相反，LLM会计算对其词汇表中每个可能的词元分配的分数，这些分数称为logits。为了简化说明，以下是这个过程的详细分解：

首先，分词器（在本例中是字节对编码）将输入文本中的每个词元转换为相应的词元ID。然后，GPT-2使用这些词元ID作为输入，尝试预测下一个最有可能的词元。最终，模型生成logits，这些logits通过softmax函数转换为概率。

例如，模型给“of”这个词元在“I have a dream”之后出现的概率分配了17%。这个输出本质上表示了潜在下一个词元的排序列表。更正式地，我们将这个概率表示为。

自回归模型（如GPT）根据前面的词元预测序列中的下一个词元。考虑一个词元序列。这个序列的联合概率可以分解为：

对于序列中的每个词元，表示在所有前面的词元给定的情况下出现的条件概率。GPT-2 为其词汇表中的50,257个词元中的每一个计算这个条件概率。

贪婪搜索（Greedy Search）

贪婪搜索是一种解码方法，在每一步中选择最可能的词元作为序列中的下一个词元。简单来说，它在每个阶段只保留最可能的词元，舍弃所有其他潜在选项。以我们的例子为例：

步骤 1: 输入: “I have a dream” → 最可能的词元: ”of”
步骤 2: 输入: “I have a dream of” → 最可能的词元: ”being”
步骤 3: 输入: “I have a dream of being” → 最可能的词元: ”a”
步骤 4: 输入: “I have a dream of being a” → 最可能的词元: ”doctor”
步骤 5: 输入: “I have a dream of being a doctor” → 最可能的词元: “.”

尽管这种方法听起来很直观，但需要注意的是，贪婪搜索是短视的：它只考虑每一步中最可能的词元，而不考虑对整个序列的整体影响。这个特性使得它速度快且高效，因为它不需要跟踪多个序列，但也意味着它可能错过那些包含稍微不那么可能的下一个词元的更好序列。

接下来，让我们使用 graphviz 和 networkx 来说明贪婪搜索的实现。我们选择得分最高的词元ID，计算其对数概率（我们取对数以简化计算），并将其添加到树中。我们将重复这个过程五次以生成五个词元。

def greedy_search(input_ids, node, length=5):
    if length == 0:
        return input_ids

    outputs = model(input_ids)
    predictions = outputs.logits

    # Get the predicted next sub-word (here we use top-k search)
    logits = predictions[0, -1, :]
    token_id = torch.argmax(logits).unsqueeze(0)

    # Compute the score of the predicted token
    token_score = get_log_prob(logits, token_id)

    # Add the predicted token to the list of input ids
    new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0)], dim=-1)

    # Add node and edge to graph
    next_token = tokenizer.decode(token_id, skip_special_tokens=True)
    current_node = list(graph.successors(node))[0]
    graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
    graph.nodes[current_node]['token'] = next_token + f"_{length}"

    # Recursive call
    input_ids = greedy_search(new_input_ids, current_node, length-1)
    
    return input_ids

束搜索（Beam Search）

与仅考虑下一个最可能词元的贪婪搜索不同，束搜索会考虑前个最可能的词元，其中表示束的数量。这个过程会重复进行，直到达到预定义的最大长度或者出现序列结束词元为止。此时，具有最高整体得分的序列（或“束”）将被选择为输出。

我们可以调整之前的函数，以考虑前个最可能的词元而不仅仅是一个。在这里，我们将维护序列得分，即每个束中每个词元的对数概率的累计和。我们通过序列长度对这个得分进行归一化，以防止对较长序列的偏向（这个因素可以调整）。同样，我们将生成五个额外的词元以完成句子“I have a dream”。

def beam_search(input_ids, node, bar, length, beams, sampling, temperature=0.1):
    if length == 0:
        return None

    outputs = model(input_ids)
    predictions = outputs.logits

    # Get the predicted next sub-word (here we use top-k search)
    logits = predictions[0, -1, :]

    if sampling == 'greedy':
        top_token_ids = greedy_sampling(logits, beams)
    elif sampling == 'top_k':
        top_token_ids = top_k_sampling(logits, temperature, 20, beams)
    elif sampling == 'nucleus':
        top_token_ids = nucleus_sampling(logits, temperature, 0.5, beams)

    for j, token_id in enumerate(top_token_ids):
        bar.update(1)

        # Compute the score of the predicted token
        token_score = get_log_prob(logits, token_id)
        cumulative_score = graph.nodes[node]['cumscore'] + token_score

        # Add the predicted token to the list of input ids
        new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0).unsqueeze(0)], dim=-1)

        # Add node and edge to graph
        token = tokenizer.decode(token_id, skip_special_tokens=True)
        current_node = list(graph.successors(node))[j]
        graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
        graph.nodes[current_node]['cumscore'] = cumulative_score
        graph.nodes[current_node]['sequencescore'] = 1/(len(new_input_ids.squeeze())) * cumulative_score
        graph.nodes[current_node]['token'] = token + f"_{length}_{j}"

        # Recursive call
        beam_search(new_input_ids, current_node, bar, length-1, beams, sampling, 1)