Build a Large Language Model (From Scratch)GPT-4o翻译和代码每行中文注释Ch5

Pretraining on Unlabeled Data
- This chapter covers
- 5.1 Evaluating generative text models
- - 5.1.1 Using GPT to generate text
  - 5.1.2 Calculating the text generation loss
  - 5.1.3 Calculating the training and validation set losses
- 5.2 Training an LLM
- 5.3 Decoding strategies to control randomness
- - 5.3.1 Temperature scaling
  - 5.3.2 Top-k sampling
  - 5.3.3 Modifying the text generation function
- 5.4 Loading and saving model weights in PyTorch
- 5.5 Loading pretrained weights from OpenAI
- 5.6 Summary

Pretraining on Unlabeled Data

预训练未标记数据

This chapter covers

本章内容包括

Computing the training and validation set losses to assess the quality of LLM-generated text during training
计算训练和验证集的损失，以评估在训练期间生成的LLM文本的质量
Implementing a training function and pretraining the LLM
实现一个训练函数并预训练LLM
Saving and loading model weights to continue training an LLM
保存和加载模型权重以继续训练LLM
Loading pretrained weights from OpenAI
从OpenAI加载预训练的权重

In the previous chapters, we implemented the data sampling, attention mechanism and coded the LLM architecture. The core focus of this chapter is to implement a training function and pretrain the LLM, as illustrated in Figure 5.1.

在前几章中，我们实现了数据采样、注意力机制并编码了LLM架构。本章的核心重点是实现一个训练函数并预训练LLM，如图5.1所示。

在这里插入图片描述

Figure 5.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights.

图5.1 编码LLM的三个主要阶段的心理模型，在通用文本数据集上预训练LLM，并在标记数据集上进行微调。本章重点是预训练LLM，包括实现训练代码、评估性能以及保存和加载模型权重。

As illustrated in Figure 5.1, we will also learn about basic model evaluation techniques to measure the quality of the generated text, which is a requirement for optimizing the LLM during the training process. Moreover, we will discuss how to load pretrained weights, giving our LLM a solid starting point for finetuning in the upcoming chapters.

如图5.1所示，我们还将学习基本的模型评估技术，以衡量生成文本的质量，这是在训练过程中优化LLM的要求。此外，我们将讨论如何加载预训练权重，为即将到来的微调章节提供坚实的起点。

WEIGHT PARAMETERS

权重参数

In the context of LLMs and other deep learning models, weights refer to the trainable parameters that the learning process adjusts. These weights are also known as weight parameters or simply parameters. In frameworks like PyTorch, these weights are stored in linear layers, for example, which we used to implement the multi-head attention module in chapter 3 and the GPTModel in chapter 4. After initializing a layer (new_layer = torch.nn.Linear(…)), we can access its weights through the weight attribute, new_layer. weight Additionally, for convenience, PyTorch allows direct access to all a model’s trainable parameters, including weights and biases, through the method model.parameters(), which we will use later when implementing the model training.

在LLM和其他深度学习模型的上下文中，权重是指学习过程调整的可训练参数。这些权重也被称为权重参数或简单地称为参数。在PyTorch等框架中，这些权重存储在线性层中，例如，我们在第3章中用它来实现多头注意力模块和第4章中的GPTModel。初始化层（new_layer = torch.nn.Linear(…))后，我们可以通过.weight属性访问其权重，new_layer.weight。此外，为了方便起见，PyTorch允许通过model.parameters() 方法直接访问模型的所有可训练参数，包括权重和偏置，我们将在实现模型训练时使用它。

5.1 Evaluating generative text models

5.1 评估生成文本模型

We begin this chapter by setting up the LLM for text generation based on code from the previous chapter and discuss basic ways to evaluate the quality of the generated text in this section. The content we cover in this section and the remainder of this chapter is outlined in Figure 5.2.

我们从设置LLM进行文本生成开始本章，该设置基于前一章的代码，并在本节中讨论评估生成文本质量的基本方法。本节和本章其余部分涵盖的内容如图5.2所示。

在这里插入图片描述

Figure 5.2 An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage.

图5.2 本章涵盖主题的概述。我们从回顾前一章的文本生成开始，并实现可以在预训练阶段使用的基本模型评估技术。

As shown in Figure 5.2, the next subsection recaps the text generation we set up at the end of the previous chapter before we dive into the text evaluation and calculation of the training and validation losses in the subsequent subsections.

如图5.2所示，下一小节回顾了我们在上一章结尾设置的文本生成，然后我们将深入研究文本评估和后续小节中的训练和验证损失的计算。

5.1.1 Using GPT to generate text

5.1.1 使用GPT生成文本

In this section, we set up the LLM and briefly recap the text generation process we implemented in chapter 4. We begin by initializing the GPT model that we will evaluate and train in this chapter, using the GPTModel class and GPT_CONFIG_124M dictionary from chapter 4:

在本节中，我们设置了LLM并简要回顾了我们在第4章中实现的文本生成过程。我们首先初始化将在本章中评估和训练的GPT模型，使用第4章中的GPTModel类和GPT_CONFIG_124M字典：

import torch                        # 导入torch库
from chapter04 import GPTModel      # 从第4章导入GPTModel

GPT_CONFIG_124M = {                 # GPT配置字典
    "vocab_size": 50257,            # 词汇表大小
    "context_length": 256,          #A 将上下文长度从1024缩短到256词元
    "emb_dim": 768,                 # 嵌入维度
    "n_heads": 12,                  # 注意力头数量
    "n_layers": 12,                 # 层数
    "drop_rate": 0.1,               #B 可能且常见的是将dropout设置为0。
    "qkv_bias": False               # QKV偏置
}

torch.manual_seed(123)              # 设置随机种子
model = GPTModel(GPT_CONFIG_124M)   # 使用配置初始化模型
model.eval()                        # 将模型设置为评估模式

Considering the GPT_CONFIG_124M dictionary, the only adjustment we have made compared to the previous chapter is reducing the context length (context_length) to 256 tokens. This modification reduces the computational demands of training the model, making it possible to carry out the training on a standard laptop computer.

考虑到GPT_CONFIG_124M字典，我们与前一章相比唯一的调整是将上下文长度（context_length）减少到256个词元。此修改减少了训练模型的计算需求，使得在标准笔记本电脑上进行训练成为可能。

Originally, the GPT-2 model with 124 million parameters was configured to handle up to 1,024 tokens. After the training process, at the end of this chapter, we will update the context size setting and load pretrained weights to work with a model configured for a 1,024-token context length.

最初，具有1.24亿参数的GPT-2模型被配置为处理最多1,024个词元。在训练过程结束后，本章末尾，我们将更新上下文大小设置并加载预训练权重，以使用配置为1,024词元上下文长度的模型。

Using the GPTModel instance, we adopt the generate_text_simple function introduced in the previous chapter and introduce two handy functions, text_to_token_ids and token_ids_to_text. These functions facilitate the conversion between text and token representations, a technique we will utilize throughout this chapter. To provide a clearer understanding, Figure 5.3 illustrates this process before we dive into the code.

使用GPTModel实例，我们采用前一章中介绍的generate_text_simple函数，并引入两个实用函数，text_to_token_ids和token_ids_to_text。这些函数促进文本和词元表示之间的转换，这是我们将在本章中使用的一项技术。为了提供更清晰的理解，图5.3在我们深入代码之前说明了这一过程。
在这里插入图片描述

Figure 5.3 Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation.

图5.3 生成文本涉及将文本编码为词元ID，LLM将其处理为logit向量。然后将logit向量转换回词元ID，解码为文本表示。

Figure 5.3 illustrates a three-step text generation process using a GPT model. First, the tokenizer converts input text into a series of token IDs, as discussed in chapter 2. Second, the model receives these token IDs and generates corresponding logits, which are vectors representing the probability distribution for each token in the vocabulary, as discussed in chapter 4. Third, these logits are converted back into token IDs, which the tokenizer decodes into human-readable text, completing the cycle from textual input to textual output.

图5.3说明了使用GPT模型的三步文本生成过程。首先，分词器将输入文本转换为一系列词元ID，如第2章中讨论的那样。其次，模型接收这些词元ID并生成相应的logits，这些向量表示词汇表中每个词元的概率分布，如第4章中讨论的那样。第三，这些logits被转换回词元ID，分词器将其解码为人类可读文本，完成从文本输入到文本输出的循环。

In code, we implement the text generation process as follows:

在代码中，我们实现文本生成过程如下：

import tiktoken                    # 导入tiktoken库
from chapter04 import generate_text_simple # 从第4章导入generate_text_simple函数

def text_to_token_ids(text, tokenizer):   # 定义text_to_token_ids函数
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) # 编码文本，允许特殊词元
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # 添加批次维度
    return encoded_tensor             # 返回编码后的张量

def token_ids_to_text(token_ids, tokenizer): # 定义token_ids_to_text函数
    flat = token_ids.squeeze(0)       # 移除批次维度
    return tokenizer.decode(flat.tolist()) # 解码为文本

start_context = "Every effort moves you" # 设置初始上下文
tokenizer = tiktoken.get_encoding("gpt2") # 获取GPT-2的分词器编码

token_ids = generate_text_simple(   # 调用generate_text_simple函数生成词元ID
    model=model,                    # 模型
    idx=text_to_token_ids(start_context, tokenizer), # 将初始上下文转换为词元ID
    max_new_tokens=10,              # 最大新词元数
    context_size=GPT_CONFIG_124M["context_length"]  # 上下文长度
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer)) # 打印生成的文本

Using the preceding code, the model generates the following text:

使用上述代码，模型生成以下文本：

Output text:
Every effort moves you reintingetic wasn refres RexMeCHicular stren

Based on the output, it’s clear the model isn’t yet producing coherent text because it hasn’t undergone training. To define what makes text “coherent” or “high quality,” we have to implement a numerical method to evaluate the generated content. This approach will enable us to monitor and enhance the model’s performance throughout its training process.

根据输出结果，很明显模型尚未生成连贯的文本，因为它还没有经过训练。要定义使文本 “连贯”或“高质量” 的标准，我们必须实现一种数值方法来评估生成的内容。这种方法将使我们能够在整个训练过程中监控和提高模型的性能。

The following section introduces how we calculate a loss metric for the generated outputs. This loss serves as a progress and success indicator of the training progress. Furthermore, in subsequent chapters on finetuning LLMs, we will review additional methodologies for assessing model quality.

接下来的部分介绍了如何计算生成输出的损失指标。此损失作为训练进度的进展和成功指示器。此外，在后续关于微调LLM的章节中，我们将回顾评估模型质量的其他方法。

5.1.2 Calculating the text generation loss

5.1.2 计算文本生成损失

This section explores techniques for numerically assessing text quality generated during training by calculating a so-called text generation loss. We go over this topic step-by-step with a practical example to make the concepts clear and applicable, beginning with a short recap of how the data is loaded from chapter 2 and how the text is generated via the generate_text_simple function from chapter 4.

本节探讨通过计算所谓的文本生成损失，在训练期间数值评估生成文本质量的技术。我们通过实际例子逐步讲解这一主题，使概念清晰且可应用，从简短回顾第2章如何加载数据和第4章如何通过generate_text_simple函数生成文本开始。

在这里插入图片描述

Figure 5.4 illustrates the overall flow from input text to LLM-generated text using a five-step procedure.

图5.4说明了使用五步过程从输入文本到LLM生成文本的整体流程。

Figure 5.4 For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.

图5.4 对于左侧显示的每个输入词元，我们计算一个包含与词汇表中每个词元相对应的概率分数的向量。每个向量中最高概率分数的索引位置表示最可能的下一个词元ID。选择与最高概率分数相关联的这些词元ID，并将其映射回表示模型生成的文本的文本中。

The text generation process in Figure 5.4 outlines what the generate_text_simple function from chapter 4 does internally. We need to perform these same initial steps before we can compute a loss that measures the generated text quality later in this section.

图5.4中的文本生成过程概述了第4章中的generate_text_simple函数在内部所做的事情。我们需要执行这些相同的初始步骤，然后才能计算衡量生成文本质量的损失。

Figure 5.4 outlines the text generation process with a small 7-token vocabulary to fit this image on a single page. However, our GPTModel works with a much larger vocabulary consisting of 50,257 words; hence, the token IDs in the following codes will range from 0 to 50,256 rather than 0 to 6.

图5.4概述了使用小型7词元词汇表的文本生成过程，以适应单页图像。然而，我们的GPTModel使用一个由50,257个词组成的更大词汇表；因此，以下代码中的词元ID范围为0到50,256而不是0到6。

Also, Figure 5.4 only shows a single text example (“every effort moves”) for simplicity. In the following hands-on code example that implements the steps in Figure 5.4, we will work with two input examples (“every effort moves” and “I really like”) as inputs for the GPT model:

此外，图5.4仅显示了一个简单文本示例（“every effort moves”）以简化。在接下来的实际代码示例中，我们将实现图5.4中的步骤，并使用两个输入示例（“every effort moves"和"I really like”）作为GPT模型的输入：

inputs = torch.tensor([[16833, 3626, 6100],  # ["every effort moves"]
                       [   40, 1107,  588]])  # ["I really like"]
# 匹配这些输入，'targets'包含我们希望模型生成的词元ID
targets = torch.tensor([[ 3626, 6100,  345],  # [" effort moves you"]
                        [  107,  588, 11311]])  # [" really like chocolate"]

Note that the targets are the inputs but shifted one position forward, a concept we covered in chapter 2 during the implementation of the data loader. This shifting strategy is crucial for teaching the model to predict the next token in a sequence.

注意，目标是输入但前移了一个位置，这是我们在第2章实现数据加载器时介绍的概念。这种移位策略对于教模型预测序列中的下一个词元至关重要。

When we feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores, which corresponds to step 2 in Figure 5.4:

当我们将输入送入模型以计算两个输入示例的logit向量（每个示例包含三个词元），并应用softmax函数将这些logit值转换为概率分数时，对应于图5.4中的步骤2：

with torch.no_grad():                   # 禁用梯度跟踪，因为我们尚未训练
    logits = model(inputs)              # 将输入送入模型，计算logit向量
    probas = torch.softmax(logits, dim=-1)  # 每个词元在词汇表中的概率
print(probas.shape)                     # 打印概率张量的维度

The resulting tensor dimension of the probability score (probas) tensor is as follows:

概率分数（probas）张量的结果维度如下：

torch.Size([2, 3, 50257])

The first number, 2, corresponds to the two examples (rows) in the inputs, also known as batch size. The second number, 3, corresponds to the number of tokens in each input (row). Finally, the last number corresponds to the embedding dimensionality, which is determined by the vocabulary size, as discussed in previous chapters.

第一个数字2对应于inputs中的两个示例（行），也称为批大小。第二个数字3对应于每个输入（行）中的词元数。最后一个数字对应于嵌入维度，这是由词汇表大小决定的，如前几章所述。

Following the conversion from logits to probabilities via the softmax function, the generate_text_simple function from chapter 4 then converts the resulting probability scores back into text, as illustrated in steps 3-5 in Figure 5.4.

通过softmax函数将logits转换为概率后，第4章中的generate_text_simple函数将生成的概率分数转换回文本，如图5.4中的步骤3-5所示。

We can implement steps 3 and 4 by applying the argmax function to the probability scores to obtain the corresponding token IDs:

我们可以通过对概率分数应用argmax函数来实现步骤3和4，以获得相应的词元ID：

token_ids = torch.argmax(probas, dim=-1, keepdim=True)  # 应用argmax函数获得词元ID
print("Token IDs:\n", token_ids)  # 打印词元ID

Given that we have 2 input batches, each containing 3 tokens, applying the argmax function to the probability scores (step 3 in Figure 5.4) yields 2 sets of outputs, each with 3 predicted token IDs:

考虑到我们有2个输入批次，每个批次包含3个词元，对概率分数应用argmax函数（图5.4中的步骤3）会产生2组输出，每组输出有3个预测的词元ID：

```python
Token IDs:
tensor([[[16657],  # 第一批次
         [  339], 
         [42826]],
        [[49906],  # 第二批次
         [29669], 
         [41751]]])

Finally, step 5 converts the token IDs back into text:

最后，步骤5将词元ID转换回文本：

print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")  # 打印第一批次的目标词元
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")  # 打印第一批次的输出词元

When we decode these tokens, we find that these output tokens are quite different from the target tokens we want the model to generate:

当我们解码这些词元时，我们发现这些输出词元与我们希望模型生成的目标词元有很大不同：

Targets batch 1:  effort moves you
Outputs batch 1:  Armed NeNetflix

The model produces random text that is different from the target text because it has not been trained yet. We now get to the part where we evaluate the performance of the model’s generated text numerically via a so-called loss as illustrated in Figure 5.4. Not only is this useful for measuring the quality of the generated text, but it’s also a building block for implementing the training function later, which we use to update the model’s weight to improve the generated text.

模型生成的文本与目标文本不同，因为它还没有经过训练。我们现在进入评估模型生成文本性能的部分，通过图5.4所示的所谓损失数值进行评估。这不仅有助于衡量生成文本的质量，而且也是实现训练函数的基石，我们使用该函数来更新模型的权重，以改进生成的文本。
在这里插入图片描述

Figure 5.5 We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training.

图5.5 我们现在在本节的其余部分实现文本评估函数。在下一节中，我们将此评估函数应用于我们用于模型训练的整个数据集。

The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs, as illustrated in Figure 5.6. This softmax probability is also used in the evaluation metric we are implementing in the remainder of this section to numerically assess the model’s generated outputs: the higher the probability in the correct positions, the better.

模型训练旨在增加与正确目标词元ID对应的索引位置的softmax概率，如图5.6所示。我们在本节其余部分实施的评估指标中也使用了这种softmax概率，以数值评估模型生成的输出：正确位置的概率越高越好。

在这里插入图片描述

Figure 5.6 Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized.

图5.6 在训练之前，模型生成随机的下一个词元概率向量。模型训练的目标是确保对应于突出显示的目标词元ID的概率值最大化。

Remember that Figure 5.6 displays the softmax probabilities for a compact 7-token vocabulary to fit everything into a single figure. This implies that the starting random values will hover around 1/7, which equals approximately 0.14.

请记住，图5.6显示了一个紧凑的7词元词汇表的softmax概率，以便将所有内容都适应在一个图中。这意味着起始的随机值将徘徊在1/7左右，即大约0.14。

However, the vocabulary we are using for our GPT-2 model has 50,257 tokens, so most of the initial probabilities will hover around 0.00002 via 1/50,257.

然而，我们的GPT-2模型使用的词汇表有50,257个词元，因此大多数初始概率将通过1/50,257徘徊在0.00002左右。

For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens via the following code:

对于两个输入文本中的每一个，我们可以通过以下代码打印对应于目标词元的初始softmax概率分数：

text_idx = 0  # 文本索引0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]  # 计算目标词元的softmax概率
print("Text 1:", target_probas_1)  # 打印文本1的概率

text_idx = 1  # 文本索引1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]  # 计算目标词元的softmax概率
print("Text 2:", target_probas_2)  # 打印文本2的概率

The 3 target token ID probabilities for each batch are as follows:

每个批次的3个目标词元ID概率如下：

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])

The goal of training an LLM is to maximize these values, aiming to get them as close to a probability of 1. This way, we ensure the LLM consistently picks the target token—essentially the next word in the sentence—as the next token it generates.

训练LLM的目标是最大化这些值，旨在使它们尽可能接近1的概率。这样，我们确保LLM一致地选择目标词元——本质上是句子中的下一个词元——作为它生成的下一个词元。

BACKPROPAGATION

反向传播

How do we maximize the softmax probability values corresponding to the target tokens? The big picture is that we update the model weights so that the model outputs higher values for the respective token IDs we want to generate. The weight update is done via a process called backpropagation, a standard technique for training deep neural networks (see sections A.3 to A.7 in Appendix A for more details about backpropagation and model training).

我们如何最大化与目标词元对应的softmax概率值？大体上，我们通过更新模型权重，使模型输出更高的值来生成相应的词元ID。权重更新通过一种称为反向传播的过程完成，这是训练深度神经网络的标准技术（有关反向传播和模型训练的更多详细信息，请参见附录A的A.3至A.7部分）。

Backpropagation requires a loss function, which calculates the difference between the model’s predicted output (here, the probabilities corresponding to the target token IDs) and the actual desired output. This loss function measures how far off the model’s predictions are from the target values.

反向传播需要一个损失函数，它计算模型预测输出（此处为与目标词元ID对应的概率）与实际期望输出之间的差异。该损失函数衡量模型预测值与目标值的偏差。

In the remainder of this section, we calculate the loss for the probability scores of the two example batches, target_probas_1 and target_probas_2. The main steps are illustrated in Figure 5.7.

在本节的其余部分，我们计算两个示例批次的概率分数target_probas_1和target_probas_2的损失。主要步骤在图5.7中说明。

在这里插入图片描述

Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6.

图5.7 计算损失涉及多个步骤。步骤1到3计算对应于目标张量的词元概率。这些概率然后通过对数变换并在步骤4-6中取平均值。

Since we already applied steps 1-3 listed in Figure 5.7 to obtain target_probas_1 and target_probas_2, we proceed with step 4, applying the logarithm to the probability scores:

由于我们已经应用图5.7中列出的步骤1-3获得target_probas_1和target_probas_2，我们继续步骤4，将概率分数取对数：

log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))  # 计算对数概率
print(log_probas)  # 打印对数概率

This results in the following values:

结果如下：

tensor([-9.5042, -10.3796, -11.3677, -11.4798, -9.7764, -12.2561])

Working with logarithms of probability scores is more manageable in mathematical optimization than handling the scores directly. This topic is outside the scope of this book, but I’ve detailed it further in a lecture, which is linked in the reference section in appendix B.

在数学优化中处理概率分数的对数比直接处理分数更易于管理。本书不涉及这一主题，但我在附录B的参考部分中链接的一次讲座中进一步详细讲解了这一点。

Next, we combine these log probabilities into a single score by computing the average (step 5 in Figure 5.7):

接下来，我们通过计算平均值将这些对数概率组合成一个分数（图5.7中的步骤5）：

avg_log_probas = torch.mean(log_probas)  # 计算平均对数概率
print(avg_log_probas)  # 打印平均对数概率

The resulting average log probability score is as follows:

结果的平均对数概率分数如下：

tensor(-10.7940)

The goal is to get the average log probability as close to 0 as possible by updating the model’s weights as part of the training process, which we will implement later in section 5.2.

目标是通过在训练过程中更新模型的权重，使平均对数概率尽可能接近0，我们将在第5.2节中实现这一点。

However, in deep learning, the common practice isn’t to push the average log probability up to 0 but rather to bring the negative average log probability down to 0. The negative average log probability is simply the average log probability multiplied by -1, which corresponds to step 6 in Figure 5.7:

然而，在深度学习中，通常的做法不是将平均对数概率推高到0，而是将负平均对数概率降到0。负平均对数概率只是平均对数概率乘以-1，对应于图5.7中的步骤6：

neg_avg_log_probas = avg_log_probas * -1  # 计算负平均对数概率
print(neg_avg_log_probas)  # 打印负平均对数概率

This prints tensor(10.7940).

这将打印tensor(10.7940)。

The term for this negative value, -10.7940 turning into 10.7940, is known as the cross entropy loss in deep learning.

这个负值-10.7940变成10.7940的术语在深度学习中称为交叉熵损失。

PyTorch comes in handy here, as it already has a built-in cross_entropy function that takes care of all these 6 steps in Figure 5.7 for us.

PyTorch在这里派上用场，因为它已经有一个内置的cross_entropy函数，可以为我们处理图5.7中的所有这6个步骤。

CROSS ENTROPY LOSS

交叉熵损失

At its core, the cross entropy loss is a popular measure in machine learning and deep learning that measures the difference between two probability distributions—typically, the true distribution of labels (here, tokens in a dataset) and the predicted distribution from a model (for instance, the token probabilities generated by an LLM).

从本质上讲，交叉熵损失是机器学习和深度学习中一种常用的度量方法，用于衡量两个概率分布之间的差异——通常是标签的真实分布（这里是数据集中的词元）和模型的预测分布（例如，LLM生成的词元概率）。

In the context of machine learning and specifically in frameworks like PyTorch, the cross_entropy function computes this measure for discrete outcomes, which is similar to the negative average log probability of the target tokens given the model’s generated token probabilities, making the terms cross entropy and negative average log probability related and often used interchangeably in practice.

在机器学习的背景下，特别是在像PyTorch这样的框架中，cross_entropy函数计算离散结果的这种度量，这类似于给定模型生成的词元概率的目标词元的负平均对数概率，使得交叉熵和负平均对数概率这两个术语相关并在实践中经常互换使用。

Before we apply the cross entropy function, let’s briefly recall the shape of the logits and target tensors:

在我们应用交叉熵函数之前，让我们简要回顾一下logits和目标张量的形状：

print("Logits shape:", logits.shape)  # 打印logits形状
print("Targets shape:", targets.shape)  # 打印目标形状

The resulting shapes are as follows:

结果形状如下：

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])

As we can see, the logits tensor has three dimensions: batch size, number of tokens, and vocabulary size. The targets tensor has two dimensions: batch size and number of tokens.

如我们所见，logits张量有三个维度：批大小、词元数量和词汇大小。目标张量有两个维度：批大小和词元数量。

For the cross_entropy_loss function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:

对于PyTorch中的cross_entropy_loss函数，我们希望通过在批维度上组合这些张量来展平它们：

logits_flat = logits.flatten(0, 1)  # 将logits展平
targets_flat = targets.flatten()  # 将目标展平
print("Flattened logits:", logits_flat.shape)  # 打印展平后的logits形状
print("Flattened targets:", targets_flat.shape)  # 打印展平后的目标形状

The resulting tensor dimensions are as follows:

结果张量的维度如下：

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])

Remember that the targets are the token IDs we want the LLM to generate, and the logits contain the unscaled model outputs before they enter the softmax function to obtain the probability scores.

请记住，targets是我们希望LLM生成的词元ID，logits包含在进入softmax函数之前未缩放的模型输出，以获得概率分数。

Previously, we applied the softmax function, selected the probability scores corresponding to the target IDs, and computed the negative average log probabilities. PyTorch’s cross_entropy function will take care of all these steps for us:

之前，我们应用了softmax函数，选择了与目标ID对应的概率分数，并计算了负平均对数概率。PyTorch的cross_entropy函数将为我们处理所有这些步骤：

loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)  # 计算交叉熵损失
print(loss)  # 打印损失

The resulting loss is the same that we obtained previously when applying the individual steps shown in Figure 5.7 manually:

结果的损失与我们之前手动应用图5.7中显示的各个步骤时获得的损失相同：

tensor(10.7940)

PERPLEXITY

困惑度

Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.

困惑度是一种经常与交叉熵损失一起使用的度量，用于评估模型在语言建模等任务中的性能。它可以提供一种更易解释的方式来理解模型在预测序列中下一个词元时的不确定性。

Perplexity measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution.

困惑度衡量模型预测的概率分布与数据集中单词的实际分布的匹配程度。与损失类似，较低的困惑度表明模型预测更接近实际分布。

Perplexity can be calculated as perplexity = torch.exp(loss), which returns tensor(48725.8203) when applied to the previously calculated loss.

困惑度可以通过perplexity = torch.exp(loss)计算，当应用于先前计算的损失时返回tensor(48725.8203)。

Perplexity is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step. In the given example, this would translate to the model being unsure about which among 47,678 words or tokens in the vocabulary to generate as the next token.

困惑度通常被认为比原始损失值更易解释，因为它表示模型在每一步不确定的有效词汇大小。在给定的示例中，这意味着模型不确定在词汇表中的47,678个单词或词元中生成哪个作为下一个词元。

In this section, we calculated the loss for two small text inputs for illustration purposes. In the next section, we apply the loss computation to the entire training and validation sets.

在本节中，我们计算了两个小文本输入的损失以作说明。在下一节中，我们将损失计算应用于整个训练和验证集。

5.1.3 Calculating the training and validation set losses

5.1.3 计算训练和验证集的损失

In this section, we first prepare the training and validation datasets that we will use to train the LLM later in this chapter. Then, we calculate the cross entropy for the training and validation sets, as illustrated in Figure 5.8, which is an important component of the model training process.

在本节中，我们首先准备将用于训练LLM的训练和验证数据集。然后，我们计算训练和验证集的交叉熵，如图5.8所示，这是模型训练过程的重要组成部分。

在这里插入图片描述

Figure 5.8 After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training.

图5.8 在上一节计算交叉熵损失后，我们现在将这种损失计算应用于我们将用于模型训练的整个文本数据集。

To compute the loss on the training and validation datasets as illustrated in Figure 5.8, we use a very small text dataset, the “The Verdict” short story by Edith Wharton, which we have already worked with in chapter 2. By selecting a text from the public domain, we circumvent any concerns related to usage rights. Additionally, the reason why we use such a small dataset is that it allows for the execution of code examples on a standard laptop computer in a matter of minutes, even without a high-end GPU, which is particularly advantageous for educational purposes.

为了计算如图5.8所示的训练和验证数据集的损失，我们使用了一个非常小的文本数据集，即Edith Wharton的短篇小说"The Verdict"，我们已经在第2章中使用过。通过选择公有领域的文本，我们避免了与使用权相关的任何问题。此外，使用这样一个小数据集的原因是它允许在标准笔记本电脑上在几分钟内执行代码示例，即使没有高端GPU，这对于教育目的特别有利。

Interested readers can also use the supplementary code of this book to prepare a larger-scale dataset consisting of more than 60,000 public domain books from Project Gutenberg and train an LLM on these (see appendix D for details).

有兴趣的读者还可以使用本书的补充代码准备一个包含来自Gutenberg项目的超过60,000本公有领域书籍的大规模数据集，并在这些数据集上训练LLM（有关详细信息，请参见附录D）。

THE COST OF PRETRAINING LLMS

预训练LLM的成本

To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs around $30 per hour. A rough estimate puts the total training cost of such an LLM at around $690,000 (calculated as 184,320 hours divided by 8, then multiplied by $30).

为了更好地理解我们项目的规模，考虑一下训练一个拥有70亿参数的Llama 2模型，这是一个相对流行的公开可用的LLM。该模型在昂贵的A100 GPU上需要184,320个GPU小时，处理2万亿词元。在撰写本文时，运行一个8xA100云服务器在AWS上的费用约为每小时30美元。粗略估计，这样一个LLM的总训练成本约为690,000美元（计算方法为184,320小时除以8，然后乘以30美元）。

The following code loads the “The Verdict” short story we used in chapter 2:

以下代码加载我们在第2章中使用的"The Verdict"短篇小说：

file_path = "the-verdict.txt"  # 文件路径
with open(file_path, "r", encoding="utf-8") as file:  # 以读模式打开文件
    text_data = file.read()  # 读取文件内容

After loading the dataset, we can check the number of characters and tokens in the dataset:

加载数据集后，我们可以检查数据集中的字符和词元数量：

total_characters = len(text_data)  # 计算总字符数
total_tokens = len(tokenizer.encode(text_data))  # 计算总词元数
print("Characters:", total_characters)  # 打印字符数
print("Tokens:", total_tokens)  # 打印词元数

The output is as follows:

输出如下：

Characters: 20479
Tokens: 5145

With just 5,145 tokens, the text might seem too small to train an LLM, but as mentioned earlier, it’s for educational purposes so that we can run the code in minutes instead of weeks. Plus, we will be loading pretrained weights from OpenAI into our GPTModel code at the end of this chapter.

只有5145个词元，这段文本可能看起来太小，不适合训练LLM，但如前所述，这是为了教育目的，以便我们可以在几分钟内运行代码，而不是几周。此外，我们将在本章末尾从OpenAI加载预训练权重到我们的GPTModel代码中。

Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training. This process is visualized in Figure 5.9.

接下来，我们将数据集分成训练集和验证集，并使用第2章中的数据加载器准备LLM训练的批次。这个过程在图5.9中可视化。

An example using stride equal to the context length (here: 4) as shown below:

一个使用步幅等于上下文长度（此处为4）的示例如下：
在这里插入图片描述

Figure 5.9 When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training.

图5.9 准备数据加载器时，我们将输入文本拆分为训练集和验证集部分。然后，我们对文本进行词元化（为简单起见，仅显示训练集部分），并将词元化的文本分成用户指定长度的块（此处为6）。最后，我们打乱行，并将分块的文本组织成批次（此处，批次大小为2），可用于模型训练。

For visualization purposes, Figure 5.9 uses a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training.

为了便于可视化，图5.9使用max_length=6，受限于空间约束。然而，对于我们正在实现的实际数据加载器，我们将max_length设置为LLM支持的256词元上下文长度，以便LLM在训练期间看到更长的文本。

TRAINING WITH VARIABLE LENGTHS

使用可变长度进行训练

We are training the model with training data presented in similarly-sized chunks for simplicity and efficiency. However, in practice, it can also be beneficial to train an LLM with variable-length inputs to help the LLM to better generalize across different types of inputs when it is being used.

为了简化和提高效率，我们用相似大小的块来训练模型。然而，在实践中，用可变长度的输入来训练LLM也是有益的，这有助于LLM在使用时更好地泛化到不同类型的输入。

To implement the data splitting and loading visualized in Figure 5.9, we first define a train_ratio to use 90% of the data for training and the remaining 10% as validation data for model evaluation during training:

为了实现图5.9中可视化的数据拆分和加载，我们首先定义一个train_ratio，将90%的数据用于训练，剩余的10%作为验证数据，用于模型训练期间的评估：

train_ratio = 0.90  # 训练集比例
split_idx = int(train_ratio * len(text_data))  # 计算分割索引
train_data = text_data[:split_idx]  # 获取训练数据
val_data = text_data[split_idx:]  # 获取验证数据

Using the train_data and val_data subsets, we can now create the respective data loader reusing the create_dataloader_v1 code from chapter 2:

使用train_data和val_data子集，我们现在可以创建相应的数据加载器，重用第2章中的create_dataloader_v1代码：

from chapter02 import create_dataloader_v1  # 从第2章导入create_dataloader_v1
torch.manual_seed(123)  # 设置随机种子

train_loader = create_dataloader_v1(
    train_data,  # 训练数据
    batch_size=2,  # 批大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大长度
    stride=GPT_CONFIG_124M["context_length"],  # 步幅
    drop_last=True,  # 丢弃最后一个不完整批次
    shuffle=True,  # 是否打乱数据
    num_workers=0  # 工作线程数
)

val_loader = create_dataloader_v1(
    val_data,  # 验证数据
    batch_size=2,  # 批大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大长度
    stride=GPT_CONFIG_124M["context_length"],  # 步幅
    drop_last=False,  # 不丢弃最后一个不完整批次
    shuffle=False,  # 是否打乱数据
    num_workers=0  # 工作线程数
)

We used a relatively small batch size in the preceding code to reduce the computational resource demand because we were working with a very small dataset. In practice, training LLMs with batch sizes of 1,024 or larger is not uncommon.

我们在前面的代码中使用了相对较小的批大小，以减少计算资源需求，因为我们处理的是一个非常小的数据集。在实际操作中，使用1,024或更大的批大小来训练LLM并不少见。

As an optional check, we can iterate through the data loaders to ensure that they were created correctly:

作为可选检查，我们可以遍历数据加载器，以确保它们已正确创建：

print("Train loader:")  # 打印训练加载器
for x, y in train_loader:
    print(x.shape, y.shape)  # 打印每个批次的形状

print("\nValidation loader:")  # 打印验证加载器
for x, y in val_loader:
    print(x.shape, y.shape)  # 打印每个批次的形状

We should see the following outputs:

我们应该看到以下输出：

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])

Based on the preceding code output, we have 9 training set batches with 2 samples and 256 tokens each. Since we allocated only 10% of the data for validation, there is only one validation batch consisting of 2 input examples.

根据前面的代码输出，我们有9个训练集批次，每个批次包含2个样本和256个词元。由于我们只分配了10%的数据进行验证，因此只有一个由2个输入示例组成的验证批次。

As expected, the input data (x) and target data (y) have the same shape (the batch size times the number of tokens in each batch) since the targets are the inputs shifted by one position, as discussed in chapter 2.

如预期的那样，输入数据（x）和目标数据（y）具有相同的形状（批大小乘以每个批次中的词元数），因为目标是将输入偏移一个位置，如第2章所述。

Next, we implement a utility function to calculate the cross entropy loss of a given batch returned via the training and validation loader:

接下来，我们实现一个实用函数，以计算通过训练和验证加载器返回的给定批次的交叉熵损失：

def calc_loss_batch(input_batch, target_batch, model, device):  # 定义计算批次损失的函数
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)  # 将输入和目标批次转移到设备上
    logits = model(input_batch)  # 模型计算logits
    loss = torch.nn.functional.cross_entropy(  # 计算交叉熵损失
        logits.flatten(0, 1), target_batch.flatten()  # 展平logits和目标批次
    )
    return loss  # 返回损失

We can now use this calc_loss_batch utility function, which computes the loss for a single batch, to implement the following calc_loss_loader function that computes the loss over all the batches sampled by a given data loader:

我们现在可以使用这个calc_loss_batch实用函数，它计算单个批次的损失，来实现以下calc_loss_loader函数，该函数计算由给定数据加载器采样的所有批次的损失：

def calc_loss_loader(data_loader, model, device, num_batches=None):  # 定义计算加载器损失的函数
    total_loss = 0.  # 初始化总损失为0
    if len(data_loader) == 0:  # 如果加载器为空
        return float("nan")  # 返回NaN
    elif num_batches is None:  # 如果未指定批次数
        num_batches = len(data_loader)  # 使用加载器中的批次数
    else:
        num_batches = min(num_batches, len(data_loader))  # 限制批次数为加载器中的批次数
    for i, (input_batch, target_batch) in enumerate(data_loader):  # 遍历加载器中的批次
        if i < num_batches:  # 如果未达到指定批次数
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # 计算批次损失
            total_loss += loss.item()  # 累加损失
        else:
            break  # 超过指定批次数则退出
    return total_loss / num_batches  # 返回平均损失

By default, the calc_loss_batch function iterates over all batches in a given data loader, accumulates the loss in the total_loss variable, and then computes and averages the loss over the total number of batches. Alternatively, we can specify a smaller number of batches via num_batches to speed up the evaluation during model training.

默认情况下，calc_loss_batch函数遍历给定数据加载器中的所有批次，将损失累加到total_loss变量中，然后计算并平均所有批次的损失。或者，我们可以通过num_batches指定较小的批次数，以加快模型训练期间的评估。

Let’s now see this calc_loss_batch function in action, applying it to the training and validation set loaders:

现在让我们来看一下这个calc_loss_batch函数的实际应用，将其应用于训练和验证集加载器：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 设置设备为GPU或CPU
model.to(device)  # 将模型移动到设备上
with torch.no_grad():  # 禁用梯度计算以提高效率
    train_loss = calc_loss_loader(train_loader, model, device)  # 计算训练集损失
    val_loss = calc_loss_loader(val_loader, model, device)  # 计算验证集损失
print("Training loss:", train_loss)  # 打印训练集损失
print("Validation loss:", val_loss)  # 打印验证集损失

The resulting loss values are as follows:

结果损失值如下：

Training loss: 10.98758347829183
Validation loss: 10.98110580444336

The loss values are relatively high because the model has not yet been trained. For comparison, the loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets.

损失值相对较高，因为模型尚未训练。相比之下，如果模型学会生成训练和验证集中出现的下一个词元，损失将接近0。

Now that we have a way to measure the quality of the generated text, in the next section, we train the LLM to reduce this loss so that it becomes better at generating text, as illustrated in Figure 5.10.

现在我们有了一种衡量生成文本质量的方法，在下一节中，我们训练LLM以减少这种损失，从而更好地生成文本，如图5.10所示。
在这里插入图片描述

Figure 5.10 We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM.

图5.10 我们回顾了文本生成过程，并实现了基本的模型评估技术来计算训练和验证集损失。接下来，我们将进行训练函数并预训练LLM。

As shown in Figure 5.10, the next section focuses on pretraining the LLM. After model training, we implement alternative text generation strategies and save and load pretrained model weights.

如图5.10所示，下一节将重点介绍LLM的预训练。模型训练后，我们将实现替代文本生成策略并保存和加载预训练模型权重。

5.2 Training an LLM

5.2 训练LLM

In this section, we finally implement the code for pretraining the LLM, our GPTModel. For this, we focus on a straightforward training loop, as illustrated in Figure 5.11, to keep the code concise and readable. However, interested readers can learn about more advanced techniques, including learning rate warmup, cosine annealing, and gradient clipping, in Appendix D, Adding Bells and Whistles to the Training Loop.

在本节中，我们最终实现了预训练LLM（我们的GPTModel）的代码。为此，我们专注于一个简单明了的训练循环，如图5.11所示，以保持代码简洁易读。然而，感兴趣的读者可以在附录D“为训练循环添加花哨功能”中了解更多高级技术，包括学习率预热、余弦退火和梯度裁剪。

在这里插入图片描述

Figure 5.11 A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized.

图5.11 PyTorch中训练深度神经网络的典型训练循环包括几个步骤，在几个时期内迭代训练集中的批次。在每个循环中，我们计算每个训练集批次的损失以确定损失梯度，使用这些梯度来更新模型权重，从而使训练集损失最小化。

The flowchart in Figure 5.11 depicts a typical PyTorch neural network training workflow, which we use for training an LLM. It outlines eight steps, starting with iterating over each epoch, processing batches, resetting and calculating gradients, updating weights, and concluding with monitoring steps like printing losses and generating text samples. If you are relatively new to training deep neural networks with PyTorch and any of these steps are unfamiliar, consider reading sections A.5 to A.8 in Appendix A, Introduction to PyTorch.

图5.11中的流程图描述了一个典型的PyTorch神经网络训练工作流，我们用它来训练LLM。它概述了八个步骤，从迭代每个周期、处理批次、重置和计算梯度、更新权重，到以打印损失和生成文本示例等监控步骤结束。如果你对使用PyTorch训练深度神经网络还不太熟悉，并且对这些步骤中的任何一个不熟悉，请考虑阅读附录A中的第A.5到A.8节“PyTorch简介”。

In code, we can implement this training flow via the following train_model_simple function:

在代码中，我们可以通过以下train_model_simple函数实现这个训练流程：

Listing 5.3 The main function for pretraining LLMs

5.3 用于预训练LLMs的主要函数

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,  # 定义train_model_simple函数
                       eval_freq, eval_iter, start_context, tokenizer):
    train_losses, val_losses, track_tokens_seen = [], [], []  # 初始化列表以跟踪损失和看到的词元
    tokens_seen, global_step = 0, -1  # 初始化词元计数和全局步数

    for epoch in range(num_epochs):  # 开始主要训练循环
        model.train()  # 设置模型为训练模式
        for input_batch, target_batch in train_loader:  # 遍历训练数据
            optimizer.zero_grad()  # 重置前一批次迭代的梯度
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # 计算批次损失
            loss.backward()  # 计算损失梯度
            optimizer.step()  # 使用损失梯度更新模型权重
            tokens_seen += input_batch.numel()  # 更新词元计数
            global_step += 1  # 增加全局步数

            if global_step % eval_freq == 0:  # 可选的评估步骤
                train_loss, val_loss = evaluate_model(  # 评估模型性能
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)  # 添加训练损失到列表
                val_losses.append(val_loss)  # 添加验证损失到列表
                track_tokens_seen.append(tokens_seen)  # 记录看到的词元数
                print(f"Ep {epoch+1} (Step {global_step:06d}): "  # 打印当前训练信息
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

    generate_and_print_sample(  # 生成并打印样本
        model, tokenizer, device, start_context
    )
    return train_losses, val_losses, track_tokens_seen  # 返回训练和验证损失及词元计数

Note that the train_model_simple function we just created uses two functions we have not defined yet: evaluate_model and generate_and_print_sample.

请注意，我们刚刚创建的train_model_simple函数使用了两个尚未定义的函数：evaluate_model和generate_and_print_sample。

The evaluate_model function corresponds to step 7 in Figure 5.11. It prints the training and validation set losses after each model update so we can evaluate whether the training improves the model.

evaluate_model函数对应于图5.11中的步骤7。它在每次模型更新后打印训练和验证集的损失，以便我们评估训练是否改进了模型。

More specifically, the evaluate_model function calculates the loss over the training and validation set while ensuring the model is in evaluation mode with gradient tracking and dropout disabled when calculating the loss over the training and validation sets:

更具体地说，evaluate_model函数在计算训练和验证集的损失时确保模型处于评估模式，禁用梯度跟踪和dropout：

def evaluate_model(model, train_loader, val_loader, device, eval_iter):  # 定义evaluate_model函数
    model.eval()  # 设置模型为评估模式
    with torch.no_grad():  # 禁用梯度跟踪
        train_loss = calc_loss_loader(train_loader, model, device,  # 计算训练集损失
                                      num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device,  # 计算验证集损失
                                    num_batches=eval_iter)
    model.train()  # 设置模型为训练模式
    return train_loss, val_loss  # 返回训练和验证损失

Similar to evaluate_model, the generate_and_print_sample function is a convenience function that we use to track whether the model improves during the training. In particular, the generate_and_print_sample function takes a text snippet (start_context) as input, converts it into token IDs, and feeds it to the LLM to generate a text sample using the generate_text_simple function we used earlier:

与evaluate_model类似，generate_and_print_sample函数是一个方便的函数，我们用它来跟踪模型在训练期间是否改进。特别是，generate_and_print_sample函数将一个文本片段（start_context）作为输入，将其转换为词元ID，并将其输入LLM以使用我们之前使用的generate_text_simple函数生成一个文本样本：

def generate_and_print_sample(model, tokenizer, device, start_context):  # 定义generate_and_print_sample函数
    model.eval()  # 设置模型为评估模式
    context_size = model.pos_emb.weight.shape[0]  # 获取上下文大小
    encoded = text_to_token_ids(start_context, tokenizer).to(device)  # 将文本转换为词元ID并移动到设备
    with torch.no_grad():  # 禁用梯度跟踪
        token_ids = generate_text_simple(  # 生成文本词元ID
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
        )
    decoded_text = token_ids_to_text(token_ids, tokenizer)  # 将词元ID转换为文本
    print(decoded_text.replace("\n", " "))  # 打印生成的文本，以紧凑格式显示
    model.train()  # 设置模型为训练模式

While the evaluate_model function gives us a numeric estimate of the model’s training progress, this generate_and_print_sample test function provides a concrete text example generated by the model to judge its capabilities during training.

虽然evaluate_model函数给出了模型训练进度的数字估计，但这个generate_and_print_sample测试函数提供了一个由模型生成的具体文本示例，以判断其在训练期间的能力。

`
AdamW

Adam optimizers are a popular choice for training deep neural networks. However, in our training loop, we opt for the AdamW optimizer. AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing larger weights. This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs.

Adam优化器是训练深度神经网络的常用选择。然而，在我们的训练循环中，我们选择了AdamW优化器。AdamW是Adam的一种变体，它改进了权重衰减方法，旨在通过惩罚较大的权重来最小化模型复杂性并防止过拟合。这一调整使得AdamW能够实现更有效的正则化和更好的泛化，因此在LLMs的训练中经常使用。

Let’s see this all in action by training a GPTModel instance for 10 epochs using an AdamW optimizer and the train_model_simple function we defined earlier.

让我们通过使用AdamW优化器和之前定义的train_model_simple函数训练一个GPTModel实例10个周期来实际演示这一切。

torch.manual_seed(123)  # 设置随机种子
model = GPTModel(GPT_CONFIG_124M)  # 初始化模型
model.to(device)  # 将模型移动到设备
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)  # 使用AdamW优化器
num_epochs = 10  # 训练周期数
train_losses, val_losses, tokens_seen = train_model_simple(  # 调用train_model_simple函数
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=1,
    start_context="Every effort moves you", tokenizer=tokenizer
)

Executing the train_model_simple function starts the training process, which takes about 5 minutes on a MacBook Air or a similar laptop to complete. The output printed during this execution is as follows:

执行train_model_simple函数开始训练过程，在MacBook Air或类似的笔记本电脑上大约需要5分钟完成。在此过程中打印的输出如下：

Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933
Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,.                                     
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and, and,
[...] Results are truncated to save space
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back the window-curtains, I had the donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gis

As we can see, based on the results printed during the training, the training loss improves drastically, starting with a value of 9.558 and converging to 0.762. The language skills of the model have improved quite a lot. In the beginning, the model is only able to append commas to the start context (“Every effort moves you,”) or repeat the word “and”. At the end of the training, it can generate grammatically correct text.

正如我们所见，根据训练期间打印的结果，训练损失显著改善，从9.558开始收敛到0.762。模型的语言技能大大提高。起初，模型只能在起始上下文（“Every effort moves you,”）后附加逗号或重复“and”这个词。在训练结束时，它可以生成语法正确的文本。

Similar to the training set loss, we can see that the validation loss starts high (9.856) and decreases during the training. However, it never becomes as small as the training set loss and remains at 6.372 after the 10th epoch.

与训练集损失相似，我们可以看到验证损失开始时很高（9.856），并在训练过程中逐渐减少。然而，它从未变得像训练集损失那样小，并在第10周期后保持在6.372。

Before discussing the validation loss in more detail, let’s create a simple plot that shows the training and validation set losses side by side:

在更详细地讨论验证损失之前，让我们创建一个简单的图表，显示训练和验证集损失的并排对比：

import matplotlib.pyplot as plt  # 导入matplotlib.pyplot库作为plt

def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):  # 定义plot_losses函数
    fig, ax1 = plt.subplots(figsize=(5, 3))  # 创建一个图形和一个子图，图形大小为5x3
    ax1.plot(epochs_seen, train_losses, label="Training loss")  # 在第一个子图上绘制训练损失
    ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")  # 在第一个子图上绘制验证损失，使用点线样式
    ax1.set_xlabel("Epochs")  # 设置x轴标签为“Epochs”
    ax1.set_ylabel("Loss")  # 设置y轴标签为“Loss”
    ax1.legend(loc="upper right")  # 设置图例位置为右上角
    ax2 = ax1.twinx()  # 创建共享同一y轴的第二个x轴  #A
    ax2.plot(tokens_seen, train_losses, alpha=0)  # 对齐刻度的隐形图  #B
    ax2.set_xlabel("Tokens seen")  # 设置第二个x轴标签为“Tokens seen”
    fig.tight_layout()  # 自动调整子图参数以填充整个图形区域
    plt.show()  # 显示图形

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))  # 创建一个从0到num_epochs的线性张量，长度为训练损失的长度
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)  # 调用plot_losses函数，传入epochs_tensor, tokens_seen, train_losses, val_losses

The resulting training and validation loss plot is shown in Figure 5.12.

生成的训练和验证损失图如图5.12所示。

在这里插入图片描述

Figure 5.12 At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it’s overfitting to the training set past epoch 2.

图5.12 在训练开始时，我们观察到训练集和验证集的损失都急剧下降，这是模型正在学习的迹象。然而，训练集损失在第二个周期后继续下降，而验证集损失停滞不前。这表明模型仍在学习，但在第二个周期后过拟合了训练集。

As Figure 5.12 shows, both the training and validation losses start to improve for the first epoch. However, the losses start to diverge past the second epoch. This divergence and the fact that the validation loss is much larger than the training loss indicate that the model is overfitting to the training data. We can confirm that the model memorizes the training data verbatim by searching for the generated text snippets, such as “quite insensible to the irony” in the “The Verdict” text file.

如图5.12所示，训练和验证损失在第一个周期开始改善。然而，损失在第二个周期后开始分化。这种分化以及验证损失远大于训练损失的事实表明模型正在对训练数据过拟合。我们可以通过搜索生成的文本片段来确认模型逐字记住了训练数据，例如“在‘The Verdic’文本文件中‘quite insensible to the iron’”。

This memorization is expected since we are working with a very, very small training dataset and training the model for multiple epochs. Usually, it’s common to train a model on a much, much larger dataset for only one epoch.

这种记忆是可以预料的，因为我们正在使用非常非常小的训练数据集，并且训练模型多个周期。通常，常见的是在一个非常非常大的数据集上训练模型只有一个周期。

As mentioned earlier, interested readers can try to train the model on 60,000 public domain books from Project Gutenberg, where this overfitting does not occur; see appendix B for details.

如前所述，感兴趣的读者可以尝试在Project Gutenberg的60,000本公共领域图书上训练模型，在这种情况下不会发生过拟合；详见附录B。

In the upcoming section, as shown in Figure 5.13, we explore sampling methods employed by LLMs to mitigate memorization effects, resulting in more novel generated text.

在接下来的部分中，如图5.13所示，我们将探讨LLMs采用的采样方法，以减轻记忆效应，从而生成更为新颖的文本。

在这里插入图片描述

Figure 5.13 Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts.

图5.13 我们的模型在实现训练功能后可以生成连贯的文本。然而，它经常逐字记住训练集中的段落。以下部分介绍了生成更多样化输出文本的策略。

As illustrated in Figure 5.13, the next section will cover text generation strategies for LLM to reduce training data memorization and increase the originality of the LLM-generated text before we cover weight loading and saving and loading pretrained weights from OpenAI’s GPT model.

如图5.13所示，在我们介绍权重加载和保存以及从OpenAI的GPT模型加载预训练权重之前，下一部分将介绍用于LLM的文本生成策略，以减少训练数据的记忆并增加LLM生成文本的原创性。

5.3 Decoding strategies to control randomness

5.3 控制随机性的解码策略

In this section, we will cover text generation strategies (also called decoding strategies) to generate more original text. First, we briefly revisit the generate_text_simple function from the previous chapter that we used inside the generate_and_print_sample earlier in this chapter. Then, we will cover two techniques, temperature scaling, and top-k sampling, to improve this function.

在本节中，我们将介绍文本生成策略（也称为解码策略），以生成更原创的文本。首先，我们将简要回顾上一章中使用的generate_text_simple函数，该函数在本章前面使用的generate_and_print_sample函数中。然后，我们将介绍两种技术，温度缩放和top-k采样，以改进此功能。

We begin by transferring the model back from the GPU to the CPU since inference with a relatively small model does not require a GPU. Also, after training, we put the model into evaluation mode to turn off random components such as dropout:

我们首先将模型从GPU传回CPU，因为相对较小的模型推理不需要GPU。此外，在训练后，我们将模型置于评估模式，以关闭随机组件，如dropout：

model.to("cpu")  # 将模型移至CPU
model.eval()  # 设置模型为评估模式

Next, we plug the GPTModel instance (model) into the generate_text_simple function, which uses the LLM to generate one token at a time:

接下来，我们将GPTModel实例（model）插入generate_text_simple函数，该函数使用LLM一次生成一个词元：

tokenizer = tiktoken.get_encoding("gpt2")  # 获取词元编码器
token_ids = generate_text_simple(  # 调用generate_text_simple函数生成词元
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),  # 将文本转换为词元ID
    max_new_tokens=25,  # 最大生成词元数为25
    context_size=GPT_CONFIG_124M["context_length"]  # 上下文大小为GPT_CONFIG_124M的context_length
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印生成的文本

The generated text is as follows:

生成的文本如下：

Output text:

输出文本：

Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run

你知道，每一次努力都是他在塞夫勒瓷器和精美布置的午餐桌上提出的格言之一。后来一天，我又从蒙特卡洛赶回来

As explained earlier in section 5.1.2, the generated token is selected at each generation step corresponding to the largest probability score among all tokens in the vocabulary.

如前面5.1.2节所述，在每次生成步骤中选择词元时，生成的词元对应于词汇表中所有词元的最大概率分数。

The following subsections introduce two concepts to control the randomness and diversity of the generated text: temperature scaling and top-k sampling.

以下小节介绍了两个控制生成文本随机性和多样性的概念：温度缩放和top-k采样。

5.3.1 Temperature scaling

5.3.1 温度缩放

This section introduces temperature scaling, a technique that adds a probabilistic selection process to the next-token generation task.

本节介绍了温度缩放，这是一种为下一个词元生成任务添加概率选择过程的技术。

Previously, inside the generate_text_simple function, we always sampled the token with the highest probability as the next token using torch.argmax, also known as greedy decoding. To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).

以前，在generate_text_simple函数中，我们总是使用torch.argmax选择下一个词元，也称为=贪婪解码==。为了生成更多样化的文本，我们可以将argmax替换为从概率分布中采样的函数（在这里，概率分数是LLM在每个词元生成步骤中为每个词汇条目生成的）。

To illustrate the probabilistic sampling with a concrete example, let’s briefly discuss the next-token generation process using a very small vocabulary for illustration purposes:

为了用具体示例说明概率采样，我们简要讨论使用非常小的词汇表进行说明的下一个词元生成过程：

vocab = {  # 定义词汇表
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}  # 定义反向词汇表

Next, assume the LLM is given the start context “every effort moves you” and generates the following next-token logits:

接下来，假设LLM给定起始上下文“every effort moves you”并生成以下下一个词元的logits：

next_token_logits = torch.tensor(  # 定义下一个词元的logits
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

As discussed in the previous chapter, inside the generate_text_simple, we convert the logits into probabilities via the softmax function and obtain the token ID corresponding to the generated token via the argmax function, which we can then map back into text via the inverse vocabulary:

如前一章所述，在generate_text_simple内部，我们通过softmax函数将logits转换为概率，并通过argmax函数获得对应于生成词元的词元ID，然后我们可以通过反向词汇表将其映射回文本：

probas = torch.softmax(next_token_logits, dim=0)  # 将logits转换为概率
next_token_id = torch.argmax(probas).item()  # 获取概率最大的词元ID
print(inverse_vocab[next_token_id])  # 打印生成的词元

Since the largest logit value, and correspondingly the largest softmax probability score, is in the fourth position (index position 3 since Python uses 0-indexing), the generated word is to implement a probabilistic sampling process, we can now replace the argmax with the multinomial function in PyTorch:

由于最大的logit值及相应的softmax概率分数在第四位（索引位置3，因为Python使用0索引），生成的单词是为了实现概率采样过程，我们现在可以用PyTorch中的multinomial函数替换argmax：

torch.manual_seed(123)  # 设置随机种子
next_token_id = torch.multinomial(probas, num_samples=1).item()  # 使用多项式分布采样
print(inverse_vocab[next_token_id])  # 打印生成的词元

The printed output is “forward” just like before. What happened? The multinomial function samples the next token proportional to its probability score. In other words, “forward” is still the most likely token and will be selected by multinomial most of the time but not all the time. To illustrate this, let’s implement a function that repeats this sampling 1000 times:

打印输出仍然是"forward"。发生了什么？多项式函数根据概率分数按比例采样下一个词元。换句话说，“forward”仍然是最可能的词元，并且在大多数情况下会被多项式选择，但不是每次都选择。为了说明这一点，让我们实现一个函数，重复这个采样1000次：

def print_sampled_tokens(probas):  # 定义print_sampled_tokens函数
    torch.manual_seed(123)  # 设置随机种子
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]  # 进行1000次采样
    sampled_ids = torch.bincount(torch.tensor(sample))  # 计算每个词元的出现频率
    for i, freq in enumerate(sampled_ids):  # 遍历词元频率
        print(f"{freq} x {inverse_vocab[i]}")  # 打印词元及其出现频率
print_sampled_tokens(probas)  # 调用print_sampled_tokens函数

The sampling output is as follows:

采样输出如下：

73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward

As we can see based on the output, the word “forward” is sampled most of the time (582 out of 1000 times), but other tokens such as “closer”, “inches”, and “toward” will also be sampled some of the time. This means that if we replaced the argmax function with the multinomial function inside the generate_and_print_sample function, the LLM would sometimes generate texts such as “every effort moves you toward”, “every effort moves you inches”, and “every effort moves you closer” instead of “every effort moves you forward”.

正如我们根据输出所见，单词“forward”在大多数情况下被采样（1000次中有582次），但其他词元如“closer”、“inches”和“toward”也会在某些情况下被采样。这意味着如果我们在generate_and_print_sample函数中将argmax函数替换为多项式函数，LLM有时会生成类似“every effort moves you toward”、“every effort moves you inches”和“every effort moves you closer”的文本，而不是“every effort moves you forward”。

We can further control the distribution and selection process via a concept called temperature scaling, where temperature scaling is just a fancy description for dividing the logits by a number greater than 0:

我们可以通过一种称为温度缩放的概念进一步控制分布和选择过程，温度缩放只是将logits除以大于0的数字的花哨描述：

def softmax_with_temperature(logits, temperature):  # 定义带温度的softmax函数
    scaled_logits = logits / temperature  # 对logits进行温度缩放
    return torch.softmax(scaled_logits, dim=0)  # 返回softmax结果
temperatures = [1, 0.1, 5]  #A 定义温度列表
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]  # 对每个温度计算缩放后的概率
x = torch.arange(len(vocab))  # 定义x轴刻度
bar_width = 0.15  # 定义柱宽
fig, ax = plt.subplots(figsize=(5, 3))  # 创建图形和子图
for i, T in enumerate(temperatures):  # 遍历温度列表
    rects = ax.bar(x + i * bar_width, scaled_probas[i],  # 绘制柱状图
                   bar_width, label=f'Temperature = {T}')
ax.set_ylabel('Probability')  # 设置y轴标签为“Probability”
ax.set_xticks(x)  # 设置x轴刻度
ax.set_xticklabels(vocab.keys(), rotation=90)  # 设置x轴刻度标签
ax.legend()  # 显示图例
plt.tight_layout()  # 调整子图布局
plt.show()  # 显示图形

temperatures greater than 1 result in more uniformly distributed token probabilities, and temperatures smaller than 1 will result in more confident (sharper or more peaky) distributions. Let’s illustrate this by plotting the original probabilities alongside probabilities scaled with d 在这里插入图片描述

Figure 5.14 A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here “forward”) will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.

图5.14 温度为1表示词汇表中每个词元的未缩放概率分数。将温度降低到0.1会使分布变得更尖锐，因此最可能的词元（此处为“forward”）将具有更高的概率分数。反之，温度增加到5使分布更加均匀。

A temperature of 1 divides the logits by 1 before passing them to the softmax function to compute the probability scores. In other words, using a temperature of 1 is the same as not using any temperature scaling. In this case, the tokens are selected with a probability equal to the original softmax probability scores via the multinomial sampling function in PyTorch.

温度为1时，将logits除以1后传递给softmax函数以计算概率分数。换句话说，使用温度1与不使用任何温度缩放相同。在这种情况下，通过PyTorch中的多项式采样函数选择词元的概率等于原始的softmax概率分数。

Also, as we can see in Figure 5.14, applying very small temperatures, such as 0.1, will result in sharper distributions such that the behavior of the multinomial function selects the most likely token (here: “forward”) almost 100% of the time, approaching the behavior of the argmax function. Vice versa, a temperature of 5 results in a more uniform distribution where other tokens are selected more often. This can add more variety to the generated texts but also more often results in nonsensical text. For example, using the temperature of 5 results in texts such as “every effort moves you pizza” about 4% of the time.

此外，正如我们在图5.14中看到的那样，应用非常小的温度（例如0.1）将导致更尖锐的分布，使得多项式函数的行为几乎100%地选择最可能的词元（此处为：“forward”），接近argmax函数的行为。反之，温度为5导致分布更加均匀，其他词元被选择的频率更高。这可以为生成的文本增加更多的多样性，但也更频繁地导致无意义的文本。例如，使用温度5大约4%的情况下会生成“every effort moves you pizza”这样的文本。

EXERCISE 5.1

练习 5.1

Use the print_sampled_tokens function to print the sampling frequencies of the softmax probabilities scaled with the temperatures shown in Figure 5.13. How often is the word “pizza” sampled in each case? Can you think of a faster and more accurate way to determine how often the word “pizza” is sampled?

使用print_sampled_tokens函数打印图5.13中显示的温度缩放的softmax概率的采样频率。在每种情况下“pizza”这个词被采样的频率是多少？你能想到一种更快更准确的方法来确定“pizza”这个词被采样的频率吗？

5.3.2 Top-k sampling

5.3.2 Top-k采样

In the previous section, we implemented a probabilistic sampling approach coupled with temperature scaling to increase the diversity of the outputs. We saw that higher temperature values result in more uniformly distributed next-token probabilities, which result in more diverse outputs as it reduces the likelihood of the model repeatedly selecting the most probable token. This method allows for exploring less likely but potentially more interesting and creative paths in the generation process. However, One downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensical outputs such as “every effort moves you pizza”.

在上一节中，我们实现了一种概率采样方法，结合温度缩放以增加输出的多样性。我们看到较高的温度值导致下一词元概率更均匀分布，这导致更多样化的输出，因为它减少了模型重复选择最可能词元的可能性。这种方法允许在生成过程中探索不太可能但可能更有趣和创造性的路径。然而，这种方法的一个缺点是有时会导致语法错误或完全无意义的输出，例如“every effort moves you pizza”。

In this section, we introduce another concept called top-k sampling, which, when combined with probabilistic sampling and temperature scaling, can improve the text generation results.

在本节中，我们介绍了另一种称为top-k采样的概念，当它与概率采样和温度缩放结合时，可以改进文本生成结果。

In top-k sampling, we can restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores, as illustrated in Figure 5.15.

在top-k采样中，我们可以将采样词元限制为top-k最可能的词元，并通过掩蔽其概率分数将所有其他词元排除在选择过程之外，如图5.15所示。

在这里插入图片描述

Figure 5.15 Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non-top-k tokens.

图5.15 使用k=3的top-k采样，我们关注与最高logits关联的3个词元，并在应用softmax函数之前将所有其他词元掩蔽为负无穷大（-inf）。这导致一个概率值为0的概率分布分配给所有非top-k词元。

The approach outlined in Figure 5.15 replaces all non-selected logits with negative infinity value (-inf), such that when computing the softmax values, the probability scores of the non-top-k tokens are 0, and the remaining probabilities sum up to 1. (Careful readers may remember this masking trick from the causal attention module we implemented in chapter 3 in section 3.5.1 Applying a causal attention mask.)

图5.15中概述的方法将所有未选择的logits替换为负无穷大值（-inf），这样在计算softmax值时，非top-k词元的概率分数为0，其余概率总和为1。（仔细的读者可能会记得这个掩蔽技巧来自我们在第3章第3.5.1节应用因果注意力掩蔽中实现的因果注意力模块。）

In code, we can implement the top-k procedure outlined in Figure 5.15 as follows, starting with the selection of the tokens with the largest logit values:

在代码中，我们可以实现图5.15中概述的top-k过程，首先选择具有最大logit值的词元：

top_k = 3  # 设置top-k为3
top_logits, top_pos = torch.topk(next_token_logits, top_k)  # 选择具有最大logit值的词元
print("Top logits:", top_logits)  # 打印最大logit值
print("Top positions:", top_pos)  # 打印最大logit值的位置

Top logits: tensor([6.7500, 6.2800, 4.5100])
#最大logit值：tensor([6.7500, 6.2800, 4.5100])
Top positions: tensor([3, 7, 0])
#最大logit值的位置：tensor([3, 7, 0])

Subsequently, we apply PyTorch’s where function to set the logit values of tokens that are below the lowest logit value within our top-3 selection to negative infinity (-inf).

随后，我们应用PyTorch的where函数将低于我们top-3选择的最低logit值的词元的logit值设置为负无穷大（-inf）。

new_logits = torch.where(  # 使用where函数更新logits
    condition=next_token_logits < top_logits[-1],  #A 确定小于top 3最小值的logits
    input=torch.tensor(float('-inf')),  #B 将这些较低的logits赋值为-inf
    other=next_token_logits  #C 其他词元保留原始logits
)
print(new_logits)  # 打印更新后的logits

The resulting logits for the next token in the 9-token vocabulary are as follows:

9个词元词汇表中下一个词元的结果logits如下：

tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])

Lastly, let’s apply the softmax function to turn these into next-token probabilities:

最后，让我们应用softmax函数将这些转换为下一个词元概率：

topk_probas = torch.softmax(new_logits, dim=0)  # 对更新后的logits应用softmax函数
print(topk_probas)  # 打印top-k概率

As we can see, the result of this top-3 approach are 3 non-zero probability scores:

正如我们所见，这种top-3方法的结果是3个非零概率分数：

tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])

We can now apply the temperature scaling and multinomial function for probabilistic sampling introduced in the previous section to select the next token among these 3 non-zero probability scores to generate the next token. We do this in the next section by modifying the text generation function.

现在，我们可以应用前一节中介绍的温度缩放和多项式函数进行概率采样，在这3个非零概率分数中选择下一个词元以生成下一个词元。我们在下一节通过修改文本生成函数来实现这一点。

5.3.3 Modifying the text generation function

5.3.3 修改文本生成函数

The previous two subsections introduced two concepts to increase the diversity of LLM-generated text: temperature sampling and top-k sampling. In this section, we combine and add these concepts to modify the generate_simple function we used to generate text via the LLM earlier, creating a new generate function:

前两节介绍了两种增加LLM生成文本多样性的概念：温度采样和top-k采样。在本节中，我们结合并添加这些概念以修改我们之前用于通过LLM生成文本的generate_simple函数，创建一个新的generate函数：

Listing 5.4 A modified text generation function with more diversity

5.4 一个具有更多多样性的修改文本生成函数

def generate(model, idx, max_new_tokens, context_size,  # 定义生成函数
             temperature=1.0, top_k=None, eos_id=None):
    for _ in range(max_new_tokens):  #A 遍历最大新词元数
        idx_cond = idx[-context_size:]  # 获取上下文
        with torch.no_grad():  # 禁用梯度计算
            logits = model(idx_cond)  # 获取模型的logits
            logits = logits[:, -1, :]  # 只关注最后一个时间步的logits
        if top_k is not None:  #B 在新部分中，使用top-k采样
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(
                logits < min_val,  #A 确定小于top 3最小值的logits
                torch.tensor(float('-inf')).to(logits.device),  #B 将这些较低的logits赋值为-inf
                logits  #C 其他词元保留原始logits
            )

        if temperature > 0.0:  #C 这是我们应用温度缩放的新部分
            logits = logits / temperature  # 对logits进行温度缩放
            probs = torch.softmax(logits, dim=-1)  # 应用softmax函数
            idx_next = torch.multinomial(probs, num_samples=1)  # 使用多项式函数进行采样
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  #D 在禁用温度缩放时执行贪婪选择
        if idx_next == eos_id:  #E 如果遇到序列结束词元且指定了eos_id，则提前终止生成
            break
        idx = torch.cat((idx, idx_next), dim=1)  # 将新词元添加到索引序列中
    return idx  # 返回索引序列

Let’s now see this new generate function in action:

现在让我们看看这个新的generate函数的实际效果：

torch.manual_seed(123)  # 设置随机种子
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),  # 将文本转换为词元ID并移动到设备
    max_new_tokens=15,  # 最大新词元数为15
    context_size=GPT_CONFIG_124M["context_length"],  # 上下文大小
    top_k=25,  # top-k值为25
    temperature=1.4  # 温度值为1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印生成的文本

The generated text is as follows:

生成的文本如下：

Output text:

输出文本：

Every effort moves you stand to work on surprise, a one of us had gone with random-

每一次努力都在推动你站起来工作，惊讶的是，我们中的一个人已经随机走了

As we can see, the generated text is very different from the one we previously generated via the generate_simple function at the beginning of section 5.3 (“Every effort moves you know,” was one of the axioms he laid…!"), which was a memorized passage from the training set.

正如我们所见，生成的文本与我们在5.3节开始时通过generate_simple函数生成的文本（“你知道，每一次努力都是他提出的格言之一…！”）非常不同，那是训练集中的记忆段落。

EXERCISE 5.2

练习 5.2

Play around with different temperatures and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are desired? Vice versa, can you think of applications where higher temperature and top-k settings are preferred? (It’s recommended to also revisit this exercise at the end of the chapter after loading the pretrained weights from OpenAI.)

尝试不同的温度和top-k设置。根据您的观察，您能想到哪些应用场景需要较低的温度和top-k设置吗？反之，您能想到哪些应用场景更适合较高的温度和top-k设置吗？（建议在本章结束后加载OpenAI的预训练权重后，重新审视这个练习。）

EXERCISE 5.3

练习 5.3

What are the different combinations of settings for the generate function to force deterministic behavior, that is, disabling the random sampling such that it always produces the same outputs similar to the generate_simple function?

生成函数的不同设置组合是什么，以强制确定性行为，即禁用随机采样，使其总是生成与generate_simple函数相似的相同输出？

So far, we covered how to pretrain LLMs and use them to generate text. The last two sections of this chapter will discuss how we save and load the trained LLM and how we load pretrained weights from OpenAI.

到目前为止，我们介绍了如何预训练LLM并使用它们生成文本。本章的最后两部分将讨论如何保存和加载训练好的LLM以及如何从OpenAI加载预训练权重。

5.4 Loading and saving model weights in PyTorch

5.4 在PyTorch中加载和保存模型权重

In this chapter, we have discussed how to numerically evaluate the training progress and pretrain an LLM from scratch. Even though both the LLM and dataset were relatively small, this exercise showed that pretraining LLMs is computationally expensive. Thus, it is important to be able to save the LLM so that we don’t have to rerun the training every time we want to use it in a new session.

在本章中，我们讨论了如何数值评估训练进度并从头开始预训练LLM。尽管LLM和数据集都相对较小，但此练习表明预训练LLM计算成本高。因此，能够保存LLM非常重要，这样每次在新会话中使用它时就不必重新运行训练。

As illustrated in the chapter overview in Figure 5.16, we cover how to save and load a pretrained model in this section. Then, in the upcoming section, we will load a more capable pretrained GPT model from OpenAI into our GPTModel instance.

如图5.16中的章节概述所示，我们在本节介绍如何保存和加载预训练模型。然后，在接下来的部分中，我们将从OpenAI加载一个更强大的预训练GPT模型到我们的GPTModel实例中。

在这里插入图片描述

Figure 5.16 After training and inspecting the model, it is often helpful to save the model so that we can use or continue training it later, which is the topic of this section before we load the pretrained model weights from OpenAI in the final section of this chapter.

图5.16 在训练和检查模型之后，通常保存模型是有帮助的，这样我们可以在以后使用或继续训练它，这是本节的主题，在本章最后一节我们将从OpenAI加载预训练模型权重。

Fortunately, saving a PyTorch model is relatively straightforward. The recommended way is to save a model’s so-called state_dict, a dictionary mapping each layer to its parameters, using the torch.save function as follows:

幸运的是，保存PyTorch模型相对简单。推荐的方法是使用torch.save函数保存模型的state_dict，一个将每一层映射到其参数的字典，如下所示：

torch.save(model.state_dict(), "model.pth")  # 保存模型权重到model.pth文件

In the preceding code, “model.pth” is the filename where the state_dict is saved. The .pth extension is a convention for PyTorch files, though we could technically use any file extension.

在上面的代码中，“model.pth”是保存state_dict的文件名。尽管从技术上讲，我们可以使用任何文件扩展名，但.pth扩展名是PyTorch文件的惯例。

Then, after saving the model weights via the state_dict, we can load the model weights into a new GPTModel model instance as follows:

然后，在通过state_dict保存模型权重后，我们可以将模型权重加载到新的GPTModel模型实例中，如下所示：

model = GPTModel(GPT_CONFIG_124M)  # 初始化新模型实例
model.load_state_dict(torch.load("model.pth"))  # 加载保存的模型权重
model.eval()  # 设置模型为评估模式

As discussed in chapter 4, dropout helps prevent the model from overfitting to the training data by randomly “dropping out” of a layer’s neurons during training. However, during inference, we don’t want to randomly drop out any of the information the network has learned. Using model.eval() switches the model to evaluation mode for inference, disabling the dropout layers of the model.

正如第4章所讨论的，dropout通过在训练期间随机“丢弃”一层的神经元来帮助防止模型过拟合训练数据。然而，在推理过程中，我们不希望随机丢弃网络学到的任何信息。使用model.eval()将模型切换到评估模式进行推理，禁用模型的dropout层。

If we plan to continue pretraining a model later, for example, using the train_model_simple function we defined earlier in this chapter, saving the optimizer state is also recommended.

如果我们计划稍后继续预训练模型，例如，使用本章前面定义的train_model_simple函数，建议也保存优化器状态。

Adaptive optimizers such as AdamW store additional parameters for each model weight. AdamW uses historical data to adjust learning rates for each model parameter dynamically. Without it, the optimizer resets, and the model may learn suboptimally or even fail to converge properly, which means that it will lose the ability to generate coherent text. Using torch.save, we can save both the model and optimizer state_dict contents as follows:

自适应优化器如AdamW为每个模型权重存储额外的参数。AdamW使用历史数据动态调整每个模型参数的学习率。没有它，优化器会重置，模型可能会次优学习甚至无法正确收敛，这意味着它会失去生成连贯文本的能力。使用torch.save，我们可以同时保存模型和优化器的state_dict内容，如下所示：

torch.save({  # 保存模型和优化器的状态字典
    "model_state_dict": model.state_dict(),  # 模型状态字典
    "optimizer_state_dict": optimizer.state_dict(),  # 优化器状态字典
}, "model_and_optimizer.pth")  # 保存到model_and_optimizer.pth文件

Then, we can restore the model and optimizer states as follows by first loading the saved data via torch.load and then using the load_state_dict method:

然后，我们可以通过先使用torch.load加载保存的数据，再使用load_state_dict方法恢复模型和优化器状态，如下所示：

checkpoint = torch.load("model_and_optimizer.pth")  # 加载保存的检查点
model = GPTModel(GPT_CONFIG_124M)  # 初始化新模型实例
model.load_state_dict(checkpoint["model_state_dict"])  # 加载模型状态字典
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)  # 初始化优化器
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])  # 加载优化器状态字典
model.train()  # 设置模型为训练模式

EXERCISE 5.4

练习 5.4

After saving the weights, load the model and optimizer in a new Python session or Jupyter notebook file and continue pretraining it for 1 more epoch using the train_model_simple function.

保存权重后，在新的Python会话或Jupyter笔记本文件中加载模型和优化器，并使用train_model_simple函数继续预训练1个周期。

5.5 Loading pretrained weights from OpenAI

5.5 从OpenAI加载预训练权重

Previously, for educational purposes, we trained a small GPT-2 model using a limited dataset comprising a short-story book. This approach allowed us to focus on the fundamentals without the need for extensive time and computational resources.

之前，为了教育目的，我们使用包含短篇小说书的有限数据集训练了一个小型GPT-2模型。这种方法使我们能够专注于基础知识，而无需大量时间和计算资源。

Fortunately, OpenAI openly shared the weights of their GPT-2 models, thus eliminating the need to invest tens to hundreds of thousands of dollars in retraining the model on a large corpus ourselves.

幸运的是，OpenAI公开分享了他们的GPT-2模型的权重，因此无需自己在大型语料库上重新训练模型，节省了成千上万美元的投资。

In the remainder of this section, we load these weights into our GPTModel class and use the model for text generation. Here, weights refer to the weight parameters that are stored in the .weight attributes of PyTorch’s Linear and Embedding layers, for example. We accessed them earlier via model.parameters() when training the model.

在本节的其余部分，我们将这些权重加载到我们的GPTModel类中，并使用该模型进行文本生成。在这里，权重是指存储在PyTorch的Linear和Embedding层中的.weight属性中的权重参数。例如，我们在训练模型时通过model.parameters()访问它们。

In the next chapters, we will reuse these pretrained weights to finetune the model for a text classification task and follow instructions similar to ChatGPT.

在接下来的章节中，我们将重复使用这些预训练权重来微调模型以完成文本分类任务，并遵循类似于ChatGPT的指令。

Note that OpenAI originally saved the GPT-2 weights via TensorFlow, which we have to install to load the weights in Python. Moreover, the following code will use a progress bar tool called tqdm to track the download process, which we also have to install.

请注意，OpenAI最初通过TensorFlow保存了GPT-2权重，我们必须安装它才能在Python中加载权重。此外，以下代码将使用一个名为tqdm的进度条工具来跟踪下载过程，我们也必须安装它。

You can install these libraries by executing the following command in your terminal:

你可以通过在终端中执行以下命令来安装这些库：

pip install tensorflow>=2.15.0 tqdm>=4.66  # 安装TensorFlow和tqdm库

The download code is relatively long, mostly boilerplate, and not very interesting. Hence, instead of devoting precious space in this chapter to discussing Python code for fetching files from the internet, we download the gpt_download.py Python module directly from this chapter’s online repository:

下载代码相对较长，主要是样板代码，不是很有趣。因此，本章不讨论用于从互联网上获取文件的Python代码，而是直接从本章的在线存储库下载gpt_download.py Python模块：

import urllib.request  # 导入urllib.request库
url = (  # 定义下载URL
    "https://raw.githubusercontent.com/rasbt/"
    "LLMs-from-scratch/main/ch05/"
    "01_main-chapter-code/gpt_download.py"
)
filename = url.split('/')[-1]  # 获取文件名
urllib.request.urlretrieve(url, filename)  # 下载文件

Next, after downloading this file to the local directory of your Python session, readers are encouraged to briefly inspect the contents of this file to ensure that it was saved correctly and contains valid Python code.

接下来，将此文件下载到Python会话的本地目录后，建议读者简要检查此文件的内容，以确保其正确保存并包含有效的Python代码。

We can now import the download_and_load_gpt2 function from the gpt_download.py file as follows, which will load the GPT-2 architecture settings (settings) and weight parameters (params) into our Python session:

我们现在可以从gpt_download.py文件中导入download_and_load_gpt2函数，如下所示，该函数将GPT-2架构设置（settings）和权重参数（params）加载到我们的Python会话中：

from gpt_download import download_and_load_gpt2  # 从gpt_download导入download_and_load_gpt2函数
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")  # 下载并加载GPT-2设置和参数

Executing the preceding code downloads the following 7 files associated with the 124M parameter GPT-2 model:

执行上述代码会下载与124M参数GPT-2模型相关的以下7个文件：

checkpoint: 100% |██████████████████████████████████████████████████████████████████| 77.0/77.0 [00:00<00:00, 63.9kiB/s]
encoder.json: 100% |████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.20MiB/s]
hparams.json: 100% |████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 78.3kiB/s]
model.ckpt.data-00000-of-00001: 100% |██████████████████████████████████████████████| 498M/498M [01:09<00:00, 7.16MiB/s]
model.ckpt.index: 100% |████████████████████████████████████████████████████████████| 5.21k/5.21k [00:00<00:00, 3.24MiB/s]
model.ckpt.meta: 100% |████████████████████████████████████████████████████████████| 471k/471k [00:00<00:00, 2.46MiB/s]
vocab.bpe: 100% |██████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.70MiB/s]

UPDATED DOWNLOAD INSTRUCTIONS

更新下载说明

If the download code does not work for you, it could be due to intermittent internet connection, server issues, or changes in how OpenAI shares the weights of the open-source GPT-2 model. In this case, please visit this chapter’s online code repository at https://github.com/rasbt/LLMs-from-scratch for alternative and updated instructions, and please reach out via the Manning Forum for further questions.

如果下载代码对你不起作用，可能是由于间歇性的互联网连接、服务器问题或OpenAI共享开源GPT-2模型权重的方式发生了变化。在这种情况下，请访问本章的在线代码库https://github.com/rasbt/LLMs-from-scratch获取替代和更新的说明，并通过Manning论坛提出进一步的问题。

After the execution of the previous code has been completed, let’s inspect the contents of settings and params:

在执行前面的代码后，让我们检查settings和params的内容：

print("Settings:", settings)  # 打印设置
print("Parameter dictionary keys:", params.keys())  # 打印参数字典键

The contents are as follows:

内容如下：

Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])

Both settings and params are Python dictionaries. The settings dictionary stores the LLM architecture settings similarly to our manually defined GPT_CONFIG_124M settings. The params dictionary contains the actual weight tensors. Note that we only printed the dictionary keys because printing the weight contents would take up too much screen space, however, we can inspect these weight tensors by printing the whole dictionary via print(params) or by selecting individual tensors via the respective dictionary keys, for example, the embedding layer weights:

settings和params都是Python字典。settings字典存储LLM架构设置，类似于我们手动定义的GPT_CONFIG_124M设置。params字典包含实际的权重张量。请注意，我们只打印了字典键，因为打印权重内容会占用太多屏幕空间，但我们可以通过print(params)打印整个字典或通过相应的字典键选择单个张量来检查这些权重张量，例如，嵌入层权重：

print(params["wte"])  # 打印词元嵌入权重张量
print("Token embedding weight tensor dimensions:", params["wte"].shape)  # 打印词元嵌入权重张量的维度

The weights of the token embedding layer are as follows:

词元嵌入层的权重如下：

[[ 0.1101301 ... 0.1363697 0.0156028 0.04531523]
 [ 0.04034033 ... 0.0865453 0.00253983 0.04318958]
 [-0.12746179 ... 0.08991534 -0.12972379 -0.08785918]
 ...
 [-0.04453601 ... 0.10435229 0.09783269 -0.06952604]
 [ 0.1860082 ... -0.09625227 0.07847701 -0.02245961]
 [ 0.05135201 ... 0.00704835 0.15519823 0.12067825]]

Token embedding weight tensor dimensions: (50257, 768)

词元嵌入权重张量的维度： (50257, 768)

We downloaded and loaded the weights of the smallest GPT-2 model via the download_and_load_gpt2(model_size=“124M”, …) setting. However, note that OpenAI also shares the weights of larger models: “355M”, “774M”, and “1558M”. The overall architecture of these differently-sized GPT models is the same, as illustrated in Figure 5.17.

我们通过download_and_load_gpt2(model_size=“124M”, …)设置下载并加载了最小的GPT-2模型的权重。但是，请注意，OpenAI还分享了更大模型的权重：“355M”、“774M”和“1558M”。这些不同大小的GPT模型的总体架构是相同的，如图5.17所示。

在这里插入图片描述

Figure 5.17 GPT-2 LLMs come in several different model sizes, ranging from 124 million to 1,558 million parameters. The core architecture is the same, with the only difference being the embedding sizes and the number of times individual components like the attention heads and transformer blocks are repeated.

图5.17 GPT-2 LLM有几种不同的模型大小，从1.24亿到15.58亿个参数不等。核心架构相同，唯一的区别是嵌入大小和注意力头和transformer blocks 等单个组件重复的次数。

As illustrated in Figure 5.17, the overall architecture of the differently-sized GPT-2 models remains the same, except that different architectural elements are repeated different numbers of times, and the embedding size differs. The remaining code in this chapter is also compatible with these larger models.

如图5.17所示，不同大小的GPT-2模型的总体架构保持不变，唯一的区别是不同的架构元素重复的次数不同，嵌入大小也不同。本章中的其余代码也与这些更大的模型兼容。

After loading the GPT-2 model weights into Python, we still need to transfer them from the settings and params dictionaries into our GPTModel instance.

将GPT-2模型权重加载到Python后，我们仍需要将它们从settings和params字典转移到我们的GPTModel实例中。

First, we create a dictionary that lists the differences between the different GPT model sizes, as explained in Figure 5.17:

首先，我们创建一个字典，列出了不同GPT模型大小之间的差异，如图5.17所示：

model_configs = {  # 定义模型配置字典
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

Suppose we are interested in loading the smallest model, “gpt2-small (124M)”. We can use the corresponding settings from the model_configs table to update our full-length GPT_CONFIG_124M we defined and used earlier throughout the chapter as follows:

假设我们有兴趣加载最小的模型“gpt2-small (124M)”。我们可以使用model_configs表中的相应设置来更新我们在本章中定义并使用的完整GPT_CONFIG_124M，如下所示：

model_name = "gpt2-small (124M)"  # 选择模型名称
NEW_CONFIG = GPT_CONFIG_124M.copy()  # 复制原始配置
NEW_CONFIG.update(model_configs[model_name])  # 更新配置为选定模型的配置

Careful readers may remember that we used a 256-token length earlier, but the original GPT-2 models from OpenAI were trained with a 1,024-token length, so we have to update the NEW_CONFIG accordingly:

细心的读者可能记得我们之前使用了256词元长度，但OpenAI的原始GPT-2模型是用1024词元长度训练的，所以我们必须相应地更新NEW_CONFIG：

NEW_CONFIG.update({"context_length": 1024})  # 更新上下文长度为1024

Also, OpenAI used bias vectors in the multi-head attention module’s linear layers to implement the query, key, and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don’t improve the modeling performance and are thus unnecessary. However, since we are working with pretrained weights, we need to match the settings for consistency and enable these bias vectors:

此外，OpenAI在多头注意力模块的线性层中使用了偏置向量来实现查询、键和值矩阵计算。偏置向量在LLM中不再常用，因为它们不会改善建模性能，因此是不必要的。然而，由于我们使用的是预训练权重，我们需要匹配设置以保持一致性并启用这些偏置向量：

NEW_CONFIG.update({"qkv_bias": True})  # 启用偏置向量

We can now use the updated NEW_CONFIG dictionary to initialize a new GPTModel instance:

我们现在可以使用更新后的NEW_CONFIG字典来初始化新的GPTModel实例：

gpt = GPTModel(NEW_CONFIG)  # 使用更新的配置初始化GPT模型
gpt.eval()  # 设置模型为评估模式

By default, the GPTModel instance is initialized with random weights for pretraining. The last step to using OpenAI’s model weights is to override these random weights with the weights we loaded into the params dictionary.

默认情况下，GPTModel实例使用随机权重初始化以进行预训练。使用OpenAI模型权重的最后一步是用我们加载到params字典中的权重覆盖这些随机权重。

For this, we will first define a small assign utility function that checks whether two tensors or arrays (left and right) have the same dimensions or shape and returns the right tensor as trainable PyTorch parameters:

为此，我们将首先定义一个小的assign实用函数，该函数检查两个张量或数组（left和right）是否具有相同的维度或形状，并返回右张量作为可训练的PyTorch参数：

def assign(left, right):  # 定义assign函数
    if left.shape != right.shape:  # 如果形状不匹配
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")  # 抛出错误
    return torch.nn.Parameter(torch.tensor(right))  # 返回右张量作为可训练参数

Next, we define a load_weights_into_gpt function that loads the weights from the params dictionary into a GPTModel instance gpt:

接下来，我们定义一个load_weights_into_gpt函数，该函数将权重从params字典加载到GPTModel实例gpt中：

Listing 5.5 Loading OpenAI weights into our GPT model code
5.5 将OpenAI权重加载到我们的GPT模型代码中

import numpy as np  # 导入numpy库

def load_weights_into_gpt(gpt, params):  # 定义load_weights_into_gpt函数
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])  #A 分配位置嵌入权重
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params["wte"])  #A 分配词元嵌入权重

    for b in range(len(params["blocks"])):  #B 遍历块
        q_w, k_w, v_w = np.split(  #C 分割权重
            params["blocks"][b]["attn"]["c_attn"]["w"], 3, axis=1)
        gpt.trf_blocks[b].attn.W_query.weight = assign(
            gpt.trf_blocks[b].attn.W_query.weight, q_w.T)
        gpt.trf_blocks[b].attn.W_key.weight = assign(
            gpt.trf_blocks[b].attn.W_key.weight, k_w.T)
        gpt.trf_blocks[b].attn.W_value.weight = assign(
            gpt.trf_blocks[b].attn.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            params["blocks"][b]["attn"]["c_attn"]["b"], 3, axis=1)
        gpt.trf_blocks[b].attn.W_query.bias = assign(
            gpt.trf_blocks[b].attn.W_query.bias, q_b)
        gpt.trf_blocks[b].attn.W_key.bias = assign(
            gpt.trf_blocks[b].attn.W_key.bias, k_b)
        gpt.trf_blocks[b].attn.W_value.bias = assign(
            gpt.trf_blocks[b].attn.W_value.bias, v_b)

        gpt.trf_blocks[b].attn.out_proj.weight = assign(
            gpt.trf_blocks[b].attn.out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].attn.out_proj.bias = assign(
            gpt.trf_blocks[b].attn.out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, params["blocks"][b]["mlp"]["c_fc"]["w"].T)  # 分配MLP层权重
        gpt.trf_blocks[b].ff.layers[0].bias = assign(  # 分配MLP层偏置
            gpt.trf_blocks[b].ff.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(  # 分配MLP层权重
            gpt.trf_blocks[b].ff.layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(  # 分配MLP层偏置
            gpt.trf_blocks[b].ff.layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(  # 分配第一个LayerNorm层权重
            gpt.trf_blocks[b].norm1.scale,
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(  # 分配第一个LayerNorm层偏置
            gpt.trf_blocks[b].norm1.shift,
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(  # 分配第二个LayerNorm层权重
            gpt.trf_blocks[b].norm2.scale,
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(  # 分配第二个LayerNorm层偏置
            gpt.trf_blocks[b].norm2.shift,
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])  # 分配最终LayerNorm层权重
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])  # 分配最终LayerNorm层偏置
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])  #D 将输出层权重与词元嵌入层权重绑定

In the load_weights_into_gpt function, we carefully match the weights from OpenAI’s implementation with our GPTModel implementation. To pick a specific example, OpenAI stored the weight tensor for the output projection layer for the first transformer block as params[“blocks”][0][“attn”][“c_proj”][“w”]. In our implementation, this weight tensor corresponds to gpt.trf_blocks[b].attn.out_proj.weight, where gpt is a GPTModel instance.

在load_weights_into_gpt函数中，我们将OpenAI的实现中的权重与我们的GPTModel实现中的权重仔细匹配。举一个具体的例子，OpenAI将第一个transformer 块的输出投影层的权重张量存储为params[“blocks”][0][“attn”][“c_proj”][“w”]。在我们的实现中，该权重张量对应于gpt.trf_blocks[b].attn.out_proj.weight，其中gpt是一个GPTModel实例。

Developing the load_weights_into_gpt function took a lot of guesswork since OpenAI used a slightly different naming convention from ours. However, the assign function would alert us if we try to match two tensors with different dimensions. Also, if we made a mistake in this function, we would notice this as the resulting GPT model would be unable to produce coherent text.

开发load_weights_into_gpt函数花费了很多猜测工作，因为OpenAI使用的命名约定与我们的略有不同。但是，如果我们尝试匹配两个具有不同维度的张量，assign函数会提醒我们。此外，如果我们在此函数中犯了错误，我们会注意到这一点，因为生成的GPT模型将无法生成连贯的文本。

Let’s not try the load_weights_into_gpt out in practice and load the OpenAI model weights into our GPTModel instance gpt:

现在我们在实际中尝试load_weights_into_gpt函数，并将OpenAI模型权重加载到我们的GPTModel实例gpt中：

load_weights_into_gpt(gpt, params)  # 加载权重到GPT模型
gpt.to(device)  # 将模型移动到设备

If the model is loaded correctly, we can now use it to generate new text using our previous generate function:

如果模型加载正确，我们现在可以使用它通过之前的generate函数生成新文本：

torch.manual_seed(123)  # 设置随机种子
token_ids = generate(  # 生成新文本
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印输出文本

The resulting text is as follows:

生成的文本如下：

Output text:

输出文本：

Every effort moves you toward finding an ideal new way to practice something! What makes us want to be on top of that?

每次努力都推动你找到一种理想的新方法来练习某些东西！是什么让我们想要在这上面做到最好？

We can be confident that we loaded the model weights correctly because the model can produce coherent text. A tiny mistake in this process would cause the model to fail.

我们可以确信我们正确加载了模型权重，因为模型可以生成连贯的文本。这个过程中的一个小错误都会导致模型失败。

In the following chapters, we will work further with this pretrained model and fine-tune it to classify text and follow instructions.

在接下来的章节中，我们将进一步使用这个预训练模型，并微调它以进行文本分类和遵循指令。

EXERCISE 5.5

练习 5.5

Calculate the training and validation set losses of the GPTModel with the pretrained weights from OpenAI on the “The Verdict” dataset.

计算使用OpenAI预训练权重的GPTModel在"The Verdict"数据集上的训练和验证集损失。

EXERCISE 5.6

练习 5.6

Readers are encouraged to experiment with GPT-2 models of different sizes, for example, the largest 1558M parameter model and compare the generated text to the 124M model we loaded in this chapter.

鼓励读者尝试不同大小的GPT-2模型，例如，最大的1558M参数模型，并将生成的文本与我们在本章加载的124M模型进行比较。

5.6 Summary

5.6 总结

When LLMs generate text, they output one token at a time.
当LLM生成文本时，它们一次输出一个词元。
By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as “greedy decoding.”
默认情况下，下一个词元是通过将模型输出转换为概率分数并从词汇表中选择对应于最高概率分数的词元生成的，这被称为“贪婪解码”。
Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
使用概率采样和温度缩放，我们可以影响生成文本的多样性和连贯性。
Training and validation set losses can be used to gauge the quality of text generated by LLM during training.
训练和验证集的损失可用于评估LLM在训练期间生成的文本质量。
Pretraining an LLM involves changing its weights to minimize the training loss.
预训练LLM涉及更改其权重以最小化训练损失。
The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer.
LLM的训练循环本身是深度学习中的标准程序，使用传统的交叉熵损失和AdamW优化器。
Pretraining an LLM on a large text corpus is time- and resource-intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.
在大型文本语料库上预训练LLM是耗时且资源密集的，因此我们可以加载OpenAI公开可用的权重，作为我们自己在大型数据集上预训练模型的替代方案。