Build a Large Language Model (From Scratch)GPT-4o翻译和代码每行中文注释Ch5

news2025/1/11 2:26:06


  • Pretraining on Unlabeled Data
    • This chapter covers
    • 5.1 Evaluating generative text models
      • 5.1.1 Using GPT to generate text
      • 5.1.2 Calculating the text generation loss
      • 5.1.3 Calculating the training and validation set losses
    • 5.2 Training an LLM
    • 5.3 Decoding strategies to control randomness
      • 5.3.1 Temperature scaling
      • 5.3.2 Top-k sampling
      • 5.3.3 Modifying the text generation function
    • 5.4 Loading and saving model weights in PyTorch
    • 5.5 Loading pretrained weights from OpenAI
    • 5.6 Summary

Pretraining on Unlabeled Data


This chapter covers


  • Computing the training and validation set losses to assess the quality of LLM-generated text during training
  • Implementing a training function and pretraining the LLM
  • Saving and loading model weights to continue training an LLM
  • Loading pretrained weights from OpenAI

In the previous chapters, we implemented the data sampling, attention mechanism and coded the LLM architecture. The core focus of this chapter is to implement a training function and pretrain the LLM, as illustrated in Figure 5.1.



Figure 5.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights.

图5.1 编码LLM的三个主要阶段的心理模型,在通用文本数据集上预训练LLM,并在标记数据集上进行微调。本章重点是预训练LLM,包括实现训练代码、评估性能以及保存和加载模型权重。

As illustrated in Figure 5.1, we will also learn about basic model evaluation techniques to measure the quality of the generated text, which is a requirement for optimizing the LLM during the training process. Moreover, we will discuss how to load pretrained weights, giving our LLM a solid starting point for finetuning in the upcoming chapters.




In the context of LLMs and other deep learning models, weights refer to the trainable parameters that the learning process adjusts. These weights are also known as weight parameters or simply parameters. In frameworks like PyTorch, these weights are stored in linear layers, for example, which we used to implement the multi-head attention module in chapter 3 and the GPTModel in chapter 4. After initializing a layer (new_layer = torch.nn.Linear(…)), we can access its weights through the weight attribute, new_layer. weight Additionally, for convenience, PyTorch allows direct access to all a model’s trainable parameters, including weights and biases, through the method model.parameters(), which we will use later when implementing the model training.

在LLM和其他深度学习模型的上下文中,权重是指学习过程调整的可训练参数。这些权重也被称为权重参数或简单地称为参数。在PyTorch等框架中,这些权重存储在线性层中,例如,我们在第3章中用它来实现多头注意力模块和第4章中的GPTModel。初始化层(new_layer = torch.nn.Linear(…))后,我们可以通过.weight属性访问其权重,new_layer.weight。此外,为了方便起见,PyTorch允许通过model.parameters() 方法直接访问模型的所有可训练参数,包括权重和偏置,我们将在实现模型训练时使用它。

5.1 Evaluating generative text models

5.1 评估生成文本模型

We begin this chapter by setting up the LLM for text generation based on code from the previous chapter and discuss basic ways to evaluate the quality of the generated text in this section. The content we cover in this section and the remainder of this chapter is outlined in Figure 5.2.



Figure 5.2 An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage.

图5.2 本章涵盖主题的概述。我们从回顾前一章的文本生成开始,并实现可以在预训练阶段使用的基本模型评估技术。

As shown in Figure 5.2, the next subsection recaps the text generation we set up at the end of the previous chapter before we dive into the text evaluation and calculation of the training and validation losses in the subsequent subsections.


5.1.1 Using GPT to generate text

5.1.1 使用GPT生成文本

In this section, we set up the LLM and briefly recap the text generation process we implemented in chapter 4. We begin by initializing the GPT model that we will evaluate and train in this chapter, using the GPTModel class and GPT_CONFIG_124M dictionary from chapter 4:


import torch                        # 导入torch库
from chapter04 import GPTModel      # 从第4章导入GPTModel

GPT_CONFIG_124M = {                 # GPT配置字典
    "vocab_size": 50257,            # 词汇表大小
    "context_length": 256,          #A 将上下文长度从1024缩短到256词元
    "emb_dim": 768,                 # 嵌入维度
    "n_heads": 12,                  # 注意力头数量
    "n_layers": 12,                 # 层数
    "drop_rate": 0.1,               #B 可能且常见的是将dropout设置为0。
    "qkv_bias": False               # QKV偏置

torch.manual_seed(123)              # 设置随机种子
model = GPTModel(GPT_CONFIG_124M)   # 使用配置初始化模型
model.eval()                        # 将模型设置为评估模式

Considering the GPT_CONFIG_124M dictionary, the only adjustment we have made compared to the previous chapter is reducing the context length (context_length) to 256 tokens. This modification reduces the computational demands of training the model, making it possible to carry out the training on a standard laptop computer.


Originally, the GPT-2 model with 124 million parameters was configured to handle up to 1,024 tokens. After the training process, at the end of this chapter, we will update the context size setting and load pretrained weights to work with a model configured for a 1,024-token context length.


Using the GPTModel instance, we adopt the generate_text_simple function introduced in the previous chapter and introduce two handy functions, text_to_token_ids and token_ids_to_text. These functions facilitate the conversion between text and token representations, a technique we will utilize throughout this chapter. To provide a clearer understanding, Figure 5.3 illustrates this process before we dive into the code.


Figure 5.3 Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation.

图5.3 生成文本涉及将文本编码为词元ID,LLM将其处理为logit向量。然后将logit向量转换回词元ID,解码为文本表示。

Figure 5.3 illustrates a three-step text generation process using a GPT model. First, the tokenizer converts input text into a series of token IDs, as discussed in chapter 2. Second, the model receives these token IDs and generates corresponding logits, which are vectors representing the probability distribution for each token in the vocabulary, as discussed in chapter 4. Third, these logits are converted back into token IDs, which the tokenizer decodes into human-readable text, completing the cycle from textual input to textual output.


In code, we implement the text generation process as follows:


import tiktoken                    # 导入tiktoken库
from chapter04 import generate_text_simple # 从第4章导入generate_text_simple函数

def text_to_token_ids(text, tokenizer):   # 定义text_to_token_ids函数
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'}) # 编码文本,允许特殊词元
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # 添加批次维度
    return encoded_tensor             # 返回编码后的张量

def token_ids_to_text(token_ids, tokenizer): # 定义token_ids_to_text函数
    flat = token_ids.squeeze(0)       # 移除批次维度
    return tokenizer.decode(flat.tolist()) # 解码为文本

start_context = "Every effort moves you" # 设置初始上下文
tokenizer = tiktoken.get_encoding("gpt2") # 获取GPT-2的分词器编码

token_ids = generate_text_simple(   # 调用generate_text_simple函数生成词元ID
    model=model,                    # 模型
    idx=text_to_token_ids(start_context, tokenizer), # 将初始上下文转换为词元ID
    max_new_tokens=10,              # 最大新词元数
    context_size=GPT_CONFIG_124M["context_length"]  # 上下文长度
print("Output text:\n", token_ids_to_text(token_ids, tokenizer)) # 打印生成的文本

Using the preceding code, the model generates the following text:


Output text:
Every effort moves you reintingetic wasn refres RexMeCHicular stren

Based on the output, it’s clear the model isn’t yet producing coherent text because it hasn’t undergone training. To define what makes text “coherent” or “high quality,” we have to implement a numerical method to evaluate the generated content. This approach will enable us to monitor and enhance the model’s performance throughout its training process.

根据输出结果,很明显模型尚未生成连贯的文本,因为它还没有经过训练。要定义使文本 “连贯”或“高质量” 的标准,我们必须实现一种数值方法来评估生成的内容。这种方法将使我们能够在整个训练过程中监控和提高模型的性能。

The following section introduces how we calculate a loss metric for the generated outputs. This loss serves as a progress and success indicator of the training progress. Furthermore, in subsequent chapters on finetuning LLMs, we will review additional methodologies for assessing model quality.


5.1.2 Calculating the text generation loss

5.1.2 计算文本生成损失

This section explores techniques for numerically assessing text quality generated during training by calculating a so-called text generation loss. We go over this topic step-by-step with a practical example to make the concepts clear and applicable, beginning with a short recap of how the data is loaded from chapter 2 and how the text is generated via the generate_text_simple function from chapter 4.



Figure 5.4 illustrates the overall flow from input text to LLM-generated text using a five-step procedure.


Figure 5.4 For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.

图5.4 对于左侧显示的每个输入词元,我们计算一个包含与词汇表中每个词元相对应的概率分数的向量。每个向量中最高概率分数的索引位置表示最可能的下一个词元ID。选择与最高概率分数相关联的这些词元ID,并将其映射回表示模型生成的文本的文本中。

The text generation process in Figure 5.4 outlines what the generate_text_simple function from chapter 4 does internally. We need to perform these same initial steps before we can compute a loss that measures the generated text quality later in this section.


Figure 5.4 outlines the text generation process with a small 7-token vocabulary to fit this image on a single page. However, our GPTModel works with a much larger vocabulary consisting of 50,257 words; hence, the token IDs in the following codes will range from 0 to 50,256 rather than 0 to 6.


Also, Figure 5.4 only shows a single text example (“every effort moves”) for simplicity. In the following hands-on code example that implements the steps in Figure 5.4, we will work with two input examples (“every effort moves” and “I really like”) as inputs for the GPT model:

此外,图5.4仅显示了一个简单文本示例(“every effort moves”)以简化。在接下来的实际代码示例中,我们将实现图5.4中的步骤,并使用两个输入示例(“every effort moves"和"I really like”)作为GPT模型的输入:

inputs = torch.tensor([[16833, 3626, 6100],  # ["every effort moves"]
                       [   40, 1107,  588]])  # ["I really like"]
# 匹配这些输入,'targets'包含我们希望模型生成的词元ID
targets = torch.tensor([[ 3626, 6100,  345],  # [" effort moves you"]
                        [  107,  588, 11311]])  # [" really like chocolate"]

Note that the targets are the inputs but shifted one position forward, a concept we covered in chapter 2 during the implementation of the data loader. This shifting strategy is crucial for teaching the model to predict the next token in a sequence.


When we feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores, which corresponds to step 2 in Figure 5.4:


with torch.no_grad():                   # 禁用梯度跟踪,因为我们尚未训练
    logits = model(inputs)              # 将输入送入模型,计算logit向量
    probas = torch.softmax(logits, dim=-1)  # 每个词元在词汇表中的概率
print(probas.shape)                     # 打印概率张量的维度

The resulting tensor dimension of the probability score (probas) tensor is as follows:


torch.Size([2, 3, 50257])

The first number, 2, corresponds to the two examples (rows) in the inputs, also known as batch size. The second number, 3, corresponds to the number of tokens in each input (row). Finally, the last number corresponds to the embedding dimensionality, which is determined by the vocabulary size, as discussed in previous chapters.


Following the conversion from logits to probabilities via the softmax function, the generate_text_simple function from chapter 4 then converts the resulting probability scores back into text, as illustrated in steps 3-5 in Figure 5.4.


We can implement steps 3 and 4 by applying the argmax function to the probability scores to obtain the corresponding token IDs:


token_ids = torch.argmax(probas, dim=-1, keepdim=True)  # 应用argmax函数获得词元ID
print("Token IDs:\n", token_ids)  # 打印词元ID

Given that we have 2 input batches, each containing 3 tokens, applying the argmax function to the probability scores (step 3 in Figure 5.4) yields 2 sets of outputs, each with 3 predicted token IDs:


Token IDs:
tensor([[[16657],  # 第一批次
         [  339], 
        [[49906],  # 第二批次

Finally, step 5 converts the token IDs back into text:


print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")  # 打印第一批次的目标词元
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")  # 打印第一批次的输出词元

When we decode these tokens, we find that these output tokens are quite different from the target tokens we want the model to generate:


Targets batch 1:  effort moves you
Outputs batch 1:  Armed NeNetflix

The model produces random text that is different from the target text because it has not been trained yet. We now get to the part where we evaluate the performance of the model’s generated text numerically via a so-called loss as illustrated in Figure 5.4. Not only is this useful for measuring the quality of the generated text, but it’s also a building block for implementing the training function later, which we use to update the model’s weight to improve the generated text.


Figure 5.5 We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training.

图5.5 我们现在在本节的其余部分实现文本评估函数。在下一节中,我们将此评估函数应用于我们用于模型训练的整个数据集。

The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs, as illustrated in Figure 5.6. This softmax probability is also used in the evaluation metric we are implementing in the remainder of this section to numerically assess the model’s generated outputs: the higher the probability in the correct positions, the better.



Figure 5.6 Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized.

图5.6 在训练之前,模型生成随机的下一个词元概率向量。模型训练的目标是确保对应于突出显示的目标词元ID的概率值最大化。

Remember that Figure 5.6 displays the softmax probabilities for a compact 7-token vocabulary to fit everything into a single figure. This implies that the starting random values will hover around 1/7, which equals approximately 0.14.


However, the vocabulary we are using for our GPT-2 model has 50,257 tokens, so most of the initial probabilities will hover around 0.00002 via 1/50,257.


For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens via the following code:


text_idx = 0  # 文本索引0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]  # 计算目标词元的softmax概率
print("Text 1:", target_probas_1)  # 打印文本1的概率

text_idx = 1  # 文本索引1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]  # 计算目标词元的softmax概率
print("Text 2:", target_probas_2)  # 打印文本2的概率

The 3 target token ID probabilities for each batch are as follows:


Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])

The goal of training an LLM is to maximize these values, aiming to get them as close to a probability of 1. This way, we ensure the LLM consistently picks the target token—essentially the next word in the sentence—as the next token it generates.




How do we maximize the softmax probability values corresponding to the target tokens? The big picture is that we update the model weights so that the model outputs higher values for the respective token IDs we want to generate. The weight update is done via a process called backpropagation, a standard technique for training deep neural networks (see sections A.3 to A.7 in Appendix A for more details about backpropagation and model training).


Backpropagation requires a loss function, which calculates the difference between the model’s predicted output (here, the probabilities corresponding to the target token IDs) and the actual desired output. This loss function measures how far off the model’s predictions are from the target values.


In the remainder of this section, we calculate the loss for the probability scores of the two example batches, target_probas_1 and target_probas_2. The main steps are illustrated in Figure 5.7.



Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6.

图5.7 计算损失涉及多个步骤。步骤1到3计算对应于目标张量的词元概率。这些概率然后通过对数变换并在步骤4-6中取平均值。

Since we already applied steps 1-3 listed in Figure 5.7 to obtain target_probas_1 and target_probas_2, we proceed with step 4, applying the logarithm to the probability scores:


log_probas = torch.log(, target_probas_2)))  # 计算对数概率
print(log_probas)  # 打印对数概率

This results in the following values:


tensor([-9.5042, -10.3796, -11.3677, -11.4798, -9.7764, -12.2561])

Working with logarithms of probability scores is more manageable in mathematical optimization than handling the scores directly. This topic is outside the scope of this book, but I’ve detailed it further in a lecture, which is linked in the reference section in appendix B.


Next, we combine these log probabilities into a single score by computing the average (step 5 in Figure 5.7):


avg_log_probas = torch.mean(log_probas)  # 计算平均对数概率
print(avg_log_probas)  # 打印平均对数概率

The resulting average log probability score is as follows:



The goal is to get the average log probability as close to 0 as possible by updating the model’s weights as part of the training process, which we will implement later in section 5.2.


However, in deep learning, the common practice isn’t to push the average log probability up to 0 but rather to bring the negative average log probability down to 0. The negative average log probability is simply the average log probability multiplied by -1, which corresponds to step 6 in Figure 5.7:


neg_avg_log_probas = avg_log_probas * -1  # 计算负平均对数概率
print(neg_avg_log_probas)  # 打印负平均对数概率

This prints tensor(10.7940).


The term for this negative value, -10.7940 turning into 10.7940, is known as the cross entropy loss in deep learning.


PyTorch comes in handy here, as it already has a built-in cross_entropy function that takes care of all these 6 steps in Figure 5.7 for us.




At its core, the cross entropy loss is a popular measure in machine learning and deep learning that measures the difference between two probability distributions—typically, the true distribution of labels (here, tokens in a dataset) and the predicted distribution from a model (for instance, the token probabilities generated by an LLM).


In the context of machine learning and specifically in frameworks like PyTorch, the cross_entropy function computes this measure for discrete outcomes, which is similar to the negative average log probability of the target tokens given the model’s generated token probabilities, making the terms cross entropy and negative average log probability related and often used interchangeably in practice.


Before we apply the cross entropy function, let’s briefly recall the shape of the logits and target tensors:


print("Logits shape:", logits.shape)  # 打印logits形状
print("Targets shape:", targets.shape)  # 打印目标形状

The resulting shapes are as follows:


Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])

As we can see, the logits tensor has three dimensions: batch size, number of tokens, and vocabulary size. The targets tensor has two dimensions: batch size and number of tokens.


For the cross_entropy_loss function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:


logits_flat = logits.flatten(0, 1)  # 将logits展平
targets_flat = targets.flatten()  # 将目标展平
print("Flattened logits:", logits_flat.shape)  # 打印展平后的logits形状
print("Flattened targets:", targets_flat.shape)  # 打印展平后的目标形状

The resulting tensor dimensions are as follows:


Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])

Remember that the targets are the token IDs we want the LLM to generate, and the logits contain the unscaled model outputs before they enter the softmax function to obtain the probability scores.


Previously, we applied the softmax function, selected the probability scores corresponding to the target IDs, and computed the negative average log probabilities. PyTorch’s cross_entropy function will take care of all these steps for us:


loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)  # 计算交叉熵损失
print(loss)  # 打印损失

The resulting loss is the same that we obtained previously when applying the individual steps shown in Figure 5.7 manually:





Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence.


Perplexity measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution.


Perplexity can be calculated as perplexity = torch.exp(loss), which returns tensor(48725.8203) when applied to the previously calculated loss.

困惑度可以通过perplexity = torch.exp(loss)计算,当应用于先前计算的损失时返回tensor(48725.8203)。

Perplexity is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step. In the given example, this would translate to the model being unsure about which among 47,678 words or tokens in the vocabulary to generate as the next token.


In this section, we calculated the loss for two small text inputs for illustration purposes. In the next section, we apply the loss computation to the entire training and validation sets.


5.1.3 Calculating the training and validation set losses

5.1.3 计算训练和验证集的损失

In this section, we first prepare the training and validation datasets that we will use to train the LLM later in this chapter. Then, we calculate the cross entropy for the training and validation sets, as illustrated in Figure 5.8, which is an important component of the model training process.



Figure 5.8 After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training.

图5.8 在上一节计算交叉熵损失后,我们现在将这种损失计算应用于我们将用于模型训练的整个文本数据集。

To compute the loss on the training and validation datasets as illustrated in Figure 5.8, we use a very small text dataset, the “The Verdict” short story by Edith Wharton, which we have already worked with in chapter 2. By selecting a text from the public domain, we circumvent any concerns related to usage rights. Additionally, the reason why we use such a small dataset is that it allows for the execution of code examples on a standard laptop computer in a matter of minutes, even without a high-end GPU, which is particularly advantageous for educational purposes.

为了计算如图5.8所示的训练和验证数据集的损失,我们使用了一个非常小的文本数据集,即Edith Wharton的短篇小说"The Verdict",我们已经在第2章中使用过。通过选择公有领域的文本,我们避免了与使用权相关的任何问题。此外,使用这样一个小数据集的原因是它允许在标准笔记本电脑上在几分钟内执行代码示例,即使没有高端GPU,这对于教育目的特别有利。

Interested readers can also use the supplementary code of this book to prepare a larger-scale dataset consisting of more than 60,000 public domain books from Project Gutenberg and train an LLM on these (see appendix D for details).




To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs around $30 per hour. A rough estimate puts the total training cost of such an LLM at around $690,000 (calculated as 184,320 hours divided by 8, then multiplied by $30).

为了更好地理解我们项目的规模,考虑一下训练一个拥有70亿参数的Llama 2模型,这是一个相对流行的公开可用的LLM。该模型在昂贵的A100 GPU上需要184,320个GPU小时,处理2万亿词元。在撰写本文时,运行一个8xA100云服务器在AWS上的费用约为每小时30美元。粗略估计,这样一个LLM的总训练成本约为690,000美元(计算方法为184,320小时除以8,然后乘以30美元)。

The following code loads the “The Verdict” short story we used in chapter 2:

以下代码加载我们在第2章中使用的"The Verdict"短篇小说:

file_path = "the-verdict.txt"  # 文件路径
with open(file_path, "r", encoding="utf-8") as file:  # 以读模式打开文件
    text_data =  # 读取文件内容

After loading the dataset, we can check the number of characters and tokens in the dataset:


total_characters = len(text_data)  # 计算总字符数
total_tokens = len(tokenizer.encode(text_data))  # 计算总词元数
print("Characters:", total_characters)  # 打印字符数
print("Tokens:", total_tokens)  # 打印词元数

The output is as follows:


Characters: 20479
Tokens: 5145

With just 5,145 tokens, the text might seem too small to train an LLM, but as mentioned earlier, it’s for educational purposes so that we can run the code in minutes instead of weeks. Plus, we will be loading pretrained weights from OpenAI into our GPTModel code at the end of this chapter.


Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training. This process is visualized in Figure 5.9.


An example using stride equal to the context length (here: 4) as shown below:


Figure 5.9 When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training.

图5.9 准备数据加载器时,我们将输入文本拆分为训练集和验证集部分。然后,我们对文本进行词元化(为简单起见,仅显示训练集部分),并将词元化的文本分成用户指定长度的块(此处为6)。最后,我们打乱行,并将分块的文本组织成批次(此处,批次大小为2),可用于模型训练。

For visualization purposes, Figure 5.9 uses a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training.




We are training the model with training data presented in similarly-sized chunks for simplicity and efficiency. However, in practice, it can also be beneficial to train an LLM with variable-length inputs to help the LLM to better generalize across different types of inputs when it is being used.


To implement the data splitting and loading visualized in Figure 5.9, we first define a train_ratio to use 90% of the data for training and the remaining 10% as validation data for model evaluation during training:


train_ratio = 0.90  # 训练集比例
split_idx = int(train_ratio * len(text_data))  # 计算分割索引
train_data = text_data[:split_idx]  # 获取训练数据
val_data = text_data[split_idx:]  # 获取验证数据

Using the train_data and val_data subsets, we can now create the respective data loader reusing the create_dataloader_v1 code from chapter 2:


from chapter02 import create_dataloader_v1  # 从第2章导入create_dataloader_v1
torch.manual_seed(123)  # 设置随机种子

train_loader = create_dataloader_v1(
    train_data,  # 训练数据
    batch_size=2,  # 批大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大长度
    stride=GPT_CONFIG_124M["context_length"],  # 步幅
    drop_last=True,  # 丢弃最后一个不完整批次
    shuffle=True,  # 是否打乱数据
    num_workers=0  # 工作线程数

val_loader = create_dataloader_v1(
    val_data,  # 验证数据
    batch_size=2,  # 批大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大长度
    stride=GPT_CONFIG_124M["context_length"],  # 步幅
    drop_last=False,  # 不丢弃最后一个不完整批次
    shuffle=False,  # 是否打乱数据
    num_workers=0  # 工作线程数

We used a relatively small batch size in the preceding code to reduce the computational resource demand because we were working with a very small dataset. In practice, training LLMs with batch sizes of 1,024 or larger is not uncommon.


As an optional check, we can iterate through the data loaders to ensure that they were created correctly:


print("Train loader:")  # 打印训练加载器
for x, y in train_loader:
    print(x.shape, y.shape)  # 打印每个批次的形状

print("\nValidation loader:")  # 打印验证加载器
for x, y in val_loader:
    print(x.shape, y.shape)  # 打印每个批次的形状

We should see the following outputs:


Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])

Based on the preceding code output, we have 9 training set batches with 2 samples and 256 tokens each. Since we allocated only 10% of the data for validation, there is only one validation batch consisting of 2 input examples.


As expected, the input data (x) and target data (y) have the same shape (the batch size times the number of tokens in each batch) since the targets are the inputs shifted by one position, as discussed in chapter 2.


Next, we implement a utility function to calculate the cross entropy loss of a given batch returned via the training and validation loader:


def calc_loss_batch(input_batch, target_batch, model, device):  # 定义计算批次损失的函数
    input_batch, target_batch =,  # 将输入和目标批次转移到设备上
    logits = model(input_batch)  # 模型计算logits
    loss = torch.nn.functional.cross_entropy(  # 计算交叉熵损失
        logits.flatten(0, 1), target_batch.flatten()  # 展平logits和目标批次
    return loss  # 返回损失

We can now use this calc_loss_batch utility function, which computes the loss for a single batch, to implement the following calc_loss_loader function that computes the loss over all the batches sampled by a given data loader:


def calc_loss_loader(data_loader, model, device, num_batches=None):  # 定义计算加载器损失的函数
    total_loss = 0.  # 初始化总损失为0
    if len(data_loader) == 0:  # 如果加载器为空
        return float("nan")  # 返回NaN
    elif num_batches is None:  # 如果未指定批次数
        num_batches = len(data_loader)  # 使用加载器中的批次数
        num_batches = min(num_batches, len(data_loader))  # 限制批次数为加载器中的批次数
    for i, (input_batch, target_batch) in enumerate(data_loader):  # 遍历加载器中的批次
        if i < num_batches:  # 如果未达到指定批次数
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # 计算批次损失
            total_loss += loss.item()  # 累加损失
            break  # 超过指定批次数则退出
    return total_loss / num_batches  # 返回平均损失

By default, the calc_loss_batch function iterates over all batches in a given data loader, accumulates the loss in the total_loss variable, and then computes and averages the loss over the total number of batches. Alternatively, we can specify a smaller number of batches via num_batches to speed up the evaluation during model training.


Let’s now see this calc_loss_batch function in action, applying it to the training and validation set loaders:


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 设置设备为GPU或CPU  # 将模型移动到设备上
with torch.no_grad():  # 禁用梯度计算以提高效率
    train_loss = calc_loss_loader(train_loader, model, device)  # 计算训练集损失
    val_loss = calc_loss_loader(val_loader, model, device)  # 计算验证集损失
print("Training loss:", train_loss)  # 打印训练集损失
print("Validation loss:", val_loss)  # 打印验证集损失

The resulting loss values are as follows:


Training loss: 10.98758347829183
Validation loss: 10.98110580444336

The loss values are relatively high because the model has not yet been trained. For comparison, the loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets.


Now that we have a way to measure the quality of the generated text, in the next section, we train the LLM to reduce this loss so that it becomes better at generating text, as illustrated in Figure 5.10.


Figure 5.10 We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM.

图5.10 我们回顾了文本生成过程,并实现了基本的模型评估技术来计算训练和验证集损失。接下来,我们将进行训练函数并预训练LLM。

As shown in Figure 5.10, the next section focuses on pretraining the LLM. After model training, we implement alternative text generation strategies and save and load pretrained model weights.


5.2 Training an LLM

5.2 训练LLM

In this section, we finally implement the code for pretraining the LLM, our GPTModel. For this, we focus on a straightforward training loop, as illustrated in Figure 5.11, to keep the code concise and readable. However, interested readers can learn about more advanced techniques, including learning rate warmup, cosine annealing, and gradient clipping, in Appendix D, Adding Bells and Whistles to the Training Loop.



Figure 5.11 A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized.

图5.11 PyTorch中训练深度神经网络的典型训练循环包括几个步骤,在几个时期内迭代训练集中的批次。在每个循环中,我们计算每个训练集批次的损失以确定损失梯度,使用这些梯度来更新模型权重,从而使训练集损失最小化

The flowchart in Figure 5.11 depicts a typical PyTorch neural network training workflow, which we use for training an LLM. It outlines eight steps, starting with iterating over each epoch, processing batches, resetting and calculating gradients, updating weights, and concluding with monitoring steps like printing losses and generating text samples. If you are relatively new to training deep neural networks with PyTorch and any of these steps are unfamiliar, consider reading sections A.5 to A.8 in Appendix A, Introduction to PyTorch.


In code, we can implement this training flow via the following train_model_simple function:


Listing 5.3 The main function for pretraining LLMs

5.3 用于预训练LLMs的主要函数

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,  # 定义train_model_simple函数
                       eval_freq, eval_iter, start_context, tokenizer):
    train_losses, val_losses, track_tokens_seen = [], [], []  # 初始化列表以跟踪损失和看到的词元
    tokens_seen, global_step = 0, -1  # 初始化词元计数和全局步数

    for epoch in range(num_epochs):  # 开始主要训练循环
        model.train()  # 设置模型为训练模式
        for input_batch, target_batch in train_loader:  # 遍历训练数据
            optimizer.zero_grad()  # 重置前一批次迭代的梯度
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # 计算批次损失
            loss.backward()  # 计算损失梯度
            optimizer.step()  # 使用损失梯度更新模型权重
            tokens_seen += input_batch.numel()  # 更新词元计数
            global_step += 1  # 增加全局步数

            if global_step % eval_freq == 0:  # 可选的评估步骤
                train_loss, val_loss = evaluate_model(  # 评估模型性能
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)  # 添加训练损失到列表
                val_losses.append(val_loss)  # 添加验证损失到列表
                track_tokens_seen.append(tokens_seen)  # 记录看到的词元数
                print(f"Ep {epoch+1} (Step {global_step:06d}): "  # 打印当前训练信息
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

    generate_and_print_sample(  # 生成并打印样本
        model, tokenizer, device, start_context
    return train_losses, val_losses, track_tokens_seen  # 返回训练和验证损失及词元计数

Note that the train_model_simple function we just created uses two functions we have not defined yet: evaluate_model and generate_and_print_sample.


The evaluate_model function corresponds to step 7 in Figure 5.11. It prints the training and validation set losses after each model update so we can evaluate whether the training improves the model.


More specifically, the evaluate_model function calculates the loss over the training and validation set while ensuring the model is in evaluation mode with gradient tracking and dropout disabled when calculating the loss over the training and validation sets:


def evaluate_model(model, train_loader, val_loader, device, eval_iter):  # 定义evaluate_model函数
    model.eval()  # 设置模型为评估模式
    with torch.no_grad():  # 禁用梯度跟踪
        train_loss = calc_loss_loader(train_loader, model, device,  # 计算训练集损失
        val_loss = calc_loss_loader(val_loader, model, device,  # 计算验证集损失
    model.train()  # 设置模型为训练模式
    return train_loss, val_loss  # 返回训练和验证损失

Similar to evaluate_model, the generate_and_print_sample function is a convenience function that we use to track whether the model improves during the training. In particular, the generate_and_print_sample function takes a text snippet (start_context) as input, converts it into token IDs, and feeds it to the LLM to generate a text sample using the generate_text_simple function we used earlier:


def generate_and_print_sample(model, tokenizer, device, start_context):  # 定义generate_and_print_sample函数
    model.eval()  # 设置模型为评估模式
    context_size = model.pos_emb.weight.shape[0]  # 获取上下文大小
    encoded = text_to_token_ids(start_context, tokenizer).to(device)  # 将文本转换为词元ID并移动到设备
    with torch.no_grad():  # 禁用梯度跟踪
        token_ids = generate_text_simple(  # 生成文本词元ID
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
    decoded_text = token_ids_to_text(token_ids, tokenizer)  # 将词元ID转换为文本
    print(decoded_text.replace("\n", " "))  # 打印生成的文本,以紧凑格式显示
    model.train()  # 设置模型为训练模式

While the evaluate_model function gives us a numeric estimate of the model’s training progress, this generate_and_print_sample test function provides a concrete text example generated by the model to judge its capabilities during training.



Adam optimizers are a popular choice for training deep neural networks. However, in our training loop, we opt for the AdamW optimizer. AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing larger weights. This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs.


Let’s see this all in action by training a GPTModel instance for 10 epochs using an AdamW optimizer and the train_model_simple function we defined earlier.


torch.manual_seed(123)  # 设置随机种子
model = GPTModel(GPT_CONFIG_124M)  # 初始化模型  # 将模型移动到设备
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)  # 使用AdamW优化器
num_epochs = 10  # 训练周期数
train_losses, val_losses, tokens_seen = train_model_simple(  # 调用train_model_simple函数
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=1,
    start_context="Every effort moves you", tokenizer=tokenizer

Executing the train_model_simple function starts the training process, which takes about 5 minutes on a MacBook Air or a similar laptop to complete. The output printed during this execution is as follows:

执行train_model_simple函数开始训练过程,在MacBook Air或类似的笔记本电脑上大约需要5分钟完成。在此过程中打印的输出如下:

Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933
Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,.                                     
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and,, and, and,
[...] Results are truncated to save space
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back the window-curtains, I had the donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gis

As we can see, based on the results printed during the training, the training loss improves drastically, starting with a value of 9.558 and converging to 0.762. The language skills of the model have improved quite a lot. In the beginning, the model is only able to append commas to the start context (“Every effort moves you,”) or repeat the word “and”. At the end of the training, it can generate grammatically correct text.

正如我们所见,根据训练期间打印的结果,训练损失显著改善,从9.558开始收敛到0.762。模型的语言技能大大提高。起初,模型只能在起始上下文(“Every effort moves you,”)后附加逗号或重复“and”这个词。在训练结束时,它可以生成语法正确的文本。

Similar to the training set loss, we can see that the validation loss starts high (9.856) and decreases during the training. However, it never becomes as small as the training set loss and remains at 6.372 after the 10th epoch.


Before discussing the validation loss in more detail, let’s create a simple plot that shows the training and validation set losses side by side:


import matplotlib.pyplot as plt  # 导入matplotlib.pyplot库作为plt

def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):  # 定义plot_losses函数
    fig, ax1 = plt.subplots(figsize=(5, 3))  # 创建一个图形和一个子图,图形大小为5x3
    ax1.plot(epochs_seen, train_losses, label="Training loss")  # 在第一个子图上绘制训练损失
    ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss")  # 在第一个子图上绘制验证损失,使用点线样式
    ax1.set_xlabel("Epochs")  # 设置x轴标签为“Epochs”
    ax1.set_ylabel("Loss")  # 设置y轴标签为“Loss”
    ax1.legend(loc="upper right")  # 设置图例位置为右上角
    ax2 = ax1.twinx()  # 创建共享同一y轴的第二个x轴  #A
    ax2.plot(tokens_seen, train_losses, alpha=0)  # 对齐刻度的隐形图  #B
    ax2.set_xlabel("Tokens seen")  # 设置第二个x轴标签为“Tokens seen”
    fig.tight_layout()  # 自动调整子图参数以填充整个图形区域  # 显示图形

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))  # 创建一个从0到num_epochs的线性张量,长度为训练损失的长度
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)  # 调用plot_losses函数,传入epochs_tensor, tokens_seen, train_losses, val_losses

The resulting training and validation loss plot is shown in Figure 5.12.



Figure 5.12 At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it’s overfitting to the training set past epoch 2.

图5.12 在训练开始时,我们观察到训练集和验证集的损失都急剧下降,这是模型正在学习的迹象。然而,训练集损失在第二个周期后继续下降,而验证集损失停滞不前。这表明模型仍在学习,但在第二个周期后过拟合了训练集。

As Figure 5.12 shows, both the training and validation losses start to improve for the first epoch. However, the losses start to diverge past the second epoch. This divergence and the fact that the validation loss is much larger than the training loss indicate that the model is overfitting to the training data. We can confirm that the model memorizes the training data verbatim by searching for the generated text snippets, such as “quite insensible to the irony” in the “The Verdict” text file.

如图5.12所示,训练和验证损失在第一个周期开始改善。然而,损失在第二个周期后开始分化。这种分化以及验证损失远大于训练损失的事实表明模型正在对训练数据过拟合。我们可以通过搜索生成的文本片段来确认模型逐字记住了训练数据,例如“在‘The Verdic’文本文件中‘quite insensible to the iron’”。

This memorization is expected since we are working with a very, very small training dataset and training the model for multiple epochs. Usually, it’s common to train a model on a much, much larger dataset for only one epoch.


As mentioned earlier, interested readers can try to train the model on 60,000 public domain books from Project Gutenberg, where this overfitting does not occur; see appendix B for details.

如前所述,感兴趣的读者可以尝试在Project Gutenberg的60,000本公共领域图书上训练模型,在这种情况下不会发生过拟合;详见附录B。

In the upcoming section, as shown in Figure 5.13, we explore sampling methods employed by LLMs to mitigate memorization effects, resulting in more novel generated text.



Figure 5.13 Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts.

图5.13 我们的模型在实现训练功能后可以生成连贯的文本。然而,它经常逐字记住训练集中的段落。以下部分介绍了生成更多样化输出文本的策略。

As illustrated in Figure 5.13, the next section will cover text generation strategies for LLM to reduce training data memorization and increase the originality of the LLM-generated text before we cover weight loading and saving and loading pretrained weights from OpenAI’s GPT model.


5.3 Decoding strategies to control randomness

5.3 控制随机性的解码策略

In this section, we will cover text generation strategies (also called decoding strategies) to generate more original text. First, we briefly revisit the generate_text_simple function from the previous chapter that we used inside the generate_and_print_sample earlier in this chapter. Then, we will cover two techniques, temperature scaling, and top-k sampling, to improve this function.


We begin by transferring the model back from the GPU to the CPU since inference with a relatively small model does not require a GPU. Also, after training, we put the model into evaluation mode to turn off random components such as dropout:

我们首先将模型从GPU传回CPU,因为相对较小的模型推理不需要GPU。此外,在训练后,我们将模型置于评估模式,以关闭随机组件,如dropout:"cpu")  # 将模型移至CPU
model.eval()  # 设置模型为评估模式

Next, we plug the GPTModel instance (model) into the generate_text_simple function, which uses the LLM to generate one token at a time:


tokenizer = tiktoken.get_encoding("gpt2")  # 获取词元编码器
token_ids = generate_text_simple(  # 调用generate_text_simple函数生成词元
    idx=text_to_token_ids("Every effort moves you", tokenizer),  # 将文本转换为词元ID
    max_new_tokens=25,  # 最大生成词元数为25
    context_size=GPT_CONFIG_124M["context_length"]  # 上下文大小为GPT_CONFIG_124M的context_length
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印生成的文本

The generated text is as follows:


Output text:


Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run


As explained earlier in section 5.1.2, the generated token is selected at each generation step corresponding to the largest probability score among all tokens in the vocabulary.


The following subsections introduce two concepts to control the randomness and diversity of the generated text: temperature scaling and top-k sampling.


5.3.1 Temperature scaling

5.3.1 温度缩放

This section introduces temperature scaling, a technique that adds a probabilistic selection process to the next-token generation task.


Previously, inside the generate_text_simple function, we always sampled the token with the highest probability as the next token using torch.argmax, also known as greedy decoding. To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step).


To illustrate the probabilistic sampling with a concrete example, let’s briefly discuss the next-token generation process using a very small vocabulary for illustration purposes:


vocab = {  # 定义词汇表
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
inverse_vocab = {v: k for k, v in vocab.items()}  # 定义反向词汇表

Next, assume the LLM is given the start context “every effort moves you” and generates the following next-token logits:

接下来,假设LLM给定起始上下文“every effort moves you”并生成以下下一个词元的logits:

next_token_logits = torch.tensor(  # 定义下一个词元的logits
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]

As discussed in the previous chapter, inside the generate_text_simple, we convert the logits into probabilities via the softmax function and obtain the token ID corresponding to the generated token via the argmax function, which we can then map back into text via the inverse vocabulary:


probas = torch.softmax(next_token_logits, dim=0)  # 将logits转换为概率
next_token_id = torch.argmax(probas).item()  # 获取概率最大的词元ID
print(inverse_vocab[next_token_id])  # 打印生成的词元

Since the largest logit value, and correspondingly the largest softmax probability score, is in the fourth position (index position 3 since Python uses 0-indexing), the generated word is to implement a probabilistic sampling process, we can now replace the argmax with the multinomial function in PyTorch:


torch.manual_seed(123)  # 设置随机种子
next_token_id = torch.multinomial(probas, num_samples=1).item()  # 使用多项式分布采样
print(inverse_vocab[next_token_id])  # 打印生成的词元

The printed output is “forward” just like before. What happened? The multinomial function samples the next token proportional to its probability score. In other words, “forward” is still the most likely token and will be selected by multinomial most of the time but not all the time. To illustrate this, let’s implement a function that repeats this sampling 1000 times:


def print_sampled_tokens(probas):  # 定义print_sampled_tokens函数
    torch.manual_seed(123)  # 设置随机种子
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]  # 进行1000次采样
    sampled_ids = torch.bincount(torch.tensor(sample))  # 计算每个词元的出现频率
    for i, freq in enumerate(sampled_ids):  # 遍历词元频率
        print(f"{freq} x {inverse_vocab[i]}")  # 打印词元及其出现频率
print_sampled_tokens(probas)  # 调用print_sampled_tokens函数

The sampling output is as follows:


73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward

As we can see based on the output, the word “forward” is sampled most of the time (582 out of 1000 times), but other tokens such as “closer”, “inches”, and “toward” will also be sampled some of the time. This means that if we replaced the argmax function with the multinomial function inside the generate_and_print_sample function, the LLM would sometimes generate texts such as “every effort moves you toward”, “every effort moves you inches”, and “every effort moves you closer” instead of “every effort moves you forward”.

正如我们根据输出所见,单词“forward”在大多数情况下被采样(1000次中有582次),但其他词元如“closer”、“inches”和“toward”也会在某些情况下被采样。这意味着如果我们在generate_and_print_sample函数中将argmax函数替换为多项式函数,LLM有时会生成类似“every effort moves you toward”、“every effort moves you inches”和“every effort moves you closer”的文本,而不是“every effort moves you forward”。

We can further control the distribution and selection process via a concept called temperature scaling, where temperature scaling is just a fancy description for dividing the logits by a number greater than 0:


def softmax_with_temperature(logits, temperature):  # 定义带温度的softmax函数
    scaled_logits = logits / temperature  # 对logits进行温度缩放
    return torch.softmax(scaled_logits, dim=0)  # 返回softmax结果
temperatures = [1, 0.1, 5]  #A 定义温度列表
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]  # 对每个温度计算缩放后的概率
x = torch.arange(len(vocab))  # 定义x轴刻度
bar_width = 0.15  # 定义柱宽
fig, ax = plt.subplots(figsize=(5, 3))  # 创建图形和子图
for i, T in enumerate(temperatures):  # 遍历温度列表
    rects = + i * bar_width, scaled_probas[i],  # 绘制柱状图
                   bar_width, label=f'Temperature = {T}')
ax.set_ylabel('Probability')  # 设置y轴标签为“Probability”
ax.set_xticks(x)  # 设置x轴刻度
ax.set_xticklabels(vocab.keys(), rotation=90)  # 设置x轴刻度标签
ax.legend()  # 显示图例
plt.tight_layout()  # 调整子图布局  # 显示图形

temperatures greater than 1 result in more uniformly distributed token probabilities, and temperatures smaller than 1 will result in more confident (sharper or more peaky) distributions. Let’s illustrate this by plotting the original probabilities alongside probabilities scaled with d在这里插入图片描述

Figure 5.14 A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here “forward”) will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.

图5.14 温度为1表示词汇表中每个词元的未缩放概率分数。将温度降低到0.1会使分布变得更尖锐,因此最可能的词元(此处为“forward”)将具有更高的概率分数。反之,温度增加到5使分布更加均匀。

A temperature of 1 divides the logits by 1 before passing them to the softmax function to compute the probability scores. In other words, using a temperature of 1 is the same as not using any temperature scaling. In this case, the tokens are selected with a probability equal to the original softmax probability scores via the multinomial sampling function in PyTorch.


Also, as we can see in Figure 5.14, applying very small temperatures, such as 0.1, will result in sharper distributions such that the behavior of the multinomial function selects the most likely token (here: “forward”) almost 100% of the time, approaching the behavior of the argmax function. Vice versa, a temperature of 5 results in a more uniform distribution where other tokens are selected more often. This can add more variety to the generated texts but also more often results in nonsensical text. For example, using the temperature of 5 results in texts such as “every effort moves you pizza” about 4% of the time.

此外,正如我们在图5.14中看到的那样,应用非常小的温度(例如0.1)将导致更尖锐的分布,使得多项式函数的行为几乎100%地选择最可能的词元(此处为:“forward”),接近argmax函数的行为。反之,温度为5导致分布更加均匀,其他词元被选择的频率更高。这可以为生成的文本增加更多的多样性,但也更频繁地导致无意义的文本。例如,使用温度5大约4%的情况下会生成“every effort moves you pizza”这样的文本。


练习 5.1

Use the print_sampled_tokens function to print the sampling frequencies of the softmax probabilities scaled with the temperatures shown in Figure 5.13. How often is the word “pizza” sampled in each case? Can you think of a faster and more accurate way to determine how often the word “pizza” is sampled?


5.3.2 Top-k sampling

5.3.2 Top-k采样

In the previous section, we implemented a probabilistic sampling approach coupled with temperature scaling to increase the diversity of the outputs. We saw that higher temperature values result in more uniformly distributed next-token probabilities, which result in more diverse outputs as it reduces the likelihood of the model repeatedly selecting the most probable token. This method allows for exploring less likely but potentially more interesting and creative paths in the generation process. However, One downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensical outputs such as “every effort moves you pizza”.

在上一节中,我们实现了一种概率采样方法,结合温度缩放以增加输出的多样性。我们看到较高的温度值导致下一词元概率更均匀分布,这导致更多样化的输出,因为它减少了模型重复选择最可能词元的可能性。这种方法允许在生成过程中探索不太可能但可能更有趣和创造性的路径。然而,这种方法的一个缺点是有时会导致语法错误或完全无意义的输出,例如“every effort moves you pizza”。

In this section, we introduce another concept called top-k sampling, which, when combined with probabilistic sampling and temperature scaling, can improve the text generation results.


In top-k sampling, we can restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores, as illustrated in Figure 5.15.



Figure 5.15 Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non-top-k tokens.

图5.15 使用k=3的top-k采样,我们关注与最高logits关联的3个词元,并在应用softmax函数之前将所有其他词元掩蔽为负无穷大(-inf)。这导致一个概率值为0的概率分布分配给所有非top-k词元。

The approach outlined in Figure 5.15 replaces all non-selected logits with negative infinity value (-inf), such that when computing the softmax values, the probability scores of the non-top-k tokens are 0, and the remaining probabilities sum up to 1. (Careful readers may remember this masking trick from the causal attention module we implemented in chapter 3 in section 3.5.1 Applying a causal attention mask.)


In code, we can implement the top-k procedure outlined in Figure 5.15 as follows, starting with the selection of the tokens with the largest logit values:


top_k = 3  # 设置top-k为3
top_logits, top_pos = torch.topk(next_token_logits, top_k)  # 选择具有最大logit值的词元
print("Top logits:", top_logits)  # 打印最大logit值
print("Top positions:", top_pos)  # 打印最大logit值的位置
Top logits: tensor([6.7500, 6.2800, 4.5100])
#最大logit值:tensor([6.7500, 6.2800, 4.5100])
Top positions: tensor([3, 7, 0])
#最大logit值的位置:tensor([3, 7, 0])

Subsequently, we apply PyTorch’s where function to set the logit values of tokens that are below the lowest logit value within our top-3 selection to negative infinity (-inf).


new_logits = torch.where(  # 使用where函数更新logits
    condition=next_token_logits < top_logits[-1],  #A 确定小于top 3最小值的logits
    input=torch.tensor(float('-inf')),  #B 将这些较低的logits赋值为-inf
    other=next_token_logits  #C 其他词元保留原始logits
print(new_logits)  # 打印更新后的logits

The resulting logits for the next token in the 9-token vocabulary are as follows:


tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])

Lastly, let’s apply the softmax function to turn these into next-token probabilities:


topk_probas = torch.softmax(new_logits, dim=0)  # 对更新后的logits应用softmax函数
print(topk_probas)  # 打印top-k概率

As we can see, the result of this top-3 approach are 3 non-zero probability scores:


tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])

We can now apply the temperature scaling and multinomial function for probabilistic sampling introduced in the previous section to select the next token among these 3 non-zero probability scores to generate the next token. We do this in the next section by modifying the text generation function.


5.3.3 Modifying the text generation function

5.3.3 修改文本生成函数

The previous two subsections introduced two concepts to increase the diversity of LLM-generated text: temperature sampling and top-k sampling. In this section, we combine and add these concepts to modify the generate_simple function we used to generate text via the LLM earlier, creating a new generate function:


Listing 5.4 A modified text generation function with more diversity

5.4 一个具有更多多样性的修改文本生成函数

def generate(model, idx, max_new_tokens, context_size,  # 定义生成函数
             temperature=1.0, top_k=None, eos_id=None):
    for _ in range(max_new_tokens):  #A 遍历最大新词元数
        idx_cond = idx[-context_size:]  # 获取上下文
        with torch.no_grad():  # 禁用梯度计算
            logits = model(idx_cond)  # 获取模型的logits
            logits = logits[:, -1, :]  # 只关注最后一个时间步的logits
        if top_k is not None:  #B 在新部分中,使用top-k采样
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(
                logits < min_val,  #A 确定小于top 3最小值的logits
                torch.tensor(float('-inf')).to(logits.device),  #B 将这些较低的logits赋值为-inf
                logits  #C 其他词元保留原始logits

        if temperature > 0.0:  #C 这是我们应用温度缩放的新部分
            logits = logits / temperature  # 对logits进行温度缩放
            probs = torch.softmax(logits, dim=-1)  # 应用softmax函数
            idx_next = torch.multinomial(probs, num_samples=1)  # 使用多项式函数进行采样
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  #D 在禁用温度缩放时执行贪婪选择
        if idx_next == eos_id:  #E 如果遇到序列结束词元且指定了eos_id,则提前终止生成
        idx =, idx_next), dim=1)  # 将新词元添加到索引序列中
    return idx  # 返回索引序列

Let’s now see this new generate function in action:


torch.manual_seed(123)  # 设置随机种子
token_ids = generate(
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),  # 将文本转换为词元ID并移动到设备
    max_new_tokens=15,  # 最大新词元数为15
    context_size=GPT_CONFIG_124M["context_length"],  # 上下文大小
    top_k=25,  # top-k值为25
    temperature=1.4  # 温度值为1.4
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印生成的文本

The generated text is as follows:


Output text:


Every effort moves you stand to work on surprise, a one of us had gone with random-


As we can see, the generated text is very different from the one we previously generated via the generate_simple function at the beginning of section 5.3 (“Every effort moves you know,” was one of the axioms he laid…!"), which was a memorized passage from the training set.



练习 5.2

Play around with different temperatures and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are desired? Vice versa, can you think of applications where higher temperature and top-k settings are preferred? (It’s recommended to also revisit this exercise at the end of the chapter after loading the pretrained weights from OpenAI.)



练习 5.3

What are the different combinations of settings for the generate function to force deterministic behavior, that is, disabling the random sampling such that it always produces the same outputs similar to the generate_simple function?


So far, we covered how to pretrain LLMs and use them to generate text. The last two sections of this chapter will discuss how we save and load the trained LLM and how we load pretrained weights from OpenAI.


5.4 Loading and saving model weights in PyTorch

5.4 在PyTorch中加载和保存模型权重

In this chapter, we have discussed how to numerically evaluate the training progress and pretrain an LLM from scratch. Even though both the LLM and dataset were relatively small, this exercise showed that pretraining LLMs is computationally expensive. Thus, it is important to be able to save the LLM so that we don’t have to rerun the training every time we want to use it in a new session.


As illustrated in the chapter overview in Figure 5.16, we cover how to save and load a pretrained model in this section. Then, in the upcoming section, we will load a more capable pretrained GPT model from OpenAI into our GPTModel instance.



Figure 5.16 After training and inspecting the model, it is often helpful to save the model so that we can use or continue training it later, which is the topic of this section before we load the pretrained model weights from OpenAI in the final section of this chapter.

图5.16 在训练和检查模型之后,通常保存模型是有帮助的,这样我们可以在以后使用或继续训练它,这是本节的主题,在本章最后一节我们将从OpenAI加载预训练模型权重。

Fortunately, saving a PyTorch model is relatively straightforward. The recommended way is to save a model’s so-called state_dict, a dictionary mapping each layer to its parameters, using the function as follows:

幸运的是,保存PyTorch模型相对简单。推荐的方法是使用torch.save函数保存模型的state_dict,一个将每一层映射到其参数的字典,如下所示:, "model.pth")  # 保存模型权重到model.pth文件

In the preceding code, “model.pth” is the filename where the state_dict is saved. The .pth extension is a convention for PyTorch files, though we could technically use any file extension.


Then, after saving the model weights via the state_dict, we can load the model weights into a new GPTModel model instance as follows:


model = GPTModel(GPT_CONFIG_124M)  # 初始化新模型实例
model.load_state_dict(torch.load("model.pth"))  # 加载保存的模型权重
model.eval()  # 设置模型为评估模式

As discussed in chapter 4, dropout helps prevent the model from overfitting to the training data by randomly “dropping out” of a layer’s neurons during training. However, during inference, we don’t want to randomly drop out any of the information the network has learned. Using model.eval() switches the model to evaluation mode for inference, disabling the dropout layers of the model.


If we plan to continue pretraining a model later, for example, using the train_model_simple function we defined earlier in this chapter, saving the optimizer state is also recommended.


Adaptive optimizers such as AdamW store additional parameters for each model weight. AdamW uses historical data to adjust learning rates for each model parameter dynamically. Without it, the optimizer resets, and the model may learn suboptimally or even fail to converge properly, which means that it will lose the ability to generate coherent text. Using, we can save both the model and optimizer state_dict contents as follows:

自适应优化器如AdamW为每个模型权重存储额外的参数。AdamW使用历史数据动态调整每个模型参数的学习率。没有它,优化器会重置,模型可能会次优学习甚至无法正确收敛,这意味着它会失去生成连贯文本的能力。使用,我们可以同时保存模型和优化器的state_dict内容,如下所示:{  # 保存模型和优化器的状态字典
    "model_state_dict": model.state_dict(),  # 模型状态字典
    "optimizer_state_dict": optimizer.state_dict(),  # 优化器状态字典
}, "model_and_optimizer.pth")  # 保存到model_and_optimizer.pth文件

Then, we can restore the model and optimizer states as follows by first loading the saved data via torch.load and then using the load_state_dict method:


checkpoint = torch.load("model_and_optimizer.pth")  # 加载保存的检查点
model = GPTModel(GPT_CONFIG_124M)  # 初始化新模型实例
model.load_state_dict(checkpoint["model_state_dict"])  # 加载模型状态字典
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)  # 初始化优化器
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])  # 加载优化器状态字典
model.train()  # 设置模型为训练模式


练习 5.4

After saving the weights, load the model and optimizer in a new Python session or Jupyter notebook file and continue pretraining it for 1 more epoch using the train_model_simple function.


5.5 Loading pretrained weights from OpenAI

5.5 从OpenAI加载预训练权重

Previously, for educational purposes, we trained a small GPT-2 model using a limited dataset comprising a short-story book. This approach allowed us to focus on the fundamentals without the need for extensive time and computational resources.


Fortunately, OpenAI openly shared the weights of their GPT-2 models, thus eliminating the need to invest tens to hundreds of thousands of dollars in retraining the model on a large corpus ourselves.


In the remainder of this section, we load these weights into our GPTModel class and use the model for text generation. Here, weights refer to the weight parameters that are stored in the .weight attributes of PyTorch’s Linear and Embedding layers, for example. We accessed them earlier via model.parameters() when training the model.


In the next chapters, we will reuse these pretrained weights to finetune the model for a text classification task and follow instructions similar to ChatGPT.


Note that OpenAI originally saved the GPT-2 weights via TensorFlow, which we have to install to load the weights in Python. Moreover, the following code will use a progress bar tool called tqdm to track the download process, which we also have to install.


You can install these libraries by executing the following command in your terminal:


pip install tensorflow>=2.15.0 tqdm>=4.66  # 安装TensorFlow和tqdm库

The download code is relatively long, mostly boilerplate, and not very interesting. Hence, instead of devoting precious space in this chapter to discussing Python code for fetching files from the internet, we download the Python module directly from this chapter’s online repository:

下载代码相对较长,主要是样板代码,不是很有趣。因此,本章不讨论用于从互联网上获取文件的Python代码,而是直接从本章的在线存储库下载 Python模块:

import urllib.request  # 导入urllib.request库
url = (  # 定义下载URL
filename = url.split('/')[-1]  # 获取文件名
urllib.request.urlretrieve(url, filename)  # 下载文件

Next, after downloading this file to the local directory of your Python session, readers are encouraged to briefly inspect the contents of this file to ensure that it was saved correctly and contains valid Python code.


We can now import the download_and_load_gpt2 function from the file as follows, which will load the GPT-2 architecture settings (settings) and weight parameters (params) into our Python session:


from gpt_download import download_and_load_gpt2  # 从gpt_download导入download_and_load_gpt2函数
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")  # 下载并加载GPT-2设置和参数

Executing the preceding code downloads the following 7 files associated with the 124M parameter GPT-2 model:


checkpoint: 100% |██████████████████████████████████████████████████████████████████| 77.0/77.0 [00:00<00:00, 63.9kiB/s]
encoder.json: 100% |████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.20MiB/s]
hparams.json: 100% |████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 78.3kiB/s] 100% |██████████████████████████████████████████████| 498M/498M [01:09<00:00, 7.16MiB/s]
model.ckpt.index: 100% |████████████████████████████████████████████████████████████| 5.21k/5.21k [00:00<00:00, 3.24MiB/s]
model.ckpt.meta: 100% |████████████████████████████████████████████████████████████| 471k/471k [00:00<00:00, 2.46MiB/s]
vocab.bpe: 100% |██████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.70MiB/s]



If the download code does not work for you, it could be due to intermittent internet connection, server issues, or changes in how OpenAI shares the weights of the open-source GPT-2 model. In this case, please visit this chapter’s online code repository at for alternative and updated instructions, and please reach out via the Manning Forum for further questions.


After the execution of the previous code has been completed, let’s inspect the contents of settings and params:


print("Settings:", settings)  # 打印设置
print("Parameter dictionary keys:", params.keys())  # 打印参数字典键

The contents are as follows:


Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])

Both settings and params are Python dictionaries. The settings dictionary stores the LLM architecture settings similarly to our manually defined GPT_CONFIG_124M settings. The params dictionary contains the actual weight tensors. Note that we only printed the dictionary keys because printing the weight contents would take up too much screen space, however, we can inspect these weight tensors by printing the whole dictionary via print(params) or by selecting individual tensors via the respective dictionary keys, for example, the embedding layer weights:


print(params["wte"])  # 打印词元嵌入权重张量
print("Token embedding weight tensor dimensions:", params["wte"].shape)  # 打印词元嵌入权重张量的维度

The weights of the token embedding layer are as follows:


[[ 0.1101301 ... 0.1363697 0.0156028 0.04531523]
 [ 0.04034033 ... 0.0865453 0.00253983 0.04318958]
 [-0.12746179 ... 0.08991534 -0.12972379 -0.08785918]
 [-0.04453601 ... 0.10435229 0.09783269 -0.06952604]
 [ 0.1860082 ... -0.09625227 0.07847701 -0.02245961]
 [ 0.05135201 ... 0.00704835 0.15519823 0.12067825]]

Token embedding weight tensor dimensions: (50257, 768)

词元嵌入权重张量的维度: (50257, 768)

We downloaded and loaded the weights of the smallest GPT-2 model via the download_and_load_gpt2(model_size=“124M”, …) setting. However, note that OpenAI also shares the weights of larger models: “355M”, “774M”, and “1558M”. The overall architecture of these differently-sized GPT models is the same, as illustrated in Figure 5.17.

我们通过download_and_load_gpt2(model_size=“124M”, …)设置下载并加载了最小的GPT-2模型的权重。但是,请注意,OpenAI还分享了更大模型的权重:“355M”、“774M”和“1558M”。这些不同大小的GPT模型的总体架构是相同的,如图5.17所示。


Figure 5.17 GPT-2 LLMs come in several different model sizes, ranging from 124 million to 1,558 million parameters. The core architecture is the same, with the only difference being the embedding sizes and the number of times individual components like the attention heads and transformer blocks are repeated.

图5.17 GPT-2 LLM有几种不同的模型大小,从1.24亿到15.58亿个参数不等。核心架构相同,唯一的区别是嵌入大小和注意力头和transformer blocks 等单个组件重复的次数

As illustrated in Figure 5.17, the overall architecture of the differently-sized GPT-2 models remains the same, except that different architectural elements are repeated different numbers of times, and the embedding size differs. The remaining code in this chapter is also compatible with these larger models.


After loading the GPT-2 model weights into Python, we still need to transfer them from the settings and params dictionaries into our GPTModel instance.


First, we create a dictionary that lists the differences between the different GPT model sizes, as explained in Figure 5.17:


model_configs = {  # 定义模型配置字典
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},

Suppose we are interested in loading the smallest model, “gpt2-small (124M)”. We can use the corresponding settings from the model_configs table to update our full-length GPT_CONFIG_124M we defined and used earlier throughout the chapter as follows:

假设我们有兴趣加载最小的模型“gpt2-small (124M)”。我们可以使用model_configs表中的相应设置来更新我们在本章中定义并使用的完整GPT_CONFIG_124M,如下所示:

model_name = "gpt2-small (124M)"  # 选择模型名称
NEW_CONFIG = GPT_CONFIG_124M.copy()  # 复制原始配置
NEW_CONFIG.update(model_configs[model_name])  # 更新配置为选定模型的配置

Careful readers may remember that we used a 256-token length earlier, but the original GPT-2 models from OpenAI were trained with a 1,024-token length, so we have to update the NEW_CONFIG accordingly:


NEW_CONFIG.update({"context_length": 1024})  # 更新上下文长度为1024

Also, OpenAI used bias vectors in the multi-head attention module’s linear layers to implement the query, key, and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don’t improve the modeling performance and are thus unnecessary. However, since we are working with pretrained weights, we need to match the settings for consistency and enable these bias vectors:


NEW_CONFIG.update({"qkv_bias": True})  # 启用偏置向量

We can now use the updated NEW_CONFIG dictionary to initialize a new GPTModel instance:


gpt = GPTModel(NEW_CONFIG)  # 使用更新的配置初始化GPT模型
gpt.eval()  # 设置模型为评估模式

By default, the GPTModel instance is initialized with random weights for pretraining. The last step to using OpenAI’s model weights is to override these random weights with the weights we loaded into the params dictionary.


For this, we will first define a small assign utility function that checks whether two tensors or arrays (left and right) have the same dimensions or shape and returns the right tensor as trainable PyTorch parameters:


def assign(left, right):  # 定义assign函数
    if left.shape != right.shape:  # 如果形状不匹配
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")  # 抛出错误
    return torch.nn.Parameter(torch.tensor(right))  # 返回右张量作为可训练参数

Next, we define a load_weights_into_gpt function that loads the weights from the params dictionary into a GPTModel instance gpt:


Listing 5.5 Loading OpenAI weights into our GPT model code
5.5 将OpenAI权重加载到我们的GPT模型代码中

import numpy as np  # 导入numpy库

def load_weights_into_gpt(gpt, params):  # 定义load_weights_into_gpt函数
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])  #A 分配位置嵌入权重
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params["wte"])  #A 分配词元嵌入权重

    for b in range(len(params["blocks"])):  #B 遍历块
        q_w, k_w, v_w = np.split(  #C 分割权重
            params["blocks"][b]["attn"]["c_attn"]["w"], 3, axis=1)
        gpt.trf_blocks[b].attn.W_query.weight = assign(
            gpt.trf_blocks[b].attn.W_query.weight, q_w.T)
        gpt.trf_blocks[b].attn.W_key.weight = assign(
            gpt.trf_blocks[b].attn.W_key.weight, k_w.T)
        gpt.trf_blocks[b].attn.W_value.weight = assign(
            gpt.trf_blocks[b].attn.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            params["blocks"][b]["attn"]["c_attn"]["b"], 3, axis=1)
        gpt.trf_blocks[b].attn.W_query.bias = assign(
            gpt.trf_blocks[b].attn.W_query.bias, q_b)
        gpt.trf_blocks[b].attn.W_key.bias = assign(
            gpt.trf_blocks[b].attn.W_key.bias, k_b)
        gpt.trf_blocks[b].attn.W_value.bias = assign(
            gpt.trf_blocks[b].attn.W_value.bias, v_b)

        gpt.trf_blocks[b].attn.out_proj.weight = assign(
        gpt.trf_blocks[b].attn.out_proj.bias = assign(

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, params["blocks"][b]["mlp"]["c_fc"]["w"].T)  # 分配MLP层权重
        gpt.trf_blocks[b].ff.layers[0].bias = assign(  # 分配MLP层偏置
        gpt.trf_blocks[b].ff.layers[2].weight = assign(  # 分配MLP层权重
        gpt.trf_blocks[b].ff.layers[2].bias = assign(  # 分配MLP层偏置

        gpt.trf_blocks[b].norm1.scale = assign(  # 分配第一个LayerNorm层权重
        gpt.trf_blocks[b].norm1.shift = assign(  # 分配第一个LayerNorm层偏置
        gpt.trf_blocks[b].norm2.scale = assign(  # 分配第二个LayerNorm层权重
        gpt.trf_blocks[b].norm2.shift = assign(  # 分配第二个LayerNorm层偏置

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])  # 分配最终LayerNorm层权重
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])  # 分配最终LayerNorm层偏置
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])  #D 将输出层权重与词元嵌入层权重绑定

In the load_weights_into_gpt function, we carefully match the weights from OpenAI’s implementation with our GPTModel implementation. To pick a specific example, OpenAI stored the weight tensor for the output projection layer for the first transformer block as params[“blocks”][0][“attn”][“c_proj”][“w”]. In our implementation, this weight tensor corresponds to gpt.trf_blocks[b].attn.out_proj.weight, where gpt is a GPTModel instance.

在load_weights_into_gpt函数中,我们将OpenAI的实现中的权重与我们的GPTModel实现中的权重仔细匹配。举一个具体的例子,OpenAI将第一个transformer 块的输出投影层的权重张量存储为params[“blocks”][0][“attn”][“c_proj”][“w”]。在我们的实现中,该权重张量对应于gpt.trf_blocks[b].attn.out_proj.weight,其中gpt是一个GPTModel实例。

Developing the load_weights_into_gpt function took a lot of guesswork since OpenAI used a slightly different naming convention from ours. However, the assign function would alert us if we try to match two tensors with different dimensions. Also, if we made a mistake in this function, we would notice this as the resulting GPT model would be unable to produce coherent text.


Let’s not try the load_weights_into_gpt out in practice and load the OpenAI model weights into our GPTModel instance gpt:


load_weights_into_gpt(gpt, params)  # 加载权重到GPT模型  # 将模型移动到设备

If the model is loaded correctly, we can now use it to generate new text using our previous generate function:


torch.manual_seed(123)  # 设置随机种子
token_ids = generate(  # 生成新文本
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))  # 打印输出文本

The resulting text is as follows:


Output text:


Every effort moves you toward finding an ideal new way to practice something! What makes us want to be on top of that?


We can be confident that we loaded the model weights correctly because the model can produce coherent text. A tiny mistake in this process would cause the model to fail.


In the following chapters, we will work further with this pretrained model and fine-tune it to classify text and follow instructions.



练习 5.5

Calculate the training and validation set losses of the GPTModel with the pretrained weights from OpenAI on the “The Verdict” dataset.

计算使用OpenAI预训练权重的GPTModel在"The Verdict"数据集上的训练和验证集损失。


练习 5.6

Readers are encouraged to experiment with GPT-2 models of different sizes, for example, the largest 1558M parameter model and compare the generated text to the 124M model we loaded in this chapter.


5.6 Summary

5.6 总结

  • When LLMs generate text, they output one token at a time.
  • 当LLM生成文本时,它们一次输出一个词元。
  • By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as “greedy decoding.”
  • 默认情况下,下一个词元是通过将模型输出转换为概率分数并从词汇表中选择对应于最高概率分数的词元生成的,这被称为“贪婪解码”。
  • Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
  • 使用概率采样和温度缩放,我们可以影响生成文本的多样性和连贯性。
  • Training and validation set losses can be used to gauge the quality of text generated by LLM during training.
  • 训练和验证集的损失可用于评估LLM在训练期间生成的文本质量。
  • Pretraining an LLM involves changing its weights to minimize the training loss.
  • 预训练LLM涉及更改其权重以最小化训练损失。
  • The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer.
  • LLM的训练循环本身是深度学习中的标准程序,使用传统的交叉熵损失和AdamW优化器。
  • Pretraining an LLM on a large text corpus is time- and resource-intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.
  • 在大型文本语料库上预训练LLM是耗时且资源密集的,因此我们可以加载OpenAI公开可用的权重,作为我们自己在大型数据集上预训练模型的替代方案。





大家好&#xff0c;我是java1234_小锋老师&#xff0c;看到一个不错的微信小程序图书馆座位预约管理系统(SpringBoot后端Vue管理端)&#xff0c;分享下哈。 项目介绍 随着移动互联网技术的飞速发展和智能设备的普及&#xff0c;图书馆服务模式正在经历深刻的变革。本论文旨在…

【Linux】从零开始认识多线程 --- 线程ID

在这个浮躁的时代 只有自律的人才能脱颖而出 -- 《觉醒年代》 1 前言 上一篇文章中讲解了线程控制的基本接口&#xff1a; 线程创建pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);: pthread_t *thread :输出…


引言 在iOS应用开发中&#xff0c;UITabBar是一个非常场景且重要的UI组件。系统为我们提供的UITabBar虽然功能强大&#xff0c;但是在某些情况下&#xff0c;它的标准样式并不能满足我们特定的设计需求&#xff0c;它的灵活性也有一些局限。为了打造更具个性化好的用户友好的交…

Ai绘画变现的14种途径 学习Stablediffusion midjourney用途


node.js中nodemon : 无法加载和使用问题,这是由于windows安全策略影起的按如下操作即可

1、用管理员权限打开vscode 2、文件终端中打开&#xff0c;输入 Set-ExecutionPolicy -Scope CurrentUser 3、再输入RemoteSigned 4、使用get-ExecutionPolicy查看权限&#xff0c;可以看到变为了RemoteSigned 重启问题解决


一、前言 ​ 在应用系统中提交是一个极为常见的功能&#xff0c;倘若不加管控&#xff0c;极易由于用户的误操作或网络延迟致使同一请求被发送多次&#xff0c;从而生成重复的数据记录。针对用户的误操作&#xff0c;前端通常会实现按钮的 loading 状态&#xff0c;以阻…


给出集合 [1,2,3,...,n]&#xff0c;其所有元素共有 n! 种排列。 按大小顺序列出所有排列情况&#xff0c;并一一标记&#xff0c;当 n 3 时, 所有排列如下&#xff1a; "123""132""213""231""312""321" 给定…


园区导航的挑战与机遇 在现代化的大型园区中&#xff0c;随着面积的不断扩张和布局的日益复杂&#xff0c;传统的纸质地图已难以满足日益增长的导航需求。每栋楼、每层楼都有着不同的办公室&#xff0c;不同的业务。这种低效的寻路过程不仅影响了客户的来访体验&#xff0c;也…

介绍一下TCP/IP 模型和 OSI 模型的区别

OSI 模型是由国际标准化组织制定的一个用于计算机或通信系统间互联的标准体系&#xff0c;一共有七层&#xff0c;由上而下分别为应用层&#xff0c;表示层&#xff0c;会话层&#xff0c;传输层&#xff0c;网络层&#xff0c;数据链路层和物理层&#xff0c;虽然 OSI 模型理论…

系统架构设计师教程 第4章 信息安全技术基础知识-4.1 信息安全基础知识-解读

系统架构设计师教程 第4章 信息安全技术基础知识-4.1 信息安全基础知识 4.1.1 信息安全的概念4.1.1.1 信息安全的范围4. 设备安全4. 数据安全4. 内容安全4. 行为安全 4.1.2 信息存储安全4.1.2.1 信息使用的安全4. 用户的标识与验证4.1.2.1.…

免费【2024】springboot 趵突泉景区的智慧导游小程序

博主介绍&#xff1a;✌CSDN新星计划导师、Java领域优质创作者、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和学生毕业项目实战,高校老师/讲师/同行前辈交流✌ 技术范围&#xff1a;SpringBoot、Vue、SSM、HTML、Jsp、PHP、Nodejs、Python、爬虫、数据可视化…


文章目录 一、文章的目的&#xff08;适用于windows&#xff09;二、处理方式三、进程查看的内容在窗口端的演示四、附上代码例子四、通过os.kill的方式&#xff0c;再回到原来的表格时&#xff0c;会出现如下错误提示&#xff1a; 一、文章的目的&#xff08;适用于windows&am…


Flink笔记整理&#xff08;三&#xff09; 文章目录 Flink笔记整理&#xff08;三&#xff09;五、DataStream API5.1Environment5.2 Source5.3 Transformation5.4 Sink 总结 五、DataStream API DataStream API是Flink的核心层API&#xff0c;一个Flink程序&#xff0c;其实本…


文章目录 一&#xff0c;安装导入ElementUI1&#xff0c;安装 element-ui2&#xff0c;导入 element-ui 二&#xff0c;ElementUI 实战1&#xff0c;将 App.vue 改为 element-ui 中的后台布局2&#xff0c;开发导航栏2.1 开发MyTable组件2.2 注册路由2.3 改造App.vue2.4 新增左…

传统行业 IT 团队数字化转型该如何进行?

一、前言 数字化转型概念&#xff1a; 数字化转型是“以更加数字化的方式重新思考&#xff0c;重新塑造商业。是从根本上考虑交付渠道&#xff0c;运营&#xff0c;市场和销售以及顾客关注——所有的业务模块&#xff0c;并重新思考这些能否包装成新的数字化产品和服务&#…


基于微信小程序SpringBootVue的青少年科普教学系统平台(带1w文档) 基于微信小程序SpringBootVue的青少年科普教学系统平台(带1w文档) 这个工具就是解决上述问题的最好的解决方案。它不仅可以实时完成信息处理&#xff0c;还缩短高校教师成果信息管理流程&#xff0c;使其系统化…

挖掘基于边缘无线协同感知的低功耗物联网 (LPIOT) 的巨大潜力

关键词&#xff1a;边缘无线协同感知、低功耗物联网(LPIOT)、无线混合组网、用电监测、用电计量、多角色、计量插座、无线场景感知、多角色运用、后台边缘层&#xff0c;网络边缘层&#xff0c;场景能效管理&#xff0c;场景能耗计算 在数字化和智能化日益加速的今天&#xff…

Elasticsearch:Java ECS 日志记录 - log4j2

ECS 记录器是你最喜欢的日志库的格式化程序/编码器插件。它们可让你轻松将日志格式化为与 ECS 兼容的 JSON。ECS 兼容的 JSON 日志记录可以帮我们简化很多分析&#xff0c;可视化及解析的工作。在今天的文章里&#xff0c;我来详述如何在 Java 应用里生成 ECS 相兼容的日志。 …




1. 项目概述 随着物联网技术的快速发展,智能家居安防系统越来越受到人们的重视。本文介绍了一种基于STM32单片机的嵌入式安防中控系统的设计与实现方案。该系统集成了多种传感器,实现了实时监控、报警和远程控制等功能,为用户提供了一个安全、可靠的家居安防解决方案。 1.1 系…