【大模型推理】大模型前向推理过程详解

文章目录

前期准备
- 环境安装
- 下载模型
- Qwen2-7b模型架构
- vscode配置launch.json文件
前向推理debug深入分析
- 预测第一个next_token
- 预测第二个next_token

为了搞清楚，大模型前向推理的具体流程，本文以Qwen2-7B-Instruct为例，通过直接debug官方推理示例，来深入理解其过程。

前期准备

环境安装

特别注意的是transformers库的版本，本文的版本为 transformers== 4.44.2

下载模型

下载Qwen2-7B-Instruct：

# 安装huggingface-cli工具
pip install -U "huggingface_hub[cli]"

# 切换huggingface镜像
export HF_ENDPOINT=https://hf-mirror.com

# 下载Qwen2-7B-Instruct到本地指定目录：
huggingface-cli download  --resume-download Qwen/Qwen2-7B-Instruct --repo-type model --local-dir /data/models/Qwen/Qwen2-7B-Instruct

Qwen2-7b模型架构

加载模型后打印模型结构：

Qwen2Model(
  (embed_tokens): Embedding(152064, 3584)
  (layers): ModuleList(
    (0-27): 28 x Qwen2DecoderLayer(
      (self_attn): Qwen2SdpaAttention(
        (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
        (k_proj): Linear(in_features=3584, out_features=512, bias=True)
        (v_proj): Linear(in_features=3584, out_features=512, bias=True)
        (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
        (rotary_emb): Qwen2RotaryEmbedding()
      )
      (mlp): Qwen2MLP(
        (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
        (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
        (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
    )
  )
  (norm): Qwen2RMSNorm((3584,), eps=1e-06)
)

Qwen2-7B-Instruct对应的config.json文件：

{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

可以看到以下几个关键信息：

一共28层hidden_layers，每一层的K和V矩阵维度一共是512，num_attention_heads=28表示每一层应该有28个head，也就是28个K和V，但是因为num_key_value_heads=4，所以每7个head共享K和V参数，也就是有4组不同的K和V，每个K和V矩阵维度是512/4=128
输入最大长度是32768
词表大小是152064
默认使用 KV_cache

ok，前期熟悉模型的工作到这里就差不多结束了，下面开始配置vscode：

vscode配置launch.json文件

这里借鉴yuanzhoulvpi大佬的工作，使用非常优雅的方式来debug代码，首先在vscode中，代码目录下，创建launch.json文件，直接copy以下内容：

{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "name": "sh_file_debug",
      "type": "debugpy",
      "request": "attach",
      "connect": {
        "host": "localhost",
        "port": 9501
      },
      "justMyCode": false
    }
  ]
}

要特别注意的是，需要设置justMyCode=false，这样debug时才可以进入环境安装时的库文件。

然后，在运行主代码的开头，添加如下代码：

import debugpy
try:
    # 5678 is the default attach port in the VS Code debug configurations. Unless a host and port are specified, host defaults to 127.0.0.1
    debugpy.listen(("localhost", 9501))
    print("Waiting for debugger attach")
    debugpy.wait_for_client()
except Exception as e:
    pass

完整代码如下，并命名为main.py文件：

from transformers import AutoModelForCausalLM, AutoTokenizer

# 代码地址：https://github.com/yuanzhoulvpi2017/vscode_debug_transformers
import debugpy
try:
    # 5678 is the default attach port in the VS Code debug configurations. Unless a host and port are specified, host defaults to 127.0.0.1
    debugpy.listen(("localhost", 9501))
    print("Waiting for debugger attach")
    debugpy.wait_for_client()
except Exception as e:
    pass


model_name = "/root/autodl-tmp/renruilong/models/Qwen/Qwen2-7B-Instruct"
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
) 
# <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant\n'

model_inputs = tokenizer([text], return_tensors="pt").to(device)
'''
{'input_ids': tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847,     13,
         151645,    198, 151644,    872,    198,  35127,    752,    264,   2805,
          16800,    311,   3460,   4128,   1614,     13, 151645,    198, 151644,
          77091,    198]], device='cuda:0'), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]], device='cuda:0')}
'''
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

前向推理debug深入分析

首先，我们需要打开三个文件：

上面的运行文件main.py
环境安装的transformer库中的~/transformers/models/qwen2/modeling_qwen2.py，主要关注位于1057行的def forward函数
环境安装的transformer库中的~/transformers/generation/utils.py，主要关注位于2888行的def _sample函数

预测第一个next_token

首先，我们需要在utils.py文件中打个断点：

在这里插入图片描述

然后运行main.py文件，terminal出现Waiting for debugger attach，我们点击debug按钮，如下图所示：

在这里插入图片描述

等待模型加载，并且停在我们打的断点处：

在这里插入图片描述

然后，我们运行到 outputs = self(**model_inputs, return_dict=True)处，停下来先看一下即将传入到模型forward函数的参数长什么样子：

在这里插入图片描述
可以看到：

输入token序列的维度为[1, 29]
此时的KV cache均为空

接着我们跳转到modeling_qwen2.py文件的forward函数中，运行到loss = None处，停下来查看输出结果的情况：

在这里插入图片描述

可以看到：

第一次前向，输出的维度是[1, 29, 3584]和输入维度[1, 29]一致，这里我们应该知道，这29个特征中最后一个是我们第一次前向，预测的next_token所表示的特征
经过lm_head后，维度变为[1, 29, 152064]，也就是转换为词表中所有index对应的logits

接着，返回utils.py文件，运行到if do_sample:处停下来查看此时的变量情况：

在这里插入图片描述

可以看到：

我们取出模型输出的最后一个特征，作为next_token_logits
此时的next_token_logits值还没有被处理（也就是没有被采样）

我们继续往下运行，logits_warper(input_ids, next_token_scores)会根据 temperature，top_p以及top_k参数，来筛选候选token，并且将其他token的logits值置为-inf（为了后续softmax置为0）：

在这里插入图片描述

接着，对候选的token做softmax，然后选择其中一个作为最终的next_token，注意，这里选择是next_token_scores没有被置-inf的索引index，这里候选的token是两个 32, 34253，我们一开始的logits值是：tensor([35.8929, 34.4643], device='cuda:0')，softmax后的结果是 tensor(0.8067, device='cuda:0'), tensor(0.1933, device='cuda:0')。

在这里插入图片描述