从FasterTransformer源码解读开始了解大模型（2.4）代码通读05

从FasterTransformer源码解读开始了解大模型（2.4）代码解读05-ContextDecoder的前向01

写在前面的话

ContextDecoder部分是用于处理输入部分的组件层，在这一层中，会对所有输入的input ids进行处理，计算Attention（在此过程中还会生成KV Cache），计算FFN，在完成所有输入部分计算之后，会生成输出部分的第一个token

零、ContextDecoder的buffer和功能函数

在src/fastertransformer/models/multi_gpu_gpt/ParallelGptContextDecoder.cc这个文件中包含了整个gpt的ContextDecoder的函数和组成结构。从25到85行的initial函数中，可以初步看见整个ContextDecoder的整体结构：

template<typename T>
void ParallelGptContextDecoder<T>::initialize()
{
    FT_LOG_DEBUG(__PRETTY_FUNCTION__);
    self_attention_layer_ = new TensorParallelGptContextAttentionLayer<T>(max_batch_size_,
                                                                          max_seq_len_,
                                                                          head_num_,
                                                                          size_per_head_,
                                                                          tensor_para_,
                                                                          stream_,
                                                                          cublas_wrapper_,
                                                                          allocator_,
                                                                          true,
                                                                          is_free_buffer_after_forward_,
                                                                          is_qk_buf_float_,
                                                                          sparse_,
                                                                          int8_mode_,
                                                                          custom_all_reduce_comm_,
                                                                          enable_custom_all_reduce_);

    bool use_gated_activation = activation_type_ == ActivationType::GeGLU || activation_type_ == ActivationType::ReGLU;
    size_t max_inter_size     = has_adapters_ ? std::max(inter_size_, adapter_inter_size_) : inter_size_;
    if (activation_type_ == ActivationType::Gelu || activation_type_ == ActivationType::GeGLU) {
        ffn_layer_ = new TensorParallelGeluFfnLayer<T>(max_batch_size_,
                                                       max_seq_len_,
                                                       head_num_,
                                                       size_per_head_,
                                                       expert_num_,  // expert_num
                                                       max_inter_size,
                                                       tensor_para_,
                                                       stream_,
                                                       cublas_wrapper_,
                                                       allocator_,
                                                       true,
                                                       is_free_buffer_after_forward_,
                                                       sparse_,
                                                       int8_mode_,
                                                       use_gated_activation,
                                                       custom_all_reduce_comm_,
                                                       enable_custom_all_reduce_);
    }
    else if (activation_type_ == ActivationType::Relu || activation_type_ == ActivationType::ReGLU) {
        ffn_layer_ = new TensorParallelReluFfnLayer<T>(max_batch_size_,
                                                       max_seq_len_,
                                                       head_num_,
                                                       size_per_head_,
                                                       expert_num_,  // expert_num
                                                       max_inter_size,
                                                       tensor_para_,
                                                       stream_,
                                                       cublas_wrapper_,
                                                       allocator_,
                                                       true,
                                                       is_free_buffer_after_forward_,
                                                       sparse_,
                                                       int8_mode_,
                                                       use_gated_activation,
                                                       custom_all_reduce_comm_,
                                                       enable_custom_all_reduce_);
    }
}

主要由一个Attention层和一个ffn层组成，Attention层主要负责进行注意力得分计算，而FFN层则主要负责进行矩阵乘进行升降维，并在高维时进行激活。在initial函数中，由于根据模型配置可能会调用不同的激活函数，所以这里留了不同激活函数的FFN。

在93到147行，则是对ContextDecoder中用到的buffer进行专门的分配。其中一些buffer可以从变量名看出它的具体用途，比如decoder_normed_input，用于存储归一化后的input输入，normed_self_attn_output用于存储归一化后的attention模块输出。而149到183，则是对上面allocate后的buffer进行释放的freebuffer函数。

在185到212行，是一系列用于layer id判断的函数。为什么要这么做？我们之前有介绍过PP架构，即Pipeline Parallel流水线并行，会将一个完整模型的多个层划分给不同的机器节点（假设我们这里有一个80层的llama2-70b，那么我们可以考虑部署4台gpu机器，每个机器负责20层，这样就可以将单卡上放不下的模型放在多卡上执行了），在185~212行的这些模型，就是判断当前节点所需要运行的模型实际层数的。

在215到300行，则是函数的构造函数和析构函数，这里不进行赘述。

一、forward前向部分之共享上下文

从303行开始，则是真正的前向推理部分。

首先，我们计算所需要的输出输入，都按照tensor的格式在output_tensors/input_tensors里写好了，从327行到349行，将所需要的decoder输入，mask输入，输入长度，输出buffer，以及kvcache等等buffer给取出来。

有一个很值得注意的技术在358和344行，叫做共享context，解释起来也比较简单，在一些对话模型中，用户的输入往往会有一个固定前缀，那么这些前缀在计算注意力时其实共享前缀的部分都是重复计算，那么就可以利用类似前缀树的方式进行管理，每当有共享前缀的输入进入时，就只计算前缀树的叶子的部分，主干部分就可以利用之前已经计算好的部分了
在这里插入图片描述

在359行是一个处理前缀的kernel，其具体实现在gpt_kernels.cc的736到770行

template<typename T>
__global__ void compact_inputs(T*         compact_input,
                               T*         compact_attention_mask,
                               int*       compact_input_lengths,
                               const T*   decoder_input,
                               const T*   decoder_mask,
                               const int* input_lengths,
                               const int* compact_idx,
                               size_t     compact_size,
                               size_t     seq_len,
                               size_t     hidden_dimension)
{
    const int global_idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (global_idx < compact_size * seq_len * hidden_dimension) {
        const int h_id     = global_idx % hidden_dimension;
        const int seq_id   = (global_idx / hidden_dimension) % seq_len;
        const int batch_id = global_idx / (hidden_dimension * seq_len);

        compact_input[global_idx] = decoder_input[(compact_idx[batch_id] * seq_len + seq_id) * hidden_dimension + h_id];
    }

    if (global_idx < compact_size * seq_len * seq_len) {
        const int seq1_id  = global_idx % seq_len;
        const int seq2_id  = (global_idx / seq_len) % seq_len;
        const int batch_id = global_idx / (seq_len * seq_len);

        compact_attention_mask[global_idx] =
            decoder_mask[(compact_idx[batch_id] * seq_len + seq2_id) * seq_len + seq1_id];
    }

    if (global_idx < compact_size) {
        compact_input_lengths[global_idx] = input_lengths[compact_idx[global_idx]];
    }
}

可以看见，主要的目的就是为了从输入的tensor中取出并不属于前缀部分的input以及mask等，并存储在compat buffer中，这是一个纯IO类kernel

二、forward前向部分之attention计算前的准备

让我们回到ContextDecoder中，我们可以简化思考，考虑不存在前缀树的情况，继续看forward函数。

在一系列做好kvcache和attention参数的计算后，在406行进入了一个整体ite的循环（这里是因为如果batch太大，每次处理的max_batch又有限的话，需要拆开batch多次循环）。在409行，如果有padding的存在，由于attention计算是和位置息息相关的，所以需要考虑padding的影响，处理好pad位置后，421行再开始整个layers层数循环。428行到455行，为了考虑到当前层数是否是第一层或最后一层，需要对buffer进行不同的设置，在457行，如果当前节点是PP并行的非节点，还需要通过nccl通信获取上一个节点的计算结果。当然，如果还有tp划分的话，还需要做AllReduce。

在496行，是真正为attention层做输入参数的配置，包含一些必要的输入以及mask，attention类型，还有用于调试信息的layer_id信息等等。在523行，如果配置了alibi那么还需要对输入插入alibi参数。

TensorMap self_attention_input_tensors{
                {"input_query",
                 Tensor{MEMORY_GPU,
                        activation_in_type,
                        {h_token_num, hidden_units_},
                        layernorm_type_ == LayerNormType::pre_layernorm ? decoder_normed_input_ : decoder_input}},
                {"attention_mask",
                 Tensor{MEMORY_GPU,
                        data_type,
                        {local_batch_size, 1, seq_len, seq_len},
                        attention_ptr + local_batch_size * ite * seq_len * seq_len}},
                {"attention_type", Tensor{MEMORY_CPU, TYPE_VOID, {1}, &attention_type}},
                {"is_final_layer", Tensor{MEMORY_CPU, TYPE_BOOL, {1}, &is_final}},
                {"layer_id", Tensor{MEMORY_CPU, TYPE_INT32, {(size_t)1}, &l}}};

            if (is_unpadded_mha) {
                self_attention_input_tensors.insert("padding_offset",
                                                    Tensor{MEMORY_GPU, TYPE_INT32, {h_token_num}, padding_offset_});
                self_attention_input_tensors.insert(
                    "cu_seqlens", Tensor{MEMORY_GPU, TYPE_INT32, {size_t(local_batch_size + 1)}, cu_seqlens_});
            }

            /* if (dynamic_quant_) { */
            /*     self_attention_input_tensors.insert("attention_query_dynamic_scale", */
            /*         Tensor{MEMORY_GPU, TYPE_FP32, {h_token_num}, attention_query_dynamic_scale_}); */
            /* } */

            if (input_tensors->isExist("linear_bias_slopes")) {
                self_attention_input_tensors.insert("linear_bias_slopes", input_tensors->at("linear_bias_slopes"));
            }

在539行，真正需要获取的输出其实很少，一个用于接下来做add_bias_norm的主要输出，以及attention计算所产生的的kv cache，之后，直接调用attention层进行了前向计算推理。

TensorMap self_attention_output_tensors{
                {"hidden_features",
                 Tensor{MEMORY_GPU, activation_out_type, {h_token_num, hidden_units_}, self_attn_output_}},
                {"key_cache", Tensor{MEMORY_GPU, data_type, self_k_cache_size, k_cache_ptr}},
                {"value_cache", Tensor{MEMORY_GPU, data_type, self_v_cache_size, v_cache_ptr}}};

            self_attention_layer_->forward(
                &self_attention_output_tensors, &self_attention_input_tensors, &layer_weight->self_attention_weights);