8.6.tensorRT高级(3)封装系列-终极封装形态，以及考虑的问题

前言

杜老师推出的 tensorRT从零起步高性能部署课程，之前有看过一遍，但是没有做笔记，很多东西也忘了。这次重新撸一遍，顺便记记笔记。

本次课程学习 tensorRT 高级-终极封装形态，以及考虑的问题

课程大纲可看下面的思维导图

在这里插入图片描述

1. 终极封装

这节我们学习 tensorRT 封装的终极形态

我们直接来看案例，我们先来分析上节课封装的 yolov5 案例的不足：

1. 预处理使用的是 CPU 版的 warpAffine

2. 后处理使用的是 CPU 版的 decode

3. commit 生产频率太高，而消费频率太低，内存很容易因为频率差导致内存占用过大，程序无法长时间运行，需要添加队列上限限制的机制

我们来看下终极封装的一个效果，我们首先来分析下目录下面的文件结构，在 src 文件夹下其结构如下：

src:.
├─app_yolo
└─tensorRT
    ├─builder
    ├─common
    ├─infer
    ├─onnx
    ├─onnxplugin
    │  └─plugins
    └─onnx_parser

tensorRT 文件夹下的 builder 中封装了 trt_builder 即 tensorRT 模型编译的过程，相比于之前的 builder，这次的封装完善了一些，我们先来看 compile 函数的变化，其定义如下：

bool compile(
    Mode mode,
    unsigned int maxBatchSize,
    const ModelSource& source,
    const CompileOutput& saveto,
    const std::vector<InputDims> inputsDimsSetup = {},
    Int8Process int8process = nullptr,
    const std::string& int8ImageDirectory = "",
    const std::string& int8EntropyCalibratorFile = "",
    const size_t maxWorkspaceSize = 1ul << 30                // 1ul << 30 = 1GB
);

首先在 compile 函数中模型的来源变成了一个 ModelSource 类型，其定义如下：

enum class ModelSourceType : int{
    OnnX,
    OnnXData
};

class ModelSource {
    public:
    ModelSource() = default;
    ModelSource(const std::string& onnxmodel);
    ModelSource(const char* onnxmodel);
    ModelSourceType type() const;
    std::string onnxmodel() const;
    std::string descript() const;
    const void* onnx_data() const;
    size_t onnx_data_size() const;

    static ModelSource onnx(const std::string& file){
        ModelSource output;
        output.onnxmodel_  = file;
        output.type_       = ModelSourceType::OnnX;
        return output;
    }

    static ModelSource onnx_data(const void* ptr, size_t size){
        ModelSource output;
        output.onnx_data_      = ptr;
        output.onnx_data_size_ = size;
        output.type_           = ModelSourceType::OnnXData;
        return output;
    }

    private:
    std::string onnxmodel_;
    const void* onnx_data_ = nullptr;
    size_t onnx_data_size_ = 0;
    ModelSourceType type_;
};

ModelSource 允许你的模型来自于 onnx 文件也能够来自于 onnxdata，为什么会有 ModelSource 呢？主要是方便用户自定义模型的输入来源，因为模型不只是来源于 onnx，还可以来自于 caffe、uff 等格式，用户可以在 ModelSource 中写扩展来支持这些格式的模型作为输入

然后输出变成了 CompileOutput 类型，其定义如下：

enum class CompileOutputType : int{
    File,
    Memory
};

class CompileOutput{
public:
    CompileOutput(CompileOutputType type = CompileOutputType::Memory);
    CompileOutput(const std::string& file);
    CompileOutput(const char* file);
    void set_data(const std::vector<uint8_t>& data);
    void set_data(std::vector<uint8_t>&& data);

    const std::vector<uint8_t>& data() const{return data_;};
    CompileOutputType type() const{return type_;}
    std::string file() const{return file_;}

private:
    CompileOutputType type_ = CompileOutputType::Memory;
    std::vector<uint8_t> data_;
    std::string file_;
};

CompileOutput 允许你有两种输出，一种输出到文件一种输出到内存

第三个就是在 compile 函数中提供了一个 inputsDimsSetup，这个参数允许你在编译模型时修改其 batch，假设目前你导出的 onnx 模型的 shape 是 1x3x640x640，是静态 batch，但是你想编译时修改它的 input shape 为 -1x3x640x640，修改为动态 batch，那么你在编译的时候指定好这个参数就行

第四个就是 int8process 函数，说明 compile 中还支持 int8 的编译，相比于之前 builder 的封装复杂了一些，功能也相对完善了一些，这是 builder 里面提供的内容，我们接下来看 common 里面

common:.
    ├─cuda_tools.cpp
    ├─cuda_tools.hpp
    ├─ilogger.cpp
    ├─ilogger.hpp
    ├─infer_controller.hpp
    ├─json.cpp
    ├─json.hpp
    ├─monopoly_allocator.hpp
    ├─preprocess_kernel.cu
    ├─preprocess_kernel.cuh
    ├─trt_tensor.cpp
    └─trt_tensor.hpp

首先是 cuda-tools 就是之前 check runtime、check kernel 等一些关于 cuda 封装的小工具，然后就是 ilogger，提供了很多常用的小函数，相当于一个工具 utils 类，接下来是 infer_controller，它是一个消费者的封装，因为很多代码都是重复的，这边对它常用的东西进行了简单的一个封装

然后就是 json，这是一个第三方库用于解析 json 文件，而 monopoly_allocator 是一个独占分配器，解决的核心问题是队列上限没有限制的问题，我们之前加限制是用的 condition_variable，现在加限制是通过独占分配器来加，它还可以用来解决 tensor 复用问题

在往下就是 preprocess_kernel 预处理的核函数，最好就是我们的 trt_tensor，和之前封装的一样，没有啥区别，它包含 MixMemory 和 tensor 两部分

common 分析完了，我们来看 infer 部分，infer 就是我们之前写的 RAII + 接口模式的封装

再往下就是 onnx 文件夹，它就是 onnx 解析器所依赖的几个 cpp，是由 onnx.proto 生成的

再就是 onnx_parser 解析器的代码，再往后就是 onnxplugin，就是把 ONNXEasyPlugin 抽出来，也提供了几个简单的 plugin 示例，可以多看看

以上就是整个 tensorRT 的封装

我们再来看 yolo 部分

app_yolo:.
    ├─object_detector.hpp
    ├─yolo_decode.cu
    ├─yolo.cpp
    └─yolo.hpp

我们先来看 yolo.hpp，由于 V5、X、V3 的后处理都是一样的，因此我们完全可以一套代码支持三种模型，如下：

enum class Type : int{
    V5 = 0,
    X  = 1,
    V3 = V5
};

然后就是 NMS 提供测试的 CPU 版本以及实际的 GPU 版本，如下：

enum class NMSMethod : int{
    CPU = 0,         // General, for estimate mAP
    FastGPU = 1      // Fast NMS with a small loss of accuracy in corner cases
};

接下来就是 Infer 的封装，接口类

class Infer{
public:
    virtual shared_future<BoxArray> commit(const cv::Mat& image) = 0;
    virtual vector<shared_future<BoxArray>> commits(const vector<cv::Mat>& images) = 0;
};

shared_ptr<Infer> create_infer(
    const string& engine_file, Type type, int gpuid,
    float confidence_threshold=0.25f, float nms_threshold=0.5f,
    NMSMethod nms_method = NMSMethod::FastGPU, int max_objects = 1024,
    bool use_multi_preprocess_stream = false
);

create_infer 函数中多了可以选择模型的 Type，是 V3、V5 还是 X，然后是 nms 的方法选择以及 max_objects 的设置

这就是 yolo.hpp 中的内容

我们来看 yolo.cpp 中和之前不一样的地方，可以发现多了一个 ControllerImpl，它是一个模板类，就是我们之前说的 InferController 消费者模型的封装，可以避免我们写大量重复的线程代码，比如说启动线程，停止线程，添加任务到队列，从队列获取任务

using ControllerImpl = InferController
    <
        Mat,                    // input
        BoxArray,               // output
        tuple<string, int>,     // start param
        AffineMatrix            // additional
    >;

然后在 preprocess 函数中不同的是我们会向 tensor_allocator_ 申请一个 tensor

job.mono_tensor = tensor_allocator_->query();
if(job.mono_tensor == nullptr){
    INFOE("Tensor allocator query failed.");
    return false;
}

那为什么要求申请 tensor 呢？我们先来考虑下实际遇到的问题：

1. tensor 的复用性差，每次你都要在 preprocess 上分配新的 tensor，在 worker 中使用完又会释放 tensor，性能很差

2. 预处理完的数据往队列中抛，会造成队列堆积大量的 tensor（commit 频率高，infer 频率低，很容易造成堆积），堆积的结果就是显存占用很高，导致系统不稳定，无法长期运行

我们是通过 tensor_allocator_ 管理 tensor 来解决上述两个问题的：

使用一个 tensor_allocator_ 来管理 tensor，所有需要使用 tensor 的，找 tensor_allocator_ 申请。我们会预先分配固定数量的 tensor（例如10个），申请的时候，如果有空闲的 tensor 没有被分配出去，则把这个空闲的给它，如果没有空闲的 tensor 则等待。如果使用者使用完毕了，它应该通知 tensor_allocator_，告诉 tensor_allocator_ 有空闲的 tensor 了，可以进行分配了。

这种方式处理了 tensor 复用的问题，它实现了申请数量太多，处理不过来时等待的问题，其实就等于处理了队列上限的问题

那最后我们看看 worker 中的实现：

for(int ibatch = 0; ibatch < infer_batch_size; ++ibatch){

    auto& job                 = fetch_jobs[ibatch];
    float* image_based_output = output->gpu<float>(ibatch);
    float* output_array_ptr   = output_array_device.gpu<float>(ibatch);
    auto affine_matrix        = affin_matrix_device.gpu<float>(ibatch);
    checkCudaRuntime(cudaMemsetAsync(output_array_ptr, 0, sizeof(int), stream_));
    decode_kernel_invoker(image_based_output, output->size(1), num_classes, confidence_threshold_, affine_matrix, output_array_ptr, MAX_IMAGE_BBOX, stream_);

    if(nms_method_ == NMSMethod::FastGPU){
        nms_kernel_invoker(output_array_ptr, nms_threshold_, MAX_IMAGE_BBOX, stream_);
    }
}

推理是怎么做呢？首先获取 input，然后拿到我们的 mono_tensor，然后把它的 gpu 地址传给 input，这个 input 其实就是 engine 的 input，拷贝好了以后就可以通知 mono_tensor 进行 release 了，注意 release 只是释放其所有权并不是释放其内存（即通知它我不需要使用了，你可以分配了，通知的是 tensor_allocator_），这个过程限制了队列的上限

然后去做推理，推理完之后把结果去做一个 decode，然后把 decode 的 boxes 塞到 promise 中

以上就是终极封装的全部内容了，其实也就是 tensorRT_Pro 这个 repo 的内容😂，实现高性能的同时稳定可靠

总结

本次课程我们学习了 tensorRT 的终极封装形态，也就是把 tensorRT_Pro 中的封装思想过了一遍，包括 builder、infer 的封装，还包括一些常见的比如 cuda_tools、ilogger、json、trt_tensor、preprocess_kernel 等的封装，关于 yolo 的封装我们使用了 InferController 封装的消费者模型以及 tensor_allocator_ 解决 tensor 复用和队列上限限制问题，具体细节还是得多去研究了。