从BEVDET来学习如何生成trt以及如何去写这些C++内容

news2024/9/24 13:19:37

0. 简介

对于深度学习而言,通过模型加速来嵌入进C++是非常有意义的,因为本身训练出来的pt文件其实效率比较低下,所以这里我们将以BEVDET作为例子,来向读者展示如何去生成trt,并完善engine加速。这里最近受到优刻得的使用邀请,正好解决了我在大模型和自动驾驶行业对GPU的使用需求。UCloud云计算旗下的Compshare的GPU算力云平台。他们提供高性价比的4090 GPU,按时收费每卡2.08元,月卡只需要1.36元每小时,并附带200G的免费磁盘空间。暂时已经满足我的使用需求了,同时支持访问加速,独立IP等功能,能够更快的完成项目搭建。

在这里插入图片描述
而且在使用后可以写对应的博客,可以完成500元的赠金,完全可以满足个人对GPU的需求。在这里插入图片描述

对应的环境搭建已经在《如何使用共享GPU平台搭建LLAMA3环境(LLaMA-Factory)》介绍过了。对于自定义的无论是LibTorch还是CUDA这些都在《Ubuntu20.04安装LibTorch并完成高斯溅射环境搭建》这篇文章提到过了。这一章节我们来看一下怎么在平台上运行基于TensorRT的BEVDet项目的。

1. BEVDet-ROS-TensorRT安装

这里我们注意到,官方提到了对应的环境,这里我们尝试一下新的环境:ubuntu-20.04、CUDA-12.1、cuDNN-8.6.0、TensorRT-8.6、yaml-cpp、Eigen3、libjpeg
在这里插入图片描述
这里值得我们留意的就是TensorRT的安装问题,首先不同版本的TensorRT需要不同的算力平台适配程度不一样,我们可以在NVIDIA官网查看:CUDA GPUs - Compute Capability | NVIDIA Developer

下面我们可以打开TensorRT官网说明文档,查看支持计算能力的TensorRT版本。

在这里插入图片描述

找到自己的版本发行说明TensorRT历史版本发行,或者通过NVIDIA TensorRT Installation Guide来查询,并通过TensorRT的官网来下载期望的TensorRT版本
在这里插入图片描述


wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/secure/8.6.1/tars/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz

# 解压
tar -zxvf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz
# 配置环境变量
sudo gedit ~/.bashrc # 也可以使用vim
 
# 末尾添加以下两条路径,需根据解压的实际路径
export LD_LIBRARY_PATH=$PATH:/home/xxx/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$PATH:/home/xxx/TensorRT-8.6.1.6/lib::$LIBRARY_PATH
# 为了避免其它软件找不到 TensorRT 的库,建议把 TensorRT 的库和头文件添加到系统路径下,我们这里是临时的所以不需要,但是BEVFormer中有使用:https://github.com/DerryHub/BEVFormer_tensorrt/blob/main/TensorRT/install.sh
# sudo cp -r ./lib/* /usr/lib
# sudo cp -r ./include/* /usr/include
# 保存后刷新环境变量
source ~/.bashrc

cd TensorRT-8.6.1.6/python
# 根据Python版本安装,博主为python3.9
pip install tensorrt-8.6.1.6-cp39-none-linux_x86_64.whl 
 
# 安装依赖
cd TensorRT-8.6.1.6/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl

安装完之后运行下边指令

dpkg -l | grep TensorRT

然后就可以查看TensorRT的版本了

接着我们安装剩下的内容

wget http://fishros.com/install -O fishros && . fishros
sudo apt-get update
sudo apt-get install libyaml-cpp-dev
sudo apt-get install libeigen3-dev
sudo apt-get install libjpeg-dev
sudo apt  install cmake
sudo apt install build-essential
sudo apt-get install ros-noetic-jsk-recognition-msgs

然后将相关的代码进行编译

mkdir -p catkin_ws/src
cd catkin_ws/src
git clone https://github.com/lovelyyoshino/BEVDet-ROS-TensorRT.git
cd .. && catkin_make

roslaunch bevdet bevdet_node.launch

这里我们需要修改TensorRT的对应版本以及位置
在这里插入图片描述
最后效果如图
在这里插入图片描述
优刻得还支持VNC连接,但是作者尝试很多方法,无法通过VNC Viewer连接,期待一波后续开发

sudo vim /etc/apt/sources.list
# 添加  deb http://archive.ubuntu.com/ubuntu/ bionic universe
sudo apt-get install -y vnc4server
sudo apt-get install -y x-window-system-core & gdm3 & ubuntu-desktop

sudo apt-get install -y gnome-panel gnome-settings-daemon metacity nautilus gnome-terminal

按照:https://blog.csdn.net/UnameUsay/article/details/137691939这篇文章配置即可,这里使用ps -ef | grep vnc来查询端口号

在这里插入图片描述

可以看到服务已正常启动,并且使用的是5901端口。但是我们依旧连不上,这里我们使用tracetcp xxx.xxx.xxx.xxx:5901可以看到并不通,询问技术后得知现在不支持外网实现VNC控制。所以等待重启后直接使用自带的VNC即可。因为本身5901没有开通,所以如果真的要用NVC Viewer可以使用SSH 隧道简明教程完成代理转发
在这里插入图片描述

2. pt转onnx转trt流程

2.1 pt转onnx

这部分代码比较简单,在 converter.py 文件中,我们定义一个 PyTorchToONNXConverter 类来封装转换逻辑。

# pytorch_to_onnx/converter.py
import torch
import torch.onnx
from typing import Dict, Sequence


class PyTorchToONNXConverter:
    def __init__(self,
                 model: torch.nn.Module,
                 input_shapes: Dict[str, Sequence[int]],
                 output_file: str,
                 opset_version: int = 11,
                 dynamic_axes: Dict[str, Dict[int, str]] = None):
        """
        Initialize the converter with the given model and parameters.

        :param model: The PyTorch model to be converted.
        :param input_shapes: The shape of the input tensor as a dictionary.
        :param output_file: The path to save the converted ONNX model.
        :param opset_version: The ONNX opset version to use. Default is 11.
        :param dynamic_axes: Dictionary to specify dynamic axes. Default is None.
        """
        self.model = model
        self.input_shapes = input_shapes
        self.output_file = output_file
        self.opset_version = opset_version
        self.dynamic_axes = dynamic_axes if dynamic_axes is not None else {'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}

    def convert(self):
        """
        Convert the PyTorch model to ONNX format and save it to the specified output file.
        """
        self.model.eval()
        dummy_inputs = {name: torch.randn(*shape) for name, shape in self.input_shapes.items()}
        torch.onnx.export(
            self.model,
            tuple(dummy_inputs.values()),
            self.output_file,
            opset_version=self.opset_version,
            do_constant_folding=True,
            input_names=list(self.input_shapes.keys()),
            output_names=['output'],
            dynamic_axes=self.dynamic_axes
        )
        print(f"Model has been converted to ONNX and saved at {self.output_file}")

在 utils.py 文件中,我们定义一些辅助函数,例如命令行参数解析和配置文件加载。

# pytorch_to_onnx/utils.py
import argparse
import yaml


def parse_args():
    parser = argparse.ArgumentParser(description='Export ONNX Model')
    parser.add_argument('config', help='yaml config file path')
    parser.add_argument('model_path', help='path to PyTorch model file')
    parser.add_argument('output_file', help='path to save the ONNX model')
    args = parser.parse_args()
    return args

def load_yaml_config(config_path):
    with open(config_path, 'r', encoding='utf-8') as f:
        return yaml.load(f, Loader=yaml.Loader)

在 scripts/convert.py 文件中,我们使用上述工具来执行 PyTorch 到 ONNX 的转换。

# scripts/convert.py
import torch
from pytorch_to_onnx import PyTorchToONNXConverter
from pytorch_to_onnx.utils import parse_args, load_yaml_config

if __name__ == '__main__':
    args = parse_args()
    
    # 加载配置文件
    yaml_cfg = load_yaml_config(args.config)
    
    # 动态加载模型
    model = torch.load(args.model_path)
    
    input_shapes = yaml_cfg['input_shapes']
    
    # 创建转换器实例
    converter = PyTorchToONNXConverter(
        model=model,
        input_shapes=input_shapes,
        output_file=args.output_file,
        opset_version=yaml_cfg.get('opset_version', 11),
        dynamic_axes=yaml_cfg.get('dynamic_axes', None)
    )
    
    # 执行转换
    converter.convert()

2.2 onnx->trt

在 converter.py 文件中,我们定义一个 ONNXToTRTConverter 类来封装转换逻辑。

# onnx_to_trt/converter.py
import os
from typing import Dict, Sequence, Union

import onnx
import tensorrt as trt


class ONNXToTRTConverter:
    def __init__(self,
                 onnx_model: Union[str, onnx.ModelProto],
                 output_file_prefix: str,
                 input_shapes: Dict[str, Dict[str, Sequence[int]]],
                 max_workspace_size: int = 1 << 30,
                 fp16_mode: bool = False,
                 device_id: int = 0,
                 log_level: trt.Logger.Severity = trt.Logger.ERROR,
                 tf32: bool = True):
        """
        Initialize the converter with the given model and parameters.

        :param onnx_model: The path to the ONNX model file or an ONNX model protobuf.
        :param output_file_prefix: The prefix for the output TensorRT engine file.
        :param input_shapes: A dictionary specifying the input shapes.
        :param max_workspace_size: The maximum workspace size for TensorRT. Default is 1GB.
        :param fp16_mode: Whether to enable FP16 mode. Default is False.
        :param device_id: The GPU device ID to use. Default is 0.
        :param log_level: The logging level for TensorRT. Default is trt.Logger.ERROR.
        :param tf32: Whether to enable TF32 mode. Default is True.
        """
        self.onnx_model = onnx_model
        self.output_file_prefix = output_file_prefix
        self.input_shapes = input_shapes
        self.max_workspace_size = max_workspace_size
        self.fp16_mode = fp16_mode
        self.device_id = device_id
        self.log_level = log_level
        self.tf32 = tf32

    def convert(self) -> trt.ICudaEngine:
        """Convert the ONNX model to TensorRT engine."""
        os.environ['CUDA_DEVICE'] = str(self.device_id)
        
        # create builder and network
        logger = trt.Logger(self.log_level)
        builder = trt.Builder(logger)
        EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        network = builder.create_network(EXPLICIT_BATCH)

        # parse onnx
        parser = trt.OnnxParser(network, logger)

        if isinstance(self.onnx_model, str):
            self.onnx_model = onnx.load(self.onnx_model)

        if not parser.parse(self.onnx_model.SerializeToString()):
            error_msgs = ''
            for error in range(parser.num_errors):
                error_msgs += f'{parser.get_error(error)}\n'
            raise RuntimeError(f'Failed to parse ONNX model, {error_msgs}')

        config = builder.create_builder_config()    
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, self.max_workspace_size)
        
        profile = builder.create_optimization_profile()

        for input_name, param in self.input_shapes.items():
            min_shape = param['min_shape']
            opt_shape = param['opt_shape']
            max_shape = param['max_shape']
            profile.set_shape(input_name, min_shape, opt_shape, max_shape)
        config.add_optimization_profile(profile)

        if self.fp16_mode:
            config.set_flag(trt.BuilderFlag.FP16)
            
        if not self.tf32:
            config.clear_flag(trt.BuilderFlag.TF32)

        # create engine
        engine = builder.build_serialized_network(network, config)
        assert engine is not None, 'Failed to create TensorRT engine'

        with open(self.output_file_prefix + '.engine', mode='wb') as f:
            f.write(bytearray(engine))
        print(f"TensorRT engine has been saved to {self.output_file_prefix}.engine")
        return engine

在 utils.py 文件中,我们定义一些辅助函数,例如命令行参数解析和文件名替换函数。

# onnx_to_trt/utils.py
import argparse
from ruamel import yaml

def parse_args():
    parser = argparse.ArgumentParser(description='Export Engine Model')
    parser.add_argument('config', help='yaml config file path')
    parser.add_argument('img_encoder_onnx', help='path to img_encoder onnx file')
    parser.add_argument('bev_encoder_onnx', help='path to bev_encoder onnx file')
    parser.add_argument(
        '--postfix', default='', help='postfix of the save file name')
    parser.add_argument(
        '--tf32', type=bool, default=True, help='default to turn on the tf32')
    parser.add_argument(
        '--fp16', type=bool, default=False, help='float16')
    parser.add_argument(
        '--gpu-id',
        type=int,
        default=0,
        help='Which gpu to be used'
    )
    args = parser.parse_args()
    return args

def load_yaml_config(config_path):
    with open(config_path, 'r', encoding='utf-8') as f:
        return yaml.load(f, Loader=yaml.Loader)

def replace_file_name(path, new_name=None):
    assert new_name is not None
    path_parts = path.split('/')[:-1]
    file = '/'.join(path_parts) + '/' + new_name
    return file

在 scripts/convert.py 文件中,我们使用上述工具来执行 ONNX 到 TensorRT 的转换。

# scripts/convert.py
import onnx
import numpy as np
from onnx_to_trt import ONNXToTRTConverter
from onnx_to_trt.utils import parse_args, load_yaml_config, replace_file_name

if __name__ == '__main__':
    args = parse_args()
        
    img_stage_onnx = onnx.load(args.img_encoder_onnx)
    try:
        onnx.checker.check_model(img_stage_onnx)
    except Exception:
        print(f'{args.img_encoder_onnx} ONNX Model Incorrect')
        assert 0
    else:
        print(f'{args.img_encoder_onnx} ONNX Model Correct')

    bev_stage_onnx = onnx.load(args.bev_encoder_onnx)
    try:
        onnx.checker.check_model(bev_stage_onnx)
    except Exception:
        print(f'{args.bev_encoder_onnx} ONNX Model Incorrect')
        assert 0
    else:
        print(f'{args.bev_encoder_onnx} ONNX Model Correct')

    yaml_cfg = load_yaml_config(args.config)
    use_depth = yaml_cfg['use_depth']
    img_H, img_W = yaml_cfg['data_config']['input_size']
    downsample_factor = yaml_cfg['model']['down_sample']
    
    feat_w, feat_h = img_H // downsample_factor, img_W // downsample_factor
    
    D = len(np.arange(*yaml_cfg['grid_config']['depth']))  
    bev_h = len(np.arange(*yaml_cfg['grid_config']['x']))
    bev_w = len(np.arange(*yaml_cfg['grid_config']['y']))
    
    bev_inchannels = (yaml_cfg['adj_num'] + 1) * yaml_cfg['model']['bevpool_channels']

    img_shape = [6, 3, img_H, img_W]
    img_input_shape = dict(
        images=dict(min_shape=img_shape, opt_shape=img_shape, max_shape=img_shape)
    )
    if use_depth:
        img_input_shape['rot'] = dict(min_shape=[1, 6, 3, 3], opt_shape=[1, 6, 3, 3], max_shape=[1, 6, 3, 3])
        img_input_shape['trans'] = dict(min_shape=[1, 6, 3], opt_shape=[1, 6, 3], max_shape=[1, 6, 3])
        img_input_shape['intrin'] = dict(min_shape=[1, 6, 3, 3], opt_shape=[1, 6, 3, 3], max_shape=[1, 6, 3, 3])
        img_input_shape['post_rot'] = dict(min_shape=[1, 6, 3, 3], opt_shape=[1, 6, 3, 3], max_shape=[1, 6, 3, 3])
        img_input_shape['post_trans'] = dict(min_shape=[1, 6, 3], opt_shape=[1, 6, 3], max_shape=[1, 6, 3])
        img_input_shape['bda'] = dict(min_shape=[1, 3, 3], opt_shape=[1, 3, 3], max_shape=[1, 3, 3])

    bev_shape = [1, bev_inchannels, bev_h, bev_w]
    bev_input_shape = dict(
        BEV_feat=dict(min_shape=bev_shape, opt_shape=bev_shape, max_shape=bev_shape)
    )
    
    img_engine_file = replace_file_name(args.img_encoder_onnx, f'img_stage{args.postfix}')
    bev_engine_file = replace_file_name(args.bev_encoder_onnx, f'bev_stage{args.postfix}')

    img_converter = ONNXToTRTConverter(
        onnx_model=args.img_encoder_onnx,
        output_file_prefix=img_engine_file,
        input_shapes=img_input_shape,
        device_id=args.gpu_id,
        max_workspace_size=1 << 32,
        fp16_mode=args.fp16,
        tf32=args.tf32
    )
    img_converter.convert()

    bev_converter = ONNXToTRTConverter(
        onnx_model=args.bev_encoder_onnx,
        output_file_prefix=bev_engine_file,
        input_shapes=bev_input_shape,
        device_id=args.gpu_id,
        max_workspace_size=1 << 32,
        fp16_mode=args.fp16,
        tf32=args.tf32
    )
    bev_converter.convert()

3. 通过C++调用engine文件

其实这一块基本上就是申明+推理的流程

    // 模型配置文件,图像数量,cam内参,cam2ego的旋转和平移,模型权重文件
    bevdet_ = std::make_shared<BEVDet>(model_config_, img_N_, sampleData_.param.cams_intrin, 
                sampleData_.param.cams2ego_rot, sampleData_.param.cams2ego_trans, 
                                                    imgstage_file_, bevstage_file_);
    
    
    // gpu分配内参, cuda上分配6张图的大小 每个变量sizeof(uchar)个字节,并用imgs_dev指向该gpu上内存, sizeof(uchar) =1
    CHECK_CUDA(cudaMalloc((void**)&imgs_dev_, img_N_ * 3 * img_w_ * img_h_ * sizeof(uchar)));


// --------------------------- 推理流程
    // 拷贝从cpu上imgs_data拷贝到gpu上imgs_dev
    // std::vector<std::vector<char>> imgs_data 并进行通道转换
    decode_cpu(imgs_data, imgs_dev_, img_w_, img_h_);

    // uchar *imgs_dev 已经到gpu上了数据
    sampleData_.imgs_dev = imgs_dev_;

    std::vector<Box> ego_boxes;
    ego_boxes.clear();
    float time = 0.f;
    // 测试推理  图像数据, boxes,时间
    bevdet_->DoInfer(sampleData_, ego_boxes, time);
    
    // std::vector<Box> lidar_boxes;
    jsk_recognition_msgs::BoundingBoxArrayPtr lidar_boxes(new jsk_recognition_msgs::BoundingBoxArray);
    
    lidar_boxes->boxes.clear();
    // box从ego坐标变化到雷达坐标
    Egobox2Lidarbox(ego_boxes, lidar_boxes, sampleData_.param.lidar2ego_rot, 
                                            sampleData_.param.lidar2ego_trans);

下面我们来看一下大致的代码,这里面涉及到一些前处理的操作,以及内存分配这些操作。这里的方法对应作者LCH1238/BEVDet项目中的方法
在这里插入图片描述
在这里插入图片描述

#include <iostream>
#include <cstdio>
#include <fstream>
#include <chrono>

#include <thrust/sort.h>
#include <thrust/functional.h>

#include "bevdet.h"
#include "bevpool.h"
#include "grid_sampler.cuh"

using std::chrono::duration;
using std::chrono::high_resolution_clock;

/**
 * @brief 构造函数用于初始化 BEVDet 对象,加载配置文件,初始化视角转换和引擎,并分配设备内存
 * @param  config_file      配置文件路径
 * @param  n_img           图像数量
 * @param  _cams_intrin     相机内参矩阵列表
 * @param  _cams2ego_rot   相机到自车坐标系的旋转矩阵列表
 * @param  _cams2ego_trans  相机到自车坐标系的平移矩阵列表
 * @param  imgstage_file    图像阶段的引擎文件路径
 * @param  bevstage_file    BEV阶段的引擎文件路径
 */
BEVDet::BEVDet(const std::string &config_file, int n_img,               
                        std::vector<Eigen::Matrix3f> _cams_intrin, 
                        std::vector<Eigen::Quaternion<float>> _cams2ego_rot, 
                        std::vector<Eigen::Translation3f> _cams2ego_trans,
                        const std::string &imgstage_file, 
                        const std::string &bevstage_file) : 
                        cams_intrin(_cams_intrin), 
                        cams2ego_rot(_cams2ego_rot), 
                        cams2ego_trans(_cams2ego_trans){
    // 初始化参数
    InitParams(config_file);
    
    if(n_img != N_img)//检查输入的图像数量是否匹配配置文件中的要求
    {
        printf("BEVDet need %d images, but given %d images!", N_img, n_img);
    }
    auto start = high_resolution_clock::now();
    
    // 初始化视角转换
    InitViewTransformer();
    auto end = high_resolution_clock::now();
    duration<float> t = end - start;
    printf("InitVewTransformer cost time : %.4lf ms\n", t.count() * 1000);

    InitEngine(imgstage_file, bevstage_file); // 初始化引擎,加载 imgstage_file 和 bevstage_file 文件
    MallocDeviceMemory();//分配设备内存
}

/**
///@brief 将当前相机的内参和外参转换并拷贝到GPU内存中
///@param  curr_cams2ego_rotMy 当前相机到自车坐标系的旋转矩阵列表
///@param  curr_cams2ego_transMy 当前相机到自车坐标系的平移矩阵列表
///@param  cur_cams_intrin  当前相机的内参矩阵列表
*/
void BEVDet::InitDepth(const std::vector<Eigen::Quaternion<float>> &curr_cams2ego_rot,
                       const std::vector<Eigen::Translation3f> &curr_cams2ego_trans,
                       const std::vector<Eigen::Matrix3f> &cur_cams_intrin){
    // 分配主机内存来存储旋转矩阵、平移矩阵、内参矩阵、后旋转矩阵、后平移矩阵和变换矩阵
    float* rot_host = new float[N_img * 3 * 3];
    float* trans_host = new float[N_img * 3];
    float* intrin_host = new float[N_img * 3 * 3];
    float* post_rot_host = new float[N_img * 3 * 3];
    float* post_trans_host = new float[N_img * 3];
    float* bda_host = new float[3 * 3];

    // 旋转和内参都保存到数组里面
    for(int i = 0; i < N_img; i++)
    {
        for(int j = 0; j < 3; j++)
        {
            for(int k = 0; k < 3; k++)
            {
                rot_host[i * 9 + j * 3 + k] = curr_cams2ego_rot[i].matrix()(j, k);// 将旋转矩阵转换为数组
                intrin_host[i * 9 + j * 3 + k] = cur_cams_intrin[i](j, k);
                post_rot_host[i * 9 + j * 3 + k] = post_rot(j, k);
            }
            trans_host[i * 3 + j] = curr_cams2ego_trans[i].translation()(j);
            post_trans_host[i * 3 + j] = post_trans.translation()(j);
        }
    }

    for(int i = 0; i < 3; i++){
        for(int j = 0; j < 3; j++){
            if(i == j){
                bda_host[i * 3 + j] = 1.0;//这代表了相机坐标系到自车坐标系的变换矩阵
            }
            else{
                bda_host[i * 3 + j] = 0.0;
            }
        }
    }


    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["rot"]], rot_host, 
                                N_img * 3 * 3 * sizeof(float), cudaMemcpyHostToDevice));//将旋转矩阵拷贝到GPU内存中
    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["trans"]], trans_host, 
                                N_img * 3 * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["intrin"]], intrin_host, 
                                N_img * 3 * 3 * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["post_rot"]], post_rot_host, 
                                N_img * 3 * 3 * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["post_trans"]], post_trans_host, 
                                N_img * 3 * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(imgstage_buffer[imgbuffer_map["bda"]], bda_host, 
                                3 * 3 * sizeof(float), cudaMemcpyHostToDevice));

    delete[] rot_host;//删除指针
    delete[] trans_host;
    delete[] intrin_host;
    delete[] post_rot_host;
    delete[] post_trans_host;
    delete[] bda_host;
}
/**
///@brief 从配置文件中加载参数并初始化相关变量
///@param  config_file      配置文件路径
 */
void BEVDet::InitParams(const std::string &config_file)
{
    YAML::Node model_config = YAML::LoadFile(config_file);
    N_img = model_config["data_config"]["Ncams"].as<int>();//获取相机数量
    src_img_h = model_config["data_config"]["src_size"][0].as<int>();//获取原始图像的高度
    src_img_w = model_config["data_config"]["src_size"][1].as<int>();//获取原始图像的宽度
    input_img_h = model_config["data_config"]["input_size"][0].as<int>();//获取输入图像的高度
    input_img_w = model_config["data_config"]["input_size"][1].as<int>();//获取输入图像的宽度
    crop_h = model_config["data_config"]["crop"][0].as<int>();//获取裁剪的高度
    crop_w = model_config["data_config"]["crop"][1].as<int>();//获取裁剪的宽度
    mean.x = model_config["mean"][0].as<float>();//获取均值
    mean.y = model_config["mean"][1].as<float>();
    mean.z = model_config["mean"][2].as<float>();
    std.x = model_config["std"][0].as<float>();//获取标准差
    std.y = model_config["std"][1].as<float>();
    std.z = model_config["std"][2].as<float>();
    down_sample = model_config["model"]["down_sample"].as<int>();//下采样因子
    depth_start = model_config["grid_config"]["depth"][0].as<float>();//深度起始值
    depth_end = model_config["grid_config"]["depth"][1].as<float>();
    depth_step = model_config["grid_config"]["depth"][2].as<float>();
    x_start = model_config["grid_config"]["x"][0].as<float>();//网格配置信息
    x_end = model_config["grid_config"]["x"][1].as<float>();
    x_step = model_config["grid_config"]["x"][2].as<float>();
    y_start = model_config["grid_config"]["y"][0].as<float>();
    y_end = model_config["grid_config"]["y"][1].as<float>();
    y_step = model_config["grid_config"]["y"][2].as<float>();
    z_start = model_config["grid_config"]["z"][0].as<float>();
    z_end = model_config["grid_config"]["z"][1].as<float>();
    z_step = model_config["grid_config"]["z"][2].as<float>();
    bevpool_channel = model_config["model"]["bevpool_channels"].as<int>();//bevpool通道数
    nms_pre_maxnum = model_config["test_cfg"]["max_per_img"].as<int>();//上一阶段非极大值抑制的最大数量
    nms_post_maxnum = model_config["test_cfg"]["post_max_size"].as<int>();//非极大值抑制的最大数量
    score_thresh = model_config["test_cfg"]["score_threshold"].as<float>();//得分阈值
    nms_overlap_thresh = model_config["test_cfg"]["nms_thr"][0].as<float>();//非极大值抑制的阈值
    use_depth = model_config["use_depth"].as<bool>();//是否使用深度信息
    use_adj = model_config["use_adj"].as<bool>();//是否使用邻接帧
    
    if(model_config["sampling"].as<std::string>() == "bicubic"){//采样方式
        pre_sample = Sampler::bicubic;
    }
    else{
        pre_sample = Sampler::nearest;
    }

    std::vector<std::vector<float>> nms_factor_temp = model_config["test_cfg"]
                            ["nms_rescale_factor"].as<std::vector<std::vector<float>>>();//设置非极大值抑制的缩放因子
    nms_rescale_factor.clear();
    for(auto task_factors : nms_factor_temp){
        for(float factor : task_factors){
            nms_rescale_factor.push_back(factor);//每个任务的缩放因子,并将其展平存储到 out_num_task_head 中[1.0, 0.7, 0.7, 0.4, 0.55, 1.1, 1.0, 1.0, 1.5, 3.5],这个数值反应了不同任务的缩放因子
        }
    }
    
    std::vector<std::vector<std::string>> class_name_pre_task;
    class_num = 0;
    YAML::Node tasks = model_config["model"]["tasks"];//读取任务类别信息
    class_num_pre_task = std::vector<int>();
    for(auto it : tasks){
        int num = it["num_class"].as<int>();
        class_num_pre_task.push_back(num);
        class_num += num;
        class_name_pre_task.push_back(it["class_names"].as<std::vector<std::string>>());//[car, truck, construction_vehicle, bus, trailer, barrier, motorcycle,bicycle, pedestrian, traffic_cone]
    }

    YAML::Node common_head_channel = model_config["model"]["common_head"]["channels"];//读取模型输出头的通道信息[2, 1, 3, 2, 2]
    YAML::Node common_head_name = model_config["model"]["common_head"]["names"];//读取模型输出头的名称信息[reg, height, dim, rot, vel]
    for(size_t i = 0; i< common_head_channel.size(); i++){
        out_num_task_head[common_head_name[i].as<std::string>()] = 
                                                        common_head_channel[i].as<int>();
    }

    resize_radio = (float)input_img_w / src_img_w;
    feat_h = input_img_h / down_sample;
    feat_w = input_img_w / down_sample;
    depth_num = (depth_end - depth_start) / depth_step;
    xgrid_num = (x_end - x_start) / x_step;
    ygrid_num = (y_end - y_start) / y_step;
    zgrid_num = (z_end - z_start) / z_step;
    bev_h = ygrid_num;
    bev_w = xgrid_num;

    // 初始化图像预处理过程中的旋转和平移矩阵
    post_rot << resize_radio, 0, 0,
                0, resize_radio, 0,
                0, 0, 1;
    post_trans.translation() << -crop_w, -crop_h, 0;

    // 初始化相邻帧的处理
    adj_num = 0;
    if(use_adj){
        adj_num = model_config["adj_num"].as<int>();
        adj_frame_ptr.reset(new adjFrame(adj_num, bev_h * bev_w, bevpool_channel));
    }


    postprocess_ptr.reset(new PostprocessGPU(class_num, score_thresh, nms_overlap_thresh,
                                            nms_pre_maxnum, nms_post_maxnum, down_sample,
                                            bev_h, bev_w, x_step, y_step, x_start,
                                            y_start, class_num_pre_task, nms_rescale_factor));//初始化后处理对象 PostprocessGPU

}

/**
///@brief 分配用于存储图像数据和阶段网络绑定数据的设备(GPU)内存。
 */
void BEVDet::MallocDeviceMemory(){
    CHECK_CUDA(cudaMalloc((void**)&src_imgs_dev, 
                                N_img * 3 * src_img_h * src_img_w * sizeof(uchar)));

    imgstage_buffer = (void**)new void*[imgstage_engine->getNbBindings()];//分配存储原始图像数据的设备内存 src_imgs_dev
    for(int i = 0; i < imgstage_engine->getNbBindings(); i++){
        nvinfer1::Dims32 dim = imgstage_context->getBindingDimensions(i);//获取绑定的维度信息
        int size = 1;
        for(int j = 0; j < dim.nbDims; j++){
            size *= dim.d[j];//计算维度的乘积
        }
        size *= dataTypeToSize(imgstage_engine->getBindingDataType(i));//计算数据类型的大小
        CHECK_CUDA(cudaMalloc(&imgstage_buffer[i], size));
    }

    std::cout << "img num binding : " << imgstage_engine->getNbBindings() << std::endl;

    bevstage_buffer = (void**)new void*[bevstage_engine->getNbBindings()];//分配存储阶段网络绑定数据的设备内存 bevstage_buffer
    for(int i = 0; i < bevstage_engine->getNbBindings(); i++){
        nvinfer1::Dims32 dim = bevstage_context->getBindingDimensions(i);//获取绑定的维度信息
        int size = 1;
        for(int j = 0; j < dim.nbDims; j++){
            size *= dim.d[j];
        }
        size *= dataTypeToSize(bevstage_engine->getBindingDataType(i));//计算数据类型的大小
        CHECK_CUDA(cudaMalloc(&bevstage_buffer[i], size));//使用 cudaMalloc 分配内存
    }

    return;
}

/**
///@brief 初始化视角转换器,将激光雷达点云转换为 BEV 图像
 */
void BEVDet::InitViewTransformer(){

    int num_points = N_img * depth_num * feat_h * feat_w;
    Eigen::Vector3f* frustum = new Eigen::Vector3f[num_points];//初始化 frustum 点云数据结构

    for(int i = 0; i < N_img; i++){
        for(int d_ = 0; d_ < depth_num; d_++){
            for(int h_ = 0; h_ < feat_h; h_++){
                for(int w_ = 0; w_ < feat_w; w_++){
                    int offset = i * depth_num * feat_h * feat_w + d_ * feat_h * feat_w
                                                                 + h_ * feat_w + w_;
                    (frustum + offset)->x() = (float)w_ * (input_img_w - 1) / (feat_w - 1);
                    (frustum + offset)->y() = (float)h_ * (input_img_h - 1) / (feat_h - 1);
                    (frustum + offset)->z() = (float)d_ * depth_step + depth_start;

                    // eliminate post transformation
                    *(frustum + offset) -= post_trans.translation();
                    *(frustum + offset) = post_rot.inverse() * *(frustum + offset);
                    // 
                    (frustum + offset)->x() *= (frustum + offset)->z();
                    (frustum + offset)->y() *= (frustum + offset)->z();
                    // img to ego -> rot -> trans
                    *(frustum + offset) = cams2ego_rot[i] * cams_intrin[i].inverse()
                                    * *(frustum + offset) + cams2ego_trans[i].translation();//将 frustum 中的点转换为世界坐标系

                    // voxelization,进行体素化处理
                    *(frustum + offset) -= Eigen::Vector3f(x_start, y_start, z_start);
                    (frustum + offset)->x() = (int)((frustum + offset)->x() / x_step);
                    (frustum + offset)->y() = (int)((frustum + offset)->y() / y_step);
                    (frustum + offset)->z() = (int)((frustum + offset)->z() / z_step);
                }
            }
        }
    }

    int* _ranks_depth = new int[num_points];
    int* _ranks_feat = new int[num_points];

    for(int i = 0; i < num_points; i++){
        _ranks_depth[i] = i;
    }
    for(int i = 0; i < N_img; i++){
        for(int d_ = 0; d_ < depth_num; d_++){
            for(int u = 0; u < feat_h * feat_w; u++){
                int offset = i * (depth_num * feat_h * feat_w) + d_ * (feat_h * feat_w) + u;
                _ranks_feat[offset] = i * feat_h * feat_w + u;
            }
        }
    }

    std::vector<int> kept;
    for(int i = 0; i < num_points; i++){
        if((int)(frustum + i)->x() >= 0 && (int)(frustum + i)->x() < xgrid_num &&
           (int)(frustum + i)->y() >= 0 && (int)(frustum + i)->y() < ygrid_num &&
           (int)(frustum + i)->z() >= 0 && (int)(frustum + i)->z() < zgrid_num){
            kept.push_back(i);
        }
    }

    valid_feat_num = kept.size();
    int* ranks_depth_host = new int[valid_feat_num];
    int* ranks_feat_host = new int[valid_feat_num];
    int* ranks_bev_host = new int[valid_feat_num];
    int* order = new int[valid_feat_num];

    for(int i = 0; i < valid_feat_num; i++){
        Eigen::Vector3f &p = frustum[kept[i]];
        ranks_bev_host[i] = (int)p.z() * xgrid_num * ygrid_num + 
                            (int)p.y() * xgrid_num + (int)p.x();
        order[i] = i;
    }

    thrust::sort_by_key(ranks_bev_host, ranks_bev_host + valid_feat_num, order);
    for(int i = 0; i < valid_feat_num; i++){
        ranks_depth_host[i] = _ranks_depth[kept[order[i]]];
        ranks_feat_host[i] = _ranks_feat[kept[order[i]]];
    }

    delete[] _ranks_depth;
    delete[] _ranks_feat;
    delete[] frustum;
    delete[] order;

    std::vector<int> interval_starts_host;
    std::vector<int> interval_lengths_host;

    interval_starts_host.push_back(0);
    int len = 1;
    for(int i = 1; i < valid_feat_num; i++){
        if(ranks_bev_host[i] != ranks_bev_host[i - 1]){
            interval_starts_host.push_back(i);
            interval_lengths_host.push_back(len);
            len=1;
        }
        else{
            len++;
        }
    }
    
    interval_lengths_host.push_back(len);
    unique_bev_num = interval_lengths_host.size();

    CHECK_CUDA(cudaMalloc((void**)&ranks_bev_dev, valid_feat_num * sizeof(int)));
    CHECK_CUDA(cudaMalloc((void**)&ranks_depth_dev, valid_feat_num * sizeof(int)));
    CHECK_CUDA(cudaMalloc((void**)&ranks_feat_dev, valid_feat_num * sizeof(int)));
    CHECK_CUDA(cudaMalloc((void**)&interval_starts_dev, unique_bev_num * sizeof(int)));
    CHECK_CUDA(cudaMalloc((void**)&interval_lengths_dev, unique_bev_num * sizeof(int)));

    CHECK_CUDA(cudaMemcpy(ranks_bev_dev, ranks_bev_host, valid_feat_num * sizeof(int), 
                                                                    cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(ranks_depth_dev, ranks_depth_host, valid_feat_num * sizeof(int), 
                                                                    cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(ranks_feat_dev, ranks_feat_host, valid_feat_num * sizeof(int), 
                                                                    cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(interval_starts_dev, interval_starts_host.data(), 
                                        unique_bev_num * sizeof(int), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(interval_lengths_dev, interval_lengths_host.data(), 
                                        unique_bev_num * sizeof(int), cudaMemcpyHostToDevice));

    // printf("Num_points : %d\n", num_points);
    // printf("valid_feat_num : %d\n", valid_feat_num);
    // printf("unique_bev_num : %d\n", unique_bev_num);
    // printf("valid rate : %.3lf\n", (float)valid_feat_num / num_points);

    delete[] ranks_bev_host;
    delete[] ranks_depth_host;
    delete[] ranks_feat_host;
}

/**
///@brief 打印维度信息
///@param  dim              维度信息
 */
void print_dim(nvinfer1::Dims dim){
    for(auto i = 0; i < dim.nbDims; i++){
        printf("%d%c", dim.d[i], i == dim.nbDims - 1 ? '\n' : ' ');
    }
}

/**
///@brief 初始化引擎,从文件反序列化 TensorRT 引擎,并创建执行上下文
///@param  imgstage_file    图像阶段的引擎文件路径
///@param  bevstage_file    BEV阶段的引擎文件路
///@return int 
 */
int BEVDet::InitEngine(const std::string &imgstage_file, const std::string &bevstage_file){
    if(DeserializeTRTEngine(imgstage_file, &imgstage_engine))//反序列化 TensorRT 引擎
    {
        return EXIT_FAILURE;
    }
    if(DeserializeTRTEngine(bevstage_file, &bevstage_engine)){
        return EXIT_FAILURE;
    }
    if(imgstage_engine == nullptr || bevstage_engine == nullptr){
        std::cerr << "Failed to deserialize engine file!" << std::endl;
        return EXIT_FAILURE;
    }
    imgstage_context = imgstage_engine->createExecutionContext();//创建执行上下文
    bevstage_context = bevstage_engine->createExecutionContext();

    if (imgstage_context == nullptr || bevstage_context == nullptr) {
        std::cerr << "Failed to create TensorRT Execution Context!" << std::endl;
        return EXIT_FAILURE;
    }

    // 设置绑定维度并初始化绑定映射
    imgstage_context->setBindingDimensions(0, 
                            nvinfer1::Dims32{4, {N_img, 3, input_img_h, input_img_w}});
    bevstage_context->setBindingDimensions(0,
            nvinfer1::Dims32{4, {1, bevpool_channel * (adj_num + 1), bev_h, bev_w}});
    imgbuffer_map.clear();
    for(auto i = 0; i < imgstage_engine->getNbBindings(); i++){
        auto dim = imgstage_context->getBindingDimensions(i);
        auto name = imgstage_engine->getBindingName(i);
        imgbuffer_map[name] = i;
        std::cout << name << " : ";
        print_dim(dim);

    }
    std::cout << std::endl;

    bevbuffer_map.clear();
    for(auto i = 0; i < bevstage_engine->getNbBindings(); i++){
        auto dim = bevstage_context->getBindingDimensions(i);
        auto name = bevstage_engine->getBindingName(i);
        bevbuffer_map[name] = i;
        std::cout << name << " : ";
        print_dim(dim);
    }    
    
    return EXIT_SUCCESS;
}

/**
///@brief 反序列化 TensorRT 引擎文件,并创建对应的 TensorRT 引擎对象
///@param  engine_file      引擎文件路径
///@param  engine_ptr      用于存储反序列化后的 TensorRT 引擎对象的指针
///@return int 
 */
int BEVDet::DeserializeTRTEngine(const std::string &engine_file, 
                                                    nvinfer1::ICudaEngine **engine_ptr){
    int verbosity = static_cast<int>(nvinfer1::ILogger::Severity::kWARNING);
    std::stringstream engine_stream;
    engine_stream.seekg(0, engine_stream.beg);//从文件读取引擎数据

    std::ifstream file(engine_file);
    engine_stream << file.rdbuf();
    file.close();

    nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(g_logger);//创建 TensorRT 运行时
    if (runtime == nullptr) 
    {
        // std::string msg("Failed to build runtime parser!");
        // g_logger.log(nvinfer1::ILogger::Severity::kERROR, msg.c_str());
        // std::cout << "" << "Failed to build runtime parser!" << std::endl;
        std::cout << "\033[1;31m" << "\nFailed to build runtime parser!\n" << "\033[0m" << std::endl;
        return EXIT_FAILURE;
    }
    engine_stream.seekg(0, std::ios::end);
    const int engine_size = engine_stream.tellg();

    engine_stream.seekg(0, std::ios::beg); 
    void* engine_str = malloc(engine_size);
    engine_stream.read((char*)engine_str, engine_size);
    
    nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(engine_str, engine_size, NULL);//使用 TensorRT 运行时反序列化引擎数据
    if (engine == nullptr) 
    {
        // std::string msg("Failed to build engine parser!");
        // g_logger.log(nvinfer1::ILogger::Severity::kERROR, msg.c_str());

        std::cout << "\033[1;31m" << "\nFailed to build engine parser!\n" << "\033[0m" << std::endl;

        return EXIT_FAILURE;
    }
    *engine_ptr = engine;
    for (int bi = 0; bi < engine->getNbBindings(); bi++) {
        if (engine->bindingIsInput(bi) == true){//检查反序列化是否成功,并打印引擎绑定信息
            printf("Binding %d (%s): Input. \n", bi, engine->getBindingName(bi));
        }
        else{
            printf("Binding %d (%s): Output. \n", bi, engine->getBindingName(bi));
        }
    }
    return EXIT_SUCCESS;
}

/**
///@brief 获取相邻帧的特征,并对其进行对齐
///@param  curr_scene_token 当前场景的标识符
///@param  ego2global_rot   自车坐标系到全局坐标系的旋转矩阵
///@param  ego2global_trans 自车坐标系到全局坐标系的平移向量
///@param  bev_buffer       存储 BEV 特征的缓冲区
 */
void BEVDet::GetAdjFrameFeature(const std::string &curr_scene_token, 
                         const Eigen::Quaternion<float> &ego2global_rot,
                         const Eigen::Translation3f &ego2global_trans,
                         float* bev_buffer) {
    /* bev_buffer : 720 * 128 x 128
    */
    bool reset = false;
    if(adj_frame_ptr->buffer_num == 0 || adj_frame_ptr->lastScenesToken() != curr_scene_token){//检查是否需要重置相邻帧缓冲区
        adj_frame_ptr->reset();
        for(int i = 0; i < adj_num; i++){
            adj_frame_ptr->saveFrameBuffer(bev_buffer, curr_scene_token, ego2global_rot,
                                                                        ego2global_trans);//保存当前帧的 BEV 特征
        }
        reset = true;
    }

    for(int i = 0; i < adj_num; i++){
        const float* adj_buffer = adj_frame_ptr->getFrameBuffer(i);

        Eigen::Quaternion<float> adj_ego2global_rot;
        Eigen::Translation3f adj_ego2global_trans;
        adj_frame_ptr->getEgo2Global(i, adj_ego2global_rot, adj_ego2global_trans);//根据索引获取相邻帧的自车坐标系到全局坐标系的变换矩阵

        cudaStream_t stream;
        CHECK_CUDA(cudaStreamCreate(&stream));
        AlignBEVFeature(ego2global_rot, adj_ego2global_rot, ego2global_trans,
                        adj_ego2global_trans, adj_buffer, 
                        bev_buffer + (i + 1) * bev_w * bev_h * bevpool_channel, stream);//获取相邻帧的 BEV 特征,并进行对齐
        CHECK_CUDA(cudaDeviceSynchronize());
        CHECK_CUDA(cudaStreamDestroy(stream));
    }


    if(!reset){
        adj_frame_ptr->saveFrameBuffer(bev_buffer, curr_scene_token, 
                                                    ego2global_rot, ego2global_trans);
    }
}

/**
///@brief 对齐 BEV 特征,使相邻帧的特征在同一全局坐标系下对齐
///@param  curr_ego2global_rotMy 当前帧的自车坐标系到全局坐标系的旋转矩阵
///@param  adj_ego2global_rotMy 相邻帧的自车坐标系到全局坐标系的旋转矩阵
///@param  curr_ego2global_transMy 当前帧的自车坐标系到全局坐标系的平移向量
///@param  adj_ego2global_transMy 相邻帧的自车坐标系到全局坐标系的平移向量
///@param  input_bev         输入的 BEV 特征
///@param  output_bev       输出的对齐后的 BEV 特征
///@param  stream          CUDA 流,用于异步操
 */
void BEVDet::AlignBEVFeature(const Eigen::Quaternion<float> &curr_ego2global_rot,
                             const Eigen::Quaternion<float> &adj_ego2global_rot,
                             const Eigen::Translation3f &curr_ego2global_trans,
                             const Eigen::Translation3f &adj_ego2global_trans,
                             const float* input_bev,
                             float* output_bev,
                             cudaStream_t stream){
    Eigen::Matrix4f curr_e2g_transform;
    Eigen::Matrix4f adj_e2g_transform;

    for(int i = 0; i < 3; i++){
        for(int j = 0; j < 3; j++){
            curr_e2g_transform(i, j) = curr_ego2global_rot.matrix()(i, j);
            adj_e2g_transform(i, j) = adj_ego2global_rot.matrix()(i, j);
        }
    }
    for(int i = 0; i < 3; i++){
        curr_e2g_transform(i, 3) = curr_ego2global_trans.vector()(i);
        adj_e2g_transform(i, 3) = adj_ego2global_trans.vector()(i);

        curr_e2g_transform(3, i) = 0.0;
        adj_e2g_transform(3, i) = 0.0;
    }
    curr_e2g_transform(3, 3) = 1.0;
    adj_e2g_transform(3, 3) = 1.0;

    Eigen::Matrix4f currEgo2adjEgo = adj_e2g_transform.inverse() * curr_e2g_transform;//计算当前帧和相邻帧的自车到全局变换矩阵
    Eigen::Matrix3f currEgo2adjEgo_2d;
    for(int i = 0; i < 2; i++){
        for(int j = 0; j < 2; j++){
            currEgo2adjEgo_2d(i, j) = currEgo2adjEgo(i, j);
        }
    }
    currEgo2adjEgo_2d(2, 0) = 0.0;
    currEgo2adjEgo_2d(2, 1) = 0.0;
    currEgo2adjEgo_2d(2, 2) = 1.0;
    currEgo2adjEgo_2d(0, 2) = currEgo2adjEgo(0, 3);
    currEgo2adjEgo_2d(1, 2) = currEgo2adjEgo(1, 3);

    Eigen::Matrix3f gridbev2egobev;
    gridbev2egobev(0, 0) = x_step;
    gridbev2egobev(1, 1) = y_step;
    gridbev2egobev(0, 2) = x_start;
    gridbev2egobev(1, 2) = y_start;
    gridbev2egobev(2, 2) = 1.0;

    gridbev2egobev(0, 1) = 0.0;
    gridbev2egobev(1, 0) = 0.0;
    gridbev2egobev(2, 0) = 0.0;
    gridbev2egobev(2, 1) = 0.0;

    Eigen::Matrix3f currgrid2adjgrid = gridbev2egobev.inverse() * currEgo2adjEgo_2d * gridbev2egobev;//计算当前帧到相邻帧的变换矩阵


    float* grid_dev;
    float* transform_dev;
    CHECK_CUDA(cudaMalloc((void**)&grid_dev, bev_h * bev_w * 2 * sizeof(float)));
    CHECK_CUDA(cudaMalloc((void**)&transform_dev, 9 * sizeof(float)));


    CHECK_CUDA(cudaMemcpy(transform_dev, Eigen::Matrix3f(currgrid2adjgrid.transpose()).data(), 
                                                        9 * sizeof(float), cudaMemcpyHostToDevice));//将变换矩阵拷贝到设备内存

    compute_sample_grid_cuda(grid_dev, transform_dev, bev_w, bev_h, stream);//计算采样网格


    int output_dim[4] = {1, bevpool_channel, bev_w, bev_h};
    int input_dim[4] = {1, bevpool_channel, bev_w, bev_h};
    int grid_dim[4] = {1, bev_w, bev_h, 2};
    

    grid_sample(output_bev, input_bev, grid_dev, output_dim, input_dim, grid_dim, 4,
                GridSamplerInterpolation::Bilinear, GridSamplerPadding::Zeros, true, stream);//将变换矩阵应用于 BEV 特征,生成对齐后的 BEV 特征
    CHECK_CUDA(cudaFree(grid_dev));
    CHECK_CUDA(cudaFree(transform_dev));
}



/**
///@brief 进行 BEV 检测推理,包括图像预处理、前向传播、特征对齐和后处理,并输出检测结果和时间消耗
///@param  cam_data         包含相机数据和参数的结构体
///@param  out_detections   用于存储检测结果的容器
///@param  cost_time        用于存储推理总时间的变量
///@param  idx              推理的索引,用于打印调试信息
///@return int 
 */
int BEVDet::DoInfer(const camsData& cam_data, std::vector<Box> &out_detections, float &cost_time,
                                                                                    int idx)
                                                                                    {
    
    printf("-------------------%d-------------------\n", idx + 1);

    printf("scenes_token : %s, timestamp : %lld\n", cam_data.param.scene_token.data(), 
                                cam_data.param.timestamp);

    auto pre_start = high_resolution_clock::now();
    // [STEP 1] : preprocess image, including resize, crop and normalize

    // 6张图像数据拷贝到gpu上src_imgs_dev
    CHECK_CUDA(cudaMemcpy(src_imgs_dev, cam_data.imgs_dev, 
        N_img * src_img_h * src_img_w * 3 * sizeof(uchar), cudaMemcpyDeviceToDevice));
    
    // 预处理
    preprocess(src_imgs_dev, (float*)imgstage_buffer[imgbuffer_map["images"]], N_img, src_img_h, src_img_w,
        input_img_h, input_img_w, resize_radio, resize_radio, crop_h, crop_w, mean, std, pre_sample);//包括图像数据的拷贝、图像预处理(如调整尺寸、裁剪和归一化)以及初始化深度信息

    // 初始化深度
    InitDepth(cam_data.param.cams2ego_rot, cam_data.param.cams2ego_trans, cam_data.param.cams_intrin);//使用 TensorRT 进行图像阶段的推理

    CHECK_CUDA(cudaDeviceSynchronize());

    // 预处理时间
    auto pre_end = high_resolution_clock::now();

    // [STEP 2] : image stage network forward
    cudaStream_t stream;
    CHECK_CUDA(cudaStreamCreate(&stream));
    
    if(!imgstage_context->enqueueV2(imgstage_buffer, stream, nullptr))
    {
        printf("Image stage forward failing!\n");
    }

    CHECK_CUDA(cudaDeviceSynchronize());
    auto imgstage_end = high_resolution_clock::now();


    // [STEP 3] : bev pool
    // size_t id1 = use_depth ? 7 : 1;
    // size_t id2 = use_depth ? 8 : 2;

    bev_pool_v2(bevpool_channel, unique_bev_num, bev_h * bev_w,
                (float*)imgstage_buffer[imgbuffer_map["depth"]], 
                (float*)imgstage_buffer[imgbuffer_map["images_feat"]], 
                ranks_depth_dev, ranks_feat_dev, ranks_bev_dev,
                interval_starts_dev, interval_lengths_dev,
                (float*)bevstage_buffer[bevbuffer_map["BEV_feat"]]

                );//对预处理后的特征进行 BEV 池化操作
    
    CHECK_CUDA(cudaDeviceSynchronize());
    auto bevpool_end = high_resolution_clock::now();


    // [STEP 4] : align BEV feature

    if(use_adj){
        GetAdjFrameFeature(cam_data.param.scene_token, cam_data.param.ego2global_rot, 
                        cam_data.param.ego2global_trans, (float*)bevstage_buffer[bevbuffer_map["BEV_feat"]]);//如果使用相邻帧数据,则进行特征对齐
        CHECK_CUDA(cudaDeviceSynchronize());
    }
    auto align_feat_end = high_resolution_clock::now();


    // [STEP 5] : BEV stage network forward
    if(!bevstage_context->enqueueV2(bevstage_buffer, stream, nullptr)){
        printf("BEV stage forward failing!\n");
    }

    CHECK_CUDA(cudaDeviceSynchronize());
    auto bevstage_end = high_resolution_clock::now();


    // [STEP 6] : postprocess 使用 TensorRT 进行 BEV 阶段的推理

    postprocess_ptr->DoPostprocess(bevstage_buffer, out_detections);
    CHECK_CUDA(cudaDeviceSynchronize());
    auto post_end = high_resolution_clock::now();

    // release stream
    CHECK_CUDA(cudaStreamDestroy(stream));

    duration<double> pre_t = pre_end - pre_start;
    duration<double> imgstage_t = imgstage_end - pre_end;
    duration<double> bevpool_t = bevpool_end - imgstage_end;
    duration<double> align_t = duration<double>(0);
    duration<double> bevstage_t;
    if(use_adj)
    {
        align_t = align_feat_end - bevpool_end;
        bevstage_t = bevstage_end - align_feat_end;
    }
    else{
        bevstage_t = bevstage_end - bevpool_end;
    }
    duration<double> post_t = post_end - bevstage_end;

    printf("[Preprocess   ] cost time: %5.3lf ms\n", pre_t.count() * 1000);
    printf("[Image stage  ] cost time: %5.3lf ms\n", imgstage_t.count() * 1000);
    printf("[BEV pool     ] cost time: %5.3lf ms\n", bevpool_t.count() * 1000);
    if(use_adj){
        printf("[Align Feature] cost time: %5.3lf ms\n", align_t.count() * 1000);
    }
    printf("[BEV stage    ] cost time: %5.3lf ms\n", bevstage_t.count() * 1000);
    printf("[Postprocess  ] cost time: %5.3lf ms\n", post_t.count() * 1000);

    duration<double> sum_time = post_end - pre_start;
    cost_time = sum_time.count() * 1000;
    printf("[Infer total  ] cost time: %5.3lf ms\n", cost_time);

    printf("Detect %ld objects\n", out_detections.size());
    return EXIT_SUCCESS;
}

BEVDet::~BEVDet(){
    CHECK_CUDA(cudaFree(ranks_bev_dev));
    CHECK_CUDA(cudaFree(ranks_depth_dev));
    CHECK_CUDA(cudaFree(ranks_feat_dev));
    CHECK_CUDA(cudaFree(interval_starts_dev));
    CHECK_CUDA(cudaFree(interval_lengths_dev));

    CHECK_CUDA(cudaFree(src_imgs_dev));

    for(int i = 0; i < imgstage_engine->getNbBindings(); i++){
        CHECK_CUDA(cudaFree(imgstage_buffer[i]));
    }
    delete[] imgstage_buffer;

    for(int i = 0; i < bevstage_engine->getNbBindings(); i++){
        CHECK_CUDA(cudaFree(bevstage_buffer[i]));
    }
    delete[] bevstage_buffer;

    imgstage_context->destroy();
    bevstage_context->destroy();

    imgstage_engine->destroy();
    bevstage_engine->destroy();

}


__inline__ size_t dataTypeToSize(nvinfer1::DataType dataType)
{
    switch ((int)dataType)
    {
    case int(nvinfer1::DataType::kFLOAT):
        return 4;
    case int(nvinfer1::DataType::kHALF):
        return 2;
    case int(nvinfer1::DataType::kINT8):
        return 1;
    case int(nvinfer1::DataType::kINT32):
        return 4;
    case int(nvinfer1::DataType::kBOOL):
        return 1;
    default:
        return 4;
    }
}

将代码简化,我们基本就是这几个模块

#include <iostream>
#include <fstream>
#include <vector>
#include <NvInfer.h>
#include <cuda_runtime_api.h>
#include <yaml-cpp/yaml.h>

using namespace nvinfer1;

class Logger : public ILogger {
    void log(Severity severity, const char* msg) noexcept override {
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} gLogger;

// 初始化参数
YAML::Node init_params(const std::string& config_file) {
    return YAML::LoadFile(config_file);
}

// 加载引擎
ICudaEngine* load_engine(const std::string& engine_file) {
    std::ifstream file(engine_file, std::ios::binary);
    if (!file.good()) {
        std::cerr << "Error opening engine file: " << engine_file << std::endl;
        return nullptr;
    }

    file.seekg(0, std::ios::end);
    size_t size = file.tellg();
    file.seekg(0, std::ios::beg);
    std::vector<char> buffer(size);
    file.read(buffer.data(), size);
    file.close();

    IRuntime* runtime = createInferRuntime(gLogger);
    return runtime->deserializeCudaEngine(buffer.data(), size, nullptr);
}

// 内存分配
void allocate_buffers(ICudaEngine* engine, void**& buffers, cudaStream_t& stream) {
    int nbBindings = engine->getNbBindings();
    buffers = new void*[nbBindings];
    for (int i = 0; i < nbBindings; ++i) {
        Dims dims = engine->getBindingDimensions(i);
        size_t size = 1;
        for (int j = 0; j < dims.nbDims; ++j) {
            size *= dims.d[j];
        }
        size *= 1;  // 这里假设数据类型是float
        cudaMalloc(&buffers[i], size * sizeof(float));
    }
    cudaStreamCreate(&stream);
}

// 推理
void infer(ICudaEngine* engine, void** buffers, cudaStream_t stream) {
    IExecutionContext* context = engine->createExecutionContext();
    context->enqueueV2(buffers, stream, nullptr);
    cudaStreamSynchronize(stream);
    context->destroy();
}

// 主函数
int main() {
    YAML::Node config = init_params("config.yaml");
    ICudaEngine* engine = load_engine("model.trt");
    if (!engine) {
        std::cerr << "Failed to load engine!" << std::endl;
        return -1;
    }

    void** buffers;
    cudaStream_t stream;
    allocate_buffers(engine, buffers, stream);

    infer(engine, buffers, stream);

    for (int i = 0; i < engine->getNbBindings(); ++i) {
        cudaFree(buffers[i]);
    }
    delete[] buffers;
    cudaStreamDestroy(stream);
    engine->destroy();

    return 0;
}

4. 参考链接

https://gitcode.csdn.net/65ed75361a836825ed799d1d.html?dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MzA0LCJleHAiOjE3MjI0ODI3ODAsImlhdCI6MTcyMTg3Nzk4MCwidXNlcm5hbWUiOiJsb3ZlbHlfeW9zaGlubyJ9.fN2NLfgAtDManErj8rB1ctFMlV-62ZSnseIyDV3mq7Q

https://blog.csdn.net/sinat_41886501/article/details/129091397

https://blog.csdn.net/Supremelv/article/details/138153736

https://blog.csdn.net/weixin_43863869/article/details/128571567

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1996316.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

autoX.js

一. 概述 AutoX.js 使用 JavaScript 作为脚本语言&#xff0c;目前使用 Rhino 1.7.13 作为脚本引擎&#xff0c;支持 ES5 与部分 ES6 特性。 下载地址&#xff1a; https://github.com/kkevsekk1/AutoX/releases 官方文档&#xff1a; AutoX.js 二. 用法 1. 首先在官网下…

如何快速从文本中找到需要的信息,字典和正则灵活运用

import re #打开文本文件 f open("stock_data.txt",encoding"utf-8") #单独读取第一行数据处理进行分割&#xff0c;末尾换行符去掉 headers f.readline().strip().split(,) print(headers) #定义一个字典&#xff0c;以股标代码做为KEY,每个行做为值 st…

嵌入式学习day14(shell脚本)

一丶概念 shell脚本的本质&#xff1a;shell命令的有序集合 shell既是应用程序又是脚本语言&#xff0c;并且是解释型语言&#xff0c;不需要编译&#xff0c;解释一条执行一条。 shell脚本编程&#xff1a;将shell命令结合一些按照一定逻辑集合到一起&#xff0c;写一个.sh文件…

springboot项目配置https证书

前言 springboot项目直接http访问不安全&#xff0c;如何开启https&#xff1f; 如何配置ssl 创建证书 注意:如果你有CA办法的数字证书&#xff0c;可以使用直接使用CA颁发的证书。非可信CA或者私人颁发的证书在浏览器地址栏会显示红色&#xff0c;不安全不受信任。 如果你没…

Mac/iPhone邮件APP绑定武大邮箱

Mac/iPhone邮件APP绑定武大邮箱 一、准备工作二、新增一个互联网账户三、设置收件箱和发件箱服务器 一、准备工作 武大邮箱账户&#xff0c;确保你知道自己的邮箱登陆密码。 下面的步骤是在Mac上操作&#xff0c;iPhone上的设置也是一样的&#xff0c;主要是收件箱和发件箱服…

DC-4 打靶渗透

一、信息收集 1、查看靶机MAC地址&#xff1a; 2、查看靶机的ip地址&#xff1a; nmap 192.168.13.0/24 3、查看靶机开放的端口&#xff1a; nmap -p- -sC -sT -sV -A 192.168.13.161 4、访问靶机的80端口&#xff1a; 5、扫描目录&#xff1a; dirsearch -u "http://192…

Redis远程字典服务器(3)——常用数据结构和单线程模型

目录 一&#xff0c;常用数据结构 1.0 前言 1.1 string 1.2 hash 1.3 list 1.4 set 1.5 zset 1.6 演示 二&#xff0c;关于单线程模型 2.1 关于Redis的单线程 2.2 Redis为什么快 一&#xff0c;常用数据结构 1.0 前言 Redis是采用键值对的方式来存储数据的&#…

pixel 3a 刷机和root 流程

1、下载镜像文件 官方下载出厂镜像刷入 或者直接下载手机本身自带的镜像。 下载地址&#xff1a;https://developers.google.com/android/images 找到对应手机的版本&#xff0c;可以通过指令 adb shell getprop 找到 [ro.bootimage.build.fingerprint]: [google/sargo/sargo:1…

代码随想录算法训练营第九天|151.翻转字符串里的单词 卡码网:55.右旋转字符串

LeetCode 151 翻转字符串里的单词 题目&#xff1a; 给你一个字符串 s &#xff0c;请你反转字符串中 单词 的顺序。 单词 是由非空格字符组成的字符串。s 中使用至少一个空格将字符串中的 单词 分隔开。 返回 单词 顺序颠倒且 单词 之间用单个空格连接的结果字符串。 注意…

超声波眼镜清洗机哪个品牌比较好用?四大高赞耐用单品全面测评

随着大家对健康卫生的关注不断提升&#xff0c;超声波清洗机逐渐受到佩戴眼镜人士的青睐。超声波清洗机是一种既方便又高效的智能清洁工具。它利用超声波技术&#xff0c;通过在眼镜表面产生高频振动&#xff0c;将污垢和油脂彻底分离&#xff0c;从而实现卓越的清洁效果。下面…

基于PSO-LSTM的多变量多特征数据分类预测

一、数据集 数据特征&#xff1a;12个多分类&#xff1a;4分类 二、PSO-LSTM网络 PSO-LSTM 网络是一种结合粒子群优化算法&#xff08;Particle Swarm Optimization, PSO&#xff09;和长短期记忆网络&#xff08;Long Short-Term Memory, LSTM&#xff09;的混合模型。它将 …

好领导都会用三招管好下属!

管不住人&#xff0c;你就当不好官&#xff0c;高明领导管人就靠这3大秘诀&#xff01; 秘诀一&#xff1a;敢于亮剑 身为领导&#xff0c;&#xff0c;有时候需要有勇气面对挑战和问题&#xff0c;勇于做出决策&#xff0c;拿出魄力&#xff0c;勇于亮剑&#xff0c;向一切宣…

【HarmonyOS NEXT星河版开发学习】小型测试案例11-购物车数字框

个人主页→VON 收录专栏→鸿蒙开发小型案例总结​​​​​ 基础语法部分会发布于github 和 gitee上面&#xff08;暂未发布&#xff09; 前言 经过一周的学习&#xff0c;我发现还是进行拆分讲解效果会比较好&#xff0c;因为鸿蒙和前端十分的相识。主要就是表达的方式不同罢了…

2024年高教社杯全国大学生数学建模竞赛报名第一次通知!建议收藏!预测类模型及应用场景汇总

对于数学建模而言,算法模型选的对,文章写的顺~其中预测类模型是数模中常用的模型之一,通过预测模型,我们可以对未来的趋势和事件进行合理推测。今天,数模0error给大家汇总一下预测类模型及其应用场景,供大家参考,小伙伴们码住! 2024年高教社杯全大学生数学建模竞赛通知…

docker学习初体验

docker学习初体验 docker是什么 docker 包括三个基本概念: 镜像&#xff08;Image&#xff09;&#xff1a;Docker 镜像&#xff08;Image&#xff09;&#xff0c;就相当于是一个 root 文件系统。比如官方镜像 ubuntu:16.04 就包含了完整的一套 Ubuntu16.04 最小系统的 roo…

四款口碑比较好充电宝排名,哪些性价比高?适合入手充电宝推荐

2024年快到下旬了&#xff0c;市场上的充电宝价格也是差不多都沉淀起来了。所以就性价比高的充电宝的入手时间来说&#xff0c;那确实是年底左右的时间挑选入手会比较好一点。而挑选性价比高的充电宝类型呢&#xff0c;那说充电宝&#xff0c;我个人其实也有入手过不少品牌的充…

晶体振荡器的频率容差与温度稳定性

晶体振荡器作为电子设备中不可或缺的频率源&#xff0c;其频率的准确性与稳定性至关重要。本文旨在阐述晶体振荡器的频率容差与温度稳定性的定义、测量单位及其在实际应用中的重要性。 一、频率容差定义及测量单位 频率容差是指晶体振荡器在特定条件下&#xff08;通常是25C的…

mysql 监控开始时间,结束时间,平均取n个时间点

需求 最近1小时 1分钟 60个点 最近3小时 5分钟 36个点 最近6小时 10分钟 36个点 最近12小时 20分钟 36个点 最近1天 1小时 24个点 最近3天 3小时 24个点 最近1周 6小时 28个点 如果你的递归查询支持递归CTE&#xff08;如MySQL 8.0&#xff09;&#xff0c;可以使用递归查询来…

[自学记录09*]关于模糊效果降采样优化性能的小实验

一、降采样在模糊中的优化 这两天接手了几个高度定制化的模糊&#xff0c;包括不限于放射和旋转状的径向模糊&#xff0c;移轴模糊&#xff0c;景深的散景模糊等等&#xff0c;这些效果在游戏中非常常见。 其实模糊的原理都差不多&#xff0c;无非就是对UV偏移后重新采样再求…

UI动画设计:提升用户体验的关键

传统的静态 UI 设计正在逐渐被淘汰&#xff0c;UI 动画设计正在脱颖而出。随着技术的成熟&#xff0c;UI 动画正试图超越现有的限制和规则&#xff0c;并通过应用程序形成、网站和其他产品的新互动模式。交互式动画也可以为 UI 设计增添活力&#xff0c;使用户界面更加丰富多彩…