文章目录
- 1. TensorRT安装
- 1.1 cuda/cudnn以及虚拟环境的创建
- 1.2 根据cuda版本安装相对应版本的tensorRT
- 2. 模型转换
- 2.1 pth转onnx
- 2.2 onnx转engine
- 3. TensorRT部署
- TensorRT推理(python API)
- TensorRT推理(C++ API)
- 可能遇到的问题
- 参考文献
1. TensorRT安装
1.1 cuda/cudnn以及虚拟环境的创建
CUDA下载链接:https://developer.nvidia.com/cuda-toolkit-archive
cuDnn下载链接:https://docs.nvidia.com/deeplearning/cudnn/latest/installation/windows.html
1.2 根据cuda版本安装相对应版本的tensorRT
TensorRT下载链接:https://developer.nvidia.com/tensorrt/download
TensorRT安装指南
下载后解压缩,并为文件夹下的lib配置环境变量
2. 模型转换
2.1 pth转onnx
python安装onnx模块,pip install onnx
input_name = 'input'
output_name = 'output'
torch.onnx.export(model, # model being run
x, # model input
"model.onnx", # where to save the model (can be a file or file-like object)
opset_version=11, # the ONNX version to export the model to
input_names = [input_name], # the model's input names
output_names = [output_name], # the model's output names
dynamic_axes= {
input_name: {0: 'batch_size', 2 : 'in_width', 3: 'int_height'},
output_name: {0: 'batch_size', 2: 'out_width', 3:'out_height'}
}
2.2 onnx转engine
注意:TensorRT的ONNX解释器是针对Pytorch版本编译的,如果版本不对应可能导致转模型时出现错误
1. 使用命令行工具
主要是调用bin文件夹下的trtexec执行程序
trtexec.exe --onnx=model.onnx --saveEngine=model.engine --workspace=6000
#生成静态batchsize的engine
./trtexec --onnx=<onnx_file> \ #指定onnx模型文件
--explicitBatch \ #在构建引擎时使用显式批大小(默认=隐式)显示批处理
--saveEngine=<tensorRT_engine_file> \ #输出engine
--workspace=<size_in_megabytes> \ #设置工作空间大小单位是MB(默认为16MB)
--fp16 #除了fp32之外,还启用fp16精度(默认=禁用)
#生成动态batchsize的engine
./trtexec --onnx=<onnx_file> \ #指定onnx模型文件
--minShapes=input:<shape_of_min_batch> \ #最小的NCHW
--optShapes=input:<shape_of_opt_batch> \ #最佳输入维度,跟maxShapes一样就好
--maxShapes=input:<shape_of_max_batch> \ #最大输入维度
--workspace=<size_in_megabytes> \ #设置工作空间大小单位是MB(默认为16MB)
--saveEngine=<engine_file> \ #输出engine
--fp16 #除了fp32之外,还启用fp16精度(默认=禁用)
#小尺寸的图片可以多batchsize即8x3x416x416
/home/zxl/TensorRT-7.2.3.4/bin/trtexec --onnx=yolov4_-1_3_416_416_dynamic.onnx \
--minShapes=input:1x3x416x416 \
--optShapes=input:8x3x416x416 \
--maxShapes=input:8x3x416x416 \
--workspace=4096 \
--saveEngine=yolov4_-1_3_416_416_dynamic_b8_fp16.engine \
--fp16
#由于内存不够了所以改成4x3x608x608
/home/zxl/TensorRT-7.2.3.4/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx \
--minShapes=input:1x3x608x608 \
--optShapes=input:4x3x608x608 \
--maxShapes=input:4x3x608x608 \
--workspace=4096 \
--saveEngine=yolov4_-1_3_608_608_dynamic_b4_fp16.engine \
--fp16
另外,可以使用trtexec.exe --help命令查看trtexec的命令参数含义
D:\Work\cuda_gpu\sdk\TensorRT-8.5.1.7\bin>trtexec.exe --help
&&&& RUNNING TensorRT.trtexec [TensorRT v8501] # trtexec.exe --help
=== Model Options ===
--uff=<file> UFF model
--onnx=<file> ONNX model
--model=<file> Caffe model (default = no model, random weights used)
--deploy=<file> Caffe prototxt file
--output=<name>[,<name>]* Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
--uffInput=<name>,X,Y,Z Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
--uffNHWC Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)
=== Build Options ===
--maxBatch Set max batch size and build an implicit batch engine (default = same size as --batch)
This option should not be used when the input model is ONNX or when dynamic shapes are provided.
--minShapes=spec Build with dynamic shapes using a profile with the min shapes provided
--optShapes=spec Build with dynamic shapes using a profile with the opt shapes provided
--maxShapes=spec Build with dynamic shapes using a profile with the max shapes provided
--minShapesCalib=spec Calibrate with dynamic shapes using a profile with the min shapes provided
--optShapesCalib=spec Calibrate with dynamic shapes using a profile with the opt shapes provided
--maxShapesCalib=spec Calibrate with dynamic shapes using a profile with the max shapes provided
Note: All three of min, opt and max shapes must be supplied.
However, if only opt shapes is supplied then it will be expanded so
that min shapes and max shapes are set to the same values as opt shapes.
Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
Each input shape is supplied as a key-value pair where key is the input name and
value is the dimensions (including the batch dimension) to be used for that input.
Each key-value pair has the key and value separated using a colon (:).
Multiple input shapes can be provided via comma-separated key-value pairs.
--inputIOFormats=spec Type and format of each of the input tensors (default = all inputs in fp32:chw)
See --outputIOFormats help for the grammar of type and format list.
Note: If this option is specified, please set comma-separated types and formats for all
inputs following the same order as network inputs ID (even if only one input
needs specifying IO format) or set the type and format once for broadcasting.
--outputIOFormats=spec Type and format of each of the output tensors (default = all outputs in fp32:chw)
Note: If this option is specified, please set comma-separated types and formats for all
outputs following the same order as network outputs ID (even if only one output
needs specifying IO format) or set the type and format once for broadcasting.
IO Formats: spec ::= IOfmt[","spec]
IOfmt ::= type:fmt
type ::= "fp32"|"fp16"|"int32"|"int8"
fmt ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|
"cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt]
--workspace=N Set workspace size in MiB.
--memPoolSize=poolspec Specify the size constraints of the designated memory pool(s) in MiB.
Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes.
Pool constraint: poolspec ::= poolfmt[","poolspec]
poolfmt ::= pool:sizeInMiB
pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"
--profilingVerbosity=mode Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)
--minTiming=M Set the minimum number of iterations used in kernel selection (default = 1)
--avgTiming=M Set the number of times averaged in each iteration for kernel selection (default = 8)
--refit Mark the engine as refittable. This will allow the inspection of refittable layers
and weights within the engine.
--sparsity=spec Control sparsity (default = disabled).
Sparsity: spec ::= "disable", "enable", "force"
Note: Description about each of these options is as below
disable = do not enable sparse tactics in the builder (this is the default)
enable = enable sparse tactics in the builder (but these tactics will only be
considered if the weights have the right sparsity pattern)
force = enable sparse tactics in the builder and force-overwrite the weights to have
a sparsity pattern (even if you loaded a model yourself)
--noTF32 Disable tf32 precision (default is to enable tf32, in addition to fp32)
--fp16 Enable fp16 precision, in addition to fp32 (default = disabled)
--int8 Enable int8 precision, in addition to fp32 (default = disabled)
--best Enable all precisions to achieve the best performance (default = disabled)
--directIO Avoid reformatting at network boundaries. (default = disabled)
--precisionConstraints=spec Control precision constraint setting. (default = none)
Precision Constaints: spec ::= "none" | "obey" | "prefer"
none = no constraints
prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
otherwise
--layerPrecisions=spec Control per-layer precision constraints. Effective only when precisionConstraints is set to
"obey" or "prefer". (default = none)
The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
layerName to specify the default precision for all the unspecified layers.
Per-layer precision spec ::= layerPrecision[","spec]
layerPrecision ::= layerName":"precision
precision ::= "fp32"|"fp16"|"int32"|"int8"
--layerOutputTypes=spec Control per-layer output type constraints. Effective only when precisionConstraints is set to
"obey" or "prefer". (default = none)
The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
layerName to specify the default precision for all the unspecified layers. If a layer has more than
one output, then multiple types separated by "+" can be provided for this layer.
Per-layer output type spec ::= layerOutputTypes[","spec]
layerOutputTypes ::= layerName":"type
type ::= "fp32"|"fp16"|"int32"|"int8"["+"type]
--calib=<file> Read INT8 calibration cache file
--safe Enable build safety certified engine
--consistency Perform consistency checking on safety certified engine
--restricted Enable safety scope checking with kSAFETY_SCOPE build flag
--saveEngine=<file> Save the serialized engine
--loadEngine=<file> Load a serialized engine
--tacticSources=tactics Specify the tactics to be used by adding (+) or removing (-) tactics from the default
tactic sources (default = all available tactics).
Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional
tactics.
Tactic Sources: tactics ::= [","tactic]
tactic ::= (+|-)lib
lib ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"|"EDGE_MASK_CONVOLUTIONS"
|"JIT_CONVOLUTIONS"
For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS
--noBuilderCache Disable timing cache in builder (default is to enable timing cache)
--heuristic Enable tactic selection heuristic in builder (default is to disable the heuristic)
--timingCacheFile=<file> Save/load the serialized global timing cache
--preview=features Specify preview feature to be used by adding (+) or removing (-) preview features from the default
Preview Features: features ::= [","feature]
feature ::= (+|-)flag
flag ::= "fasterDynamicShapes0805"
|"disableExternalTacticSourcesForCore0805"
=== Inference Options ===
--batch=N Set batch size for implicit batch engines (default = 1)
This option should not be used when the engine is built from an ONNX model or when dynamic
shapes are provided when the engine is built.
--shapes=spec Set input shapes for dynamic shapes inference inputs.
Note: Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
Each input shape is supplied as a key-value pair where key is the input name and
value is the dimensions (including the batch dimension) to be used for that input.
Each key-value pair has the key and value separated using a colon (:).
Multiple input shapes can be provided via comma-separated key-value pairs.
--loadInputs=spec Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
Input values spec ::= Ival[","spec]
Ival ::= name":"file
--iterations=N Run at least N inference iterations (default = 10)
--warmUp=N Run for N milliseconds to warmup before measuring performance (default = 200)
--duration=N Run performance measurements for at least N seconds wallclock time (default = 3)
--sleepTime=N Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
--idleTime=N Sleep N milliseconds between two continuous iterations(default = 0)
--streams=N Instantiate N engines to use concurrently (default = 1)
--exposeDMA Serialize DMA transfers to and from device (default = disabled).
--noDataTransfers Disable DMA transfers to and from device (default = enabled).
--useManagedMemory Use managed memory instead of separate host and device allocations (default = disabled).
--useSpinWait Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
--threads Enable multithreading to drive engines with independent threads or speed up refitting (default = disabled)
--useCudaGraph Use CUDA graph to capture engine execution and then launch inference (default = disabled).
This flag may be ignored if the graph capture fails.
--timeDeserialize Time the amount of time it takes to deserialize the network and exit.
--timeRefit Time the amount of time it takes to refit the engine before inference.
--separateProfileRun Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
--buildOnly Exit after the engine has been built and skip inference perf measurement (default = disabled)
--persistentCacheRatio Set the persistentCacheLimit in ratio, 0.5 represent half of max persistent L2 size (default = 0)
=== Build and Inference Batch Options ===
When using implicit batch, the max batch size of the engine, if not given,
is set to the inference batch size;
when using explicit batch, if shapes are specified only for inference, they
will be used also as min/opt/max in the build profile; if shapes are
specified only for the build, the opt shapes will be used also for inference;
if both are specified, they must be compatible; and if explicit batch is
enabled but neither is specified, the model must provide complete static
dimensions, including batch size, for all inputs
Using ONNX models automatically forces explicit batch.
=== Reporting Options ===
--verbose Use verbose logging (default = false)
--avgRuns=N Report performance measurements averaged over N consecutive iterations (default = 10)
--percentile=P1,P2,P3,... Report performance for the P1,P2,P3,... percentages (0<=P_i<=100, 0 representing max perf, and 100 representing min perf; (default = 90,95,99%)
--dumpRefit Print the refittable layers and weights from a refittable engine
--dumpOutput Print the output tensor(s) of the last inference iteration (default = disabled)
--dumpProfile Print profile information per layer (default = disabled)
--dumpLayerInfo Print layer information of the engine to console (default = disabled)
--exportTimes=<file> Write the timing results in a json file (default = disabled)
--exportOutput=<file> Write the output tensors to a json file (default = disabled)
--exportProfile=<file> Write the profile information per layer in a json file (default = disabled)
--exportLayerInfo=<file> Write the layer information of the engine in a json file (default = disabled)
=== System Options ===
--device=N Select cuda device N (default = 0)
--useDLACore=N Select DLA core N for layers that support DLA (default = none)
--allowGPUFallback When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
--plugins Plugin library (.so) to load (can be specified multiple times)
=== Help ===
--help, -h Print this message
2. 使用TensorRT API 转换成TensorRT engine
def generate_engine(onnx_path, engine_path):
# 1.构建trt日志记录器
logger = trt.Logger(trt.Logger.WARNING)
# 初始化
trt.init_libnvinfer_plugins(logger, namespace="")
# 2.create a builder,logger放入进去
builder = trt.Builder(logger)
# 3.创建配置文件,用于trt如何优化模型
config = builder.create_builder_config()
# 设置工作空间内存大小
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB
# 设置精度
config.set_flag(trt.BuilderFlag.FP16)
# INT8需要进行校准
# 4.创建一个network。EXPLICIT_BATCH:batch是动态的
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
# 创建ONNX模型解析器
parser = trt.OnnxParser(network, logger)
# 解析ONNX模型,并填充到网络
success = parser.parse_from_file(onnx_path)
# 处理错误
for idx in range(parser.num_errors):
print(parser.get_error(idx))
if not success:
pass # Error handling code here
# 5.engine模型序列化,即生成了trt.engine model
serialized_engine = builder.build_serialized_network(network, config)
# 保存序列化的engine,如果以后要用到的话. 模型不能跨平台,即和trt版本 gpu类型有关
with open(engine_path, "wb") as f:
f.write(serialized_engine)
# 6.反序列化engine。使用runtime接口。即加载engine模型进行推理。
# runtime = trt.Runtime(logger)
# engine = runtime.deserialize_cuda_engine(serialized_engine)
# with open("sample.engine", "rb") as f:
# serialized_engine = f.read()
完成上述步骤后,将获得一个转换为TensorRT格式的模型文件(model.trt)。可以将该文件用于TensorRT的推理和部署。
3. TensorRT部署
TensorRT部署包括Python和C++ 两种API
可以使用TensorRT的Python API或C++ API
TensorRT推理(python API)
在安装好tensorrt环境后,可以尝试使用预训练权重进行转化封装部署,运行以下代码!!
import torch
import tensorrt as trt
from collections import OrderedDict, namedtuple
def infer(img_data, engine_path):
# 1.日志器
logger = trt.Logger(trt.Logger.INFO)
# 2.runtime加载trt engine model
runtime = trt.Runtime(logger)
trt.init_libnvinfer_plugins(logger, '') # initialize TensorRT plugins
with open(engine_path, "rb") as f:
serialized_engine = f.read()
engine = runtime.deserialize_cuda_engine(serialized_engine)
# 3.绑定输入输出
bindings = OrderedDict()
Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))
fp16 = False
for index in range(engine.num_bindings):
name = engine.get_binding_name(index)
dtype = trt.nptype(engine.get_binding_dtype(index))
shape = tuple(engine.get_binding_shape(index))
data = torch.from_numpy(np.empty(shape, dtype=np.dtype(dtype))).to(self.device)
# Tensor.data_ptr 该tensor首个元素的地址即指针,为int类型
bindings[name] = Binding(name, dtype, shape, data, int(data.data_ptr()))
if engine.binding_is_input(index) and dtype == np.float16:
fp16 = True
# 记录输入输出的指针地址
binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())
# 4.加载数据,绑定数据,并推理,将推理的结果放入到
context = engine.create_execution_context()
binding_addrs['images'] = int(img_data.data_ptr())
context.execute_v2(list(binding_addrs.values()))
# 5.获取结果((根据导出onnx模型时设置的输入输出名字获取)
nums = bindings['num'].data[0]
boxes = bindings['boxes'].data[0]
scores = bindings['scores'].data[0]
classes = bindings['classes'].data[0]
TensorRT推理(C++ API)
项目工程设置属性 — 链接器
找到链接器下的输入,在附加依赖项中加入
cudnn.lib
cublas.lib
cudart.lib
nvinfer.lib
nvparsers.lib
nvonnxparser.lib
nvinfer_plugin.lib
opencv_world460d.lib
完整推理代码:
#include <cassert>
#include <cfloat>
#include <fstream>
#include <iostream>
#include <memory>
#include <sstream>
#include <cuda_runtime_api.h>
#include "NvInfer.h"
#include "NvOnnxParser.h"
#include "logger.h"
using sample::gLogError;
using sample::gLogInfo;
using namespace nvinfer1;
// logger用来管控打印日志级别
// TRTLogger继承自nvinfer1::ILogger
class TRTLogger : public nvinfer1::ILogger
{
void log(Severity severity, const char *msg) noexcept override
{
// 屏蔽INFO级别的日志
if (severity != Severity::kINFO)
std::cout << msg << std::endl;
}
} gLogger;
int ReadEngineData(char* enginePath, char *&engineData)
{
// 读取引擎文件
std::ifstream engineFile(enginePath, std::ios::binary);
if (engineFile.fail())
{
std::cerr << "Failed to open file!" << std::endl;
return -1;
}
engineFile.seekg(0, std::ifstream::end);
auto fsize = engineFile.tellg();
engineFile.seekg(0, std::ifstream::beg);
if (nullptr == engineData)
{
engineData = new char[fsize];
}
engineFile.read(engineData, fsize);
engineFile.close();
return fsize;
}
size_t getMemorySize(nvinfer1::Dims32 input_dims, int typeSize)
{
size_t psize = input_dims.d[0] * input_dims.d[1] * input_dims.d[2] * input_dims.d[3] * typeSize;
return psize;
}
bool inferDemo(float* input_buffer, int* tensorSize)
{
int batchsize = tensorSize[0];
int channel = tensorSize[1];
int width = tensorSize[2];
int height = tensorSize[3];
size_t dataSize = width * height*channel*batchsize;
// 读取引擎文件
char* enginePath = "net_model.engine";
char* engineData = nullptr;
int fsize = ReadEngineData(enginePath, engineData);
printf("fsize=%d\n", fsize);
// 创建运行时 & 加载引擎
// TRTLogger glogger; // 可以使用这个代替sample::gLogger.getTRTLogger()
std::unique_ptr<nvinfer1::IRuntime> runtime{ nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger()) };
std::unique_ptr<nvinfer1::ICudaEngine> mEngine(runtime->deserializeCudaEngine(engineData, fsize, nullptr));
assert(mEngine.get() != nullptr);
// 创建执行上下文
std::unique_ptr<nvinfer1::IExecutionContext> context(mEngine->createExecutionContext());
const char* name0 = mEngine->getBindingName(0);
const char* name1 = mEngine->getBindingName(1);
const char* name2 = mEngine->getBindingName(2);
const char* name3 = mEngine->getBindingName(3);
printf("name0=%s\nname1=%s\nname2=%s\nname3=%s\n", name0, name1, name2, name3);
// 获取输入大小
auto input_idx = mEngine->getBindingIndex("input");
if (input_idx == -1)
{
return false;
}
assert(mEngine->getBindingDataType(input_idx) == nvinfer1::DataType::kFLOAT);
auto input_dims = context->getBindingDimensions(input_idx);
context->setBindingDimensions(input_idx, input_dims);
auto input_size = getMemorySize(input_dims, sizeof(float_t));
// 获取输出大小 所有输出的空间都要分配
auto output1_idx = mEngine->getBindingIndex("output1");
if (output1_idx == -1)
{
return false;
}
assert(mEngine->getBindingDataType(output1_idx) == nvinfer1::DataType::kFLOAT);
auto output1_dims = context->getBindingDimensions(output1_idx);
auto output1_size = getMemorySize(output1_dims, sizeof(float_t));
auto output2_idx = mEngine->getBindingIndex("output2");
if (output2_idx == -1)
{
return false;
}
assert(mEngine->getBindingDataType(output2_idx) == nvinfer1::DataType::kFLOAT);
auto output2_dims = context->getBindingDimensions(output2_idx);
auto output2_size = getMemorySize(output2_dims, sizeof(float_t));
auto output3_idx = mEngine->getBindingIndex("output3");
if (output3_idx == -1)
{
return false;
}
assert(mEngine->getBindingDataType(output3_idx) == nvinfer1::DataType::kFLOAT);
auto output3_dims = context->getBindingDimensions(output3_idx);
auto output3_size = getMemorySize(output3_dims, sizeof(float_t));
// 准备推理
// Allocate CUDA memory
void* input_mem{ nullptr };
if (cudaMalloc(&input_mem, input_size) != cudaSuccess)
{
gLogError << "ERROR: input cuda memory allocation failed, size = " << input_size << " bytes" << std::endl;
return false;
}
void* output1_mem{ nullptr };
if (cudaMalloc(&output1_mem, output1_size) != cudaSuccess)
{
gLogError << "ERROR: output cuda memory allocation failed, size = " << output1_size << " bytes" << std::endl;
return false;
}
void* output2_mem{ nullptr };
if (cudaMalloc(&output2_mem, output2_size) != cudaSuccess)
{
gLogError << "ERROR: output cuda memory allocation failed, size = " << output2_size << " bytes" << std::endl;
return false;
}
void* output3_mem{ nullptr };
if (cudaMalloc(&output3_mem, output3_size) != cudaSuccess)
{
gLogError << "ERROR: output cuda memory allocation failed, size = " << output3_size << " bytes" << std::endl;
return false;
}
// 复制数据到设备
cudaMemcpy(input_mem, input_buffer, input_size, cudaMemcpyHostToDevice); // cudaMemcpyHostToDevice 从主机到设备 即 内存到显存
// 绑定输入输出内存 一起送入推理
void* bindings[4];
bindings[input_idx] = input_mem;
bindings[output1_idx] = output1_mem;
bindings[output2_idx] = output2_mem;
bindings[output3_idx] = output3_mem;
// 推理
bool status = context->executeV2(bindings);
if (!status)
{
gLogError << "ERROR: inference failed" << std::endl;
cudaFree(input_mem);
cudaFree(output1_mem);
cudaFree(output2_mem);
cudaFree(output3_mem);
return 0;
}
// 获得结果
float* output3_buffer = new float[dataSize];
cudaMemcpy(output3_buffer, output3_mem, output3_size, cudaMemcpyDeviceToHost);
// 释放CUDA内存
cudaFree(input_mem);
cudaFree(output1_mem);
cudaFree(output2_mem);
cudaFree(output3_mem);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
gLogError << "ERROR: failed to free CUDA memory: " << cudaGetErrorString(err) << std::endl;
return false;
}
// save the results
delete[] output3_buffer;
output3_buffer = nullptr;
return true;
}
int main()
{
int batchsize = 1;
int channel = 3;
int width = 256;
int height = 256;
size_t dataSize = width * height*channel*batchsize;
int tensorSize[4] = { batchsize, channel, width, height };
float* input_buffer = new float[dataSize];
for (int i = 0; i < dataSize; i++)
input_buffer[i] = 0.1;
inferDemo(input_buffer, tensorSize);
delete[] input_buffer;
input_buffer = nullptr;
system("pause");
return 0;
}
可能遇到的问题
trt转换时,cudnn库报错:
安装正确版本的cudnn,另外也需要将tensorRT中lib的dll拷贝到cuda的bin文件下。
trt转换时,缺少 zlibwapi.dll:
Could not locate zlibwapi.dll. Please make sure it is in your library path!
网上的解决方案有: 在NVIDIA官网下载(不过好像自2023年后下架了);另外就是下载源码进行编译
zlib源码:https://github.com/madler/zlib
而我自己则通过搜索电脑上的zlibwapi.dll(安装的pytorch路径下、钉钉、Origin),将其拷贝到cuda的bin文件下解决了。
TensorRT推理时,size报错:
如果模型是动态大小导出的,则需要自己设置size。
参考文献
tensorRT基础(1)-实现模型的推理过程
ONNX基本操作
重要,python部署成功 模型部署】TensorRT的安装与使用
有点参考价值 TensorRT部署模型基本步骤(C++)
TensorRT优化部署(一)–TensorRT和ONNX基础
值得参考C++部署 TensorRT基础知识及应用【C++深度学习部署(十)】
windows平台使用tensorRT部署yolov5详细介绍,整个流程思路以及细节。
C ++部署成功 TensorRT Windows C++ 部署 ONNX 模型简易教程
python版tensorrt推理
python TensorRT API转换模型 TensorRT实战:构建与部署Python推理模型(二)
onnx转engine命令 【TensorRT】trtexec工具转engine
trt 使用trtexec工具ONNX转engine