基于opencv-C++dnn模块推理的yolov5 onnx模型

前言

由于最近手边的项目要求，本人需要在一块算力吃紧的板端上部署yolov5目标检测模型（纯CPU推理），考虑到python在板端上的运行速率和其运行时所占内存过大，因此使用torch.hub.load对训练好后的pt模型进行加载和运行不是一个可以选择的方案，因此本文提出了一个借助opencv C++的dnn模块，读取训练好的onnx模型并进行纯CPU运行的方案。、
本文所使用到的环境：
- OpenCV 4.5
- pytorch 2.4.1+cu121
- python 3.10.2

1 常规方法`torch.hub.load`

1-1 API介绍

torch.hub.load 是 PyTorch 提供的一个功能强大的工具，它允许用户轻松地加载预训练的模型或模型定义。这个功能可以用来快速实现以下操作：
1. 加载预训练模型：可以直接从 GitHub 仓库加载已经训练好的模型权重，这对于快速原型设计和实验非常有用。
2. 加载模型定义：可以加载模型的结构，而不加载预训练的权重，这对于自定义训练过程很有帮助。
torch.hub.load 函数的基本语法如下：
```
torch.hub.load(repo_or_dir, model, *args, **kwargs)
```
- repo_or_dir：GitHub 仓库的名称或本地目录路径。如果是 GitHub 仓库，格式通常是 ‘用户名/仓库名’。
- model：模型名称，通常是在 GitHub 仓库的 hubconf.py 文件中定义的模型函数。
- *args 和 **kwargs：传递给模型函数的参数.

1-2 模型推理

这里我已经准备好了一个训练好的YOLOV5目标检测的模型文件best.pt，我们使用torch.hub.load简单加载并针对结果进行解析并通过cv把推理结果画在原图上。
先放上完整代码，让我们在详细看看其中的细节

import cv2
import torch
import time
import platform
import pathlib
plt = platform.system()
if plt != 'Windows':
  pathlib.WindowsPath = pathlib.PosixPath

model_path = "/home/zhlucifer/yolov5/runs/train/exp9/weights/best.pt"
device = torch.device('cpu')


model = torch.hub.load('./', 'custom', model_path, source='local')

print(torch.__version__)

video_path = "/home/zhlucifer/yolov5/3.mp4"
cap = cv2.VideoCapture(video_path)
class_names=['right_sign','left_sign']
while cap.isOpened():
    ret, frame = cap.read()
    if ret:
        results = model(frame)

        detections = results.xyxy[0]
        for *xyxy, conf, cls in detections:
            if conf.item()<0.6:
                continue
            label = class_names[int(cls)]
            x1, y1, x2, y2 = int(xyxy[0]), int(xyxy[1]), int(xyxy[2]), int(xyxy[3])
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            conf_str = "{:.2f}".format(conf.item())
            cv2.putText(frame, label+conf_str, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)
        cv2.imshow('frame', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break


    else:
        break

cap.release()
cv2.destroyAllWindows()

调用model(frame)会对传入的图像帧进行目标检测。YOLOv5模型会返回一个包含检测结果的字典，其中包括边界框的位置、置信度以及类别ID。results.xyxy是模型返回结果中的一个键，它包含了检测到的目标边界框的坐标（x_min, y_min, x_max, y_max）。[0]表示取第一个元素，因为在一轮循环内，我们只输入一张图像。

results = model(frame)
detections = results.xyxy[0]

然后我们对拿到的推理结果进行解包，其中*xyxy表示边界框的坐标，conf是置信度，cls是类别ID。我们进行置信度过滤，然后把结果画在图上。

for *xyxy, conf, cls in detections:
	if conf.item()<0.6:
		continue
	label = class_names[int(cls)]
	x1, y1, x2, y2 = int(xyxy[0]), int(xyxy[1]), int(xyxy[2]), int(xyxy[3])

	cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)

	conf_str = "{:.2f}".format(conf.item())

	cv2.putText(frame, label+conf_str, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)

1-3 效果展示和问题说明

如同我在前言提到的一样，板端的算力吃紧且只有cpu,运行python的速率很慢，且运行import torch和torch.hub.load这两行指令会进行多余的加载耗费时间，以此我们考虑使用别的方法进行模型推理。

2 onnx模型导出

由于我们即将使用的opencv的dnn模块的cv::dnn::readNetFromONNX函数来加载ONNX模型，因此我们需要将训练好的模型文件pt(或者其他类型的模型文件)转换为ONNX。
研究过YOLOV5的同学应该不陌生，在train.py和detect.py同级目录下，存在的用于模型文件转换的export.py

2-1 export.py使用-ONNX模型导出

在运行export.py之前，需要配置export的部分参数，我们打开export.py，可以找到python执行脚本的参数配置说明
我们需要配置以下几个参数
- --data：指定数据配置文件的路径，这个文件包含了数据集的路径、类别等信息。
- --weights：指定训练好的模型权重文件的路径。
- --imgsz：制定输入图像的大小（像素），通常为32的倍数，例如640、512等。
- --device：指定用于推理的设备，可以是’cpu’、‘0’（第一个GPU）、‘0,1,2,3’（多个GPU）等。
- --opset：指定ONNX模型的操作集版本，不同的版本可能支持不同的操作符。(这里务必选择版本为12!!!不然可能会出问题!!!)
- --include：指定要导出的模型格式，可以是onnx、torchscript、coreml等。
以我的项目为例子(可以直接)

python export.py 
	--data "F:/2024bicycle\yolov5\data\bicycle2024_LR_sign.yaml"
	--weights "F:/2024bicycle\yolov5\runs\train\exp9\weights\best.pt"
	--imgsz [640, 640]
	--device "cpu"
	--opset 12
	--include "onnx"

运行成功将会出现如下

2-2 ONNX模型验证

我们把转换成功的onnx放到我们刚刚使用torch.hub.load的代码中运行，由于从这一布开始我转移了版端上进行部署，为了适配linux版端的代码，我们作出如下修改

import platform
import pathlib
plt = platform.system()
if plt != 'Windows':
pathlib.WindowsPath = pathlib.PosixPath

同时我们在模型的输入前加上
```
frame=cv2.resize(frame,(640,640))
```
- 由于我们在导出ONNX模型时候制定了模型的输入为640* 640,故这里我们也需要进行resize
同时我们需要安装ONNX运行时候需要的功能包

pip install onnxruntime

运行结果如下，没有任何区别，表示ONNX模型转换正常。

3 dnn模块导入onnx模型

3-1 API介绍

cv::dnn::readNetFromONNX 是OpenCV的深度学习模块（cv::dnn）中的一个函数，用于从ONNX（Open Neural Network Exchange）格式加载神经网络模型。ONNX是一个开放格式，用于表示深度学习模型，使得模型可以在不同的框架之间进行迁移。
函数API
```
cv::dnn::Net cv::dnn::readNetFromONNX(const String& model, const String& config = String());
```
- model: 字符串，表示ONNX模型的路径。这个路径可以是文件系统的绝对路径或者相对路径。
- config: （可选）字符串，表示配置文件的路径。这个参数通常用于指定模型的权重文件路径，但在ONNX模型中，权重通常已经包含在模型文件中，因此这个参数在加载ONNX模型时通常不需要。

3-2 导入尝试&&问题解决

我们来尝试导入onnx模型

#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>


int main() {
    cv::dnn::Net net;
    try
    {
       net =  cv::dnn::readNetFromONNX("/home/zhlucifer/yolov5/runs/train/exp9/weights/best.onnx");

    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
    
    
    if (net.empty()) {
        std::cerr << "model load failed!" << std::endl;
        return -1;
    }else{
        std::cout << "model load sucessed!" << std::endl;
    }

   
    return 0;
}

运行代码发现报错：
不慌，看看报错，报错中提到了Identity ，在ONNX模型中，Identity 节点通常表示一个简单的操作，即输出与输入相同的数据，不进行任何改变。在ONNX中，Identity 节点可以用于多种目的，比如标记特定的操作步骤、在模型图中保持数据流等。可以从报错中是这一层导致模型加载出了问题，我们需要剔除这一层

3-3 `onnx-simplifier`

onnx-simplifier 是一个开源工具，用于简化 ONNX (Open Neural Network Exchange) 模型。ONNX 是一个开放格式，用于表示深度学习模型，它旨在促进不同框架之间的模型互操作性。onnx-simplifier 的目的是优化 ONNX 模型，使其更加高效、易于理解和部署，同时保持模型的预测结果不变。
onnx-simplifier 的使用非常简单，通常只需要几行命令：其中 input.onnx 是原始的 ONNX 模型文件，output.onnx 是简化后的模型文件。

pip install onnx-simplifier  # 安装 onnx-simplifier
python -m onnxsim input.onnx output.onnx  # 使用 onnx-simplifier 简化模型

我们运行简化代码，onnx-simplifier 会检查模型中的所有层，并移除那些对输出没有影响的层，例如 Identity 层。
重新运行代码，发现模型被正确加载。

4 dnn模块模型推理

4-1 总览

老规矩先放上代码，再一步一步进行说明

#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>

using namespace cv;
using namespace cv::dnn;

std::vector<std::string> classes = {
    "right_sign", "left_sign"
 
};

int main() {
    Net net =  cv::dnn::readNetFromONNX("/home/zhlucifer/OpencvProject/src/yolo-cpp/models/simply_best.onnx");

    if (net.empty()) {
        std::cerr << "Error: Could not load the neural network." << std::endl;
        return -1;
    }

    VideoCapture cap("/home/zhlucifer/OpencvProject/src/yolo-cpp/videos/1.mp4");
    if (!cap.isOpened()) {
        std::cerr << "Error: Could not open the video file." << std::endl;
        return -1;
    }


    const int input_width = 640;
    const int input_height = 640;
    float x_factor=640/640.0f;
    float y_factor=480/640.0f;

    namedWindow("Object Detection", WINDOW_NORMAL);

    Mat frame, blob;
    while (cap.read(frame)) {
        resize(frame, blob, Size(input_width, input_height));


        cv::Mat blob = cv::dnn::blobFromImage(frame, 1 / 255.0, cv::Size(640, 640), cv::Scalar(0, 0, 0), true, false);
        net.setInput(blob);
        cv::Mat preds =net.forward();
        std::cout << "rows: "<< preds.size[1]<< " data: " << preds.size[2] << std::endl;
        cv::Mat det_output(preds.size[1], preds.size[2], CV_32F, preds.ptr<float>());
        //In a typical YOLO output, the format is [x_center, y_center, width, height, object_confidence, class_score1, class_score2, ..., class_scoreN] for each bounding box. 
        for (int i = 0; i < det_output.rows; i++) {
            float confidence = det_output.at<float>(i, 4);
            cv::Mat class_scores = det_output.row(i).colRange(5, 5 + classes.size());

         
            Point class_id_point;
            double max_class_score;
            minMaxLoc(class_scores, NULL, &max_class_score, NULL, &class_id_point);
            int class_id = class_id_point.x; 

           float final_confidence = confidence * max_class_score;

            std::cout << "Final confidence: " << final_confidence << std::endl;

            if (final_confidence < 0.45) {
                continue;
            }
         
            float cx = det_output.at<float>(i, 0);
            float cy = det_output.at<float>(i, 1);
            float ow = det_output.at<float>(i, 2);
            float oh = det_output.at<float>(i, 3);
            int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
            int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
            int width = static_cast<int>(ow * x_factor);
            int height = static_cast<int>(oh * y_factor);
            cv::rectangle(frame, cv::Point(x, y), cv::Point(x + width, y + height), cv::Scalar(0, 0, 255), 2, 8);
            putText(frame,classes.at(class_id) +std::to_string(final_confidence), Point(x,y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);
               
                }
                imshow("Object Detection", frame);

                if (waitKey(1) == 'q') {
                    break;
                }
    }

    cap.release();
    destroyAllWindows();

    return 0;
}

4-2 模型导入和视频读取

这一部分代码完成基础的视频读取和模型导入

#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>

using namespace cv;
using namespace cv::dnn;

std::vector<std::string> classes = {
    "right_sign", "left_sign"
 
};

int main() {
    Net net =  cv::dnn::readNetFromONNX("/home/zhlucifer/OpencvProject/src/yolo-cpp/models/simply_best.onnx");

    if (net.empty()) {
        std::cerr << "Error: Could not load the neural network." << std::endl;
        return -1;
    }

    VideoCapture cap("/home/zhlucifer/OpencvProject/src/yolo-cpp/videos/1.mp4");
    if (!cap.isOpened()) {
        std::cerr << "Error: Could not open the video file." << std::endl;
        return -1;
    }


    const int input_width = 640;
    const int input_height = 640;
    float x_factor=640/640.0f;
    float y_factor=480/640.0f;

    namedWindow("Object Detection", WINDOW_NORMAL);

    Mat frame, blob;

4-3 模型输入及其预处理

由于模型的输入需要转换为Blob格式，这里我们需要转换为Blob格式

4-3-1 Blob

在计算机视觉和深度学习领域，“Blob”（Binary Large Object的缩写，但在这里通常不特指二进制大对象）是一个通用的术语，用于指代经过特定预处理并转换为适合神经网络输入的图像数据格式。
具体来说，Blob通常具有以下特点：
1. 多维数组：Blob是一个多维数组（通常是4维或5维），它包含了图像数据，这些数据已经过归一化处理，并可能已经过其他形式的预处理，如缩放、裁剪、翻转等。
2. 归一化：Blob中的像素值通常被归一化到某个特定范围内，比如[0, 1]或[-1, 1]。这有助于神经网络更好地学习和收敛。
3. 通道顺序：对于彩色图像，Blob通常包含三个通道（红、绿、蓝），但是通道的顺序可能会根据特定的神经网络框架进行调整（例如，从BGR转换为RGB）。
4. 批处理：在训练或推理时，通常需要同时处理多个图像，因此Blob通常包含一个额外的维度，用于表示批处理中的图像数量。
Blob的维度通常表示为 [批量大小, 通道数, 高度, 宽度] 对于单张图像，或者 [批量大小, 时间步长, 通道数, 高度, 宽度] 对于视频数据或序列数据。
- 以下是一个Blob的例子：
  - 批量大小（Batch Size）：4
  - 通道数（Channels）：3（对于彩色图像）
  - 高度（Height）：224
  - 宽度（Width）：224
- 这样的Blob的维度将是 [4, 3, 224, 224]。

resize(frame, blob, Size(input_width, input_height));
cv::Mat blob = cv::dnn::blobFromImage(frame, 1 / 255.0, cv::Size(640, 640), cv::Scalar(0, 0, 0), true, false);

blobFromImage 函数将图像转换为神经网络输入所需的格式（即Blob）。
frame 是原始的图像帧，尽管之前已经进行了缩放，但这里再次使用了原始帧，这可能是代码中的一个错误，应该使用已经调整大小的blob。
1 / 255.0 是缩放因子，用于将图像像素值从[0, 255]范围归一化到[0, 1]范围。
cv::Size(640, 640) 是目标尺寸，与模型训练时使用的尺寸相匹配。
cv::Scalar(0, 0, 0) 是均值减去的值，通常用于图像归一化，这里设置为0，意味着不进行均值减。
true 表示交换RB通道到BR通道，因为OpenCV默认使用BGR格式，而某些神经网络模型可能需要RGB格式。
false 表示不进行图像裁剪，保留原始宽高比。

4-3-2 模型输入

net.setInput(blob);
cv::Mat preds =net.forward();

setInput 方法用于设置网络的输入Blob。
blob 是前面通过blobFromImage函数得到的预处理后的图像数据。
forward 方法执行神经网络的前向传播，计算输出。
preds 是前向传播的结果，包含了模型对输入图像的预测。

4-4 推理和模型输出解析

4-4-1 模型输出

我们来看看模型输出代码

cv::Mat det_output(preds.size[1], preds.size[2], CV_32F, preds.ptr<float>());

preds.size[1]：这表示输出矩阵的行数。在YOLO模型中，这通常对应于预测的边界框数量。
preds.size[2]：这表示输出矩阵的列数。在YOLO模型中，每一列包含一个边界框的详细信息，例如边界框的位置、大小、置信度以及类别概率。
CV_32F：这是一个OpenCV类型定义，表示矩阵的数据类型是32位浮点数（即单精度浮点数）。
preds.ptr<float>()：这是获取指向预测输出数据的指针的方法。由于YOLO模型的输出通常是浮点数，所以这里使用<float>来指定数据类型。这个指针指向模型输出的第一个元素。
在处理YOLO模型输出时，我们需要遍历每个边界框，检查置信度，计算最终得分等。使用cv::Mat可以更容易地执行这些操作。

4-4-2 输出解析

我们来看看输出解析

   for (int i = 0; i < det_output.rows; i++) {
            float confidence = det_output.at<float>(i, 4);
            cv::Mat class_scores = det_output.row(i).colRange(5, 5 + classes.size());

         
            Point class_id_point;
            double max_class_score;
            minMaxLoc(class_scores, NULL, &max_class_score, NULL, &class_id_point);
            int class_id = class_id_point.x; 

           float final_confidence = confidence * max_class_score;

            std::cout << "Final confidence: " << final_confidence << std::endl;

            if (final_confidence < 0.45) {
                continue;
            }
         
            float cx = det_output.at<float>(i, 0);
            float cy = det_output.at<float>(i, 1);
            float ow = det_output.at<float>(i, 2);
            float oh = det_output.at<float>(i, 3);
            int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
            int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
            int width = static_cast<int>(ow * x_factor);
            int height = static_cast<int>(oh * y_factor);
            cv::rectangle(frame, cv::Point(x, y), cv::Point(x + width, y + height), cv::Scalar(0, 0, 255), 2, 8);
            putText(frame,classes.at(class_id) +std::to_string(final_confidence), Point(x,y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);
               
                }

我们遍历det_output矩阵的每一行，每一行代表一个预测的边界框。对于标准的YOLO输出，我们有如下内容,对照如下，我们能轻松拿到所有数据

[x_center, y_center, width, height, object_confidence, class_score1, class_score2, ..., class_scoreN]

float confidence = det_output.at<float>(i, 4);获取每边界框的对象置信度
cv::Mat class_scores = det_output.row(i).colRange(5, 5 + classes.size());其中class_scores是一个新的cv::Mat对象，它包含了第i个边界框的所有类别分数。classes.size()是类别总数。
我们需要找出最高类别分数：minMaxLoc函数用于找到class_scores中的最大值及其位置。max_class_score是最高类别分数，而class_id是具有最高分数的类别的索引。分数最高的就是本次判断的类别。

Point class_id_point;
double max_class_score;
minMaxLoc(class_scores, NULL, &max_class_score, NULL, &class_id_point);
int class_id = class_id_point.x;

计算最终置信度：最终置信度是对象置信度和最高类别分数的乘积，表示模型对该边界框包含特定类别的最终置信度。

float final_confidence = confidence * max_class_score;

4-4-3 画图

我们获取边界框的坐标和尺寸：cx和cy是边界框中心的x和y坐标，ow和oh是边界框的宽度和高度。然后将预测坐标转换为图像坐标，这里使用了x_factor和y_factor来将预测的边界框坐标和尺寸缩放到原始图像尺寸。

float cx = det_output.at<float>(i, 0);
float cy = det_output.at<float>(i, 1);
float ow = det_output.at<float>(i, 2);
float oh = det_output.at<float>(i, 3);
int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
int width = static_cast<int>(ow * x_factor);
int height = static_cast<int>(oh * y_factor);

画图

putText(frame,classes.at(class_id) +std::to_string(final_confidence), Point(x,y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);

4-5 效果展示和问题分析

可以看到我们的YOLO模型被正确加载且有识别框，类别，置信度的显示，但是在目标周围出现了好多框，这时候就需要进行NMS非极大值抑制来处理这个问题

5 NMS后处理

5-1 介绍

非极大值抑制（Non-Maximum Suppression, NMS）是一种在对象检测任务中常用的算法，用于筛选一组重叠的边界框，并只保留具有最高置信度的那些。NMS可以有效地减少模型输出的冗余检测框，确保每个对象只有一个边界框与之对应。

5-2 实现

5-2-1 NMS的引入和初始化

std::vector<int> class_ids;
std::vector<float> confidences;
std::vector<Rect> boxes;
std::vector<int> indices;

std::vector<int> class_ids;这个向量用于存储每个检测到的边界框对应的类别ID。
std::vector<float> confidences;用于存储每个检测到的边界框的置信度得分。置信度得分是模型对检测到的对象确实存在的置信程度的量化。
std::vector<Rect> boxes;用于存储每个检测到的边界框的位置和大小。每个Rect对象包含了边界框的x和y坐标以及宽度和高度。
std::vector<int> indices;用于存储经过非极大值抑制（NMS）处理后保留的边界框的索引。NMS算法会遍历所有检测到的边界框，并决定哪些边界框应该被保留，哪些应该被抑制。
然后我们在循环中添加

for (int i = 0; i < det_output.rows; i++) {


   //cv::rectangle(frame, cv::Point(x, y), cv::Point(x + width, y + height), cv::Scalar(0, 0, 255), 2, 8);
            //putText(frame,classes.at(class_id) +std::to_string(final_confidence), Point(x,y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);
                boxes.push_back(Rect(x, y, width, height));
                confidences.push_back(final_confidence);
                class_ids.push_back(class_id);
}

NMS的初始化


dnn::NMSBoxes(boxes, confidences, 0.4, 0.5, indices);

boxes：包含所有检测到的边界框的位置和大小。
confidences：包含与每个边界框相关联的置信度。
0.4：这是NMS算法中使用的IOU（交并比）阈值，用于确定哪些边界框是重叠的并且应该被抑制。
0.5：这是置信度阈值，只有高于此阈值的边界框才会被考虑进行NMS处理。
indices：经过NMS处理后，这个向量将包含保留的边界框的索引。

5-2-2 NMS处理

NMSBoxes函数将根据IOU阈值和置信度阈值处理边界框列表，移除那些与更高置信度边界框高度重叠的边界框。

// Draw the final bounding boxes
for (size_t i = 0; i < indices.size(); ++i) {
	int idx = indices[i];
	Rect box = boxes[idx];
	cv::rectangle(frame, cv::Point(box.x, box.y), cv::Point(box.x + box.width, box.y + box.height), cv::Scalar(0, 0, 255), 2, 8);


std::string label = classes[class_ids[idx]] + ": " + std::to_string(confidences[idx]);
putText(frame, label.c_str(), Point(box.x, box.y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);

效果如下，多余的框被去除了

6 完整代码展示

如下

#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>

using namespace cv;
using namespace cv::dnn;

std::vector<std::string> classes = {
    "right_sign", "left_sign"
 
};

int main() {
    Net net =  cv::dnn::readNetFromONNX("/home/zhlucifer/OpencvProject/src/yolo-cpp/models/simply_best.onnx");

    if (net.empty()) {
        std::cerr << "Error: Could not load the neural network." << std::endl;
        return -1;
    }

    VideoCapture cap("/home/zhlucifer/OpencvProject/src/yolo-cpp/videos/1.mp4");
    if (!cap.isOpened()) {
        std::cerr << "Error: Could not open the video file." << std::endl;
        return -1;
    }



    const int input_width = 640;
    const int input_height = 640;
    float x_factor=640/640.0f;
    float y_factor=480/640.0f;

    namedWindow("Object Detection", WINDOW_NORMAL);

    Mat frame, blob;
    while (cap.read(frame)) {
        resize(frame, blob, Size(input_width, input_height));


        cv::Mat blob = cv::dnn::blobFromImage(frame, 1 / 255.0, cv::Size(640, 640), cv::Scalar(0, 0, 0), true, false);
        net.setInput(blob);
        cv::Mat preds =net.forward();
        std::vector<int> class_ids;
        std::vector<float> confidences;
        std::vector<Rect> boxes;
        std::cout << "rows: "<< preds.size[1]<< " data: " << preds.size[2] << std::endl;
        cv::Mat det_output(preds.size[1], preds.size[2], CV_32F, preds.ptr<float>());
        //In a typical YOLO output, the format is [x_center, y_center, width, height, object_confidence, class_score1, class_score2, ..., class_scoreN] for each bounding box. 
        for (int i = 0; i < det_output.rows; i++) {
            float confidence = det_output.at<float>(i, 4);
            cv::Mat class_scores = det_output.row(i).colRange(5, 5 + classes.size());

         
            Point class_id_point;
            double max_class_score;
            minMaxLoc(class_scores, NULL, &max_class_score, NULL, &class_id_point);
            int class_id = class_id_point.x; 

           float final_confidence = confidence * max_class_score;

            std::cout << "Final confidence: " << final_confidence << std::endl;

            if (final_confidence < 0.45) {
                continue;
            }
         
            float cx = det_output.at<float>(i, 0);
            float cy = det_output.at<float>(i, 1);
            float ow = det_output.at<float>(i, 2);
            float oh = det_output.at<float>(i, 3);
            int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
            int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
            int width = static_cast<int>(ow * x_factor);
            int height = static_cast<int>(oh * y_factor);
            //cv::rectangle(frame, cv::Point(x, y), cv::Point(x + width, y + height), cv::Scalar(0, 0, 255), 2, 8);
            //putText(frame,classes.at(class_id) +std::to_string(final_confidence), Point(x,y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);
                boxes.push_back(Rect(x, y, width, height));
                confidences.push_back(final_confidence);
                class_ids.push_back(class_id);
                }
    
            std::vector<int> indices;
            dnn::NMSBoxes(boxes, confidences, 0.4, 0.5, indices); 

            // Draw the final bounding boxes
            for (size_t i = 0; i < indices.size(); ++i) {
                int idx = indices[i];
                Rect box = boxes[idx];
                cv::rectangle(frame, cv::Point(box.x, box.y), cv::Point(box.x + box.width, box.y + box.height), cv::Scalar(0, 0, 255), 2, 8);
           
           
            std::string label = classes[class_ids[idx]] + ": " + std::to_string(confidences[idx]);
            putText(frame, label.c_str(), Point(box.x, box.y - 10), FONT_HERSHEY_SIMPLEX, 0.9, Scalar(0, 255, 0), 2);
                }
            



                imshow("Object Detection", frame);

                if (waitKey(1) == 'q') {
                    break;
                }
    }

    cap.release();
    destroyAllWindows();

    return 0;
}

7 总结

本文我们将了如何使用基础的torch.hub.load进行pt/onnx模型的推理使用，并介绍了如何使用export.py进行pt->onnx模型转换，同时介绍了如何使用cv::dnn来加载YOLOV5onnx模型并进行推理和模型输出解析，并手动书写NMS进行后处理优化。
如有错误，欢迎提出！！！感谢大家的支持！！！