本文根据 Deep Learning with OpenCV DNN Module: A Definitive Guide 中相关内容进行翻译整理而得，用于今后的学习和工程。

§00 前言 --- 机器视觉研究领域从上个世纪六十年后期就已创立。图像分类和物体检测是计算机视觉领域中的一些最古老的的问题，研究者为解决它进行了几十年的努力。基于神经网络和深度学习计算机在某些领域中对图像的认识和理解已经达到了很高的精度，谁知在一些场合超过了人类。OpenCV中的DNN是学习神经网络和深度学校的非常棒的起点。由于OpenCV针对CPU进行算法性能上的提升，计时用户没有强大的GPU 也能够非常容易的开始。

希望这个博文能够成为你学习深度学习算法的好的起点。

▲ 图1 使用OpenCV中的DNN模块基于深度学习实现图像分类和目标检测

▲ 图1 使用OpenCV中的DNN模块基于深度学习实现图像分类和目标检测的示例图像

除了理论部分，我们还提供了基于OpenCV DNN的动手实验经验。下面会讨论图片以及实时视频中的分类、物体检测算法细节。

下面让我们进入OpenCV DNN模块实现深度学习解决机器视觉问题。

§01 OpenCVDNN模块

我们知道OpenCV 是机器视觉最好的算法库。此外它还能够运行深度学习推理的功能。最棒的部分就是可以从不同的学习框架中加载不同的模型，通过它可以实现多个深度学习功能。从OpenCV 3.3版本之后就提供了加载不同学习框架模型的支持。当然现在还有很多新的用户并没有注意到OpenCV这个超级棒的特性，这样就可能丢失很多有趣和良好的学习机遇。

1.1 为什么选择OpenCV DNN模块？

OpenCV只是支持图片和视频的深度学习推理，它并不只是模型的细调和训练。尽管如此，OpenCV DNN模块仍然可以作为初学者涉足机器视觉深度学习的非常快的起始点，进行鼓捣相关的算法。

OpenCV 的DNN模块中的一个非常厉害之处就是针对Intel处理器进行了高度优化。在目标检测和图像分割应用中，对于实时视频图像进行模型推理的过程中可以获得很好的处理帧速（FPS）。通过DNN模块使用特定框架下预训练模型经常会得到更高的FPS。比如我们可以对比不同框架下的图像分类处理速度。

▲ 图1.1.1 图像分类处理速度（三幅图像的平均速度）

▲ 图1.1.1 图像分类处理速度（三幅图像的平均速度）

上面就是对比了三中不同的框架下图像分类的推理速度。

上面的结果就是对于DenseNet121模型的推理时间的对比。令人惊讶的是 OpenCV比起在TensorFlow下的最初实现速度高了一大截，只是比Pytorch模型慢了那么一点点。实际操作中， Tensorflow推理时间大约1秒钟，使用OpenCV的推理时间则小于200ms。

上述对比结果使用了博文写作时最新的OpenCV版本进行测量的。他们是：

对比结果软件版本：

PyTorch：1.8.0
OpenCV：4.5.1
TensorFlow：2.4

所有的测试都是在Google Colab中完成的，所使用的是Intel Xeon 2.3GHz处理器。

在物体检测中对比也是这样。

▲ 图1.1.2 视频中物体检测帧速对比

▲ 图1.1.2 视频中物体检测帧速对比

基于CPU对比不同学习框架模型在物体检测中的速度。

上面图标显示了利用 Tiny YOLOv4 在原始Karknet框架以及OpenCV上处理视频中的目标检测速度。性能指标是在Gen Laptop CPU(2.6GHz 时钟频率)下进行测量的。可以看到对于相同的视频 OpenCV的DNN模块可以达到35FPS，但Darknet利用OpenMP和AVX编译后的模型只能达到15FPS。如果DarkNet不使用OpenMP， AVX进行优化，运行Tiny YOLOv4只能达到3FPS。考虑到我们使用的原始Darknet Tiny YOLOv4模型在两种情况下的对比，这种差别是非常巨大的。

上图显示了OpenCV DNN模块在CPU工作环境下的实用性和强大能力。在CPU上的快速推理使其可以在算力受限的情况下作为边端设备部署网络模型的优异工具。基于ARM处理器的边端设备就是一个最好的例证。下面的图表说明了这一切：

▲ 图1.1.3 在Raspberry Pi 3B运行的不同框架下不同模型的处理速度对比图

▲ 图1.1.3 在Raspberry Pi 3B运行的不同框架下不同模型的处理速度对比图

从上图可以看到不同的狂阶我和模型在Reapberry Pi 3B中运行速度的对比。对比结果非常显著。在SqueezeNet, MObileNet模型中， OpenCV比其他框架在处理速度上逗号。对于GoogLeNet，OpenCV性能排在第二，TensorFlow的性能最好。对于Network，OpenCV RasPBerryrFPS性能最差。

上面几个图标显示了优化后的OpenCV 对于神经网络推理的速度是多么的快。因此也说明了为什么学习OpenCV DNN的理由。

1.2 OpenCV NN提供的深度学习功能

我们知道利用OpenCV DNN模块我们可以对图像和视频完成基于深度学习的计算机视觉推理。下面看看OpenCV DNN所支持的功能列表。令人感兴趣的时候我们所有想到大多数的深度学习和计算机视觉任务OpenCV DNN都支持，下面列表证明了这个想法：
1. 图像分类；
2. 目标检测；
3. 图像分割；
4. 文字检测和识别；
5. 姿态估计；
6. 深度估计
7. 人脸验证和检测；
8. 人体重新识别；

上面全面的列表给出了很多深度学习实际应用场景。可以通过访问 OpenCV 功能库 WiKi 网页 中相关信息了解更多内容。

重要的是很多模块的选择依赖于系统硬件，计算能力（后面我们将会讨论）。对于每种应用，可以通过选择当今最好的但计算复杂模型到可以运行在边端（嵌入式）设备中的简易模型。

显然在一个博文中很难对上面所列写的各种模型都讨论到。因此，我们选择目标检测和人体姿态估计进行详细讨论，来初步了解在OpenCV DNN模块中如何进行模型选择。

1.2.1 OpenCV DNN支持的不同模型列表

为了支持上面讨论的应用，我们需要很多预先训练好的模型。而且的确存在着一些可供选择的SOTA 模型。下面表格列出了用于不同场合的深度学习模型。

Image Classification	Object Detection	Image Segmentation	Text detection and recognition	Human Pose estimation	Person and face detection
Alexnet	MobileNet SSD	DeepLab	Easy OCR	Open Pose	Open Face
GoogLeNet	VGG SSD	UNet	CRNN	Alpha Pose	Torchreid
VGG	Faster R-CNN	FCN			Mobile FaceNet
ResNet	EfficientDet				OpenCV FaceDetector
SqueezeNet
DenseNet
ShuffleNet
EfficientNet

上面的表格虽然给出的模型很多但并未穷尽OpenCV DNN模块中所有的模型。仍然还有很多网络模型上述表格中并没有给出。就像前面所示，列出所有模型并进行讨论在一个博文中无法完成。上述表格只是告诉我们一个可行的方法去何选择计算机视觉中不同的深度学习模型。

1.3 OpenCVDNN 支持的深度学习框架

看到上面各种模型，脑子突然蹦出一个问题：是不是这些模型都可以由一个单一框架支持？实际上，并不是这样。

OpenCV DNN支持很多流行的深度学习框架。下面锯齿OpenCV DNN中所支持的深度学习框架。

1.3.1 Caffe

在利用OpenCV DNN 调用Caffe中的预训练模型是，我们需要狂歌事情。一个就是 model.caffemodel 文件，其中包含了预训练权重。另外一个就是模型架构稳健，后缀为。prototxt。这是一个JSON结构类似的问呗文件，包含了所有的网络层的定义。如果想得到这个文件清晰的结构，可以访问 这个链接 。

1.3.2 TensorFlow

使用基于TensorFlow预训练的模型，我们需要两个文件。一个是模型权重参数文件，另一个是定义有模型配置的protobuf文件。权重参数文件的后缀为 .pd，也就是protobuf文件，存储有所有预训练的网络参数。如果之前你使用过TensorFlow，你知道 .pb 文件就是模型的检查点，即在模型存储以及权系数固定之后存储的文件。模型配置在 protobuf文件中，具有 .pbtxt 文件后缀。

注意：在Tensorflow的新的版本中，网络参数文件不再使用 .pb的文件格式。如果你使用自己存储的模型的话，文件的格式可能是 .ckpt 或者 .h5 格式，此时在使用OpenCV DNN之前需要一些中间步骤进行处理。这种情况下，帅看到模型转换成 ONNX格式，然后在转换成 .pb格式，这样可以保证所有结果都和所期望的那样。

1.3.3 Torch和PyTorch

为了载入Torch 模型文件，我们需要包含有预训练权重参数的文件。通常这个文件具有 .t7 或者 .net的文件后缀。最新的Torch版本的网络模型具有 .pth 的文件后缀。将这些文件首先转换成 ONNX是最好的处理方法。转换成ONNX文件之后，你可以直接通过OpenCV DNN所支持的ONNX模型方式载入网络。

1.3.4 Darknet

OpenCV DNN 也支持著名的 DarkNet学习框架。如果你使用过官方的基于Darknet学习框架的YOLO模型就可以了解这一点。

通常，我们通过具有 .weights 后缀的文件来载入 Darknet模型。Darknet模型的网络配置文件的后缀是 .cfg。

1.3.5 转换成ONNX格式

可以通过软件工具将来自于 Keras 或者Pytorch的网络模型转换成 ONNX 格式文件。在OpenCV DNN 中不能够直接使用来自于Keras, Pytorch学习框架中的网络模型。通常将这些模型转换成ONNX的格式（Open Neural Network Exchange），这样可以使用他们，甚至将它们转换成其他学习框架中的模型，比如 TensorFlow, PyTorch。

在OpenCV DNN中我们需要后缀为 .onnx 的权重参数文件来载入 ONNX 模型。

通过访问 OpenCV 官方文件 来了解不同的学习框架，他们权重阐述文件以及配置文件。

最有可能的是，上面所列举的包括有所有著名深度学习框架。通过访问 官方Wiki网页 了解OpenCV DNN所支持的完整学习框架思想。

这里所见到的所有模型都是经过OpenCV DNN 模块完美测试过的。理论上，前面所有学习框架下的网络都可以在DNN模块中工作。我们只需要找到正确的权重参数文件以及相应的神经网络结构文件即可。通过本教程的代码部分就可以清楚所有的事情了。

我们将会覆盖足够的理论。下面进入博文的代码部分。首先我们需要一个图像分类的完整OpenCV DNN的使用过程。接着就是基于DNN模块的目标检测。

§02 图像分类

这是一个基于OpenCV DNN软件模块的图像分类的完整指导书。

2.1 网络模型

在这部分，我们将使用OpenCV DNN 模块完成图像的分类。对于每一部将会讲解，完成整个程序之后你就会对所有的过程了如指掌。

我们将使用在Caffe学习框架中的神经网络模型，它在著名数据集 ImageNet上经过预训练。使用 NeseNet121神经网络完成分类任务。这个模型的优势在于它经过了包含有1000中物品的ImageNet数据集合训练，所以不管我们想要分类什么物品，它早已包含在模型中了。这让我们可以在更大的范围内选择图像。

我们将使用下面的图像进行分类。

▲ 图2.1 将用于OpenCV DNN 模型进行图像分类的样例图像

▲ 图2.1 将用于OpenCV DNN 模型进行图像分类的样例图像

2.2 实现步骤

简单来讲，我们将采用以下步骤完成图像分类。
1. 从磁盘读入文件的名称，获取所需标签；
2. 从磁盘读入预训练的神经网络；
3. 但图片读入并转换成深度学习网络所需要的正确格式；
4. 把输入图像在网络中前向传递获得网络的输出结果；

下面我们通过代码来演示每一步的操作

2.2.1 导入模型并加载类别文本文件

对于Python编程，我们需要载入OpenCV 和Numpy模块。对于C++来说，我们需要包括 OpenCV， OpenCV DNN 的静态库。

Python

import cv2
import numpy as np

C++

#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/dnn/all_layers.hpp>
 
using namespace std;
using namespace cv;
using namespace dnn;

记住，所使用的 DenseNet21 模型已经在1000中类别的ImageNet数据集合进行预训练过。所以需要将这1000中物品调入内存这样便于访问它们，通常这些类别信息存储在TEXT文件中。其中一个这种文件为：classification_classes_ILSVRC2012.txt 文件。包含有所有的类别名称，格式如下：

tench, Tinca tinca
goldfish, Carassius auratus
great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
tiger shark, Galeocerdo cuvieri
hammerhead, hammerhead shark
...

文件的每一行包含有所有针对一个图片的所有标签或者类别名称。比如，第一行包括有 tench， Tinca Tinca。对于这个同样的鱼它有两个名字。类似，第二行也是对应的金鱼有两个名称。通常情况下第一个名称是被大多数人最常用到的名字。

下面我们展示如何将text文件读入内存，并将第一个名称抽取出来作为图像分类的标签。

Python

# read the ImageNet class names
with open('../../input/classification_classes_ILSVRC2012.txt', 'r') as f:
   image_net_names = f.read().split('\n')
# final class names (just the first word of the many ImageNet names for one image)
class_names = [name.split(',')[0] for name in image_net_names]

C++

std::vector<std::string> class_names;
   ifstream ifs(string("../../input/classification_classes_ILSVRC2012.txt").c_str());
   string line;
   while (getline(ifs, line))
   {
       class_names.push_back(line);
   }

首先，我们通过读入模式打开包含有所有类别名称的文件，将它们分割成新的一行行文本。现在将所有的类都安装下面的格式存储在 image_net_names 列表中。

[‘tench, Tinca tinca’, ‘goldfish, Carassius auratus’, ‘great white shark, white shark, man-eater, man-eating shark’, ...]

 
 1

然而对于每一行，我们只需要第一个名称。这就是第二句代码的功能。对于 image_net_names列表中每个元素，将它们使用逗号（，）作为分隔符号进行分割，只保留第一个元素即可。这些名称存储在 class_names的列表中。至此，这个名称列表看起来为：

['tench', 'goldfish', 'great white shark', 'tiger shark', 'hammerhead', …]

 
 1

2.2.2 从磁盘载入预训练的DenseNet121

正如前面所述，我们使用在Caffe学习框架中已经预训练好的DenseNet121网络。我们需要相应的模型权重文件（.caffemodel）以及模型配置文件（.prototxt）。

下面看一下相应的代码并对载入模型的部分进行解释。

Python

# load the neural network model
model = cv2.dnn.readNet(model='../../input/DenseNet_121.caffemodel',
    config='../../input/DenseNet_121.prototxt', framework='Caffe')

C++

// load the neural network model
   auto model = readNet("../../input/DenseNet_121.prototxt",
                       "../../input/DenseNet_121.caffemodel",
                       "Caffe");

你可以看到我们使用了OpenCV DNN中的 readNet() 函数，这个函数需要提供如下输入参数：

odel: This is the path to the pre-trained weights file. In our case, it is the pre-trained Caffe model.
config: This is the path to the model configuration file and it is the Caffe model’s .prototxt file in this case.
framework: Finally, we need to provide the framework name that we are loading the models from. For us, it is the Caffe framework.

除了 readNet() 函数之外， DNN模块还提供了从特定学习框架中载入模型的函数。这些函数无需提供 framework 参数，下面给出了这些函数：

readNetFromCaffe(): This is used to load pre-trained Caffe models and accepts two arguments. They are the path to the prototxt file and the path to the Caffe model file.
readNetFromTensorflow(): We can use this function to directly load the TensorFlow pre-trained models. This also accepts two arguments. One is the path to the frozen model graph and the other is the path to the model architecture protobuf text file.
readNetFromTorch(): We can use this to load Torch and PyTorch models which have been saved using the torch.save() function. We need to provide the model path as the argument.
readNetFromDarknet(): This is used to load the models trained using the DarkNet framework. We need to provide two arguments here as well. One of the path to the model weights and the other is the path to the model configuration file.
readNetFromONNX(): We can use this to load ONNX models and we only need to provide the path to the ONNX model file.

本文中使用 readNet() 来载入预训练好的模型。后面在目标检测中也是用readNet()函数。

2.2.3 读入图片并转换成网络输入格式

我们通过 OpenCV中的 imread() 函数读入图片。注意，有些细节需要我们关注。使用DNN 模块载入的预训练好的模型并不能够直接使用读入图像数据。需要预先进行预处理一下。

下面让我们先把代码码好，这样就比较容易讨论技术细节了。

Python

# load the image from disk
image = cv2.imread('../../input/image_1.jpg')
# create blob from image
blob = cv2.dnn.blobFromImage(image=image, scalefactor=0.01, size=(224, 224), mean=(104, 117, 123))

C++

// load the image from disk
Mat image = imread("../../input/image_1.jpg");
// create blob from image
Mat blob = blobFromImage(image, 0.01, Size(224, 224), Scalar(104, 117, 123));

在读取文件之前，我们假设读入的图像在 input 目录中，该目录在当前目录前两级父目录中。下面的步骤非常重要，使用 blobFromImage() 函数，将输入图像转换成模型所需要的格式。下面是函数参数的解释：

image: This is the input image that we just read above using the imread() function.
calefactor: This value scales the image by the provided value. It has a default value of 1 which means that no scaling is performed.
ize: This is the size that the image will be resized to. We have provided the size as 224×224 as most classification models trained on the ImageNet dataset expect this size only.
an: The mean argument is pretty important. These are actually the mean values that are subtracted from the image’s RGB color channels. This normalizes the input and makes the final input invariance to different illumination scales.

有一个事情想需要提一下。那就是所有的深度学习模型都希望输入数据都是成批次的。然而这里我们只有一张图片。不管怎样， blob 的输出格式我们得到的矩阵为[1,3,224,224]，blockFromImage() 函数输出的结果是在原来的彩色图像（3维）的基础上又增加了一维。这就是神经网络模型所需要到正确输入格式。

2.2.4 将输入图像在模型中前向传播

现在网络输入已经准备好了，我们可以利用网络进行预测了，这个过程也是吧输入图像在网络各层中前向进行传播。

Python

# set the input blob for the neural network
model.setInput(blob)
# forward pass image blog through the model
outputs = model.forward()

网络预测需要两个步骤：

首先，将网络的输入中加载图像数据；
接着使用forward()函数将输入数据前向通过网络模型，可以获得网络所有的输出。

上面的代码就是完成了两个步骤。

在 outputs 中存储着网络的所有输出。在获得正确的分类类别之前，还有一些需要与处理得步骤。

现在 outputs 还是一个向量（1,1000,1,1,）。从其中抽取对应的类别还比较困难。因此，下面的代码，将 outputs进行重新排列，这里便于我们获得正确的类别标签并通过 label ID 映射到类别名称。

python

final_outputs = outputs[0]
# make all the outputs 1D
final_outputs = final_outputs.reshape(1000, 1)
# get the class label
label_id = np.argmax(final_outputs)
# convert the output scores to softmax probabilities
probs = np.exp(final_outputs) / np.sum(np.exp(final_outputs))
# get the final highest probability
final_prob = np.max(probs) * 100.
# map the max confidence to the class label names
out_name = class_names[label_id]
out_text = f"{out_name}, {final_prob:.3f}"
# put the class name text on top of the image
cv2.putText(image, out_text, (25, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
cv2.imshow('Image', image)
cv2.waitKey(0)
cv2.imwrite('result_image.jpg', image)

C++

// set the input blob for the neural network
model.setInput(blob);
// forward pass the image blob through the model
Mat outputs = model.forward(); 
Point classIdPoint;
double final_prob;
minMaxLoc(outputs.reshape(1, 1), 0, &final_prob, 0, &classIdPoint);
int label_id = classIdPoint.x; 
// Print predicted class.
string out_text = format("%s, %.3f", (class_names[label_id].c_str()), final_prob);
// put the class name text on top of the image
putText(image, out_text, Point(25, 50), FONT_HERSHEY_SIMPLEX, 1, Scalar(0, 255, 0), 2);
imshow("Image", image);
imwrite("result_image.jpg", image);

Outputs 经过变形（reshape）之后，形成 (1000,1)的形状，表示一个1000行，1列的矩阵，每一行对应的一个类别的得分，对应的数值举例为：

[[-1.44623446e+00]
[-6.37421310e-01]
 [-1.04836571e+00]
 [-8.40160131e-01]
…
]

从其中查找到最大的一个类别所在的行，将其存储在 label_id 中。注意，这里的得分并不是真正意义上的概率。我们需要通过 softmax 来获得最高得分标签所对应的分类概率。

在上面的Python代码中，我们使用下面的代码进行了 softmax概率转换。

np.exp(final_outputs) / np.sum(np.exp(final_outputs))

 
 1

然后，将最高得分概率乘以100，得到预测分数的百分比。

最后一步就是在图像上标注出预测类别名称以及对应的百分比。我们将图像进行显示并存储在磁盘上。

在执行完代码之后，我们可以获得下面的输出。

▲ 图2.2.1 DenseNet121网络的输出结果：名称Tiger，预测的分：91.030

▲ 图2.2.1 DenseNet121网络的输出结果：名称Tiger，预测的分：91.030

上面的结果显示通过DenseNet121对于输入的图像进行预测，种类为Tiger，概率为91%，这是一个相当不错的分类结果。

通过上面的讨论，我们知道如何通过 OpenCV DNN模块来进行图像分类，所使用的网络为 DenseNet121。通过对于每一部的讨论我们很好了解了DNN模块工作的细节。

下面我们将使用 OpenCV DNN 模块完成目标检测任务。

§03 目标检测

使用OpenCV DNN模块我们可以比较轻松的在深度学习和机器视觉中完成目标检测任务。就像分类一样，我们需要载入图像，适当的模型，将输入在模型中前向传输获得预测结果。只是对于目标检测任务的预处理过程和对结果进行标注显示部分略有不同而已。通过下面博文我们会展示所有实现细节。

我们先从图像中物品检测开始。

3.1 利用OpenCV DNN进行物体检测

与图像分类一样，这里我们还是使用预先训练好的模型。这个模型在MS COCO数据集合上已经预先训练好了，这也是当前深度学习中的物体检测算法测试数据集合。

MS COCO可以检测大约80个种类的物体。包括有人、小轿车、牙刷等。数据集合中包含了日常所能碰到的80中物品。通过TEXT文件载入MS COCO 数据集合中的所有类别标签用于后面的物体检测。

下面的图片用于物体检测算法。
▲ 图3.1 用于物体检测的图片。图片中包含有多种物体的聚集，包括有人物，自行车，双轮车等。这对测试模型的性能非常好

▲ 图3.1 用于物体检测的图片。图片中包含有多种物体的聚集，包括有人物，自行车，双轮车等。这对测试模型的性能非常好

我们使用 MobileNet SSD(Single Shot Detector)模型，它在MS COCO数据集合利用TensorFlow深度学习框架进行训练过。SSD模型相对于其他物体检测来说运行速度比较快。而且，MobileNet 基础网络也使得它的计算量较小。因此，这也是一个学习OpenCV DNN 进行物体检测的一个良好的开始网络。

下面看看相应的实现代码。

Python

import cv2
import numpy as np

C++

#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/dnn/all_layers.hpp>
 
using namespace std;
using namespace cv;
using namespace dnn;

在Python语言代码中，我们导入 cv2, numpy软件模块；
在 C++中，导入 OpenCV, OpenCV DNN 库头文件；
Python

# load the COCO class names
with open('object_detection_classes_coco.txt', 'r') as f:
   class_names = f.read().split('\n')
 
# get a different color array for each of the classes
COLORS = np.random.uniform(0, 255, size=(len(class_names), 3))

C++

std::vector<std::string> class_names;
   ifstream ifs(string("../../../input/object_detection_classes_coco.txt").c_str());
string line;
while (getline(ifs, line))
{
    class_names.push_back(line);
}

接下来我们读入 object_detection_classes_coco.txt文件，其中每一行包含有所有类别名称。将它们存储在class_names列表中。

class_names列表的内容如下：

['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', … 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush', '']

 
 1

此外，我们利用 COLORS数组来获得一些元组阵列，每个元组包含有三个随机的整型数字，表示了随机的颜色。他们将会在后面用于框定每个种类目标。对于不同的类别物品我们最好使用不同的颜色，这将便于我们在最后的图像结果中来区分不同的种类物品。

3.1.1 载入MobileNet模型并准备输入

通过 readNet()载入 MobileNet SSD 模型，这个函数前面我们是用过。

Python

# load the DNN model
model = cv2.dnn.readNet(model='frozen_inference_graph.pb', 
               config='ssd_mobilenet_v2_coco_2018_03_29.pbtxt.txt',framework='TensorFlow')

C++

// load the neural network model
auto model = readNet("../../../input/frozen_inference_graph.pb",
"../../../input/ssd_mobilenet_v2_coco_2018_03_29.pbtxt.txt", "TensorFlow");

上面的代码中：

模型参数接收固定推理图文件路径，与训练模型包含有网络权重参数；
config 参数接收模型配置文件路径，这是一个 protobuf 文本文件包含有网络的配置信息；
最后，通过声明 frameworkwz='TensorFlow’，来表明模型预训练的学习框架；

接下来，我们单从磁盘读入图片并进行格式转换，形成网络的输入。

Python

# read the image from disk
image = cv2.imread('../../input/image_2.jpg')
image_height, image_width, _ = image.shape
# create blob from image
blob = cv2.dnn.blobFromImage(image=image, size=(300, 300), mean=(104, 117, 123), swapRB=True)
# set the blob to the model
model.setInput(blob)
# forward pass through the model to carry out the detection
output = model.forward()

C++

// read the image from disk
Mat image = imread("../../../input/image_2.jpg");
int image_height = image.cols;
int image_width = image.rows;
//create blob from image
Mat blob = blobFromImage(image, 1.0, Size(300, 300), Scalar(127.5, 127.5, 127.5),true, false);
//create blob from image
model.setInput(blob);
//forward pass through the model to carry out the detection
Mat output = model.forward();
Mat detectionMat(output.size[2], output.size[3], CV_32F, output.ptr<float>());

对于目标检测，我们使用 blobFromImage()函数中的位差别参数。

输入尺寸为 300×300，这是SSD模型在所有训练框架中所期望的输入尺寸。对于TensorFlow也是一样的。
使用 swapRB参数，进行颜色通道转换。在OpenCV中，彩色图像使用BGR格式，但对于目标检测网络则希望输入为RGB格式。所以 swapRB参数将交换R,B颜色通道使其成为RGB格式。

3.1.2 网络结果输出

将数据设置为MobileNet SSD网络的输入块，并前向传播，使用forward() 函数完成网络预测。

output数据结构为：

[[[[0.00000000e+00 1.00000000e+00 9.72869813e-01 2.06566155e-02 1.11088693e-01 2.40461200e-01 7.53399074e-01]]]]

 
 1

索引1包含有所有类别标签，从1 到80.
索引2 包含有置信得分。虽然并不是概率，也代表了模型对于物体所属种类的置信度。
最后四个数值，前两个是x,y，表示物体置框位置坐标，后两个代表框的宽和高度。

3.1.3 循环检测所有目标并进行标定

下面对于output中的所有检测结果并使用方框进行标注。下面代码演示了对检测结果循环标注过程。

Python

# loop over each of the detection
for detection in output[0, 0, :, :]:
   # extract the confidence of the detection
   confidence = detection[2]
   # draw bounding boxes only if the detection confidence is above...
   # ... a certain threshold, else skip
   if confidence > .4:
       # get the class id
       class_id = detection[1]
       # map the class id to the class
       class_name = class_names[int(class_id)-1]
       color = COLORS[int(class_id)]
       # get the bounding box coordinates
       box_x = detection[3] * image_width
       box_y = detection[4] * image_height
       # get the bounding box width and height
       box_width = detection[5] * image_width
       box_height = detection[6] * image_height
       # draw a rectangle around each detected object
       cv2.rectangle(image, (int(box_x), int(box_y)), (int(box_width), int(box_height)), color, thickness=2)
       # put the FPS text on top of the frame
       cv2.putText(image, class_name, (int(box_x), int(box_y - 5)), cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2)
 
cv2.imshow('image', image)
cv2.imwrite('image_result.jpg', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

C++

for (int i = 0; i < detectionMat.rows; i++){
       int class_id = detectionMat.at<float>(i, 1);
       float confidence = detectionMat.at<float>(i, 2);
      
       // Check if the detection is of good quality
       if (confidence > 0.4){
           int box_x = static_cast<int>(detectionMat.at<float>(i, 3) * image.cols);
           int box_y = static_cast<int>(detectionMat.at<float>(i, 4) * image.rows);
           int box_width = static_cast<int>(detectionMat.at<float>(i, 5) * image.cols - box_x);
           int box_height = static_cast<int>(detectionMat.at<float>(i, 6) * image.rows - box_y);
           rectangle(image, Point(box_x, box_y), Point(box_x+box_width, box_y+box_height), Scalar(255,255,255), 2);
           putText(image, class_names[class_id-1].c_str(), Point(box_x, box_y-5), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0,255,255), 1);
       }
   }   
 
   imshow("image", image);
   imwrite("image_result.jpg", image);
   waitKey(0);
   destroyAllWindows();

在 for循环中，我们抽取了当前检测目标的置信度，这可以通过索引2来获得；
接着通过if来判断物体的置信度是否高于一个阈值。对于那些置信度改与0.4 的结果进一步处理；
获得类别的ID，并映射到MS COCO 类别名称列表。获得单个用的对于当前种类绘制它的位置方框，并在方框顶部放置类别名称；
通过抽取方框位置和宽高获得对应的矩形参数；
最后一步在图像中绘制矩形，并在矩形上部输出类别文字。

这就是使用OpenCV DNN进行物体检测所有步骤。执行完代码可以获得如下的结果：

▲ 图3.1.2 利用MobileNet SSD进行物体检测。模型将图片中几乎所有的物体都进行了检测，然而可以注意到一些检测结果是错误的

▲ 图3.1.2 利用MobileNet SSD进行物体检测。模型将图片中几乎所有的物体都进行了检测，然而可以注意到一些检测结果是错误的

在上面的检测结果图像中，可以看到效果还是不错的。网络模型将图片中的所有可见物体都进行了检测。然后也存在着一些错误的预测。比如 MobileNet SSD模型把右侧的自行车标注成摩托车。 MobileNet SSD 模型在应用于实际场景中，为了加快速度有可能会产生更多的这类错误。

上面是在图片中进行物体检测。为了使得对OpenCV DNN模块了解更多，下面我们在视频中进行物体检测。

3.2 视频中物体检测

在视频中检测物体的代码与图像中物体检测相同，只是进行少数的变化。

实现代码中有少量的代码与图像物体检测是一样的。下面先完成这部分。

Python

import cv2
import time
import numpy as np
 
# load the COCO class names
with open('object_detection_classes_coco.txt', 'r') as f:
   class_names = f.read().split('\n')
 
# get a different color array for each of the classes
COLORS = np.random.uniform(0, 255, size=(len(class_names), 3))
 
# load the DNN model
model = cv2.dnn.readNet(model='frozen_inference_graph.pb',                        config='ssd_mobilenet_v2_coco_2018_03_29.pbtxt.txt',framework='TensorFlow')
 
# capture the video
cap = cv2.VideoCapture('../../input/video_1.mp4')
# get the video frames' width and height for proper saving of videos
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
# create the `VideoWriter()` object
out = cv2.VideoWriter('video_result.mp4', cv2.VideoWriter_fourcc(*'mp4v'), 30, (frame_width, frame_height))

C++

#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>
#include <opencv2/dnn.hpp>
#include <opencv2/dnn/all_layers.hpp>
 
using namespace std;
using namespace cv;
using namespace dnn;
 
int main(int, char**) {
   std::vector<std::string> class_names;
   ifstream ifs(string("../../../input/object_detection_classes_coco.txt").c_str());
   string line;
   while (getline(ifs, line))
   {
       class_names.push_back(line);
   } 
  
   // load the neural network model
   auto model = readNet("../../../input/frozen_inference_graph.pb",
"../../../input/ssd_mobilenet_v2_coco_2018_03_29.pbtxt.txt","TensorFlow");
 
   // capture the video
   VideoCapture cap("../../../input/video_1.mp4");
   // get the video frames' width and height for proper saving of videos
   int frame_width = static_cast<int>(cap.get(3));
   int frame_height = static_cast<int>(cap.get(4));
   // create the `VideoWriter()` object
   VideoWriter out("video_result.avi", VideoWriter::fourcc('M', 'J', 'P', 'G'), 30, Size(frame_width, frame_height));

可以看到代码绝大部分是相同的。我们使用了相同的 MS COCO 类别文件，以及 MobileNet SSD模型。

取代图像读取，我们使用 VideoCapture()对象来捕获视频帧。我们也创建了 VideoWrite() 对象，用于存储视频帧。

3.2.1 对于视频每一帧进行物体检测

到现在为止，我们已经准备好了视频，MobileNet SSD 网络模型。下面就是对于视频的每一帧都进行物体检测，这里把视频帧看成一幅幅图像。

Python

# detect objects in each frame of the video
while cap.isOpened():
   ret, frame = cap.read()
   if ret:
       image = frame
       image_height, image_width, _ = image.shape
       # create blob from image
       blob = cv2.dnn.blobFromImage(image=image, size=(300, 300), mean=(104, 117, 123), swapRB=True)
       # start time to calculate FPS
       start = time.time()
       model.setInput(blob)
       output = model.forward()       
       # end time after detection
       end = time.time()
       # calculate the FPS for current frame detection
       fps = 1 / (end-start)
       # loop over each of the detections
       for detection in output[0, 0, :, :]:
           # extract the confidence of the detection
           confidence = detection[2]
           # draw bounding boxes only if the detection confidence is above...
           # ... a certain threshold, else skip
           if confidence > .4:
               # get the class id
               class_id = detection[1]
               # map the class id to the class
               class_name = class_names[int(class_id)-1]
               color = COLORS[int(class_id)]
               # get the bounding box coordinates
               box_x = detection[3] * image_width
               box_y = detection[4] * image_height
               # get the bounding box width and height
               box_width = detection[5] * image_width
               box_height = detection[6] * image_height
               # draw a rectangle around each detected object
               cv2.rectangle(image, (int(box_x), int(box_y)), (int(box_width), int(box_height)), color, thickness=2)
               # put the class name text on the detected object
               cv2.putText(image, class_name, (int(box_x), int(box_y - 5)), cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2)
               # put the FPS text on top of the frame
               cv2.putText(image, f"{fps:.2f} FPS", (20, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
      
       cv2.imshow('image', image)
       out.write(image)
       if cv2.waitKey(10) & 0xFF == ord('q'):
           break
   else:
       break
 
cap.release()
cv2.destroyAllWindows()

C++

while (cap.isOpened()) {
       Mat image;
       bool isSuccess = cap.read(image);
       if (! isSucess) break;
      
       int image_height = image.cols;
       int image_width = image.rows;
       //create blob from image
       Mat blob = blobFromImage(image, 1.0, Size(300, 300), Scalar(127.5, 127.5, 127.5),
                               true, false);
       //create blob from image
       model.setInput(blob);
       //forward pass through the model to carry out the detection
       Mat output = model.forward();
      
       Mat detectionMat(output.size[2], output.size[3], CV_32F, output.ptr<float>());
      
       for (int i = 0; i < detectionMat.rows; i++){
           int class_id = detectionMat.at<float>(i, 1);
           float confidence = detectionMat.at<float>(i, 2);
 
           // Check if the detection is of good quality
           if (confidence > 0.4){
               int box_x = static_cast<int>(detectionMat.at<float>(i, 3) * image.cols);
               int box_y = static_cast<int>(detectionMat.at<float>(i, 4) * image.rows);
               int box_width = static_cast<int>(detectionMat.at<float>(i, 5) * image.cols - box_x);
               int box_height = static_cast<int>(detectionMat.at<float>(i, 6) * image.rows - box_y);
               rectangle(image, Point(box_x, box_y), Point(box_x+box_width, box_y+box_height), Scalar(255,255,255), 2);
               putText(image, class_names[class_id-1].c_str(), Point(box_x, box_y-5), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0,255,255), 1);
           }
       }
      
       imshow("image", image);
       out.write(image);
       int k = waitKey(10);
       if (k == 113){
           break;
       }
   }
 
cap.release();
destroyAllWindows();
}

上面代码中，模型在每一帧都进行了物体检测直到最后一帧。需要注意的是：

在处理开始帧之前存储了起始系统时间，并在检测之后也把系统时间进行了存储；
利用上面的时间可以帮助我们计算FPS（Frames Per Seconds）。将计算出的FPS存储在 fps。
代码的最后一部分，将fps参数写在当前帧的顶部，可以让我们了解到使用OpenCV DNN 运行 MobileNet SSD网络可以获得什么样的处理速度。
最后，在屏幕上显示每一帧的处理结果并将它们存储在磁盘中。
执行上面的代码可以获得如下的输出：

▲ 图3.2.1 视频物体检测结果

▲ 图3.2.1 视频物体检测结果

在 i7 Gen八代笔记本电脑CPU上可以获得大约33fps。对于检测数量来讲结果还不错。模型可以把场景中的所有行人，车辆甚至交通灯都能够检测。对于一些小的物品，比如手提包和背包的检测还存在缺陷。33FPS的检测速度是在检测精度和减少小物品检测方面做得折中所获得的速度。

3.3 GPU上的推理

可以在GPU上运行所有的分类和检测网络，我们需要从带有GPU的源工程中对OpenCV DNN模块进行重新编译。

如果在Ubuntu，访问 LearnOpenCV.com 了解带有GPU编译OpenCV。
在Windows下，访问这个链接了解带有GPU编译OpenCV。

在GPU上运行推理，我们需要对C++，Python代码做些小的修改。

Python:
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

C++:
net.setPreferableBackend(DNN_BACKEND_CUDA);
net.setPreferableTarget(DNN_TARGET_CUDA);

在将神经网络模型在入职前添加上面两行代码。第一行保证网络在支持 CUDE GPU模块的情况下将会使用CUDA。

第二行代码表示所有的王落集村将会在GPU中进行，而不是CPU。使用CUDA来使能GPU，相比于CPU可以在物体检测中获得更高的FPS。对于普通的图片，推理过程耗时也比CPU中缩短很多。

※ 总结 ※

我们介绍并讨论了为什么选择OpenCV DNN模块。通过图表对比了相关性能表现。看到OpenCV DNN所支持的不同的深度学习功能，模型以及学习框架。

通过实操代码讨论了使用OpenCV DNN模块完成图像分类以及物体检测任务，还看到利用OpenCV DNN 在视频中检测物体。

4.1 关键知识点

1. 人工神经网络和深度学习已经使得计算机在理解和认知图像的精度得到了极大提高。某些场合甚至超过了人类的水平。
2. OpenCV DNN 模块：
* 选择OpenCV DNN进行模型推理是一个良好的选择，尤其是在Intel处理器中；
* 安装方便；
* 具有很多成熟可用的模型算法，能够对付大多数常见的问题；
* 尽管DNN模块还不具备训练网络的能力，但对于边端设备应用具有很强的支持功能。

希望你喜欢这个博文并能够对OpenCV DNN 模块有了基础的了解。将你的经验分享在下面的评论区吧。

■ 相关文献链接: