一、回顾以及本篇博客内容概述

二、代码解析

2.1 FasterRCNNBase类

2.1.1 forward正向传播

2.2 FasterRCNN类

2.2.1 roi_heads定义

2.3 TwoMLPHead类（faster_rcnn_framework.py）

2.4 FastRCNNPredictor类

2.5 RoIHeads类（roi_head.py）

2.5.1 初始化函数 __init__

2.5.2 正向传播forward

一、回顾以及本篇博客内容概述

在之前的博客中，我们生成了自己的数据集（Dataset）、数据的预处理模块（将图片组成一个尺寸大小都相同的一个batch）、通过特征提取网络Backbone将图片处理成特征图、通过RPN网络计算RPN损失。接下来我们讲述ROIAlign、TwoMLPHead、FastRCNNPredictor部分。

二、代码解析

2.1 FasterRCNNBase类

2.1.1 forward正向传播

	#注意：这里输入的images的大小都是不同的。后面会进行预处理将这些图片放入同样大小的tensor中打包成一个batch
	#正向传播过程 params ：预测的图片，为List[Tensor]型 
	#image和target我们再word上面有标注
    def forward(self, images, targets=None):
        # type: (List[Tensor], Optional[List[Dict[str, Tensor]]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]
        """
        Arguments:
            images (list[Tensor]): images to be processed
            targets (list[Dict[Tensor]]): ground-truth boxes present in the image (optional)

        Returns:
            result (list[BoxList] or dict[Tensor]): the output from the model.
                During training, it returns a dict[Tensor] which contains the losses.
                During testing, it returns list[BoxList] contains additional fields
                like `scores`, `labels` and `mask` (for Mask R-CNN models).

        """
		#判断是否是训练模式，若是训练模式一定要有targets，若targets为空，抛出异常
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")

		#检查标注框是否有错误
        if self.training:
            assert targets is not None
            for target in targets:         # 进一步判断传入的target的boxes参数是否符合规定
                boxes = target["boxes"]
				#判断boxes是不是torch.Tensor的格式
                if isinstance(boxes, torch.Tensor):
					#shape对应的目标有几个，毕竟一个目标就对应一个边界框嘛
					#box的第一个维度是N表示图像中有几个边界框 第二个维度是4（xminxmax..）
					#即如果最后一个维度！=4也要报错
                    if len(boxes.shape) != 2 or boxes.shape[-1] != 4:
                        raise ValueError("Expected target boxes to be a tensor"
                                         "of shape [N, 4], got {:}.".format(
                                          boxes.shape))
                else:
                    raise ValueError("Expected target boxes to be of type "
                                     "Tensor, got {:}.".format(type(boxes)))

 
		#存储每张图片的原始尺寸 定义是个List类型 每个list又是个元组类型 元组里面存放着图片的长宽
		original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])

        for img in images:
			#对每张图片取得最后两个元素，再pytorch中维度的排列为[channel,height,width]
            val = img.shape[-2:]
            assert len(val) == 2  # 防止输入的是个一维向量
            original_image_sizes.append((val[0], val[1]))
        # original_image_sizes = [img.shape[-2:] for img in images]

		#GeneralizedRCNNTransform 函数 png的第二步（标准化处理、resize大小）
		#现在的image和targets才是真正的batch 我们在输入之前都是一张张尺寸大小不一样的图片，我们这样是没有办法打包成一个batch输入到gpu中进行运算的
        images, targets = self.transform(images, targets)  # 对图像进行预处理

        # print(images.tensors.shape)
        features = self.backbone(images.tensors)  # 将图像输入backbone得到特征图
		#判断特征图是否是tensor类型的，对于上面的图片是img和target型的 但是我们经过backbone后就得到了一个个的特征图（仅有图）
        if isinstance(features, torch.Tensor):  # 若只在一层特征层上预测，将feature放入有序字典中，并编号为‘0’
			#将特征图加入有序字典 key=0 
            features = OrderedDict([('0', features)])  # 若在多层特征层上预测，传入的就是一个有序字典

        # 将特征层以及标注target信息传入rpn中
        # proposals: List[Tensor], Tensor_shape: [num_proposals, 4],是一个绝对坐标
        # 每个proposals是绝对坐标，且为(x1, y1, x2, y2)格式
		#proposal是一个list大小为2（batch_size）是2 每个元素是个tensor，对于每个list而言是个tensor 2000*4 2000代表rpn生成有2000个proposal
        proposals, proposal_losses = self.rpn(images, features, targets)

        # 将rpn生成的数据以及标注target信息传入fast rcnn后半部分
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)

        # 对网络的预测结果进行后处理（主要将bboxes还原到原图像尺度上）
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

        losses = {}
        losses.update(detector_losses)
        losses.update(proposal_losses)

        if torch.jit.is_scripting():
            if not self._has_warned:
                warnings.warn("RCNN always returns a (Losses, Detections) tuple in scripting")
                self._has_warned = True
            return losses, detections
        else:
            return self.eager_outputs(losses, detections)

之前我们以及说过了数据的预处理：

images, targets = self.transform(images, targets)  # 对图像进行预处理

将打包好的数据输入到backbone特征提取网络生成特征图：

features = self.backbone(images.tensors)  # 将图像输入backbone得到特征图

将图片、特征和标注数据送入RPN网络计算RPN损失和proposal。

proposals, proposal_losses = self.rpn(images, features, targets)

我们需要注意的是，这里的proposal生成的是一个list列表，列表中的每一个元素都是tensor类型的，tensor的shape是 $[num\_proposal\times4]$ 。

接着我们讲述本次的内容：先来看roi_heads部分：

detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)

2.2 FasterRCNN类

2.2.1 roi_heads定义

        # 将roi pooling, box_head以及box_predictor结合在一起
        roi_heads = RoIHeads(
            # box
            box_roi_pool, box_head, box_predictor,
            box_fg_iou_thresh, box_bg_iou_thresh,  # 0.5  0.5
            box_batch_size_per_image, box_positive_fraction,  # 512  0.25  在每张图片当中会选取多少个proposal用来计算fastrcnn的损失
            bbox_reg_weights,
            box_score_thresh, box_nms_thresh, box_detections_per_img)  # 0.05  0.5  100
我们传入了很多参数。首先是box_roi_pool，我们在上面可以找到它的定义。
        if box_roi_pool is None:
            box_roi_pool = MultiScaleRoIAlign(
                featmap_names=['0', '1', '2', '3'],  # 在哪些特征层进行roi pooling
                output_size=[7, 7],
                sampling_ratio=2)
其实就是一个MultiScaleRoIAlign，对应的就是上图中的ROIPolling。这里使用的RoIAlign比RoIPooling更准确些（定位效果更好）。

底层这里我们不说，因为已经跳到了torchvision源码中了，这个RoIAlign层的目的就是将在指定的特征层featmap_names的proposal进行卷积操作得到output_size大小的特征矩阵。其channel不变。

即 $proposal(channel \times h \times w)$ --> $channel\times7\times7$ 的特征矩阵。

第二个参数是box_head，我们在上面可以找到它的定义。
        # fast RCNN中roi pooling后的展平处理两个全连接层部分
		# out_channels 通过展平之后所拥有的节点个数 由于我们通过proposal通过ROIAlaign之后得到的是一个shape固定的特征矩阵，矩阵的faeatures的通道数=out_channels
		# 特征矩阵的output_size=[7, 7], 因此展平之后的节点个数是out_channels*7*7，第二个参数是全连接层1的节点个数
        if box_head is None:
            resolution = box_roi_pool.output_size[0]  # 默认等于7
            representation_size = 1024
            box_head = TwoMLPHead(
                out_channels * resolution ** 2,
                representation_size
            )
对应我们上述流程图的Two MLPHead部分，进行展平操作依次通过两个全连接层。

我们传入两个参数：

参数一：通过展平操作后对应的节点个数：由于我们的proposal通过RoIAlign过后我们得到的是一个shape固定的特征矩阵，因此我们特征矩阵的channel是out_channels 。

这里的channel是backbone过后的channel，backbone的channel和特征矩阵的channel是一个channel。
# 预测特征层的channels
out_channels = backbone.out_channels
我们得到的特征矩阵的长和宽都是 $7\times7$ 的（output_size=[7, 7]），因此它对应的节点个数就是out_channels * resolution ** 2。

参数二：FC1/FC2的节点个数 = 1024。

TwoMLPHead类的解析在2.3节中。

其输出的变量 $x$ 是 $1024\times1024$ 维的向量。

第一个1024对应着一个batch有两张图片，一张图片有512个proposal，第二个1024代表着第二个全连接层FC2的节点个数。

第三个是box_predictor，它其实就是个FastRCNNPredictor。
        # 在box_head的输出上预测部分
		#并联两个全连接层 一个全连接层并联（一个全连接层用于预测每个proposal的类别分数）（一个全连接层用于预测每个proposal的边界框回归参数）
        if box_predictor is None:
            representation_size = 1024
            box_predictor = FastRCNNPredictor(
                representation_size,
                num_classes)  #加上背景21
FastRCNNPredictor类的解析在2.4节中。其实就是将Two MLPHead的输出结果并行的接上两个全连接层，一个全连接层用于预测每一个proposal的类别分数（num_classes = 21），另一个全连接层用于预测每个proposal的边界框回归参数。

第四个是box_fg_iou_thresh, box_bg_iou_thresh, # 0.5 0.5，对应着匹配正负样本时如果proposal和groundtruth的IoU值大于box_fg_iou_thresh认为是正样本，小于box_bg_iou_thresh对应负样本。

第五个是box_batch_size_per_image, box_positive_fraction, # 512 0.25 在每张图片当中会选取多少个proposal用来计算fastrcnn的损失。

第六个是box_score_thresh, box_nms_thresh, box_detections_per_img) # 0.05 0.5 100，对应后处理时所应用的阈值。

2.3 TwoMLPHead类（faster_rcnn_framework.py）

class TwoMLPHead(nn.Module):
    """
    Standard heads for FPN-based models

    Arguments:
        in_channels (int): number of input channels
        representation_size (int): size of the intermediate representation
    """

	#定义两个全连接层
    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()
		#in_channels = outputchannel 即roi层展平的个数
        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
		#展平 1024（batch整个proposal的个数 batchzize=2的，由于训练的时候我们只取rpn给的2000个中的512个proposal）256（channel） 7 7 
        x = x.flatten(start_dim=1)
		#1024 * 12544
        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x
在初始化函数中我们定义了两个全连接层。

第一个全连接层的维度大小即经过展平后的特征矩阵的大小，输出大小为1024。

第二个全连接层的维度大小为1024，输出大小为1024。

在正向传播过程中：

这里的 $x$ 就是通过ROIAlign后所得到的每个proposal所对应的特征矩阵：

1024是整个batch的整个的proposal的个数，这里的训练脚本的batchsize = 2，训练的时候只取了RPN给的2000个proposal中的512个proposal，当然两张图片加起来就是1024个proposal，每个proposal经过backbone输出channels等于256，宽度7高度7。

依次通过两个全连接层就得到了输出 $x$ 。

2.4 FastRCNNPredictor类

#全连接层 输入的是全连接层的输出1024 
class FastRCNNPredictor(nn.Module):
    """
    Standard classification + bounding box regression layers
    for Fast R-CNN.

    Arguments:
        in_channels (int): number of input channels
        num_classes (int): number of output classes (including background)
    """

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
		#默认不满足
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
		#1024* 21
        scores = self.cls_score(x)
		#1024 84
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas
类的初始化定义了两个全连接层，用于预测目标分数和边界框回归参数。

第一个全连接层输入channel为1024，输出channel为21。

第二个全连接层输入channel为1024，输出channel为84。

第一个1024对应着一个batch有两张图片，一张图片有512个proposal，第二个1024代表着第二个全连接层FC2的节点个数。

将结果分别传给两个预测器得到预测结果。

2.5 RoIHeads类（roi_head.py）

2.5.1 初始化函数 init

    def __init__(self,
                 box_roi_pool,   # Multi-scale RoIAlign pooling
                 box_head,       # TwoMLPHead
                 box_predictor,  # FastRCNNPredictor
                 # Faster R-CNN training
                 fg_iou_thresh, bg_iou_thresh,  # default: 0.5, 0.5
                 batch_size_per_image, positive_fraction,  # default: 512, 0.25
                 bbox_reg_weights,  # None
                 # Faster R-CNN inference
                 score_thresh,        # default: 0.05
                 nms_thresh,          # default: 0.5
                 detection_per_img):  # default: 100
        super(RoIHeads, self).__init__()

		#计算IoU的方法
        self.box_similarity = box_ops.box_iou
        # assign ground-truth boxes for each proposal
		#将proposal划分为正负样本中
        self.proposal_matcher = det_utils.Matcher(
            fg_iou_thresh,  # default: 0.5
            bg_iou_thresh,  # default: 0.5
            allow_low_quality_matches=False)

		#对于划分的正负样本进行采样
        self.fg_bg_sampler = det_utils.BalancedPositiveNegativeSampler(
            batch_size_per_image,  # default: 512
            positive_fraction)     # default: 0.25

        if bbox_reg_weights is None:
            bbox_reg_weights = (10., 10., 5., 5.)
        self.box_coder = det_utils.BoxCoder(bbox_reg_weights)

        self.box_roi_pool = box_roi_pool    # Multi-scale RoIAlign pooling
        self.box_head = box_head            # TwoMLPHead
        self.box_predictor = box_predictor  # FastRCNNPredictor

        self.score_thresh = score_thresh  # default: 0.05
        self.nms_thresh = nms_thresh      # default: 0.5
        self.detection_per_img = detection_per_img  # default: 100

简单的对我们2.2.1节的参数进行了类内初始化。

        self.proposal_matcher = det_utils.Matcher(
            fg_iou_thresh,  # default: 0.5
            bg_iou_thresh,  # default: 0.5
            allow_low_quality_matches=False)

这个是将proposal划分到正负样本中。

self.fg_bg_sampler = det_utils.BalancedPositiveNegativeSampler(
            batch_size_per_image,  # default: 512
            positive_fraction)     # default: 0.25

这个是将正负样本进行采样。

2.5.2 正向传播forward

	#参数：features特征图，proposals框体的坐标，image_shapes图片经过预处理后的大小，targets真实目标的标注信息
    def forward(self,
                features,       # type: Dict[str, Tensor]
                proposals,      # type: List[Tensor]
                image_shapes,   # type: List[Tuple[int, int]]
                targets=None    # type: Optional[List[Dict[str, Tensor]]]
                ):
        # type: (...) -> Tuple[List[Dict[str, Tensor]], Dict[str, Tensor]]
        """
        Arguments:
            features (List[Tensor])
            proposals (List[Tensor[N, 4]])
            image_shapes (List[Tuple[H, W]])
            targets (List[Dict])
        """

        # 检查targets的数据类型是否正确
        if targets is not None:
            for t in targets:
                floating_point_types = (torch.float, torch.double, torch.half)
                assert t["boxes"].dtype in floating_point_types, "target boxes must of float type"
                assert t["labels"].dtype == torch.int64, "target labels must of int64 type"

        if self.training:
            # 划分正负样本，统计对应gt的标签以及边界框回归信息
			#在我们的rpn输出时会提供2000个proposal，但在我们的训练过程中我们只需要从中采样512个就够了
            proposals, labels, regression_targets = self.select_training_samples(proposals, targets)
		#不是训练模式生成1000个proposal rpn_post_nms_top_n_test=1000       
	    else:
            labels = None
            regression_targets = None

        # 将采集样本通过Multi-scale RoIAlign pooling层
        # box_features_shape: [num_proposals, channel, height, width]
		#这里的box_roi_pool就是我们所说的ros_alain 通过它就能将我们的proposal处理到我们所指定的大小当中
		#features由于我们在多个特征层上预测，因此features有五个预测特征层
		#box_features 1024 256 7 7   两张图片，一张照片512个proposal，每一个proposal经过ros_alain后得到一个256 7 7大小的特征矩阵
        box_features = self.box_roi_pool(features, proposals, image_shapes)

        # 通过roi_pooling后的两层全连接层 TwoMLPHead
        # box_features_shape: [num_proposals, representation_size] 1024 1024
        box_features = self.box_head(box_features)

        # 接着分别预测目标类别和边界框回归参数 1024 21   1024 84
        class_logits, box_regression = self.box_predictor(box_features)

		#空列表空字典
        result = torch.jit.annotate(List[Dict[str, torch.Tensor]], [])
        losses = {}
		#训练模式记录，计算fastrcnn部分的损失
        if self.training:
            assert labels is not None and regression_targets is not None
            loss_classifier, loss_box_reg = fastrcnn_loss(
                class_logits, box_regression, labels, regression_targets)
            losses = {
                "loss_classifier": loss_classifier,
                "loss_box_reg": loss_box_reg
            }
		#验证模式对预测结果进行后处理
		#验证模式不会进行正负样本划分及采样过程，预测过程中直接使用rpn所有的proposal进行预测，预测的时候rpn只会提供1000个proposal
        else:
            boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
            num_images = len(boxes)
            for i in range(num_images):
                result.append(
                    {
                        "boxes": boxes[i],
                        "labels": labels[i],
                        "scores": scores[i],
                    }
                )

        return result, losses

这里的参数：

@features：特征图，经过backbone模块后得到的部分

@proposals：RPN生成的proposals

@image_shapes：在预处理之后图像所得到的shape，即经过等比例缩放后的图片的高度宽度大小。不是打包成batch的大小！

@targets：真实目标的标注信息

        if self.training:
            # 划分正负样本，统计对应gt的标签以及边界框回归信息
			#在我们的rpn输出时会提供2000个proposal，但在我们的训练过程中我们只需要从中采样512个就够了
            proposals, labels, regression_targets = self.select_training_samples(proposals, targets)
		#不是训练模式生成1000个proposal rpn_post_nms_top_n_test=1000       
	    else:
            labels = None
            regression_targets = None

如果是训练模式，我们用select_training_samples方法选取我们使用的样本，我们回忆一下，在RPN输出时会提供2000个proposal，但我们在训练过程中只需要采样512个样本就够了，因此在训练过程中我们会进一步采样；如果不是训练模式（验证模式），RPN只会生成1000个proposal。

        # 将采集样本通过Multi-scale RoIAlign pooling层
        # box_features_shape: [num_proposals, channel, height, width]
		#这里的box_roi_pool就是我们所说的ros_alain 通过它就能将我们的proposal处理到我们所指定的大小当中
		#features由于我们在多个特征层上预测，因此features有五个预测特征层
		#box_features 1024 256 7 7   两张图片，一张照片512个proposal，每一个proposal经过ros_alain后得到一个256 7 7大小的特征矩阵
        box_features = self.box_roi_pool(features, proposals, image_shapes)

将我们的features, proposals, image_shapes传给box_roi_pool，这里的box_roi_pool就是我们说的ROIAlign，通过这个函数可以将我们的proposal处理到指定的大小当中。

这里的features是通过backbone所得到的特征矩阵features。（FPN结构5个特征层）

这里的proposals是经过筛选之后对于每张图片只保留了512个proposal。

这里的image_shapes是每张图片缩放之后对应的尺寸。

我们得到的box_features如下：

1024对应着两张图片，一张图片中含有512个proposal。每个proposal经过RoIAlign后变成256*7*7大小的特征矩阵了。

        # 通过roi_pooling后的两层全连接层 TwoMLPHead
        # box_features_shape: [num_proposals, representation_size] 1024 1024
        box_features = self.box_head(box_features)

这里的BoxHead对应的图中Two MLPHead部分。

现在我们的box_features是1024*1024的。

我们再将我们所得的box_features传给box_predictor。对应图中的FastRCNNPreDictor部分。

        # 接着分别预测目标类别和边界框回归参数 1024 21   1024 84
        class_logits, box_regression = self.box_predictor(box_features)

对于每个proposal，都会预测21种类别的概率。

对于每个proposal，都会预测21种类别每个类别的四个坐标参数。

我们定义了空列表和空字典：

		#空列表空字典
        result = torch.jit.annotate(List[Dict[str, torch.Tensor]], [])
        losses = {}

对于训练模式下计算fastrcnn部分的损失。

        if self.training:
            assert labels is not None and regression_targets is not None
            loss_classifier, loss_box_reg = fastrcnn_loss(
                class_logits, box_regression, labels, regression_targets)
            losses = {
                "loss_classifier": loss_classifier,
                "loss_box_reg": loss_box_reg

对于验证模式下，对于预测的结果进行后处理：

		#验证模式对预测结果进行后处理
		#验证模式不会进行正负样本划分及采样过程，预测过程中直接使用rpn所有的proposal进行预测，预测的时候rpn只会提供1000个proposal
        else:
            boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
            num_images = len(boxes)
            for i in range(num_images):
                result.append(
                    {
                        "boxes": boxes[i],
                        "labels": labels[i],
                        "scores": scores[i],
                    }
                )

即低概率目标筛选掉、nms处理等等。这块下篇博客讲述！