6 -【Faster R-CNN 代码精读】之 Proposals 、 Filter Proposals

1、前言
2、数据回顾
3、计算候选框位置（proposal coordinates）
4、筛选候选框（filter proposals）及相关处理
- 1）筛选出预测概率排前 2000 的proposals
- 2）将概率值转化到 0～1
- 3）截断 proposals 超出原图像的部分
- 4）删除宽高都小于 1的proposals
- 5）筛除概率小于阈值的 proposal
- 6）使用 NMS 筛除冗余的 proposal
- 7）最后再根据取 top k 个proposals
5、代码
- 1、计算候选框坐标
- 2、筛选候选框

目的：将预测的bbox回归参数应用到对应anchors上得到预测bbox的坐标

1、前言

在上一篇文章【Faster R-CNN】之 AnchorGenerator代码精读中，我们获取到了 anchor 在原图中的坐标
在上上篇文章【Faster R-CNN】之 RPN Head 代码精读中，我们得到了预测出的 bounding box 的概率以及预测出的 bounding box 回归参数（regression parameters）。这里的 bounding box 回归参数并不是预测bounding box 在原图中的绝对位置，而是相对于每一个 anchor 位置的偏移量 offset。

所以，我们需要结合 anchor 原图中的绝对位置以及 bounding box 回归参数，计算出预测 bounding box 在原图中的绝对位置。求 predicted bounding box 的绝对坐标位置，是为了和 ground truth bounding box 的坐标位置做对比，计算出 loss。

2、数据回顾

raw anchor：使用超参数 $s i ze = (32, 64, 128, 256, 512)$ 和 $aspect\_ratios=(0.5, 1.0, 2.0)$ 创建anchors，可创建 $5 * 3 = 15$ 个 anchors。（ feature map 上的每一个像素点都会产生这15个anchors）
映射到原图像上的 anchors ：feature map 上的每个像素都会产生15个anchors，将 feature map 上的每个像素点都映射回原图像中，原图像中的这些像素点就是 anchors 的中心点。然后，将 anchors 也映射回原图中，就得到了原图像中的 anchors 的绝对坐标位置。

每张图像 anchors总量 = $15 * h e i g h t * w i d t h$ ，其中，height、width 为feature map 的高和宽。
预测 bounding box 回归参数：由 RPN Head 网络得出，shape=(batch_size, 60, height, width) ，我们将它形状转换为 shape= $batch\_size*15*height*width, 4)$ 用于之后的计算

3、计算候选框位置（proposal coordinates）

目的：计算出 predicted boxes 在原图上的坐标位置。
方法：在原图上 anchor 坐标的基础上，加上 predicted box 的偏移量。 predicted box 的偏移量是由预测bounding box 回归参数计算得到。

1）将anchor 的坐标形式从(xmin, ymin, xmax, ymax) 转换为 (ctr_x, ctr_y, widths, heights)

anchor 宽度： anchor_width = xmax - xmin
anchor 高度：anchor_height = ymax - ymin
anchor中心x坐标：anchor_ctr_x = xmin + anchor_width
anchor中心y坐标：anchor_ctr_y= ymin + anchor_height

在这里插入图片描述

anchor_widths = anchors[:, 2] - anchors[:, 0]  # anchors 宽度
anchor_heights = anchors[:, 3] - anchors[:, 1]  # anchors 高度
ctr_x = anchors[:, 0] + 0.5 * anchor_widths  # anchors 中心x坐标
ctr_y = anchors[:, 1] + 0.5 * anchor_heights  # anchors 中心y坐标

$\quad$
2）预测回归参数 pred_box_regression，shape 为 $batch\_size*15*height*width, 4)$ ，其中 $batch\_size*15*height*width$ 是 anchor 的数量， 4 就是每个anchor 对应的坐标回归参数，将这4个值解析出来，分别记为：dx, dy, dw, dh。

限制 dw, dh 的最大值为 1000./16=4.135，预防后面计算的时候，传递过大的值到 torch.exp()中

# self.bbox_xform_clip = math.log(1000. / 16)
dw = torch.clamp(dw, max=self.bbox_xform_clip)
dh = torch.clamp(dh, max=self.bbox_xform_clip)

$\quad$
3）计算出预测box 相对于anchor 中心的偏移量

x方向的偏移量： pred_x_offset = dx * anchor_widths,
y方向的偏移量： pred_y_offset = dy * anchor_heights

将偏移量加到 anchor 中心坐标，就得到预测box 的中心坐标位置

预测 box 的中心 x方向坐标：pred_ctr_x = pred_x_offset + anchor_widths
预测 box 的中心 y方向坐标：pred_ctr_x = pred_y_offset + anchor_heights

预测 box 的高度和宽度

预测 box 的宽度：pred_w = $e^{dw}$ * anchor_widths
预测 box 的高度：pred_h = $e^{dwh}$ * anchor_heights

pred_ctr_x = dx * anchor_widths[:, None] + ctr_x[:, None]
pred_ctr_y = dy * anchor_heights[:, None] + ctr_y[:, None]
pred_w = torch.exp(dw) * anchor_widths[:, None]
pred_h = torch.exp(dh) * anchor_heights[:, None]

$\quad$
4）最后，将预测 box的坐标形式从 (ctr_x, ctr_y, width, height) 转换为 (xmin, ymin, xmax, ymax)

预测box 的左上角x坐标：pred_boxes_xmin = pred_ctr_x - 0.5 * pred_w
预测box 的左上角y坐标：pred_boxes_ymin = pred_ctr_y - 0.5 * pred_h
预测box 的右下角x坐标：pred_boxes_xmax = pred_ctr_x + 0.5 * pred_w
预测box 的右下角y坐标：pred_boxes_ymax = pred_ctr_y + 0.5 * pred_h

pred_boxes_xmin = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
pred_boxes_ymin = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
pred_boxes_xmax = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
pred_boxes_ymax = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h

$\quad$
到此，我们就计算出了 proposals 在原图中的坐标了。

$\quad$

当前获取到的候选框 (proposals) 数量太大了，一般都是十万级的数量。
我们接下来我们要筛选出 1000+ 个proposals.（不超过2000个）

4、筛选候选框（filter proposals）及相关处理

1）筛选出预测概率排前 2000 的proposals

筛选出预测概率排前 2000 的proposals，也就是最有可能包含 objects 的 proposals

还记得我们之前从 RPN Head 网络（如下图）中得到了所有 proposals 的概率嘛，我们从中筛选出 top 2000
在这里插入图片描述
筛选出的 proposals，我们按照概率的降序的顺序存储，包括概率和对应的proposals的坐标。

'''
objectness ：预测概率值
proposals ：proposals的坐标
train阶段， pre_nms_top_n = 2000
'''
# 预测概率值为 top 2000 的 proposals 的索引
_, top_n_idx = objectness.topk(pre_nms_top_n, dim=1)  

# top 2000 的 proposals 的 预测概率值 和 坐标
batch_idx = torch.arange(num_images, device=device).reshape(-1, 1)
objectness = objectness[batch_idx, top_n_idx]
proposals = proposals[batch_idx, top_n_idx]

bjectness_prob = torch.sigmoid(objectness)

2）将概率值转化到 0～1

将预测概率做 sigmoid 运算，转成真正的概率值：数值映射到 0~1

objectness_prob = torch.sigmoid(objectness)

3）截断 proposals 超出原图像的部分

截断 proposals 超出原图像的部分，并将坐标调整到图片边界上。
这里 width 和 height 分别为图像 resize 之后，padding 之前的高和宽。细节在这里

在这里插入图片描述

boxes = clip_boxes_to_image(boxes, img_shape)

def clip_boxes_to_image(boxes, size):
    boxes_x = boxes[:, 0::2]  # x1, x2
    boxes_y = boxes[:, 1::2]  # y1, y2
    height, width = size

    boxes_x = boxes_x.clamp(min=0, max=width)   # 限制x坐标范围在[0,width]之间
    boxes_y = boxes_y.clamp(min=0, max=height)  # 限制y坐标范围在[0,height]之间

    clipped_boxes = torch.stack((boxes_x, boxes_y), dim=2)
    return clipped_boxes.reshape(boxes.shape) # (top_k_num, 4)

4）删除宽高都小于 1的proposals

这里的 min_size=1 , 是超参数

keep = remove_small_boxes(boxes, self.min_size)
boxes, scores = boxes[keep], scores[keep]

def remove_small_boxes(boxes, min_size):
    ws = boxes[:, 2] - boxes[:, 0]   # boxes的宽
    hs = boxes[:, 3] - boxes[:, 1]  # boxes的高
    keep = torch.logical_and(torch.ge(ws, min_size), torch.ge(hs, min_size))
    keep = torch.where(keep)[0]
    return keep

5）筛除概率小于阈值的 proposal

这里阈值设置的是 0.0

keep = torch.where(torch.ge(scores, self.score_thresh))[0]  
boxes, scores = boxes[keep], scores[keep]

6）使用 NMS 筛除冗余的 proposal

使用的 shreshold=0.7

keep = batched_nms(boxes, scores, self.nms_thresh)
def batched_nms(boxes, scores, iou_threshold):
    if boxes.numel() == 0:
        return torch.empty((0,), dtype=torch.int64, device=boxes.device)
    keep = torch.ops.torchvision.nms(boxes, scores, iou_threshold)
    return keep

7）最后再根据取 top k 个proposals

我们第一步处理的时候，是先筛选出了 top 2000 的proposals (pre_nms_top_n=20000)
最后一步，我们会再设置超参数 post_nms_top_n 筛选出一定数量的 proposals，只是这里我们设置的 post_nms_top_n = pre_nms_top_n = 2000

5、代码

1、计算候选框坐标


def decode(self, pred_box_regression, anchors):
    # concatgate batch_size 张图片的所有 anchors， 形为  (batch_size * Anchor * Height * Width, 1)
    anchors = torch.cat(anchors, dim=0)

    # 将预测的bbox回归参数应用到对应anchors上得到预测bbox的坐标
    anchors = anchors.to(pred_box_regression.dtype)

    # anchor 的坐标形式 从(xmin, ymin, xmax, ymax) 转换为 (ctr_x, ctr_y, anchor_widths, anchor_heights)
    anchor_widths = anchors[:, 2] - anchors[:, 0]  # anchor/proposal宽度
    anchor_heights = anchors[:, 3] - anchors[:, 1]  # anchor/proposal高度
    ctr_x = anchors[:, 0] + 0.5 * anchor_widths  # anchor/proposal中心x坐标
    ctr_y = anchors[:, 1] + 0.5 * anchor_heights  # anchor/proposal中心y坐标

	# 解析出 predicted box 的中心坐标回归参数
    dx, dy, dw, dh = pred_box_regression.split([1, 1, 1, 1], dim=1) 

    # limit max value, prevent sending too large values into torch.exp()
    # self.bbox_xform_clip = math.log(1000. / 16) = 4.135
    dw = torch.clamp(dw, max=self.bbox_xform_clip)
    dh = torch.clamp(dh, max=self.bbox_xform_clip)

    # 在anchor的基础上计算 predicted box 的坐标： pred_ctr_x, pred_ctr_y, pred_w, pred_h
    pred_ctr_x = dx * anchor_widths[:, None] + ctr_x[:, None]
    pred_ctr_y = dy * anchor_heights[:, None] + ctr_y[:, None]
    pred_w = torch.exp(dw) * anchor_widths[:, None]
    pred_h = torch.exp(dh) * anchor_heights[:, None]

    # 将 predicted box 坐标形式 由(pred_ctr_x, pred_ctr_y, pred_w, pred_h) 转换为 (xmin、ymin、xmax、ymax)
    pred_boxes_xmin = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
    pred_boxes_ymin = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
    pred_boxes_xmax = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
    pred_boxes_ymax = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h

    pred_boxes = torch.stack((pred_boxes_xmin, pred_boxes_ymin, pred_boxes_xmax, pred_boxes_ymax), dim=2).flatten(1)
    return pred_boxes

2、筛选候选框

def filter_proposals(self, proposals, objectness, image_shapes):
    # # type: (Tensor, Tensor, List[Tuple[int, int]], List[int]) -> Tuple[List[Tensor], List[Tensor]]

    num_images = proposals.shape[0]
    device = proposals.device

    # do not backprop throught objectness
    objectness = objectness.detach()
    objectness = objectness.reshape(num_images, -1)

    # select top_n boxes before applying nms

    if self.training:
        pre_nms_top_n = self.pre_nms_top_n['training']   # pre_nms_top_n = 2000
    else:
        pre_nms_top_n = self.pre_nms_top_n['testing']   # pre_nms_top_n = 1000

    # 预测概率值为 top 2000 的 proposals 的索引
    _, top_n_idx = objectness.topk(pre_nms_top_n, dim=1)

    # top 2000 的 预测概率值 & proposals坐标
    batch_idx = torch.arange(num_images, device=device).reshape(-1, 1)
    objectness = objectness[batch_idx, top_n_idx]
    proposals = proposals[batch_idx, top_n_idx]

    objectness_prob = torch.sigmoid(objectness)

    final_boxes = []
    final_scores = []
    # 遍历每张图像的相关预测信息
    for boxes, scores, img_shape in zip(proposals, objectness_prob, image_shapes):
        # 调整 proposals 的坐标，将越界的坐标调整到图片边界上
        boxes = clip_boxes_to_image(boxes, img_shape)

        # 返回boxes满足宽，高都大于min_size的索引
        keep = remove_small_boxes(boxes, self.min_size)
        boxes, scores = boxes[keep], scores[keep]

        # 移除小概率boxes，参考 https://github.com/pytorch/vision/pull/3205
        keep = torch.where(torch.ge(scores, self.score_thresh))[0]  # ge: >=
        boxes, scores = boxes[keep], scores[keep]

        # NMS (non-maximum suppression), independently done per level
        keep = batched_nms(boxes, scores, self.nms_thresh)

        # keep only topk scoring predictions  # self.pre_nms_top_n={'training': 2000, 'testing': 1000}
        if self.training:
            keep = keep[: self.post_nms_top_n['training']]
        else:
            keep = keep[: self.post_nms_top_n['testing']]
        boxes, scores = boxes[keep], scores[keep]

        final_boxes.append(boxes)
        final_scores.append(scores)
    return final_boxes, final_scores


def clip_boxes_to_image(boxes, size):
    boxes_x = boxes[:, 0::2]  # x1, x2
    boxes_y = boxes[:, 1::2]  # y1, y2
    height, width = size

    boxes_x = boxes_x.clamp(min=0, max=width)   # 限制x坐标范围在[0,width]之间
    boxes_y = boxes_y.clamp(min=0, max=height)  # 限制y坐标范围在[0,height]之间

    clipped_boxes = torch.stack((boxes_x, boxes_y), dim=2)
    return clipped_boxes.reshape(boxes.shape)


def remove_small_boxes(boxes, min_size):
    ws = boxes[:, 2] - boxes[:, 0]   # boxes的宽
    hs = boxes[:, 3] - boxes[:, 1]  # boxes的高
    keep = torch.logical_and(torch.ge(ws, min_size), torch.ge(hs, min_size))
    keep = torch.where(keep)[0]
    return keep


def batched_nms(boxes, scores, iou_threshold):
    if boxes.numel() == 0:
        return torch.empty((0,), dtype=torch.int64, device=boxes.device)
    keep = torch.ops.torchvision.nms(boxes, scores, iou_threshold)
    return keep