【Faster R-CNN】之 AnchorGenerator 代码精读

1、anchor 的 size 和 aspect_ratios
2、计算以中心坐标为 (0, 0) 的 anchor
3、将 anchor 映射到原图上
4、代码汇总

anchor 的作用：anchor 是用来做辅助计算的，用于和（上节课说的，由RPN Head 计算出的）bounding box regression 一起，计算出预测的候选框的坐标信息。
** 我理解 bounding box regression 是一个相对位置信息，且是一个系数。

1、anchor 的 size 和 aspect_ratios

anchor size ：表示 anchor 的大小，一般取值为 anchor_sizes = (32, 64, 128, 256, 512)
anchor aspect_ratios ：表示 anchor 的高宽比，一般取值为 (0.5, 1.0, 2.0)，作者表示这个是经验值。

每个size 对应三个高宽比，会创建 3个 anchor。

比如，size=32， aspect_ratios= (0.5, 1.0, 2.0) 时，会创建如下 3个 anchor ：

当高宽比为 1:1 时， size=32 为 anchor 的边长，即，height = width = 32
当高宽比为 0.5 时，height / width = 0.5，anchor 面积为 $32^2$ ，
当高宽比为 2 时，height / width = 2，，anchor 面积为 $32^2$

在这里插入图片描述

因为有 (32, 64, 128, 256, 512) 这 5个size，每个 size 分别会生成 3个不同 aspect_ratios 的anchor，所以，一共会生成 15个不同尺寸的 anchor。

2、计算以中心坐标为 (0, 0) 的 anchor

我们仍然以 size=32， aspect_ratios= (0.5, 1.0, 2.0) 举例：

当 aspect_ratios=0.5 时，假设高为 $x$ ，宽为 $2 x$ ,
anchor 面积为： $x * 2x =32^2$ ，解方程得 $\frac{32}{\sqrt{2}}$
所以， $anchor\_height = x = \frac{32}{\sqrt{2}}$ ， $anchor\_width = 2x = 32 \sqrt{2}$
当 aspect_ratios= 1时， $anchor\_height = 32$ ， $anchor\_width = 32$
当 aspect_ratios=2 时， $anchor\_width = x = \frac{32}{\sqrt{2}}$ ， $anchor\_height = 2x = 32 \sqrt{2}$

为了计算方便，我们设计一个 h_ratios 和 w_ratios ： $h\_ratios = [\frac{1}{\sqrt{2}}, 1, \sqrt{2}], \quad w\_ratios = [\sqrt{2}, 1, \frac{1}{\sqrt{2}}]$

使得：
$anchor\_width = w\_ratios * scales = (\sqrt{2}, 1, \frac{1}{\sqrt{2}}) * 32 = [45.2548, 32.0000, 22.6274]$
$anchor\_heigth = h\_ratios * scales = (\frac{1}{\sqrt{2}}, 1, \sqrt{2}) * 32 = [22.6274, 32.0000, 45.2548]$

最后，我们将 anchor 中心位置的坐标作为 [0, 0]，分别计算 3个 anchor 左上角的坐标和右下角的坐标：

size=32， aspect_ratios= 0.5 的 anchor：
( xmin, xmax, ymin, ymax）= (-ws, -hs, ws, hs) / 2 = [-22.6274, -11.3137, 22.6274, 11.3137]
size=32， aspect_ratios= 1 的 anchor：
( xmin, xmax, ymin, ymax）= (-ws, -hs, ws, hs) / 2 = [-16.0000, -16.0000, 16.0000, 16.0000]
size=32， aspect_ratios= 2 的 anchor：
( xmin, xmax, ymin, ymax）= (-ws, -hs, ws, hs) / 2 = [-11.3137, -22.6274, 11.3137, 22.6274]

最后，将坐标进行四舍五入，得：
size=32， aspect_ratios= 0.5 的 anchor ：( xmin, xmax, ymin, ymax）= [-23., -11., 23., 11.]
size=32， aspect_ratios= 0.5 的 anchor ：( xmin, xmax, ymin, ymax）= [-16., -16., 16., 16.]
size=32， aspect_ratios= 0.5 的 anchor ：( xmin, xmax, ymin, ymax）= [-11., -23., 11., 23.]

代码如下；

def generate_anchors(self, scales, aspect_ratios, dtype, device):
    scales = torch.as_tensor(scales, dtype=dtype, device=device)
    aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
    h_ratios = torch.sqrt(aspect_ratios)
    w_ratios = 1.0 / h_ratios

    ws = (w_ratios[:, None] * scales[None, :]).view(-1)
    hs = (h_ratios[:, None] * scales[None, :]).view(-1)

    base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
    return base_anchors.round()  # round 四舍五入

同理，我们可以求出 size 分别为 64、128、256、512 的 12个 anchor 的坐标信息。

一共15个 anchor 如下：
[ -23., -11., 23., 11.],
[ -16., -16., 16., 16.],
[ -11., -23., 11., 23.],
[ -45., -23., 45., 23.],
[ -32., -32., 32., 32.],
[ -23., -45., 23., 45.],
[ -91., -45., 91., 45.],
[ -64., -64., 64., 64.],
[ -45., -91., 45., 91.],
[-181., -91., 181., 91.],
[-128., -128., 128., 128.],
[ -91., -181., 91., 181.],
[-362., -181., 362., 181.],
[-256., -256., 256., 256.],
[-181., -362., 181., 362.]])

到现在为止，我们得到的 anchor 都是自己在跟自己玩，下面我们要把它映射到原图上去

3、将 anchor 映射到原图上

我们的做法可以理解为：

特征图上的每一个像素都会有这15个不同尺寸的 anchor，比如特征图尺寸为（3， 4），每一个像素都会有这15个anchor，所以一共有 $3 * 4 * 15 = 180$ 个anchor
将特征图上的每一个像素按照比例（高度缩放比例，和宽度缩放比例）映到原图上去，然后将anchor也映射过去。
- 这里的原图指的是resize_and_padding 之后的，batch中的图片尺寸。后面我们都会称之为原图
- 高度缩放比例和宽度缩放比例，是由backbone 决定的，RPN Head 并没有改变图片的尺寸
最后，我们得到 achor 在原图上的坐标信息，形为（xmin, ymin, xmax, ymax）

接下来，我们举例计算细节：
假设原图的尺寸为（10， 12），特征图的尺寸为（3， 4）
缩放尺寸为 (这里称为步幅) ： stride=（10//3， 12//4）=（3， 3）

特征图上的高和宽的坐标，按照步幅映射到原图，如下
x轴坐标 $\times 3 = (0, 3, 6)$
y轴坐标 $3)\times 3 = (0, 3, 6, 9)$
在这里插入图片描述

然后通过 torch.meshgrid() 函数，将每个元素的 x,y 坐标都复制出来，然后展平
在这里插入图片描述

通过函数 torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1) 将他们叠起来，作为 anchor 在原图中的绝对坐标的中心。

在这里插入图片描述

再把之前计算出的以（0, 0）为中心的相对坐标加上，就会得到 anchor 在原图上的绝对坐标啦！
也就是上面的每一列都加上我们之前计算得到的 15个anchor相对坐标，一共会得到 12 *15 = 180个anchor

        [ -23.,  -11.,   23.,   11.]
        [ -16.,  -16.,   16.,   16.]
        [ -11.,  -23.,   11.,   23.]
        [ -45.,  -23.,   45.,   23.]
        [ -32.,  -32.,   32.,   32.]
        [ -23.,  -45.,   23.,   45.]
        [ -91.,  -45.,   91.,   45.]
        [ -64.,  -64.,   64.,   64.]
        [ -45.,  -91.,   45.,   91.]
        [-181.,  -91.,  181.,   91.]
        [-128., -128.,  128.,  128.]
        [ -91., -181.,   91.,  181.]
        [-362., -181.,  362.,  181.]
        [-256., -256.,  256.,  256.]
        [-181., -362.,  181.,  362.]

这一部分的代码

def grid_anchors(self, feature_map_size, strides):
    cell_anchors = self.cell_anchors

    grid_height, grid_width = feature_map_size
    stride_height, stride_width = strides
    device = cell_anchors[0].device

    shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
    shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height

    shift_y, shift_x = torch.meshgrid([shifts_y, shifts_x], indexing='ij')
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)

    shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)
    shifts_anchor = shifts.view(-1, 1, 4) + cell_anchors[0].view(1, -1, 4)
    return shifts_anchor.reshape(-1, 4)  # List[Tensor(all_num_anchors, 4)]

4、代码汇总

class AnchorsGenerator(torch.nn.Module):
    def __init__(self, sizes, aspect_ratios):
        # anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
        # aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)

        super(AnchorsGenerator, self).__init__()
        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.cell_anchors = None
        self._cache = {}

    def forward(self, image_list, feature_maps):
        feature_map_size = feature_maps.shape[-2:]
        image_size = image_list.tensors.shape[-2:]
        dtype, device = feature_maps.dtype, feature_maps.device

        strides = [torch.tensor(image_size[0] // feature_map_size[0], dtype=torch.int64, device=device),
                   torch.tensor(image_size[1] // feature_map_size[1], dtype=torch.int64, device=device)]

        cell_anchors = [
            self.generate_anchors(sizes, aspect_ratios, dtype, device)
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)
        ]
        self.cell_anchors = [torch.concat(cell_anchors, dim=0)]

        anchors_over_all_feature_maps = self.grid_anchors(feature_map_size, strides)
        anchors = [anchors_over_all_feature_maps for i in range(feature_maps.shape[0])]
        return anchors

    def generate_anchors(self, scales, aspect_ratios, dtype, device):
        # # type: (List[int], List[float], torch.dtype, torch.device) -> Tensor
        scales = torch.as_tensor(scales, dtype=dtype, device=device)
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)
        h_ratios = torch.sqrt(aspect_ratios)
        w_ratios = 1.0 / h_ratios

        ws = (w_ratios[:, None] * scales[None, :]).view(-1)
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)

        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2
        return base_anchors.round()  # round 四舍五入

    def grid_anchors(self, feature_map_size, strides):
        # # type: (torch.Size([int, int]), List[Tensor, Tensor]) -> List[Tensor]
        cell_anchors = self.cell_anchors

        grid_height, grid_width = feature_map_size
        stride_height, stride_width = strides
        device = cell_anchors[0].device

        shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
        shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height

        shift_y, shift_x = torch.meshgrid([shifts_y, shifts_x], indexing='ij')
        shift_x = shift_x.reshape(-1)
        shift_y = shift_y.reshape(-1)

        shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)
        shifts_anchor = shifts.view(-1, 1, 4) + cell_anchors[0].view(1, -1, 4)
        return shifts_anchor.reshape(-1, 4)  # List[Tensor(all_num_anchors, 4)]