【MIT-BEVFusion代码解读】第二篇：LiDAR的encoder部分

文章目录

1. Voxelization
2. backbone
- 2.1 稀疏卷积介绍
- 2.2 SparseEncoder
- - （1）输入输出及参数说明
  - （2）流程

BEVFusion相关的其它文章链接：

【论文阅读】ICRA 2023|BEVFusion：Multi-Task Multi-Sensor Fusion with Unified Bird‘s-Eye View Representation
MIT-BEVFusion训练环境安装以及问题解决记录
【MIT-BEVFusion代码解读】第一篇：整体结构与config参数说明
【MIT-BEVFusion代码解读】第二篇：LiDAR的encoder部分

encoder部分分为LiDAR和camera两部分。这一篇文章主要介绍lidar的encoder部分，lidar的encoder主要有两部分，分别是voxelize和backbone，其中backbone部分使用的是SparseEncoder方式。
在这里插入图片描述

1. Voxelization

在train.py部分中，使用build_model构建模型，其中使用注册器register根据type类型创建BEVFusion实例。这部分我们主要关注lidar的enconder部分。

lidar的voxelization分为hard和dynamic voxelization，我们这里使用的是hard voxelization。创建voxel后再根据voxlize和backbone创建lidar的encoders。

if encoders.get("lidar") is not None:
    if encoders["lidar"]["voxelize"].get("max_num_points", -1) > 0:
    	# hard voxelization
        voxelize_module = Voxelization(**encoders["lidar"]["voxelize"])
    else:
    	# dynamic voxelization
        voxelize_module = DynamicScatter(**encoders["lidar"]["voxelize"])
    # 根据voxlize和backbone创建lidar的encoders
    self.encoders["lidar"] = nn.ModuleDict(
        {
            "voxelize": voxelize_module,
            "backbone": build_backbone(encoders["lidar"]["backbone"]),
        }
    )
    self.voxelize_reduce = encoders["lidar"].get("voxelize_reduce", True)

先来看一下voxelization部分的参数。如果对激光的体素化有所了解，这部分参数应该很好理解。

# 单个voxel最大点云个数
max_num_points: 10 
# voxel的大小[x, y, z]
voxel_size: [0.075, 0.075, 0.2] 
# 点云范围[x_min, y_min, z_min, x_max, y_max, z_max]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] 
# (training, testing)的最大点云个数
max_voxels: [120000, 160000]

激光的voxelization部分使用的是hard_voxelize代码，它是用c++实现，具体也不介绍了，这里只介绍一下它的调用部分。

这里输入points的某一帧的大小为[236137, 5]，第一位为点的个数，第二位为每个点的属性，分别表示[x, y, z, intensity, timestamp_diff]，其中最后一个timestamp_diff是指时间戳差异。

class _Voxelization(Function):
    @staticmethod
    def forward(
        ctx, points, voxel_size, coors_range, max_points=35, max_voxels=20000, deterministic=True
    ):
    	# 判断使用哪一种voxelization方法
        if max_points == -1 or max_voxels == -1:
            coors = points.new_zeros(size=(points.size(0), 3), dtype=torch.int)
            dynamic_voxelize(points, coors, voxel_size, coors_range, 3)
            return coors
        else:
        	# voxel.shape = [120000, 10, 5])
            voxels = points.new_zeros(size=(max_voxels, max_points, points.size(1)))
            # coors.size = [120000, 3])
            coors = points.new_zeros(size=(max_voxels, 3), dtype=torch.int)
            # num_points_per_voxel.shape = [120000]
            num_points_per_voxel = points.new_zeros(size=(max_voxels,), dtype=torch.int)
            # deterministic=True
            voxel_num = hard_voxelize(
                points,
                voxels,
                coors,
                num_points_per_voxel,
                voxel_size,
                coors_range,
                max_points,
                max_voxels,
                3,
                deterministic,
            )
            # select the valid voxels
            voxels_out = voxels[:voxel_num]
            coors_out = coors[:voxel_num]
            num_points_per_voxel_out = num_points_per_voxel[:voxel_num]
            return voxels_out, coors_out, num_points_per_voxel_out

2. backbone

2.1 稀疏卷积介绍

稀疏卷积常用于3D项目（如3D点云分割）中，由于点云数据是稀疏的，无法使用标准的卷积操作。同理，2D任务中，如果只处理其中一部分像素，也需要使用稀疏卷积，这样有助于模型加速。 本质上就是通过建立哈希表，保存特定位置的计算结果。

稀疏卷积和普通卷积没有区别，最重要的区别在于卷积的数据的存储方式和计算方法，这种计算方法可以增加计算稀疏点云的效率，其他的都是完全相同的(但SubMConv3d还是稍微有点区别的，只有中心kernel覆盖值才计算)，唯一多了一个indice_key，这是为了在indice相同的情况下重复利用计算好的rulebook和hash表，减少计算。

backbone根据注册器的type调用的是SparseEncoder方法。常用的两种3D稀疏卷积SparseConv3d(稀疏卷积)和SubMConv3d(子流形卷积)：

系数卷积： regular output definition，就像普通的卷积一样，只要kernel 覆盖一个 active input site，就可以计算出output site。对应论文：SECOND: Sparsely Embedded Convolutional Detection
子流形卷积：、submanifold output definition。只有当kernel的中心覆盖一个 active input site时，卷积输出才会被计算。对应论文：3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

稀疏卷积是非常高效的，因为我们只计算非零元素（元素指的是像素或者体素）的卷积，而不需要计算所有的元素。稀疏卷积中的卷积计算，不用滑动窗口方法，而是根据rulebook计算所有的原子操作。

2.2 SparseEncoder

BEVFusion的LiDAR部分使用的backbone，调用的是spconv中的SparseEncoder函数。

（1）输入输出及参数说明

参数说明

order ： ("conv", "norm", "act")，稀疏卷积模块的顺序。make_sparse_convmodule会根据order的顺序创建稀疏卷积的子模块。其中三个主要模块self.conv_input、self.encoder_layers和self.conv_out都是make_sparse_convmodule所创建。
block_type : 有两种形式，默认的是conv_module，另一种是basicblock形式。BEVFusion中的block_type使用的是basicblock。
conv_type ：卷积的形式，有SubMConv3d和SparseConv3d形式、

输入数据

特征：[N, num_channels]
索引：[N, (batch_idx, x, y, z)]，batch_idx为batchsize的索引，坐标 xyz 顺序为体素化坐标系索引。

在代码中输入表现形式如下：

# 体素化后的点, 每个点的数据[x,y,z,i,ts]
voxel_features.shape = [nums, 5]
# 每个点对应的坐标，[batch_size, x, y, z]
# 第一位batch表示对应的第几个batch，x,y,z对应的是voxel索引。
coors.shape = [nums, 4]

输出数据

最终输出BEV特征，维度大小为(N, C * D, H, W) 。

# 输出特征图的大小，N = 4， C=
# N: 表示batchsize大小
# H、W：特征图大小
# (N, C, D, H, W) = (4, 128, 2, 180, 180)
spatial_features.shape = (N, C * D, H, W) = [4, 256, 180, 180])

（2）流程

SparseEncoder的流程如下所示，这里的encoder_layers中使用的是basicblock。

在这里插入图片描述

conv_input、encoder_layers和conv_out的具体结构如下所示。

conv_input

SparseSequential(
  (0): SubMConv3d()
  (1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
)

encoder_layers

encoder_layers根据self.encoder_channels =[[16, 16, 32], [32, 32, 64], [64, 64, 128], [128, 128]]创建layer，可以看出总共创建4个layer，每个layer的基本组成为SparseBasicBlock和SparseConv3d。它们的具体结构如下所示。

SparseSequential(
  (encoder_layer1): SparseSequential(
    (0): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (1): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): SparseSequential(
      (0): SparseConv3d()
      (1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
  )
  (encoder_layer2): SparseSequential(
    (0): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (1): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): SparseSequential(
      (0): SparseConv3d()
      (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
  )
  (encoder_layer3): SparseSequential(
    (0): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (1): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): SparseSequential(
      (0): SparseConv3d()
      (1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
  )
  (encoder_layer4): SparseSequential(
    (0): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (1): SparseBasicBlock(
      (conv1): SubMConv3d()
      (bn1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (conv2): SubMConv3d()
      (bn2): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
  )
)

conv_out

SparseSequential(
  (0): SparseConv3d()
  (1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
)