Pixel Nerf代码阅读

news2025/2/23 23:01:38

Input：图像的分辨率是 300*400；每个场景里面有 49张 Training 的图像。

SB： scene batch 场景的个数； 4
NV： number input ，每个场景的视角，也就是图像的数量； 49

每条光线首先会采集 64个采样点，一次训练4 个场景，每个场景随机【用上了该场景的所有的图像】选择128条光线，因此光线的 shape (4,128,8)

## 将光线reshape 成（512,8）
rays = rays.reshape(-1, 8)  # (SB * B, 8)

## 每条光线采集 64 个采样点
z_coarse = self.sample_coarse(rays)  # (B, Kc)
coarse_composite = self.composite(
   model, rays, z_coarse, coarse=True, sb=superbatch_size,
)

在 compostit_alpha 函数里面，使用 Z 的数值生成采样点 points, point 的 shape 设置为（4,8192,3）表示一共有4个场景，每个场景生成 8192条光线

model.py forward 函数中正式实现论文中的 pixel nerf

输入的是 world 系下面的采样点的坐标：

将world 系下面的采样点坐标转变到相机坐标系

将 world 系下面的采样点转换到各自的相机坐标系下面，然后做 projection

     xyz = repeat_interleave(xyz, NS)  # (SB*NS, B, 3)
     xyz_rot = torch.matmul(self.poses[:, None, :3, :3], xyz.unsqueeze(-1))[
         ..., 0
     ]
     xyz = xyz_rot + self.poses[:, None, :3, 3]

由于世界点的是由所有图像去生成的，并不是由某一张图像去生成的 Ray, 因此可以将这些采样点投影到 2D 图像上去 query feature

 latent = self.encoder.index(
                    uv, None, self.image_shape
                )  # (SB * NS, latent, B)

得到 Query 的 feature 之后，然后再通过网络回归出 density 和 color

Appendix: https://openaccess.thecvf.com/content/CVPR2021/supplemental/Yu_pixelNeRF_Neural_Radiance_CVPR_2021_supplemental.pdf

def forward(self, zx, combine_inner_dims=(1,), combine_index=None, dim_size=None):
        """
        :param zx (..., d_latent + d_in)
        :param combine_inner_dims Combining dimensions for use with multiview inputs.
        Tensor will be reshaped to (-1, combine_inner_dims, ...) and reduced using combine_type
        on dim 1, at combine_layer
        """
        with profiler.record_function("resnetfc_infer"):
            assert zx.size(-1) == self.d_latent + self.d_in
            if self.d_latent > 0:
                z = zx[..., : self.d_latent]
                x = zx[..., self.d_latent :]
            else:
                x = zx
            if self.d_in > 0:
                x = self.lin_in(x)  ## 对PE 的mlp ，input:42 out:512
            else:
                x = torch.zeros(self.d_hidden, device=zx.device)

            ## combine_layer = 3, 在Appendix 当中 ResBlock 需要注入3次： ReatBlock 3×
            for blkid in range(self.n_blocks):
                if blkid == self.combine_layer:
                    ## 一个场景可能有多个视角，在重复3次 ResNet 之后，需要对于 对个视角的特征进行 Average 
                    ## https://openaccess.thecvf.com/content/CVPR2021/supplemental/Yu_pixelNeRF_Neural_Radiance_CVPR_2021_supplemental.pdf
                    x = util.combine_interleaved(
                        x, combine_inner_dims, self.combine_type
                    )

                if self.d_latent > 0 and blkid < self.combine_layer:
                    tz = self.lin_z[blkid](z)  ## 对于 imgae_feature 进行的 mlp 操作
                    if self.use_spade:
                        sz = self.scale_z[blkid](z)
                        x = sz * x + tz
                    else:
                        x = x + tz

                x = self.blocks[blkid](x)
            out = self.lin_out(self.activation(x))
            return out

大致的思路是 Feature 和Postion Encoidng 得到的 Feature 需要通过 MLP 进行组合，组合的方式需要Repeat 3次，每一次都需要注入 Feature 的信息。然后如果一个场景参与训练的图像超过了 1张，需要在 combine_interleaved 函数中，将多个视角的 feature 进行平均。之后回归出 rgb 和 density
在这里插入图片描述