nerf论文代码理解

news2025/7/16 1:21:25

近年来，生成式AI(AGI)快速发展，各类生成式模型层出不群，但我更觉得具有物理意义的生成式AI将是未来革命性技术。因此也在抽空看看关于三维重建的知识，这篇文章就是记录我对nerf的理解。

一、论文理解

首先，需要知道nerf的输入和输出是什么？

做cv的朋友都肯定会觉得输入必然是图像，其实图像的输入只是为了后续的loss监督使用，网络的输入是5D向量，包含空间位置xyz和角度theta和phi。

（1）是什么的空间位置xyz和角度theta和phi？

粒子，nerf假设空间中的粒子是自发光的，所以图像的渲染是通过这些自发光粒子得到的。

（2）网络的结构是什么样？

（3）怎样得到这些粒子？

根据像素位置转换到归一化相机坐标系，然后再转换到统一的世界坐标系，沿着射线在2~6的范围内均匀采样64个点作为粒子。世界坐标系的转换矩阵，也就是pose是通过SFM(运动结构恢复)得到的(经典的方法是COLMAP)

（4e）xyz是怎么变为63维的？

通过傅里叶级数将输入的xyz变为高频信号，就像transfomer里面的位置编码一样。这样做的好处是能够区别像素间的差异，对于空间重建比较重要，相邻的两个像素，在空间中的距离也可能距离很大。论文中设的embedding的维度是10，所以10*2*3+3=63。其中2表示sin和cos，3是xyz。

（5）在第5层为什么要加入原始信息？

为了保持更为准确的空间位置。

（6）输出的theta是什么？

在对应的粒子采样点处光线被粒子阻碍的概率密度。

（7）视角的输入维度为什么是3，又怎么变为27？

实际网络的输入是向量，是单位向量表示方向。embedding的维度是4，所以3*2*4+3=27.

（8）为什么视角输入只是在最后加入？

概率密度只和位置相关，颜色和视角相关。

（9）最后怎么样渲染得到图像？

体渲染技术。

（10）nerf的核心是什么？

前处理得到网络输入，后处理体渲染技术，分层采样。

（11）分层采样指的是？

nerf的模型包括corse模型和fine模型，corse模型的采样是(2)中的均匀等距采样，fine模型会根据corse模型输出的weight，采样逆变换采样的方式重新采样得到128个粒子，使得采样更加关注空间中不为空的区域，128+64=192个粒子作为fine模型的输入。

（12）还有其他细节嘛？

为了节省显存，不会整张图对应的射线粒子都会一股脑进入网络，会随机采样1024个像素对应的射线粒子进入网络。

二、代码理解。

获取输入粒子的坐标和方向主要关注以下代码。

    def initialize(self):

        warange = torch.arange(self.width,  dtype=torch.float32, device=self.device)
        harange = torch.arange(self.height, dtype=torch.float32, device=self.device)
        y, x = torch.meshgrid(harange, warange)

        self.transformed_x = (x - self.width * 0.5) / self.focal
        self.transformed_y = (y - self.height * 0.5) / self.focal #normailization coord

        # pre center crop
        self.precrop_index = torch.arange(self.width * self.height).view(self.height, self.width)

        dH = int(self.height // 2 * self.precrop_frac)
        dW = int(self.width  // 2 * self.precrop_frac)
        self.precrop_index = self.precrop_index[
            self.height // 2 - dH:self.height // 2 + dH, 
            self.width  // 2 - dW:self.width  // 2 + dW
        ].reshape(-1)

        poses = torch.FloatTensor(self.poses, device=self.device) #torch.cuda.FloatTensor(self.poses, device=self.device)
        all_ray_dirs, all_ray_origins = [], []

        for i in range(len(self.images)):
            ray_dirs, ray_origins = self.make_rays(self.transformed_x, self.transformed_y, poses[i])
            all_ray_dirs.append(ray_dirs)
            all_ray_origins.append(ray_origins)

        self.all_ray_dirs    = torch.stack(all_ray_dirs, dim=0)
        self.all_ray_origins = torch.stack(all_ray_origins, dim=0)
        self.images          = torch.FloatTensor(self.images, device=self.device).view(self.num_image, -1, 3)#torch.cuda.FloatTensor(self.images, device=self.device).view(self.num_image, -1, 3)


    def make_rays(self, x, y, pose):

        # 100, 100, 3
        # 坐标系在-y，-z方向上
        directions    = torch.stack([x, -y, -torch.ones_like(x)], dim=-1)
        camera_matrix = pose[:3, :3]
        
        # 10000 x 3
        ray_dirs = directions.reshape(-1, 3) @ camera_matrix.T
        ray_origin = pose[:3, 3].view(1, 3).repeat(len(ray_dirs), 1)
        return ray_dirs, ray_origin # xyz theta phi

体渲染主要关注以下代码。

def predict_to_rgb(sigma, rgb, z_vals, raydirs, white_background=False):

    device         = sigma.device
    delta_prefix   = z_vals[..., 1:] - z_vals[..., :-1] #0.0635 sample point delta 
    delta_addition = torch.full((z_vals.size(0), 1), 1e10, device=device)
    delta = torch.cat([delta_prefix, delta_addition], dim=-1)
    delta = delta * torch.norm(raydirs[..., None, :], dim=-1)

    alpha    = 1.0 - torch.exp(-sigma * delta) # 1 - e**-sigma*delta
    exp_term = 1.0 - alpha
    epsilon  = 1e-10
    exp_addition = torch.ones(exp_term.size(0), 1, device=device)
    exp_term = torch.cat([exp_addition, exp_term + epsilon], dim=-1)
    transmittance = torch.cumprod(exp_term, axis=-1)[..., :-1] #(1-alpha0)

    weights       = alpha * transmittance
    rgb           = torch.sum(weights[..., None] * rgb, dim=-2)
    depth         = torch.sum(weights * z_vals, dim=-1)
    acc_map       = torch.sum(weights, -1)

    if white_background:
        rgb       = rgb + (1.0 - acc_map[..., None])
    return rgb, depth, acc_map, weights

体渲染的公式：

连续的：