【三维AIGC】扩散模型LDM辅助3D Gaussian重建三维场景

标题：《Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models》
来源：Glasgow大学；爱丁堡大学
连接：https://arxiv.org/abs/2406.13099

提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

摘要
一、引言
二、相关工作
- 2.1 密集重建
- 2.2 稀疏重建
- 2.3 生成式模型
三、主要方法
- 3.1 Autoencoder
- 3.2 Denoiser 去噪器
4.实验
总结

摘要

本文提出一个三维场景的潜在扩散模型，它仅使用2D图像数据训练。文章首先设计了一个 1.自动编码器，将多视图图像映射到三维GS ，同时构建这些splats的压缩潜在表示。然后，我们 2.在潜在空间上训练一个多视图扩散模型来学习一个有效的生成模型 。该管道不需要对象的mask 或深度，并且适用于具有任意摄像机位置的复杂场景。我们在两个复杂真实场景的大规模数据集上进行了仔细的实验-MVImgNet和RealEstate 10K。我们的方法能够在短短0.2秒内生成3D场景，或从头开始，从单个输入视图，或从稀疏输入视图。它产生多样化和高质量的结果，同时运行比非潜在扩散模型和早期的基于NeRF的生成模型快一个数量级。

一、引言

学习生成模型，捕捉我们周围三维世界的分布，是一个引人注目但具有挑战性的问题。除了构建能够理解其环境的智能代理的更大目标外，这些模型对许多实际任务也很有用。在游戏和视觉效果中，它们可以毫不费力地创建3D资产，而这在目前是出了名的困难、缓慢和昂贵的。在计算机视觉中，它们能够从单个图像中重建真实场景的三维重建，生成模型合成了可信的三维细节，即使是在图像中不可见的区域——不像经典的三维重建方法[35,64]。

图像、文本和视频[83,29,7]的大规模数据集使学习这些令人印象深刻的[78,73,41]模式的生成模型成为可能。然而，目前还没有大规模的逼真的三维场景数据集。现有的3D数据集要么很大，但主要由孤立的对象（不是完整的场景）组成，通常具有不现实的纹理[118,22,12]；或者它们是逼真的环境（用3D扫描仪捕获），但太小，无法通过[21,4]学习生成模型。相比之下，大规模的多视图图像数据集容易获得[128,135,74]。

因此，我们希望直接从多视图图像的数据集上学习三维生成模型，而不是从三维数据中学习。一个简单的策略是将标准的三维重建技术应用于这样的数据集中的每个场景，然后直接在得到的重建[65,129]上训练一个三维生成模型。然而，这在计算上是昂贵的。这也给生成模型带来了一个具有挑战性的学习任务，因为通过独立重建场景，我们无法获得一个平滑的、共享的表示空间（例如，相似的场景在表示为NeRF[64]时可能具有非常不同的权重）。这使得我们很难学习一个概括的先验，而不是简单地记忆单个场景。这些限制激发了一系列直接从图像[3,96,84,38]中学习3D生成模型的工作。不幸的是，最近的方法的采样速度非常慢，因为它们在扩散过程[96,2,102]的每一步之后都需要昂贵的体渲染操作。

二、相关工作

2.1 密集重建

许多场景表示和相应的推理方法已经被提出，包括表面表示（如网格、距离场[70]）、点云[82,86,93]、光场[20,32]和体积表示（如辐射场[120]和体素[87,91]）。目前最先进的方法使用神经辐射场（NeRFs）[64,5]，它隐式地参数化辐射场，通过最小化图像重建损失，使它们很容易地通过梯度下降优化。然而，nerf需要昂贵的体积渲染，涉及大量的MLP查询，尽管最近努力减少它们的大小[71,28,2,66,55,14,121,28]，但训练和渲染都很缓慢。最近，高斯飞溅[48]被引入作为一种替代方案，允许实时渲染和快速训练，其质量接近于最先进的nerf。我们的工作也使用了这种有效的表示，但我们没有拟合单个场景，而是建立了一个生成模型，学习从一个分布中采样它们（例如，基于一个类标签或稀疏的图像集）。

2.2 稀疏重建

而上述方法可以从密集的(如>50)图像集，在实际中，场景的部分不能从多个图像中观察到，需要推断。为了解决这个问题，一系列的工作训练模型从更少的（例如<10）视图重建3D场景。大多数方法将二维图像特征投影到三维空间，融合并应用NeRF渲染。一些最近的并发方法[13,17,139,130,124,133,114,88,97]处理使用 splats 作为三维表示的稀疏视图重建任务；其他的目标是直接预测新的视图，而没有显式的3D [53,79,25]。然而，这些方法并不是概率性的——它们并不代表场景中未被观察到的部分（例如，一个对象的背面）的不确定性。因此，这些方法输出一个单一的平均解，而不是采样许多看似可信的三维表示中的一种。

例如，MVSplat [17]和pixelSplat [13]训练一个网络将上下文图像映射到3D斑点，但缺乏无条件的3D场景生成，或对遮挡区域的多样化采样补全的能力。LatentSplat[114]确实在splats上构造了一个后验分布，然后对其进行采样——但是它在碎片参数本身强加了一个平均场（独立）后验，这意味着它不能捕获场景几何中复杂的后验依赖，因此不能对未观察到的场景区域的相干形状进行采样。请注意，几乎所有上述方法都是为两个或更多个输入视图而设计的。只有 SplatterImage[97]才能从单个图像中预测3D飞溅，甚至这种方法也仅限于隐藏的、以对象为中心的场景。虽然一些方法结合了来自生成模型的信息来增加不确定区域[89,77,138,63,19,69]的合理性，但它们没有学习采样场景上的真实后验分布，这是我们的方法旨在学习的。

2.3 生成式模型

不同家族的生成模型[50,31,106,47,94,42]已经被提出从训练数据中学习复杂的分布。随着生成模型在各种模式下的成功，包括语言[107,73]、sound [105]和图像[78,41]，现在人们对对3D内容进行采样的兴趣越来越大。一个简单的方法是创建一个大规模的三维数据集[12,22]，并直接在这个[62,104,16,134,45,54,18,65,6,111,49,90,34,46,33]上训练一个生成模型。然而，与其他模式不同的是，大规模的、高度现实的场景3D数据集的创建具有挑战性，而且大多数3D表示缺乏share structure（因为每个数据点都是独立创建的，例如mesh 具有不同的拓扑，点云有不同数量的点），这使得学习先验变得困难。虽然最近的工作使用自动解码[6]或最优传输来跨表示[129] share structure，但这些方法仅限于小的或以对象为中心的数据集。

为了避免在独立重建场景上学习平滑先验的困难，各种方法直接从图像中学习三维感知生成模型。一种简单而优雅的方法[26,117,7,11,103,58,101,127,113,51,100,59,57,44,30,98]是在不需要显式三维表示的情况下生成基于相机姿态的多视图图像，然后使用经典的三维重建方法对生成的图像进行三维重建。然而，它继承了经典方法的局限性，主要是需要生成大量一致的图像（>50），以及在三维重建过程中缺乏先验。

为了直接对三维表示进行采样，一些作品学习了二维图像的三维感知生成模型，它保留了生成图像模型的数学公式，但在网络架构中引入了先验，迫使模型通过显式的三维表示输出图像。开创性的方法是基于VAEs [52,36,38,1,37]和GANs [84,10,92,23,68,67,132,24]，而目前最先进的方法使用三维感知去噪扩散模型[3,46,96,102,122,2,43,85,9]。与分数蒸馏方法[72,112,99,123,136,125,119,137,56,109]存在模式寻找行为，并且没有真正的分布采样不同，三维感知扩散模型可以从真实的后验分布中采样三维场景。然而，现有的作品使用辐射场来表示场景，因此受到缓慢的训练、采样和渲染时间的限制。相比之下，我们的工作在不到1秒（0.2秒），而最近的[96]或51秒；我们还可以实时渲染采样的3D资产。

三、主要方法

我们的目标是建立一个支持有条件和无条件生成3D场景的模型。我们假设只能访问带有相机姿态的多视图图像的训练数据集（可以从手机相机或COLMAP SfM[82]获得）。我们不需要任何额外的2D/3D监督（例如，注释、预先训练的模型、前景掩模、深度图），我们也不假设相机姿态在整个数据集中一致对齐。

我们通过设计一个两阶段训练的潜在扩散框架来实现这一点。首先，在多视图图像集（3.1节）上训练一个三维感知的变分自动编码器（VAE）。它将多视图图像编码为一个紧凑的潜在表示，将其解码为一个由高斯飞溅表示的显式三维场景，然后渲染场景以重建图像。其次，我们在由自编码器（3.2节）学习到的紧致潜在空间上训练一个去噪扩散模型。该扩散模型联合训练进行类条件和图像条件生成，可以有效地学习潜在空间上的分布。在推理过程中，产生的latents被自动编码器解码回splats，并渲染。

3.1 Autoencoder

Autoencoder输入为一个场景的V个视图 $x$ ={ $x_v$ }（每个大小H×W像素）作为输入，以及像机姿态 $π$ = { $π_v$ }，输出出一组splats的S，可以从每个 $π_v$ 的渲染原始图像 $x_v$ 。重要的是，它通过一个低维的latent bottleneck传递关于场景的所有信息，产生一个压缩的表示，然后从中解码和渲染splats

Encoding multi-view images. $x_v$ 首先独立通过三个类似于[27,78]的降采样残差块，生成分辨率为 $\frac {H}{8}$ × $\frac {W}{8}$ 的特征图。这些特征由 多视图U-Net [80]处理，它使不同的视图能够有效地交换信息（是实现一致的三维重建所必需的）。这个U-Net是基本基于DDPM的。为了使它适应我们的多视图设置，我们从视频扩散模型 中获得灵感，特别是[Align your latents:
High-resolution video synthesis with latent diffusion models CVPR 2023] ，并在每个块后添加一个小的交叉视图ResNet，为每个像素独立地结合了来自所有视图的信息。我们还修改了所有的注意层，以共同attend 来自所有视角的cross 特征。除了这些部分，残差块处理的其余部分都是独立地处理视图。U-Net的最终卷积输出每个视图的大小为 $\frac {H}{8}$ × $\frac {W}{8}$ 的特征图的均值和对数方差(log-variances)。这将是压缩的潜在空间{ $z_v$ } $^V_v=1$ ，对其执行去噪（见第3.2节），所以我们限制它只有很少的通道。遵循VAEs [50,78]，我们假设一个对角高斯后验分布（diagonal Gaussian posterior distribution），用E表示从 $x_v$ 到潜在样本 $z_v$ 的整体编码器映射；如图1中的绿框所示。

Decoding to a 3D scene。上一步得到的 $z_v$ ，通过三个上采样残差块传递特征，与E的初始层互为镜像关系，生成与原始图像具有相同大小的特征映射。与[97,13]类似，每个视图的特征通过卷积层被映射为splats(supported on the view frustum)的参数；对于每个像素，我们预测相应splats 的深度、不透明度、RGB颜色、旋转和尺度，共需要12个通道；通过预测的深度，沿相应的相机射线反投影，计算每个splats的三维位置。所有图像的V×H×W splats的联合，构成了我们的场景表示S。这种表示（在[Splatter image]中称为飞溅图像）提供了一种结构化的方式来表示splats，允许使用标准卷积层对它们进行推理，而不是非结构化点云所需的排列不变层对它们进行推理。我们用D表示从{ $z_v$ } $^V_{v=1}$ 到S的映射，如图1中的蓝框所示。然后可以使用任意的相机参数 $π^∗$ 渲染到像素 $x^∗$ ，即 $x^∗=R(S,π^∗)$ 的渲染操作。在所有去噪视点的图像上都支持splats的一个关键好处是，我们可以在我们看到的任何地方表示3D内容。这与 SplatterImage [97]形成对比——当执行单图像重建时，它们只能参数化输入图像 view frustum的内部（或非常接近）的splats，而我们可以在任意远处产生连贯的内容。

在这里插入图片描述

Conditioning on pose。为了使自动编码器适应于视图的相对姿态，我们设计了一种基于splat渲染器的新策略。对于每个视图，我们沿着其视图的view frustum 边缘生成一组splats；对于每个视图，分配一个随机的颜色。然后，我们渲染了来自所有摄像机的产生的splat cloud。这些渲染图在encoder 的第一个残差块之前与输入 $x_v$ 拼接起来。请注意，如果没有这种条件作用，自动编码器就不可能由于透视深度/尺度模糊性，学习任意场景比例（即使它成功地学习三角化输入图像）。

训练。我们假设可以访问包含V个输入视图 $x^{(in)},π^{(in)})$ ，和额外的没有传递给 $E$ 的 $V^{'}$ 个附近的目标视图 $x^{(target)},π^{(target)})$ 。预测splats $S = D(E(x^{(in)},π^{(in)}))$ ，然后在目标视图上渲染: $x^∗_{v'}=R(S,π_{v'}^{(target)}))$ 。网络训练为变分自编码器[50,75]，使用输入和渲染图像之间的L2和LPIPS [131]距离之和作为重建损失，并在标准高斯先验下将采样latent { $z_v$ } $^V_{v=1}$ 的对数概率最大化。损失为：

在这里插入图片描述
其中，β调整了KL损失的权重。请注意，与使用nerf的方法不同，splat渲染操作R的速度意味着我们可以直接将LPIPS应用于全图像渲染。

压缩。与将splats 本身作为潜在空间来处理相比，当我们使用6个潜在通道时（如本文主要实验中），产生的压缩因子为128×。如第4节所示。这使得去噪器能够进行更有效的训练和推理。

3.2 Denoiser 去噪器

在编码器 $E$ 得到的低维多视图的latent feature map { $z_v$ }上定义一个去噪扩散模型。模型不是2D U-Net，使用一个与3.1节非常相似的多视图U-Net架构的自动编码器，以扩散时间步长为条件（DDPM）。这个多视图U-Net $\hat{v}_θ$ 以 $π_v$ 为条件，比如拼接所有view frusta的表示。U-Net负责学习所有视图的 latent features的联合分布，通过交叉视图注意和卷积操作传递信息。

Conditional generation。我们对图像条件生成和类别条件生成，联合训练去噪器（每个minibatch来随机选择），并对20%的minibatch完全去掉条件，以实现无分类器指导[40]。对于类别条件，使用可学习的嵌入embedding，并将其添加到U-Net的时间步长嵌入中。对于图像条件处理（即在执行3D重建时），再次使用预训练编码器E，对条件图像 $x_{cond}$ 及其姿态 $π_{cond}$ 进行编码。然后，将 $E(x_{cond},π_{cond})$ 输出的条件化latent output与噪声在开始时拼接起来。

训练。训练过程中，对带位姿视图 ${(x,π)}$ 采样minibatch。将它们传递给E并进行后验采样，转换为latent $z$ 。基于第一个minibatch的统计量，将z标准化为0均值和单位标准差。对扩散时间步长t，以及高斯正向过程 $N(z^{(t)};α_tz, σ_t^2I)$ 的噪声进行采样，其中 $α_t$ 和 $σ_t$ 由线性的noise schedule指定；通过梯度下降优化降噪参数θ：

在这里插入图片描述
重要的是，抽象的latent space，我们实现了训练去噪器 $\hat{v}_θ(z^{(t)},t)$ 来预测 $v^{(t)}≡α_tϵ−σ_tz$ [81]，它比其他三维感知扩散模型[96,2,3]使用的 $x^{(0)}$ 预测更稳定，约束去噪器本身输出渲染像素。

采样。从模型中采样一个3D场景：首先在潜在空间中采样高斯噪声 $z^{(1000)}∼N(0,I)$ ，并选择一组相机pose（例如，从一个保留的验证集中）作为条件。然后使用DDIM采样[95]找到 $z^{(0)}$ ，使用classifier-free为类别条件指导。由此，通过 $S = D(z^{(0)})$ 解码生成的3D场景，可以用R从任意视点 $π^∗$ 有效地渲染场景。

4.实验

我们在单张/少视图任务上，评估三维重建。

数据集 。两个真实世界大规模数据集：MVImgNet 和RealEstate10K。MVimgNet由in the wild 不同物体的视频组成，使用来自MVPNet子集的分割，包含87,820个视频(通常30帧)，覆盖180个对象类，并限制为5000个场景进行评估；RealEstate10K 包含69893个视频，为房屋的室内和室外视角。限制在5000个评估场景。每个视频通常包含50-200帧。我们使用完整的视频剪辑（通常带有大量的摄像机运动）来进行训练和评估。这两个数据集中的视频都描绘了完整的场景，而不仅仅是孤立的对象。对于这两个数据集，我们在中心裁剪大小等于小边缘的图像，然后重新缩放到96×96。在训练过程中，我们随机抽取6帧的图像集作为多视图图像。我们使用每个数据集提供的相机姿态，但只提供与模型相关的相对姿态。我们不需要任何canonicalisation的场景方向或尺度，也不需要分割和depth。

Baseline 。GIBR [2]和ViewSet Diffusion[96]是在多视图像上的三维感知扩散模型。他们encode 一组有噪声的视图，并使用其构建场景的辐射场表示，以及渲染，以给出去噪后的图像。就像我们的方法一样，它们同时支持无条件的生成和重建。然而，两者都是相对昂贵的，因为它们必须在每个去噪步骤中执行体渲染。 RenderDiffusion[3]是一种类似的方法，在训练过程中只需要单个图像； SplatterImage[97]是一种最近的确定性方法，从单个图像重建三维，从单个U-Net通道输出splats。这是目前唯一一种能够直接从单个图像中预测三维飞溅的方法。然而，最初的作品只考虑了孤立物体的掩蔽视图；因此，我们采用他们的方法来适应我们更大的、未被掩蔽的场景。最近， PixelNeRF [126]通过从一个或多个输入图像中不投影特征来确定性地预测一个辐射场；我们使用了来自[2:Denoising diffusion via image-based
rendering]的in-the-wild 变体。

生成结果 。无条件生成（RealEstate10K），以及完整的（180个类）MVImgNet数据集的类别条件生成。定性的例子如图2所示，更多的例子见附录。我们看到，我们的模型可以生成不同类的对象，只给定label作为条件。生成的场景在不同视图中是连贯的，包括RealEstate10K中的长摄像机运动。定量评价如表1（底部五行）。我们按照[2]的设置，使用MVImgNet的椅子、桌子和沙发这几类，与GIBR等算法进行比较。
在这里插入图片描述

单视图重建 。给定一幅图像，我们预测了其他11个均匀间隔的视点，并使用PSNR和LPIPS [131]测量了预测视图的准确性。由于我们的方法是生成性的（并且可以为给定的输入生成许多可信的样本），我们遵循随机预测的标准实践，为每个场景抽取多个（20）个样本，然后记录最佳样本。在这里，我们比较SplatterImage在MVImgNet和RealEstate 10K，以及MVImgNet的家具子集，定性结果见图3。我们看到，我们的方法可以从一个单一的图像重建合理的形状，为不同的对象类，甚至整个房间。输入图像中的细节被保留，而在被遮挡的部分中生成可信的内容。此外，在图4中，我们展示了给定一个物体的一幅图像，我们的模型可以为未观察到的物体的背面采样不同的（但看似合理的）纹理和形状。定量结果见表1（“1个视图重建”列）。我们看到，在完整的MVImg上，我们的方法的性能明显优于SplatterImage。

在这里插入图片描述

图3：我们的模型（上一行）和SplatterImage（下一行）从单个图像进行的三维重建的定性比较。第一列显示输入（条件）图像，第二列显示地面真实图像，而第三列和第四列分别显示预测的帧和深度。

在这里插入图片描述

图4：给定来自MVImgNet的单个输入图像（第一列），我们的模型以生成的方式执行三维重建，因此可以产生多个不同的后视图（列2到7）。与由确定性模型（第8列）生成的后视图相比，我们的模型的预测要清晰得多。

速度。重建场景的时间（为了公平，将每个方法固定到50个去噪步骤），处理每个minibatch的8个场景，在单个消费者GPU（NVIDIA RTX 3090）上计算每个场景的平均时间。在生成方法中，我们的方法是迄今为止最快的（0.22秒），而接下来最快的方法（ViewSet Diffusion）是4.3秒。GIBR是最慢的，重建一个场景需要44秒；它受到在每个去噪步骤中需要构建和重新渲染一个NeRF的限制。PixelNeRF总体上是最快的，但以质量换取速度，并且不能代表3D上的后验场景。

稀疏视角三维重建 。稀疏的(6个)视图中评估重建：在输入之间间隔相等的6个保留视图上测量PSNR和LPIPS。定量结果见表1，定性例子见图9（下图）。我们看到定性重建的性能非常好，保留了精细的细节，以及可信的深度图。定量地说，我们的方法在MVImgNet的家具子集上的性能略低于GIBR，尽管它执行得要快得多。注意，对于这个任务，我们只使用模型的自动编码器，而不是去噪器。因此，这些结果也表明，自动编码器忠实地保留了三维场景中的细节，即使压缩了128×倍，证明了它可以作为学习生成先验的潜在空间。

在这里插入图片描述

生成式和latent的好处 。我们现在评估了我们的模型的几种变体，展示了在潜在空间上操作的好处，并将3D重建视为生成（概率）任务，而不是确定性任务。在MVImgNet上的测试结果如表2所示。在这里，确定性表示我们的模型的一个变体，给定一个输入图像，去噪器被代表场景的潜在变量的确定性预测器所取代。该模型具有与主模型相同的网络架构，但没有对可能的场景上的后验分布进行采样；而是期望在潜在空间中学习条件期望。我们看到这个分数明显低于我们在LPIPS上的模型，在PSNR上也略低。这是意料之中的，因为PSNR（基于均方误差）对过度平滑的解的惩罚不如LPIPS（它倾向于感知正确的输出，例如，尖锐的边缘，即使这些是轻微的不对齐）。

在这里插入图片描述

此外，图4中的定性结果显示，当我们的方法为场景中未观察到的部分生成许多尖锐的、可信的样本时，确定性变量产生一个模糊的预测，平均了不确定性。此外，对于RealEstate10K（图5），我们的模型可以为输入框架中不可见的房间生成不同而可信的内容，例如，当摄像机穿过门口时。Splats-as-latents 表示一种自然消融，我们不是学习压缩的潜在空间，而是将多视图分裂参数本身作为潜在变量，并直接在这些变量上学习去噪。我们在与我们的主要模型相同的时间内训练这种消融，以公平地衡量性能和准确性的权衡。我们看到，生成和重建的性能都明显不如我们的性能（57.3vs23.1FID，0.39vs0.32 LPIPS）。此外，由于128×的维度，该模型比我们的长60×来采样3D场景。因此，对于一个固定的训练计算预算，我们的潜在方法在质量和准确性方面产生了明显的好处。最后，Non-latent是一种使用单阶段训练过程的消融术，类似于[2,96]。在这里，去噪器直接在像素上操作，但在解码器中合并了我们的基于分割的场景表示，通过在每个去噪步骤中渲染它来给出干净的像素。该模型的性能更差（对于相同的训练计算预算），达到FID为133.8，LPIPS为0.499。

在这里插入图片描述

该处使用的url网络请求的数据。

总结

参考文献

[1] T. Anciukevicius, P. Fox-Roberts, E. Rosten, and P. Henderson. Unsupervised causal generative understanding of images. Advances in Neural Information Processing Systems, 35:37037–37054, 2022.
[2] T. Anciukeviˇcius, F. Manhardt, F. Tombari, and P. Henderson. Denoising diffusion via image-based
rendering. In The Twelfth International Conference on Learning Representations, 2024.
[3] T. Anciukeviˇcius, Z. Xu, M. Fisher, P. Henderson, H. Bilen, N. J. Mitra, and P. Guerrero. Renderdiffusion:
Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 12608–12618, June 2023.
[4] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2d-3d-semantic data for indoor scene understanding.
arXiv preprint arXiv:1702.01105, 2017.
[5] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan. Mip-nerf:
A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 5855–5864, 2021.
[6] M. A. Bautista, P. Guo, S. Abnar, W. Talbott, A. Toshev, Z. Chen, L. Dinh, S. Zhai, H. Goh, D. Ulbricht,
A. Dehghan, and J. Susskind. Gaudi: A neural architect for immersive 3d scene generation. In NeurIPS,2022.
[7] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English,
V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.
arXiv preprint arXiv:2311.15127, 2023.
[8] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents:
High-resolution video synthesis with latent diffusion models. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2023.
[9] A. Cao, J. Johnson, A. Vedaldi, and D. Novotny. Lightplane: Highly-scalable components for neural 3d
fields. ArXiv, 2024.
[10] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. D. Mello, O. Gallo, L. Guibas, J. Tremblay,
S. Khamis, T. Karras, and G. Wetzstein. Efficient geometry-aware 3D generative adversarial networks. In
CVPR, 2022.
[11] E. R. Chan, K. Nagano, J. J. Park, M. Chan, A. W. Bergman, A. Levy, M. Aittala, S. D. Mello, T. Karras,
and G. Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. In IEEE International
Conference on Computer Vision (ICCV), October 2023.
[12] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,
H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[13] D. Charatan, S. Li, A. Tagliasacchi, and V. Sitzmann. pixelsplat: 3d gaussian splats from image pairs for
scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023.
[14] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su. Tensorf: Tensorial radiance fields. In Computer Vision –
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII,
page 333–350, Berlin, Heidelberg, 2022. Springer-Verlag.
[15] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su. Mvsnerf: Fast generalizable radiance field
reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 14124–14133, 2021.
[16] H. Chen, J. Gu, A. Chen, W. Tian, Z. Tu, L. Liu, and H. Su. Single-stage diffusion nerf: A unified
approach to 3d generation and reconstruction. In ICCV, 2023.
[17] Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. Mvsplat: Efficient
3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627, 2024.
[18] Y.-C. Cheng, H.-Y. Lee, S. Tuyakov, A. Schwing, and L. Gui. SDFusion: Multimodal 3d shape completion,
reconstruction, and generation. In CVPR, 2023.
[19] J. Chung, S. Lee, H. Nam, J. Lee, and K. M. Lee. Luciddreamer: Domain-free generation of 3d gaussian
splatting scenes. arXiv preprint arXiv:2311.13384, 2023.
[20] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In
Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages
303–312, 1996.
[21] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated
3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE,2017.
[22] M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V. Voleti, S. Y.
Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing
Systems, 36, 2024.
[23] Y. Deng, J. Yang, J. Xiang, and X. Tong. Gram: Generative radiance manifolds for 3d-aware image
generation. In IEEE Computer Vision and Pattern Recognition, 2022.
[24] T. Devries, M. Á. Bautista, N. Srivastava, G. W. Taylor, and J. M. Susskind. Unconstrained scene
generation with locally conditioned radiance fields. 2021 IEEE/CVF International Conference on
Computer Vision (ICCV), pages 14284–14293, 2021.
[25] Y. Du, C. Smith, A. Tewari, and V. Sitzmann. Learning to render novel views from wide-baseline stereo
pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[26] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu,
I. Danihelka, K. Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–1210,2018.
[27] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883,2021.
[28] S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa. K-planes: Explicit radiance
fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 12479–12488, 2023.
[29] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv
preprint arXiv:2101.00027, 2020.
[30] R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole.
Cat3d: Create anything in 3d with multi-view diffusion models. arXiv:2405.10314, 2024.
[31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[32] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In Proceedings of the 23rd
annual conference on Computer graphics and interactive techniques, pages 43–54, 1996.
[33] J. Gu, Q. Gao, S. Zhai, B. Chen, L. Liu, and J. Susskind. Learning controllable 3d diffusion models from
single-view images. ArXiv, 2023.
[34] A. Gupta, W. Xiong, Y. Nie, I. Jones, and B. O˘guz. 3dgen: Triplane latent diffusion for textured mesh
generation. arXiv preprint arXiv:2303.05371, 2023.
[35] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press,2003.
[36] P. Henderson and V. Ferrari. Learning single-image 3D reconstruction by generative modelling of shape,
pose and shading. International Journal of Computer Vision (IJCV), 2019.
[37] P. Henderson, C. H. Lampert, and B. Bickel. Unsupervised video prediction from a single frame by
estimating 3d dynamic scene structure. CoRR, abs/2106.09051, 2021.
11
[38] P. Henderson, V. Tsiminaki, and C. Lampert. Leveraging 2D data to learn textured 3D mesh generation.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[39] P. Henzler, J. Reizenstein, P. Labatut, R. Shapovalov, T. Ritschel, A. Vedaldi, and D. Novotny. Unsupervised learning of 3d object categories from videos in the wild. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 4700–4709, 2021.
[40] J. Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
[41] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi,
D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint
arXiv:2210.02303, 2022.
[42] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information
Processing Systems, 33:6840–6851, 2020.
[43] L. Höllein, A. Božiˇc, N. Müller, D. Novotny, H.-Y. Tseng, C. Richardt, M. Zollhöfer, and M. Nießner.
Viewdiff: 3d-consistent image generation with text-to-image models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2024.
[44] H. Hu, Z. Zhou, V. Jampani, and S. Tulsiani. Mvd-fusion: Single-view 3d via depth-consistent multi-view
generation. In CVPR, 2024.
[45] K.-H. Hui, R. Li, J. Hu, and C.-W. Fu. Neural wavelet-domain diffusion for 3d shape generation. In
SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
[46] A. Karnewar, A. Vedaldi, D. Novotny, and N. Mitra. Holodiffusion: Training a 3d diffusion model using
2d images. ArXiv, 2023.
[47] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–
4410, 2019.
[48] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field
rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
[49] S. W. Kim, B. Brown, K. Yin, K. Kreis, K. Schwarz, D. Li, R. Rombach, A. Torralba, and S. Fidler.
Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2023.
[50] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Y. Bengio and Y. LeCun, editors,
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16,
2014, Conference Track Proceedings, 2014.
[51] X. Kong, S. Liu, X. Lyu, M. Taher, X. Qi, and A. J. Davison. Eschernet: A generative model for scalable
view synthesis. arXiv preprint arXiv:2402.03908, 2024.
[52] A. R. Kosiorek, H. Strathmann, D. Zoran, P. Moreno, R. Schneider, S. Mokrá, and D. J. Rezende. Nerf-vae:
A geometry aware 3d scene generative model. In M. Meila and T. Zhang, editors, Proceedings of the 38th
International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139
of Proceedings of Machine Learning Research, pages 5742–5752. PMLR, 2021.
[53] J. Kulhánek, E. Derner, T. Sattler, and R. Babuška. Viewformer: Nerf-free neural rendering from few
images using transformers. In European Conference on Computer Vision (ECCV), 2022.
[54] M. Li, Y. Duan, J. Zhou, and J. Lu. Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[55] Z. Li, T. Müller, A. Evans, R. H. Taylor, M. Unberath, M.-Y. Liu, and C.-H. Lin. Neuralangelo: Highfidelity neural surface reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2023.
[56] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y.
Lin. Magic3d: High-resolution text-to-3d content creation. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.
[57] M. Liu, C. Xu, H. Jin, L. Chen, M. Varma T, Z. Xu, and H. Su. One-2-3-45: Any single image to 3d mesh
in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36,2024.
[58] R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick. Zero-1-to-3: Zero-shot one
image to 3d object. ArXiv, 2023.
[59] Y. Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang. Syncdreamer: Generating multiviewconsistent images from a single-view image. In The Twelfth International Conference on Learning
Representations, 2024.
12
[60] Y. Liu, S. Peng, L. Liu, Q. Wang, P. Wang, C. Theobalt, X. Zhou, and W. Wang. Neural rays for
occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 7824–7833, 2022.
[61] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on
Learning Representations (ICLR), 2019.
[62] S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
[63] L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi. Realfusion: 360° reconstruction of any object
from a single image. In Arxiv, 2023.
[64] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing
scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[65] N. Müller, Y. Siddiqui, L. Porzi, S. R. Bulo, P. Kontschieder, and M. Nießner. Diffrf: Rendering-guided
3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 4328–4338, 2023.
[66] T. Müller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution
hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
[67] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang. Hologan: Unsupervised learning of
3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7588–7597, 2019.
[68] T. Nguyen-Phuoc, C. Richardt, L. Mai, Y.-L. Yang, and N. Mitra. Blockgan: Learning 3d object-aware
scene representations from unlabelled images. In Advances in Neural Information Processing Systems 33,
Nov 2020.
[69] M. Niemeyer, J. T. Barron, B. Mildenhall, M. S. M. Sajjadi, A. Geiger, and N. Radwan. Regnerf:
Regularizing neural radiance fields for view synthesis from sparse inputs. 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), Jun 2022.
[70] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed
distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 165–174, 2019.
[71] S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger. Convolutional occupancy networks.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part III 16, pages 523–540. Springer, 2020.
[72] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. ArXiv,2022.
[73] A. Radford and K. Narasimhan. Improving language understanding by generative pre-training. OpenAI,2018.
[74] J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects
in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In International
Conference on Computer Vision, 2021.
[75] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in
deep generative models. In International conference on machine learning, pages 1278–1286. PMLR,2014.
[76] C. Rockwell, D. F. Fouhey, and J. Johnson. Pixelsynth: Generating a 3d-consistent experience from a
single image. In ICCV, 2021.
[77] B. Roessle, N. Müller, L. Porzi, S. R. Bulò, P. Kontschieder, and M. Nießner. Ganerf: Leveraging
discriminators to optimize neural radiance fields. ACM Trans. Graph., 42(6), nov 2023.
[78] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with
latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10684–10695, June 2022.
[79] R. Rombach, P. Esser, and B. Ommer. Geometry-free view synthesis: Transformers and no 3d priors.
ArXiv, 2021.
[80] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
[81] T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. In International
Conference on Learning Representations (ICLR), 2022.
[82] J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
13
[83] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta,
C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation
image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[84] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger. GRAF: generative radiance fields for 3d-aware image
synthesis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural
Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[85] K. Schwarz, S. Wook Kim, J. Gao, S. Fidler, A. Geiger, and K. Kreis. Wildfusion: Learning 3d-aware
latent diffusion models in view space. In International Conference on Learning Representations (ICLR),2024.
[86] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of
multi-view stereo reconstruction algorithms. In 2006 IEEE computer society conference on computer
vision and pattern recognition (CVPR’06), volume 1, pages 519–528. IEEE, 2006.
[87] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel coloring. International journal
of computer vision, 35:151–173, 1999.
[88] Q. Shen, X. Yi, Z. Wu, P. Zhou, H. Zhang, S. Yan, and X. Wang. Gamba: Marry gaussian splatting with
mamba for single view 3d reconstruction. ArXiv, 2024.
[89] J. Shriram, A. Trevithick, L. Liu, and R. Ramamoorthi. RealmDreamer: Text-driven 3d scene generation
with inpainting and depth diffusion. ArXiv, 2024.
[90] J. R. Shue, E. R. Chan, R. Po, Z. Ankner, J. Wu, and G. Wetzstein. 3d neural field generation using
triplane diffusion. arXiv preprint arXiv:2211.16677, 2022.
[91] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer. Deepvoxels: Learning
persistent 3d feature embeddings. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE,2019.
[92] I. Skorokhodov, S. Tulyakov, Y. Wang, and P. Wonka. Epigraf: Rethinking training of 3d gans. In
S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural
Information Processing Systems, volume 35, pages 24487–24501. Curran Associates, Inc., 2022.
[93] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM
siggraph 2006 papers, pages 835–846. 2006.
[94] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
[95] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In International Conference on
Learning Representations, 2021.
[96] S. Szymanowicz, C. Rupprecht, and A. Vedaldi. Viewset diffusion: (0-)image-conditioned 3D generative
models from 2D data. In ICCV, 2023.
[97] S. Szymanowicz, C. Rupprecht, and A. Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction.
Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[98] J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu. Lgm: Large multi-view gaussian model for
high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
[99] J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient
3d content creation. In The Twelfth International Conference on Learning Representations, 2024.
[100] S. Tang, J. Chen, D. Wang, C. Tang, F. Zhang, Y. Fan, V. Chandra, Y. Furukawa, and R. Ranjan.
Mvdiffusion++: A dense high-resolution multi-view diffusion model for single or sparse-view 3d object
reconstruction. arXiv preprint arXiv:2402.12712, 2024.
[101] S. Tang, F. Zhang, J. Chen, P. Wang, and Y. Furukawa. MVDiffusion: Enabling holistic multi-view image
generation with correspondence-aware diffusion. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.
[102] A. Tewari, T. Yin, G. Cazenavette, S. Rezchikov, J. B. Tenenbaum, F. Durand, W. T. Freeman, and
V. Sitzmann. Diffusion with forward models: Solving stochastic inverse problems without direct
supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[103] H.-Y. Tseng, Q. Li, C. Kim, S. Alsisan, J.-B. Huang, and J. Kopf. Consistent view synthesis with
pose-guided diffusion models. ArXiv, 2023.
[104] A. Vahdat, F. Williams, Z. Gojcic, O. Litany, S. Fidler, K. Kreis, et al. Lion: Latent point diffusion models
for 3d shape generation. Advances in Neural Information Processing Systems, 35:10021–10039, 2022.
[105] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.
14
[106] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In International
conference on machine learning, pages 1747–1756. PMLR, 2016.
[107] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin.
Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017.
[108] H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d
diffusion models for 3d generation. ArXiv, 2022.
[109] H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d
diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 12619–12629, 2023.
[110] Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and
T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
[111] T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen, D. Chen, F. Wen, Q. Chen, and
B. Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion. ArXiv, 2022.
[112] Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu. Prolificdreamer: High-fidelity and diverse
text-to-3d generation with variational score distillation. Advances in Neural Information Processing
Systems, 36, 2024.
[113] D. Watson, W. Chan, R. M. Brualla, J. Ho, A. Tagliasacchi, and M. Norouzi. Novel view synthesis with
diffusion models. In The Eleventh International Conference on Learning Representations, 2023.
[114] C. Wewer, K. Raj, E. Ilg, B. Schiele, and J. E. Lenssen. latentsplat: Autoencoding variational gaussians
for fast generalizable 3d reconstruction. In arXiv, 2024.
[115] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. Synsin: End-to-end view synthesis from a single
image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
7467–7477, 2020.
[116] C.-Y. Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari. Multiview compressive coding for 3d
reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 9065–9075, 2023.
[117] R. Wu, B. Mildenhall, P. Henzler, K. Park, R. Gao, D. Watson, P. P. Srinivasan, D. Verbin, J. T. Barron,
B. Poole, and A. Holynski. Reconfusion: 3d reconstruction with diffusion priors. ArXiv, 2023.
[118] T. Wu, J. Zhang, X. Fu, Y. Wang, L. P. Jiawei Ren, W. Wu, L. Yang, J. Wang, C. Qian, D. Lin, and Z. Liu.
Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[119] J. Wynn and D. Turmukhambetov. DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising
Diffusion Models. In CVPR, 2023.
[120] Y. Xie, T. Takikawa, S. Saito, O. Litany, S. Yan, N. Khan, F. Tombari, J. Tompkin, V. Sitzmann, and
S. Sridhar. Neural fields in visual computing and beyond. Computer Graphics Forum, 2022.
[121] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neumann. Point-nerf: Point-based neural
radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 5438–5448, 2022.
[122] Y. Xu, H. Tan, F. Luan, S. Bi, P. Wang, J. Li, Z. Shi, K. Sunkavalli, G. Wetzstein, Z. Xu, and K. Zhang.
DMV3d: Denoising multi-view diffusion using 3d large reconstruction model. In The Twelfth International
Conference on Learning Representations, 2024.
[123] T. Yi, J. Fang, J. Wang, G. Wu, L. Xie, X. Zhang, W. Liu, Q. Tian, and X. Wang. Gaussiandreamer: Fast
generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In CVPR, 2024.
[124] X. Yinghao, S. Zifan, Y. Wang, C. Hansheng, Y. Ceyuan, P. Sida, S. Yujun, and W. Gordon. Grm: Large
gaussian reconstruction model for efficient 3d reconstruction and generation. ArXiv, 2024.
[125] P. Yoo, J. Guo, Y. Matsuo, and S. S. Gu. Dreamsparse: Escaping from plato’s cave with 2d frozen
diffusion model given sparse views. CoRR, 2023.
[126] A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587,2021.
[127] J. J. Yu, F. Forghani, K. G. Derpanis, and M. A. Brubaker. Long-term photometric consistent novel view
synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision
(ICCV), 2023.
[128] X. Yu, M. Xu, Y. Zhang, H. Liu, C. Ye, Y. Wu, Z. Yan, T. Liang, G. Chen, S. Cui, and X. Han. Mvimgnet:
A large-scale dataset of multi-view images. In CVPR, 2023.
15
[129] B. Zhang, Y. Cheng, J. Yang, C. Wang, F. Zhao, Y. Tang, D. Chen, and B. Guo. Gaussiancube: Structuring
gaussian splatting using optimal transport for 3d generative modeling. arXiv preprint arXiv:2403.19655,2024.
[130] K. Zhang, S. Bi, H. Tan, Y. Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu. Gs-lrm: Large reconstruction
model for 3d gaussian splatting. ArXiv, 2024.
[131] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 586–595, 2018.
[132] X. Zhao, F. Ma, D. Güera, Z. Ren, A. G. Schwing, and A. Colburn. Generative multiplane images:
Making a 2d gan 3d-aware. In Proc. ECCV, 2022.
[133] S. Zheng, B. Zhou, R. Shao, B. Liu, S. Zhang, L. Nie, and Y. Liu. Gps-gaussian: Generalizable pixelwise 3d gaussian splatting for real-time human novel view synthesis. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[134] L. Zhou, Y. Du, and J. Wu. 3d shape generation and completion through point-voxel diffusion. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
[135] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis
using multiplane images. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 2018.
[136] Z. Zhou and S. Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In
CVPR, 2023.
[137] J. Zhu and P. Zhuang. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint
arXiv:2305.18766, 2023.
[138] Z.-X. Zou, W. Cheng, Y.-P. Cao, S.-S. Huang, Y. Shan, and S.-H. Zhang. Sparse3d: Distilling multiviewconsistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078,2023.
[139] Z.-X. Zou, Z. Yu, Y.-C. Guo, Y. Li, D. Liang, Y.-P. Cao, and S.-H. Zhang. Triplane meets gaussian
splatting: Fast and generalizable single-view 3d reconstruction with transformers. ArXiv, 2023.