【读论文】【泛读】三篇生成式自动驾驶场景生成: Bevstreet, DisCoScene, BerfScene

news2025/3/10 21:26:22


  • 1. Street-View Image Generation from a Bird’s-Eye View Layout
    • 1.1 Problem introduction
    • 1.2 Why
    • 1.3 How
    • 1.4 My takeaway
  • 2. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis
    • 2.1 What
    • 2.2 Why
    • 2.3 How
    • 2.4 My takeaway
  • 3. BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation(Follow DisCoScene)
    • 3.1 What
    • 3.2 Why
    • 3.3 How
    • 3.4 My takeaway

1. Street-View Image Generation from a Bird’s-Eye View Layout

1.1 Problem introduction

From the title of this paper, we know it bound a relation from Bev(Bird’s-Eye View) to Street view image.


Concretely, the input (Bev) is a two-dimensional representation of a three-dimensional environment from a top perspective. In the BEV diagram, squares of different colors represent different objects or road features, such as vehicles, pedestrians, lane lines, etc. And green square means an ego vehicle that has three cameras in front.

The task is to generate three street-view images aligned to the Bev according to the relative position among these square objects.

As for the concept of “layout”, it should consider the effects of these factors:

  • Cameras with an overlapping field-of-view (FoV) must ensure overlapping content is correctly shown
  • The visual styling of the scene also needs to be consistent such that all virtual views appear to be created in the same geographical area (e.g., urban vs. rural), at the same time of day, with the same weather conditions, and so on.
  • In addition to this consistency, the images must correspond to the HD
    map, faithfully reproducing the specified road layout, lane lines, and vehicle locations.

1.2 Why

It is the first attempt to explore the generative side of BEV perception for driving scenes.

1.3 How

  1. Methods
    As shown in this pipeline, the Bev layout and source images were encoded as an input of the autoregressive transformer collaborating with direction and camera information to help the understanding of space. New mv-images were output.

  2. Experiments
    Three metrics are used.
    FID represents the diversity and quality of generated images. Road mIoU and Vehicle mIoU can be used to represent the overlapping to verify the relative position in the Bev inputs.
    Scene edit was achieved by the change of Bev layout:

1.4 My takeaway

  1. How to utilize the ability of an autoregressive transformer!!! Why do we use it other than others?
  2. I have known about what is Bev.

2. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

2.1 What

An editable 3D generative model using object bounding boxes without semantic annotation as layout prior, allowing for high-quality scene synthesis and flexible user control of both the camera and scene objects.

2.2 Why

  • Existing generative models focus on individual objects, lacking the ability to handle non-trivial scenes.

  • Some works like GSN can only generate scenes, without object-level editing. That is because of the lack of explicit object definition in NeRF.

  • GIRAFFE explicitly composites object-centric radiance fields to support object-level control. Yet, it works poorly on mixed scenes due to the absence of proper spatial priors.

  • Interesting refer:

    17: Layout-transformer: Layout generation and completion with self-attention.

    26: Layout-gan: Generating graphic layouts with wireframe discriminators.

    58: Blockplanner: City block generation with vectorized graph representation.

2.3 How


Bounding boxes as layout priors to generate the objects, combined with the generated background were used in neural rendering. Meanwhile, an extra object discriminator for local discrimination is added, leading to better object-level supervision.

2.4 My takeaway

  1. Is it possible to cancel the manually marked bbox and automatically identify and regenerate the corresponding area in Gaussian?

3. BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation(Follow DisCoScene)

3.1 What

Incorporating an equivariant radiance field with the guidance of a BEV map, this method allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency.

Understood as the superposition of patches in a bev:

3.2 Why

  1. Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis techniques since 3D scenes usually hold complex spatial configurations and consist of many objects at varying scales.
  2. Previous approaches often relied on scene graphs, facing limitations in processing due to unstructured topology.
  3. DiscoScene introduces complexity in interpreting the entire scene and
    faces scalability challenges when using Bbox.
  4. BEV maps could specify the composition and scales of objects clearly but lack insights into the detailed visual appearance of the objects. Recent attempts like InfiniCity and SceneDreamer try to avoid the ambiguity of BEV maps, but they are inefficiency.

3.3 How


To integrate the prior information provided by the BEV map into the radiation field, the researchers introduced a generator U U U, which can generate a 2D feature map based on BEV map conditions. Builder U U U adopts a network structure that combines U-Net architecture and StyleGAN blocks.

3.4 My takeaway

  1. Confused about how to use this U-Net, need some other time to supplement background knowledge. 🤡





数据集展示 7369SMITHCLERK79021980/12/17800207499ALLENSALESMAN76981981/2/201600300307521WARDSALESMAN76981981/2/221250500307566JONESMANAGER78391981/4/22975207654MARTINSALESMAN76981981/9/2812501400307698BLAKEMANAGER78391981/5/12850307782CLARKMANAGER78391981/…

【Image captioning】论文阅读九—Self-Distillation for Few-Shot Image Captioning_2022

摘要 大规模图像字幕数据集的开发成本高昂,而大量未配对的图像和文本语料库可能有助于减少手动注释的工作。在本文中,我们研究了只需要少量带注释的图像标题对的少样本图像标题问题。我们提出了一种基于集成的自蒸馏方法,允许使用不成对的图像和字幕来训练图像字幕模型。该…

Flask Web 应用与 MongoDB 集成:用户登录和退出

在本文中,我们将探讨如何使用 Flask Web 框架和 MongoDB 数据库构建一个简单的 Web 应用。我们将使用 Flask-PyMongo 扩展来实现 MongoDB 集成,并使用 Flask-Login 扩展来处理用户认证。 1. 安装所需库 首先,我们需要安装 Flask、Flask-PyM…


纯网页版视频编辑器 一、前言二、功能实现三、所需技术四、部分功能实现4.1 素材预设4.2 多轨道剪辑 一、前言 介绍:本篇文章打算利用纯前端的技术,来实现一个网页版的视频编辑器。为什么突然想做一个这么项目来呢,主要是最近一直在利用手机…


实体的概念与数据库 实体(对象): 客观存在的事物都是实体实体数据的存储要求: 必须按照一定的分类和规律存储数据库: 专门用于存储这些实体的信息的数据集合数据库的特点: 海量存储数据/数据检索非常方便保持数据信息的一致/完整/并实现数据…

C# .NET 中的反应式系统

概述:反应式系统已成为构建健壮、可扩展和响应迅速的应用程序的强大范式。这些系统被设计为更具弹性、弹性和消息驱动性,确保它们在各种条件下保持响应,包括高负载、网络延迟和故障。在本文中,我们将探讨 .NET 生态系统中反应式系…

Day 15 Linux网络管理

IP解析 IP地址组成:IP地址由4部分数字组成,每部分数字对应于8位二进制数字,各部分之间用小数点分开,这是点分2进制。如果换算为10进制我们称为点分10进制。 每个ip地址由两部分组成网络地址(NetID)和主机地址(HostID).网络地址表…


DataGrip数据库管理工具安装使用 DataGrip介绍 DataGrip是jetbrains旗下的一款数据库管理工具,相信做过java开发的同学都知道,idea就是这家公司发明的。 DataGrip 是JetBrains公司开发的数据库管理客户端工具(操作数据库的IDE,…




重新了解地址空间 在学习c/c语言的时候,大家一定见过以下这张图 说的是程序会加载在如图的结构上,实际上,我们真的对他很了解吗,而在Linux进程控制这,就会有一个奇怪的现象 前提提要:简要介绍一下fork函数 进程内核数据结构(PCB)自己的代码以及数据 在Linux中,fork可以从当…






前言 随着互联网的快速发展,数据成为了当今社会最宝贵的资源之一。然而,要获取大量的网络数据并进行有效的分析,往往需要面对诸多挑战,其中之一就是网络封锁与反爬虫机制。在这个背景下,人工智能(AI&#x…


样例介绍 使用多路离线视频流(* .mp4)作为应用程序的输入,基于YoloV5s模型对输入视频中的物体做实时检测,将推理结果信息使用imshow方式显示。 样例代码逻辑如下所示: 环境信息 CPU:Intel Xeon Gold 63…


摘 要 C/S应用程序平台多语种技术是一种基于C/S应用技术结构平台的关于多语种的转换和翻译技术。本设计基于Visual Studio.Net集成开发环境,采用SQL Server2000进行数据库后台开发。通过采用数据字典实现应用系统的静态文本转换;通过使用Visual Studio.…

适用于 Windows 的 10 个顶级 PDF 编辑器 [免费和付费]

曾经打开PDF文件,感觉自己被困在数字迷宫中吗?无法编辑的文本、无法调整大小的图像以及签署感觉像是一件苦差事的文档?好吧,不用再担心了!本指南解开了在 Windows 上掌握 PDF 的秘密,其中包含 10 款适用于 …


文章目录 简介LoRA文章主要贡献LoRA技术模型图技术细节论文实验结果LoRA在bert的运用LoRA核心代码实战分析 简介 论文链接https://arxiv.org/pdf/2106.09685v2.pdf 本文将先介绍论文中的LoRA技术,然后以BERT为例在IMDB数据集上代码实现运用这项微调技术。 代码数…


傅里叶变换的作用 高频:变化剧烈的灰度分量,例如边界 低频:变化缓慢的灰度分量,例如一片大海 滤波 低通滤波器:只保留低频,会使得图像模糊 高通滤波器:只保留高频,会使得图像细节…

【React】Ant Design自定义主题风格及主题切换

Ant Design 的自定义主题,对于刚入手的时候感觉真是一脸蒙圈,那今天给它梳理倒腾下; 1、自定义主题要点 整体样式变化,主要两个部分: 1.1、Design Token https://ant.design/docs/react/customize-theme-cn#theme 官…

