端到端自动驾驶：挑战与前沿

End-to-end Autonomous Driving: Challenges and Frontiers

端到端自动驾驶：挑战与前沿

Abstract

The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 270 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework. We maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving.
自动驾驶领域见证了一种快速增长的方法，这些方法采用了端到端算法框架，利用原始传感器输入生成车辆运动计划，而不是集中于检测和运动预测等单独任务。与模块化流水线相比，端到端系统从感知和规划的联合特征优化中受益。这个领域由于大规模数据集的可用性、闭环评估和对自动驾驶算法在具有挑战性场景中有效执行的日益增长的需求而蓬勃发展。在这篇综述中，我们提供了对超过270篇论文的全面分析，涵盖了端到端自动驾驶的动机、路线图、方法论、挑战和未来趋势。我们深入探讨了几个关键挑战，包括多模态性、可解释性、因果混淆、鲁棒性和世界模型等。此外，我们还讨论了基础模型和视觉预训练的当前进展，以及如何将这些技术纳入端到端驾驶框架。我们维护了一个包含最新文献和开源项目的活跃存储库，地址为 https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving。
这段文字是对端到端自动驾驶领域的一个综述，它概述了该领域的研究进展、挑战和未来趋势。提到的GitHub存储库链接是 OpenDriveLab 维护的，用于分享相关领域的文献和开源项目。

1 INTRODUCTION

Cmodular design strategy, wherein each functionality, ONVENTIONAL autonomous driving systems adopt a such as perception, prediction, and planning, is individually developed and integrated into onboard vehicles. The planning or control module, responsible for generating steering and acceleration outputs, plays a crucial role in determining the driving experience. The most common approach for planning in modular pipelines involves using sophisticated rule-based designs, which are often ineffective in addressing the vast number of situations that occur on road. Therefore, there is a growing trend to leverage large-scale data and to use learning-based planning as a viable alternative.
在传统的自动驾驶系统中，模块化设计策略被广泛采用，其中每个功能，如感知、预测和规划，都是独立开发并集成到车载系统中的。负责生成转向和加速输出的规划或控制模块，在决定驾驶体验方面起着至关重要的作用。在模块化流水线中，规划的最常见方法是使用复杂的基于规则的设计，这些设计通常在应对道路上发生的大量情况时效果不佳。因此，有一个日益增长的趋势是利用大规模数据，并使用基于学习的规划作为一种可行的替代方案。
We define end-to-end autonomous driving systems as fully differentiable programs that take raw sensor data as input and produce a plan and/or low-level control actions as output. Fig. 1 (a)-(b) illustrates the difference between the classical and end-to-end formulation. The conventional approach feeds the output of each component, such as bounding boxes and vehicle trajectories, directly into subsequent units (dashed arrows). In contrast, the end-to-end paradigm propagates feature representations across components (gray solid arrow). The optimized function is set to be, for example, the planning performance, and the loss is minimized via back-propagation (red arrow). Tasks are jointly and globally optimized in this process.
我们定义端到端自动驾驶系统为完全可微分的程序，它们以原始传感器数据为输入，并产生计划和/或低级控制动作作为输出。图1（a）-（b）展示了传统方法和端到端公式之间的区别。传统方法将每个组件的输出，如边界框和车辆轨迹，直接输入到后续单元（虚线箭头）。相比之下，端到端范式在组件之间传播特征表示（灰色实线箭头）。优化函数被设置为例如规划性能，并通过反向传播（红色箭头）最小化损失。在这个过程中，任务是联合和全局优化的。
在这里插入图片描述
图1：概览调查。(a) 流程和方法。我们将端到端自动驾驶定义为基于学习的算法框架，具有原始传感器输入和规划/控制输出。我们深入研究了270多篇论文，并将它们分类为模仿学习（IL）和强化学习（RL）。 (b) 基准测试。我们将流行的基准测试分为闭环和开环评估。我们涵盖了闭环模拟的各个方面以及开环评估对这个问题的局限性。 © 挑战。这是我们工作的主要部分。我们列出了来自广泛主题的关键挑战，并广泛分析了为什么这些关注点至关重要。也涵盖了对这些挑战有希望的解决方案。 (e) 未来趋势。我们讨论了端到端范式如何通过基础模型、视觉预训练等快速发展的辅助而受益。部分照片由在线资源提供。
In this survey, we conduct an extensive review of this emerging topic. Fig. 1 provides an overview of our work. We begin by discussing the motivation and roadmap for end-toend autonomous driving systems. End-to-end approaches can be broadly classified into imitation and reinforcement learning, and we give a brief review of these methodologies.
在这篇综述中，我们对这一新兴话题进行了广泛的回顾。图1 提供了我们工作的概览。我们首先讨论了端到端自动驾驶系统的动机和路线图。端到端方法可以广泛地分类为模仿学习和强化学习，我们对这些方法论进行了简要回顾。
We cover datasets and benchmarks for both closed- and open-loop evaluation. We summarize a series of critical challenges, including interpretability, generalization, world models, causal confusion, etc. We conclude by discussing future trends that we think should be embraced by the community to incorporate the latest developments from data engines, and large foundation models, amongst others. Note that this review is mainly orchestrated from a theoretical perspective. Engineering efforts such as version control, unit testing, data servers, data cleaning, software-hardware co-design, etc., play crucial roles in deploying the end-toend technology. Publicly available information regarding the latest practices on these topics is limited. We invite the community towards more openness in future discussions.
在这篇综述中，我们涵盖了针对闭环和开环评估的数据集和基准测试。我们总结了一系列的重大挑战，包括可解释性、泛化能力、世界模型、因果混淆等。我们以讨论未来趋势作为结尾，我们认为这些趋势应该被社区所采纳，以整合来自数据引擎和大型基础模型等的最新发展。请注意，这篇综述主要是从理论角度进行策划的。工程能力，如版本控制、单元测试、数据服务器、数据清洗、软硬件协同设计等，在部署端到端技术中扮演着至关重要的角色。关于这些主题的最新实践的公开信息是有限的。我们邀请社区在未来的讨论中更加开放。

1.1 Motivation of an End-to-end System

In the classical pipeline, each model serves a standalone component and corresponds to a specific task (e.g., traffic light detection). Such a design is beneficial in terms of interpretability and ease of debugging. However, since the optimization objectives across modules are different, with detection pursuing mean average precision (mAP) while planning aiming for driving safety and comfort, the entire system may not be aligned with a unified target, i.e., the ultimate planning/control task. Errors from each module, as the sequential procedure proceeds, could be compounded and result in an information loss. Moreover, compared to one end-to-end neural network, the multi-task, multi-model deployment which involves multiple encoders and message transmission systems, may increase the computational burden and potentially lead to sub-optimal use of compute.
在传统的流水线中，每个模型都是一个独立的组件，对应于一个特定的任务（例如，交通灯检测）。这样的设计在可解释性和易于调试方面是有益的。然而，由于各个模块的优化目标不同，检测任务追求平均精度均值（mean average precision, mAP），而规划任务则着眼于驾驶安全和舒适性，整个系统可能无法与统一的目标对齐，即最终的规划/控制任务。随着顺序过程的进行，每个模块的错误可能会累积，并导致信息丢失。此外，与单一端到端神经网络相比，涉及多个编码器和消息传输系统的多任务、多模型部署可能会增加计算负担，并可能导致计算资源的次优使用。
In contrast to its classical counterpart, an end-to-end autonomous system offers several advantages. (a) The most apparent merit is its simplicity in combining perception, prediction, and planning into a single model that can be jointly trained. (b) The whole system, including its intermediate representations, is optimized towards the ultimate task. © Shared backbones increase computational efficiency. (d) Data-driven optimization has the potential to improve the system by simply scaling training resources.
与传统的自动驾驶系统相比，端到端自动驾驶系统提供了几个显著的优势：
(a) 最明显的长处是它将感知、预测和规划结合到一个单一模型中，并且可以联合训练。这种集成减少了模块间接口的复杂性，简化了开发和维护过程。
(b) 整个系统，包括其中间表示，都针对最终任务进行优化。这意味着系统的所有部分都朝着提高规划或控制性能这一共同目标努力，从而提高了整体性能。
© 共享的主干网络（backbones）增加了计算效率。在端到端系统中，特征提取和表示学习可以在整个网络中共享，减少了重复计算，提高了资源利用效率。
(d) 数据驱动的优化有潜力通过简单地扩展训练资源来改善系统。随着数据量的增加和计算能力的提升，端到端系统可以通过更多的训练数据来提高其性能和泛化能力。
端到端自动驾驶系统通过这些优势，旨在创建一个更加高效、强大且能够适应复杂驾驶环境的系统。
Note that the end-to-end paradigm does not necessarily indicate one black box with only planning/control outputs. It could have intermediate representations and outputs (Fig. 1 (b)) as in classical approaches. In fact, several state-ofthe-art systems [1, 2] propose a modular design but optimize all components together to achieve superior performance.
需要注意的是，端到端范式并不一定意味着只有一个只输出规划/控制结果的黑盒。它也可以有中间表示和输出（如图1（b）所示），就像传统方法一样。事实上，一些最先进的系统[1, 2]提出了一种模块化设计，但将所有组件一起优化以实现更优越的性能。

1.2 Roadmap

Fig. 2 depicts a chronological roadmap of critical achievements in end-to-end autonomous driving, where each part indicates an essential paradigm shift or performance boost. The history of end-to-end autonomous driving dates back to 1988 with ALVINN [3], where the input was two “retinas“ from a camera and a laser range finder, and a simple neural network generated steering output. NVIDIA designed a prototype end-to-end CNN system, which reestablished this idea in the new era of GPU computing [8]. Notable progress has been achieved with the development of deep neural networks, both in imitation learning [15, 16] and reinforcement learning [4, 17, 18, 19]. The policy distillation paradigm proposed in LBC [5] and related approaches [20, 21, 22, 23] has significantly improved closed-loop performance by mimicking a well-behaved expert. To enhance generalization ability due to the discrepancy between the expert and learned policy, several papers [10, 24, 25] have proposed aggregating on-policy data [26] during training.
图2 展示了端到端自动驾驶关键成就的时间线，每一部分都标志着一个重要的范式转变或性能提升。端到端自动驾驶的历史可以追溯到1988年的ALVINN[3]，其中输入是来自摄像头和激光测距仪的两个“视网膜”，并且一个简单的神经网络生成转向输出。NVIDIA设计了一个原型端到端CNN系统，在GPU计算的新时代重新确立了这一概念[8]。随着深度神经网络的发展，在模仿学习[15, 16]和强化学习[4, 17, 18, 19]中都取得了显著进展。LBC[5]中提出的策略蒸馏范式及相关方法[20, 21, 22, 23]通过模仿一个表现良好的专家，显著提高了闭环性能。为了增强由于专家和学习策略之间的差异而导致的泛化能力，一些论文[10, 24, 25]提出了在训练期间聚合策略数据[26]。
在这里插入图片描述
图2：端到端自动驾驶路线图。我们按时间顺序呈现关键里程碑，将类似的工作归类在同一主题下。代表性或首次工作的文献以加粗显示并配有插图，而同一主题下其余文献的日期可能不同。我们还展示了每年CARLA排行榜[13]（DS，范围从0到100）的顶级条目的得分，以及最近的nuPlan挑战[14]（得分范围从0到1）。
A significant turning point occurred around 2021. With diverse sensor configurations available within a reasonable computational budget, attention was focused on incorporating more modalities and advanced architectures (e.g., Transformers [27]) to capture global context and representative features, as in TransFuser [6, 28] and many variants [29, 30, 31]. Combined with more insights about the simulation environment, these advanced designs resulted in a substantial performance boost on the CARLA benchmark [13]. To improve the interpretability and safety of autonomous systems, approaches [11, 32, 33] explicitly involve various auxiliary modules to better supervise the learning process or utilize attention visualization. Recent works prioritize generating safety-critical data [7, 34, 35], pre-training a foundation model or backbone curated for policy learning [12, 36, 37], and advocating a modular end-to-end planning philosophy [1, 2, 38, 39]. Meanwhile, the new and challenging CARLA v2 [13] and nuPlan [14] benchmarks have been introduced to facilitate research into this area.
大约在2021年左右，端到端自动驾驶领域发生了一个重要的转折点。在合理的计算预算内，有多种传感器配置可供选择，研究的重点转向了整合更多的模态和更先进的架构（例如，Transformers[27]）来捕捉全局上下文和代表性特征，如TransFuser[6, 28]及其许多变体[29, 30, 31]。结合对模拟环境更深入的理解，这些先进的设计在CARLA基准测试[13]中带来了显著的性能提升。
为了提高自动驾驶系统的可解释性和安全性，一些方法[11, 32, 33]明确地涉及各种辅助模块，以更好地监督学习过程或利用注意力可视化。近期的工作优先考虑生成关键安全数据[7, 34, 35]，预训练一个为策略学习而策划的基础模型或主干网络[12, 36, 37]，并提倡一种模块化的端到端规划理念[1, 2, 38, 39]。同时，新的且具有挑战性的CARLA v2[13]和nuPlan[14]基准测试被引入，以促进这一领域的研究。

1.3 Comparison to Related Surveys

We would like to clarify the difference between our survey and previous related surveys [40, 41, 42, 43, 44, 45, 46, 47, 48]. Some prior surveys [40, 41, 42, 43] cover content similar to ours in the sense of an end-to-end system. However, they do not cover new benchmarks and approaches that arose with the significant recent transition in the field, and place a minor emphasis on frontiers and challenges. The others focus on specific topics in this domain, such as imitation learning [44, 45, 46] or reinforcement learning [47, 48]. In contrast, our survey provides up-to-date information on the latest developments in this field, covering a wide span of topics and providing in-depth discussions of critical challenges.
我们希望澄清我们的综述与之前相关综述[40, 41, 42, 43, 44, 45, 46, 47, 48]之间的区别。一些先前的综述[40, 41, 42, 43]在内容上与我们的综述相似，都涉及端到端系统，但它们没有涵盖该领域最近重大转变中出现的新基准测试和方法，并且对前沿和挑战的重视程度较小。其他综述则专注于该领域的特定主题，如模仿学习[44, 45, 46]或强化学习[47, 48]。相比之下，我们的综述提供了该领域最新发展的即时信息，涵盖了广泛的主题，并深入讨论了关键挑战。

1.4 Contributions

To summarize, this survey has three key contributions: (a) We provide a comprehensive analysis of end-to-end autonomous driving for the first time, including high-level motivation, methodologies, benchmarks, and more. Instead of optimizing a single block, we advocate for a philosophy to design the algorithm framework as a whole, with the ultimate target of achieving safe and comfortable driving. (b) We extensively investigate the critical challenges that concurrent approaches face. Out of the more than 250 papers surveyed, we summarize major aspects and provide in-depth analysis, including topics on generalizability, language-guided learning, causal confusion, etc. © We cover the broader impact of how to embrace large foundation models and data engines. We believe that this line of research and the large scale of high-quality data it provides could significantly advance this field. To facilitate future research, we maintain an active repository updated with new literature and open-source projects.
总结来说，这篇综述有三个关键贡献：
( a) 我们首次提供了对端到端自动驾驶的全面分析，包括高层次的动机、方法论、基准测试等。我们提倡的不是优化单一模块，而是将算法框架设计为一个整体的哲学，最终目标是实现安全舒适的驾驶。
( b) 我们广泛调查了当前方法面临的重大挑战。在超过250篇被调查的论文中，我们总结了主要方面并提供了深入分析，包括泛化能力、语言引导学习、因果混淆等主题。
( c) 我们涵盖了如何拥抱大型基础模型和数据引擎的更广泛影响。我们相信，这一研究线路及其提供的大规模高质量数据，可能会显著推进这一领域的发展。为了促进未来的研究，我们维护了一个活跃的存储库，更新了新的文献和开源项目。
这篇综述通过其全面和深入的分析，为端到端自动驾驶领域的研究者提供了宝贵的资源和见解。通过强调整体设计哲学、挑战分析和未来趋势，它不仅总结了当前的进展，也为未来的研究方向提供了指导。同时，通过维护一个活跃的存储库，它还为社区提供了一个持续更新的资源，进一步促进了知识的共享和交流。

2 METHODS

This section reviews fundamental principles behind most existing end-to-end self-driving approaches. Sec. 2.1 discusses methods using imitation learning and provides details on the two most popular sub-categories, namely behavior cloning and inverse optimal control. Sec. 2.2 summarizes methods that follow the reinforcement learning paradigm.
这一部分回顾了大多数现有端到端自动驾驶方法背后的基本原理。第2.1节讨论了使用模仿学习的方法，并详细介绍了两个最受欢迎的子类别，即行为克隆和逆向最优控制。第2.2节总结了遵循强化学习范式的方法。

2.1 Imitation Learning

Imitation learning (IL), also referred to as learning from demonstrations, trains an agent to learn the policy by imitating the behavior of an expert. IL requires a dataset D = {ξi} containing trajectories collected under the expert’s policy πβ, where each trajectory is a sequence of state-action pairs. The goal of IL is to learn an agent policy π that matches πβ.
模仿学习（Imitation Learning, IL），也称为从演示中学习，是一种训练代理（agent）通过模仿专家的行为来学习策略的方法。模仿学习需要一个数据集 $\{\xi_i\}$ ，其中包含了在专家策略 $\pi_{\beta}$ 下收集的轨迹，每个轨迹是一系列状态-动作对的序列。模仿学习的目标是学习一个代理策略 $\pi$ ，使其与专家策略 $\pi_{\beta}$ 相匹配。
The policy π can output planned trajectories or control signals. Early works usually adopt control outputs, due to the ease of collection. However, predicting controls at different steps could lead to discontinuous maneuvers and the network inherently specializes to the vehicle dynamics which hinders generalization to other vehicles. Another genre of works predicts waypoints. It considers a relatively longer time horizon. Meanwhile, converting trajectories for vehicles to track into control signals needs additional controllers, which is non-trivial and involves vehicle models and control algorithms. Since no clear performance gap has been observed between these two paradigms, we do not differentiate them explicitly in this survey. An interesting and more in-depth discussion can be found in [22].
策略 $\pi$ 可以输出规划的轨迹或控制信号。早期的工作通常采用控制输出，因为它们更容易收集。然而，预测不同步骤的控制信号可能导致不连贯的驾驶动作，并且网络本身会专门针对车辆动力学进行优化，这阻碍了对其他车辆的泛化能力。
另一类工作预测航点（waypoints）。它考虑了一个相对较长的时间范围。同时，将轨迹转换为车辆跟踪的控制信号需要额外的控制器，这并不简单，并且涉及到车辆模型和控制算法。由于在这两种范式之间没有观察到明显的性能差距，我们在这篇综述中没有明确区分它们。更有趣和深入的讨论可以在文献[22]中找到。
One widely used category of IL is behavior cloning (BC) [49], which reduces the problem to supervised learning. Inverse Optimal Control (IOC), also known as Inverse Reinforcement Learning (IRL) [50] is another type of IL method that utilizes expert demonstrations to learn a reward function. We elaborate on these two categories below.
**模仿学习（Imitation Learning, IL）**是一种通过观察专家行为来训练代理的方法。以下是两种广泛使用的模仿学习类别：

行为克隆（Behavior Cloning, BC）：
- 定义：行为克隆是一种将模仿学习问题转化为监督学习问题的方法。它通过模仿专家的决策来训练一个策略，使得代理能够在给定状态时输出与专家相似的动作。
- 方法：在行为克隆中，代理被训练以模仿专家在特定状态下的行为。这通常通过将状态-动作对作为输入-输出对来训练一个神经网络或其他机器学习模型实现。
- 优点：行为克隆简单直观，易于实现，并且可以快速从专家数据中学习。
- 缺点：它可能会遇到过拟合的问题，即模型可能过于依赖训练数据，而无法很好地泛化到未见过的新情况。此外，行为克隆不涉及对环境的探索，因此可能无法学习到所有可能的有效策略。
逆向最优控制（Inverse Optimal Control, IOC），也称为逆向强化学习（Inverse Reinforcement Learning, IRL）：
- 定义：逆向最优控制是一种通过分析专家的行为来推断潜在的优化目标或奖励函数的方法。这种方法假设专家的行为是为了最大化某个未知的奖励函数。
- 方法：在IOC中，通过观察专家的决策过程，试图找到一个奖励函数，该函数能够解释专家的决策。然后，可以使用这个奖励函数来训练代理，使其模仿专家的行为。
- 优点：IOC可以提供对专家决策背后的动机的深入理解，并且能够处理更复杂的任务，其中直接从状态到动作的映射可能不直观。
- 缺点：逆向最优控制可能需要大量的专家数据来准确推断奖励函数，并且对于某些任务，找到合适的奖励函数可能非常具有挑战性。

这两种方法各有优缺点，选择哪一种方法取决于具体的应用场景和可用的专家数据。在实际应用中，研究者可能会根据任务的需求和数据的特点，选择最合适的模仿学习方法。

2.1.1 Behavior Cloning

In BC, matching the agent’s policy with the expert’s is accomplished by minimizing planning loss as supervised learning over the collected dataset: E(s,a) ℓ(πθ(s), a). Here, ℓ(πθ(s), a) represents a loss function that measures the distance between the agent action and the expert action.
在这里插入图片描述
Early applications of BC for driving [3, 8, 51] utilized an end-to-end neural network to generate control signals from camera inputs. Further enhancements, such as multi-sensor inputs [6, 52], auxiliary tasks [16, 28], and improved expert design [21], have been proposed to enable BC-based end-toend driving models to handle challenging urban scenarios.
早期将行为克隆（BC）应用于驾驶的实例[3, 8, 51]利用端到端神经网络从摄像头输入生成控制信号。为了使基于BC的端到端驾驶模型能够应对具有挑战性的城市场景，已经提出了进一步的增强措施，如多传感器输入[6, 52]、辅助任务[16, 28]和改进的专家设计[21]。
BC is advantageous due to its simplicity and efficiency, as it does not require hand-crafted reward design, which is crucial for RL. However, there are some common issues. During training, it treats each state as independently and identically distributed, resulting in an important problem known as covariate shift. For general IL, several onpolicy methods have been proposed to address this issue [26, 53, 54, 55]. In the context of end-to-end autonomous driving, DAgger [26] has been adopted in [5, 10, 25, 56]. Another common problem with BC is causal confusion, where the imitator exploits and relies on false correlations between certain input components and output signals. This issue has been discussed in the context of end-to-end autonomous driving in [57, 58, 59, 60]. These two challenging problems are further discussed in Sec. 4.9 and Sec. 4.8, respectively
行为克隆（BC）因其简单和高效而具有优势，它不需要像强化学习（RL）那样手工设计奖励函数。然而，它也存在一些常见问题：

协变量偏移（Covariate Shift）：在训练过程中，BC将每个状态视为独立同分布的，这导致了所谓的协变量偏移问题。这意味着训练数据的分布与实际运行时的分布不同，导致模型在实际应用中性能下降。
解决协变量偏移的方法：为了解决这个问题，提出了几种on-policy方法[26, 53, 54, 55]。在端到端自动驾驶的背景下，DAgger算法[26]被采用[5, 10, 25, 56]。DAgger算法通过在训练过程中不断收集代理生成的动作和随后的专家状态-动作对来更新训练数据集，以减少协变量偏移。
因果混淆（Causal Confusion）：BC的另一个常见问题是因果混淆，即模仿者利用并依赖于某些输入组件与输出信号之间的虚假相关性。这可能导致模型学习到错误的因果关系，而不是真正的控制策略。
因果混淆的讨论：在端到端自动驾驶的背景下，因果混淆问题已在文献[57, 58, 59, 60]中进行了讨论。这些问题可能涉及到模型错误地将某些环境特征与动作联系起来，而不是学习到如何根据环境状态做出适当的决策。

这两个具有挑战性的问题将在第4.9节和第4.8节中进一步讨论。

2.1.2 Inverse Optimal Control

Traditional IOC algorithms learn an unknown reward function R(s, a) from expert demonstrations, where the expert’s reward function can be represented as a linear combination of features [50, 61, 62, 63, 64]. However, in continuous, highdimensional autonomous driving scenarios, the definition of the reward is implicit and difficult to optimize.
传统的逆向最优控制（IOC）算法从专家演示中学习未知的奖励函数 $R (s, a)$ ，其中专家的奖励函数可以表示为特征的线性组合[50, 61, 62, 63, 64]。然而，在连续的、高维的自动驾驶场景中，奖励的定义是隐式的，并且难以优化。
逆向最优控制算法的核心思想是通过观察专家的行为来推断其潜在的奖励函数。以下是一些关键点：

线性组合特征：在许多情况下，专家的奖励函数假设为一系列特征的线性组合，这些特征捕捉了任务的关键方面，如速度、安全性、效率等。
奖励函数的隐式性：在自动驾驶等复杂任务中，奖励函数可能不容易明确定义。这是因为许多因素，如车辆动力学、交通规则、道路条件等，都可能影响决策过程。
优化困难：在高维空间中，奖励函数可能具有复杂的非线性关系，这使得从专家演示中学习奖励函数变得困难。此外，奖励函数可能在不同的状态和动作空间中表现出不同的特征。
特征选择和工程：为了解决这些问题，研究者可能需要进行特征选择和工程，以识别对专家决策过程影响最大的特征，并构建一个有效的奖励函数。
泛化和可扩展性：在高维和连续的自动驾驶任务中，学习到的奖励函数需要具有良好的泛化能力，以适应不同的驾驶场景和条件。
替代方法：由于直接从专家演示中学习奖励函数存在挑战，一些研究者可能采用替代方法，如使用模仿学习来直接学习策略，或者结合强化学习来进一步优化策略。

逆向最优控制算法在自动驾驶领域的应用需要解决这些挑战，以实现有效的策略学习。随着机器学习技术的发展，研究者正在探索新的方法来提高奖励函数学习的准确性和效率。
Generative adversarial imitation learning [65, 66, 67] is a specialized approach in IOC that designs the reward function as an adversarial objective to distinguish the expert and learned policies, similar to the concept of generative adversarial networks [68]. Recently, several works propose optimizing a cost volume or cost function with auxiliary perceptual tasks. Since a cost is an alternative representation of the reward, we classify these methods as belonging to the IOC domain. We define the cost learning framework as follows: end-to-end approaches learn a reasonable cost c(·) and use algorithmic trajectory samplers to select the trajectory τ ∗ with the minimum cost, as illustrated in Fig. 3.
生成对抗性模仿学习（Generative Adversarial Imitation Learning, GAIL）是逆向最优控制（IOC）中的一种专门方法，它将奖励函数设计为一个对抗性目标，以区分专家策略和学习策略，这与生成对抗网络（Generative Adversarial Networks, GANs）的概念类似[65, 66, 67]。最近，一些工作提出了使用辅助感知任务来优化成本体积（cost volume）或成本函数。由于成本是奖励的另一种表示形式，我们将这些方法归类为IOC领域。我们定义成本学习框架如下：端到端方法学习一个合理的成本函数 $c(\cdot)$ ，并使用算法轨迹采样器选择成本最小的轨迹 $\tau^*$ ，如图3所示。
在这里插入图片描述
以下是对成本学习框架的进一步解释：

生成对抗性模仿学习（GAIL）：GAIL通过对抗过程来训练模仿者，使其行为尽可能接近专家。在这个过程中，生成器（模仿者）生成策略，而判别器（通常是专家）试图区分生成器生成的策略和真实专家策略。
成本函数：在自动驾驶和其他决策任务中，成本函数用于量化特定行为的代价或不利因素。与奖励函数相反，成本函数通常表示需要避免或最小化的因素。
成本体积（Cost Volume）：在某些方法中，成本可能与状态-动作对的多个未来时刻相关，形成一个成本体积，该体积可以在多个时间步上累积成本。
轨迹采样器：算法轨迹采样器是一个从当前策略中选择最佳轨迹的机制。在成本学习框架中，采样器会选择具有最低成本的轨迹 ( \tau^* ) 作为输出。
端到端学习：端到端方法直接从输入到输出学习整个决策过程，无需手动设计中间特征或奖励函数。

这种框架的优势在于能够自动从专家数据中学习成本函数，并通过优化这个成本函数来训练策略，从而提高了模仿学习的灵活性和有效性。同时，这种方法也面临着挑战，如确保生成的成本函数能够有效地指导策略学习，以及处理可能的对抗性训练过程中的不稳定性。
Regarding cost design, it has representations including a learned cost volume in a bird’s-eye-view (BEV) [32], joint energy calculated from other agents’ future motion [69], or a set of probabilistic semantic occupancy or freespace layers [39, 70, 71]. On the other hand, trajectories are typically sampled from a fixed expert trajectory set [1] or processed by parameter sampling with a kinematic model [32, 38, 39, 70]. Then, a max-margin loss is adopted as in classic IOC methods to encourage the expert demonstration to have a minimal cost while others have high costs.
在成本设计方面，有几种不同的表示方法：

学习的成本体积（Cost Volume）：在鸟瞰图（Bird’s-Eye View, BEV）中表示的成本体积是一种常见的表示方法[32]。这种表示方法通常用于自动驾驶领域，可以捕捉车辆周围的环境特征，并用于评估不同轨迹的成本。
联合能量（Joint Energy）：从其他代理（agents）的未来运动中计算得到的联合能量，这种方法考虑了多智能体系统中的交互作用[69]。
概率语义占用或自由空间层：一组概率语义占用层或自由空间层，这些层可以提供关于环境的语义信息，如哪些区域是可行驶的，哪些是障碍物[39, 70, 71]。

在轨迹采样方面，通常采用以下方法：

固定专家轨迹集：从一组固定的专家轨迹中进行采样[1]。这意味着系统从预先定义的专家级轨迹中选择轨迹，这些轨迹被认为是最优或高效率的。
参数采样与运动学模型：通过参数采样和运动学模型处理轨迹[32, 38, 39, 70]。这种方法允许系统在一定的参数空间内探索不同的轨迹，并评估它们的成本。

在损失函数设计方面：

最大边界损失（Max-Margin Loss）：采用最大边界损失，这是经典IOC方法中常用的一种损失函数。这种损失函数鼓励专家演示的轨迹具有最小成本，而其他轨迹则具有较高的成本。

最大边界损失函数的设计旨在通过增加专家轨迹与非专家轨迹之间的成本边界来优化学习策略。这种方法有助于确保学习到的策略不仅模仿专家的行为，而且能够在面对不同的环境条件和干扰时保持鲁棒性。
通过这些方法，成本学习框架能够为自动驾驶系统提供一个结构化的方法来评估和选择最优轨迹，同时考虑到环境的复杂性和动态性。
Several challenges exist with cost learning approaches. In particular, in order to generate more realistic costs, HD maps, auxiliary perception tasks, and multiple sensors are typically incorporated, which increases the difficulty of learning and constructing datasets for multi-modal multitask frameworks. Nevertheless, the aforementioned cost learning methods significantly enhance the safety and interpretability of decisions (see Sec. 4.6), and we believe that the industry-inspired end-to-end system design is a viable approach for real-world applications.
成本学习方法存在几个挑战。特别是，为了生成更现实的费用，通常需要整合高精地图（HD maps）、辅助感知任务和多种传感器，这增加了多模态多任务框架的学习和构建数据集的难度。尽管如此，上述成本学习方法显著提高了决策的安全性和可解释性（见第4.6节），我们认为，受行业启发的端到端系统设计是现实世界应用的可行方法。

2.2 Reinforcement Learning

Reinforcement learning (RL) [72, 73] is a field of learning by trial and error. The success of deep Q networks (DQN) [74] in achieving human-level control on the Atari benchmark [75] has popularized deep RL. DQN trains a neural network called the critic (or Q network), which takes as input the current state and an action, and predicts the discounted return of that action. The policy is then implicitly defined by selecting the action with the highest predicted return.
强化学习（Reinforcement Learning, RL）[72, 73] 是一种通过试错进行学习的领域。深度Q网络（Deep Q-Networks, DQN）[74] 在实现Atari基准测试[75]上的人类级别控制方面取得了成功，这使得深度强化学习（Deep RL）变得流行。DQN训练一个称为评估器（或Q网络）的神经网络，它以当前状态和动作为输入，并预测该动作的折扣回报。然后，策略通过选择预测回报最高的动作来隐式定义。
RL requires an environment that allows potentially unsafe actions to be executed, to collect novel data (e.g., via random actions). Additionally, RL requires significantly more data to train than IL. For this reason, modern RL methods often parallelize data collection across multiple environments [76]. Meeting these requirements in the real world presents great challenges. Therefore, almost all papers that use RL in driving have only investigated the technique in simulation. Most use different extensions of DQN. The community has not yet converged on a specific RL algorithm.
强化学习（RL）确实需要一个可以执行潜在不安全动作的环境，以收集新数据（例如，通过随机动作）。此外，与模仿学习（IL）相比，RL需要更多的数据来训练。由于这些原因，现代RL方法常常在多个环境中并行化数据收集[76]。在现实世界中满足这些要求存在巨大挑战。因此，几乎所有使用RL在驾驶领域的论文都只在模拟环境中研究了这项技术。大多数研究使用了DQN的不同扩展。社区尚未就特定的RL算法达成共识。
RL has successfully learned lane following on a real car on an empty street [4]. Despite this encouraging result, it must be noted that a similar task was already accomplished by IL three decades prior [3]. To date, no report has shown results for end-to-end training with RL that are competitive with IL. The reason for this failure likely is that the gradients obtained via RL are insufficient to train deep perception architectures (i.e., ResNet) required for driving. Models used in benchmarks like Atari, where RL succeeds, are relatively shallow, consisting of only a few layers [77].
强化学习（RL）确实已成功地在真实车辆上学会了在空旷街道上进行车道跟随[4]。尽管这是一个令人鼓舞的结果，但必须指出，类似的任务在三十年前就已经通过模仿学习（IL）完成了[3]。到目前为止，还没有报告显示使用RL的端到端训练结果与IL竞争。这种失败的原因可能是通过RL获得的梯度不足以训练深度感知架构（例如ResNet）来驾驶。在Atari等基准测试中，RL成功的模型相对较浅，只包含少数几层[77]。
以下是对RL在自动驾驶领域应用现状的一些关键点的进一步解释：

车道跟随任务：RL在真实车辆上实现车道跟随是一个重要的里程碑，表明RL可以在实际环境中学习基本的驾驶技能。
模仿学习的历史：然而，自20世纪80年代以来，模仿学习已经在类似的任务上取得了成功，这表明RL在自动驾驶领域的应用可能并不像最初预期的那样具有革命性。
端到端训练的挑战：尽管RL在某些领域取得了成功，但在端到端自动驾驶训练方面，RL尚未展现出与IL相媲美的性能。这可能是因为RL的梯度对于训练深度学习模型（如深度感知网络）来说可能不够充分。
深度感知架构的需求：自动驾驶需要复杂的感知能力，以理解和解释车辆周围的环境。这通常需要深度学习架构，如ResNet，这些架构在当前的RL框架下可能难以训练。
RL在简单任务上的应用：RL在像Atari这样的基准测试中取得了成功，这些任务通常涉及相对简单的视觉输入和决策过程。这些成功的模型通常结构较浅，这表明RL可能更适合处理不太复杂的任务。
未来研究方向：尽管RL在自动驾驶领域的应用面临挑战，但研究者仍在探索如何改进RL算法，以便更好地处理深度学习和自动驾驶的复杂性。
算法的融合与创新：为了克服这些挑战，未来的研究可能会探索将RL与IL或其他学习方法相结合，或者开发新的算法来提高RL在自动驾驶领域的性能和适用性。

总的来说，尽管RL在自动驾驶领域取得了一些进展，但要实现与IL相竞争的性能，仍然需要在算法设计、模型架构和训练策略上进行更多的研究和创新。
RL has been successfully applied in end-to-end driving when combined with supervised learning (SL). Implicit affordances [18, 19] pre-train the CNN encoder using SL with tasks like semantic segmentation. In the second stage, this encoder is frozen, and a shallow policy head is trained on the features from the frozen encoder with a modern version of Q-learning [78]. RL can also be used to finetune full networks that were pre-trained using IL [17, 79].
强化学习（RL）在与监督学习（Supervised Learning, SL）结合时，已成功应用于端到端驾驶。以下是一些关键点，说明RL与SL结合的方法：

隐式可供性（Implicit Affordances）：这是一种方法，它使用监督学习来预训练卷积神经网络（CNN）的编码器，可以用于像语义分割这样的任务[18, 19]。预训练帮助网络学习从传感器输入中提取环境特征。
编码器冻结：在第一阶段预训练之后，编码器的参数被冻结，即在接下来的训练过程中不再更新。
策略头训练：在第二阶段，一个较浅的策略头（policy head）被添加到冻结的编码器上，并使用现代版本的Q学习进行训练[78]。策略头负责基于编码器提取的特征做出驾驶决策。
Q学习：Q学习是一种流行的RL算法，通过学习动作价值函数（Q函数）来选择最优动作。在这种情况下，Q学习被用来训练策略头，使其能够根据特征做出决策。
端到端训练：通过结合预训练的编码器和RL训练的策略头，可以实现端到端的训练，从原始传感器数据直接到控制命令。
微调（Fine-tuning）：RL也可以用于微调那些使用模仿学习预训练的完整网络[17, 79]。微调通常涉及在特定任务上进一步训练网络，以提高其性能。
结合优势：将RL与SL结合的方法利用了监督学习在特征提取方面的优势和强化学习在决策制定方面的能力，从而提高了端到端驾驶系统的性能。
实际应用：这种方法在自动驾驶领域尤其有用，因为它允许系统从专家数据中学习环境的表示，并通过与环境的交互来优化驾驶策略。

RL can also been effectively applied, if the network has access to privileged simulator information. [48, 80, 81]. Privileged RL agents can be used for dataset curation. Roach [21] trains an RL agent on privileged BEV semantic maps and uses the policy to automatically collect a dataset with which a downstream IL agent is trained. WoR [20] employs a Q-function and tabular dynamic programming to generate additional or improved labels for a static dataset.
如果网络能够访问特权模拟器信息，RL也可以有效应用[48, 80, 81]。可以使用特权RL代理进行数据集策划。Roach[21]在一个拥有特权信息的鸟瞰图（BEV）语义地图上训练了一个RL代理，并使用该策略自动收集数据集，该数据集随后用于训练下游的IL代理。WoR[20]利用Q函数和表格动态规划为静态数据集生成额外的或改进的标签。
A challenge in the field is to transfer the findings from simulation to the real world. In RL, the objective is expressed as reward functions, and many algorithms require them to be dense and provide feedback at each environment step. Current works typically use simple objectives, such as progress and collision avoidance. These simplistic designs potentially encourage risky behaviors [80]. Devising or learning better reward functions remains an open problem. Another direction would be to develop RL algorithms that can handle sparse rewards, enabling the optimization of relevant metrics directly. RL can be effectively combined with world models [82, 83, 84], though this presents specific challenges (See Sec. 4.3). Current RL solutions for driving rely heavily on low-dimensional representations of the scene, and this issue is further discussed in Sec. 4.2.2.
在自动驾驶领域，将模拟环境中的研究成果转移到现实世界是一个挑战。在强化学习（RL）中，目标通过奖励函数来表达，许多算法要求这些奖励函数是密集的，并在每个环境步骤中提供反馈。当前的工作通常使用简单的目标，例如进度和避免碰撞。这些简化的设计可能会鼓励冒险行为[80]。设计或学习更好的奖励函数仍然是一个开放问题。另一个方向是开发能够处理稀疏奖励的RL算法，从而直接优化相关指标。RL可以有效地与世界模型结合使用[82, 83, 84]，尽管这带来了特定的挑战（见第4.3节）。当前驾驶的RL解决方案在很大程度上依赖于场景的低维表示，这个问题在第4.2.2节中进一步讨论。

3 BENCHMARKING

Autonomous driving systems require a comprehensive evaluation to ensure safety. Researchers must benchmark these systems using appropriate datasets, simulators, metrics, and hardware to accomplish this. This section delineates three approaches for benchmarking end-to-end autonomous driving systems: (1) real-world evaluation, (2) online or closedloop evaluation in simulation, and (3) offline or open-loop evaluation on driving datasets. We focus on the scalable and principled online simulation setting and summarize realworld and offline assessments for completeness.
自动驾驶系统需要进行全面评估以确保安全。为了实现这一目标，研究人员必须使用适当的数据集、模拟器、指标和硬件来对这些系统进行基准测试。本节概述了评估端到端自动驾驶系统的三种方法：(1) 现实世界评估，(2) 模拟中的在线或闭环评估，以及 (3) 在驾驶数据集上的离线或开环评估。我们专注于可扩展和有原则的在线模拟设置，并为完整性总结现实世界和离线评估。

3.1 Real-world Evaluation

Early efforts on benchmarking self-driving involved realworld evaluation. Notably, DARPA initiated a series of races to advance autonomous driving. The first event offered $1M in prize money for autonomously navigating a 240km route through the Mojave desert, which no team achieved [85]. The final series event, called the DARPA Urban Challenge, required vehicles to navigate a 96km mock-up town course, adhering to traffic laws and avoiding obstacles [86]. These races fostered important developments in autonomous driving, such as LiDAR sensors. Following this spirit, the University of Michigan established MCity [87], a large controlled real-world environment designed to facilitate testing autonomous vehicles. Besides such academic ventures, industries with the resources to deploy fleets of driverless vehicles also rely on real-world evaluation to benchmark improvements in their algorithms.
早期对自动驾驶进行基准测试的努力涉及现实世界的评估。值得注意的是，DARPA（美国国防高级研究计划局）发起了一系列比赛来推进自动驾驶技术。第一个活动提供了100万美元的奖金，用于奖励能够自动驾驶通过穿越莫哈维沙漠240公里路线的团队，但没有任何团队实现[85]。该系列赛事的最后一个活动被称为DARPA城市挑战赛，要求车辆在遵守交通法规和避免障碍物的同时，自动驾驶穿越96公里的模拟城镇路线[86]。这些比赛促进了自动驾驶的重要发展，例如激光雷达传感器。继承这一精神，密歇根大学建立了MCity[87]，这是一个大型控制的现实世界环境，旨在促进测试自动驾驶车辆。除了这样的学术尝试，有能力部署无人驾驶车队的行业也依赖于现实世界的评估来衡量他们算法的改进。

3.2 Online/Closed-loop Simulation

Conducting tests of self-driving systems in the real world is costly and risky. To address this challenge, simulation is a viable alternative [14, 88, 89, 90, 91, 92]. Simulators facilitate rapid prototyping and testing, enable the quick iteration of ideas, and provide low-cost access to diverse scenarios for unit testing. In addition, simulators offer tools for measuring performance accurately. However, their primary disadvantage is that the results obtained in a simulated environment do not necessarily generalize to the real world (Sec. 4.9.3).
在现实世界中对自动驾驶系统进行测试既昂贵又有风险。为了应对这一挑战，仿真是一个可行的替代方案[14, 88, 89, 90, 91, 92]。仿真器有助于快速原型制作和测试，能够快速迭代想法，并为单元测试提供低成本接触多样化场景的途径。此外，仿真器还提供了准确测量性能的工具。然而，它们的主要缺点是，在模拟环境中获得的结果不一定能够泛化到现实世界（见第4.9.3节）。
Closed-loop evaluation involves building a simulated environment that closely mimics a real-world driving environment. The evaluation entails deploying the driving system in simulation and measuring its performance. The system has to navigate safely through traffic while progressing toward a designated goal location. There are four main sub-tasks involved in developing such simulators: parameter initialization, traffic simulation, sensor simulation, and vehicle dynamics simulation. We briefly describe these subtasks below, followed by a summary of currently available open-source simulators for closed-loop benchmarks.
闭环评估涉及构建一个紧密模仿现实世界驾驶环境的模拟环境。评估包括在模拟中部署驾驶系统并测量其性能。系统必须在通往指定目标位置的过程中安全地导航交通。开发此类模拟器涉及四个主要子任务：参数初始化、交通模拟、传感器模拟和车辆动力学模拟。我们在下面简要描述这些子任务，然后总结当前可用的闭环基准测试的开源模拟器。

3.2.1 Parameter Initialization

Simulation offers the benefit of a high degree of control over the environment, including weather, maps, 3D assets, and low-level attributes such as the arrangement of objects in a traffic scene. While powerful, the number of these parameters is substantial, resulting in a challenging design problem. Current simulators tackle this in two ways:
仿真提供了对环境高度控制的优势，包括天气、地图、3D资产以及交通场景中物体排列等低级属性。尽管功能强大，但这些参数的数量是巨大的，导致设计问题具有挑战性。当前的模拟器以两种方式解决这个问题：
Procedural Generation: Traditionally, initial parameters are hand-tuned by 3D artists and engineers [88, 89, 90, 91]. This limits scalability. Recently, some of the simulation properties can be sampled from a probabilistic distribution with computer algorithms, which we refer to as procedural generation [93]. Procedural generation algorithms combine rules, heuristics, and randomization to create diverse road networks, traffic patterns, lighting conditions, and object placements [94, 95]. Due to its efficiency compared to fully manual design, it has become one of the most commonly used methods of initialization for video games and simulations. Nevertheless, the process still needs pre-defined parameters and algorithms to control generation reliability, which is time-consuming and requires a lot of expertise.
程序化生成：传统上，初始参数由3D艺术家和工程师手动调整[88, 89, 90, 91]。这限制了可扩展性。最近，一些模拟属性可以从概率分布中通过计算机算法进行采样，我们称之为程序化生成[93]。程序化生成算法结合规则、启发式和随机化来创建多样的道路网络、交通模式、照明条件和物体放置[94, 95]。由于其与完全手动设计相比的效率，它已成为视频游戏和仿真中最常用的初始化方法之一。然而，该过程仍然需要预定义的参数和算法来控制生成的可靠性，这是耗时的并且需要大量的专业知识。
Data-Driven: Data-driven approaches for simulation initialization aim to learn the required parameters. Arguably, the simplest way is to sample from real-world driving logs [14, 92], where parameters such as road maps or traffic patterns are directly extracted from pre-recorded datasets. The advantage of log sampling is its ability to capture the natural variability present in real-world data, leading to more realistic simulation scenarios. However, it may not encompass rare situations that are critical for testing the robustness of autonomous driving systems. The initial parameters can be optimized to increase the representation of such scenarios [7, 34, 35]. Another advanced data-driven approach to initialization is generative modeling, where machine learning algorithms are utilized to learn the underlying structure and distributions of real-world data. They can then generate novel scenarios that resemble the real world but were not included in the original data [96, 97, 98, 99].
数据驱动：数据驱动的仿真初始化方法旨在学习所需的参数。可以说，最简单的方式是从现实世界驾驶日志中进行抽样[14, 92]，其中道路地图或交通模式等参数直接从预先录制的数据集中提取。日志抽样的优势在于其能够捕捉现实世界数据中存在的自然变异性，从而产生更加真实的仿真场景。然而，它可能无法包含对测试自动驾驶系统鲁棒性至关重要的罕见情况。可以通过优化初始参数来增加这些场景的代表性[7, 34, 35]。另一种先进的数据驱动初始化方法是生成性建模，其中利用机器学习算法学习现实世界数据的潜在结构和分布。然后，它们可以生成类似于现实世界但未包含在原始数据中的新场景[96, 97, 98, 99]。

3.2.2 Traffic Simulation

Traffic simulation involves generating and positioning virtual entities in the environment with realistic motion [97, 100]. These entities often include vehicles (such as cars, motorcycles, bicycles, etc.) and pedestrians. Traffic simulators must account for the effects of speed, acceleration, braking, obstructions, and the behavior of other entities. Moreover, traffic light states must be periodically updated to simulate realistic city driving. There are two popular approaches for traffic simulation, which we describe below.
交通模拟涉及在环境中生成并定位具有现实运动的虚拟实体[97, 100]。这些实体通常包括车辆（如汽车、摩托车、自行车等）和行人。交通模拟器必须考虑速度、加速度、制动、障碍物以及其它实体行为的影响。此外，还必须定期更新交通灯状态以模拟真实的城市驾驶。我们下面描述两种流行的交通模拟方法。
Rule-Based: Rule-based traffic simulators use predefined rules to generate the motion of traffic entities. The most prominent implementation of this concept is the Intelligent Driver Model (IDM) [101]. IDM is a car-following model that computes acceleration for each vehicle based on its current speed, the speed of the leading vehicle, and a desired safety distance. Although widely used and straightforward, this approach may be inadequate to simulate realistic motion and complex interactions in urban environments.
基于规则的交通模拟器使用预定义的规则来生成交通实体的运动。这种概念最突出的实现是智能驾驶模型（Intelligent Driver Model, IDM）[101]。IDM是一种跟车模型，根据每辆车的当前速度、前车速度以及期望的安全距离来计算加速度。尽管这种方法被广泛使用且简单直观，但它可能不足以模拟城市环境中的真实运动和复杂交互。
Data-Driven: Realistic human traffic behavior is highly interactive and complex, including lane changing, merging, sudden stopping, etc. To model such behavior, data-driven traffic simulation utilizes data collected from real-world driving. These models can capture more nuanced, realistic behavior but require significant amounts of labeled data for training. A wide variety of learning-based techniques have been proposed for this task [97, 98, 100, 102, 103, 104].
数据驱动：逼真的人类交通行为具有高度的互动性和复杂性，包括变道、并线、突然停车等。为了模拟这样的行为，数据驱动的交通模拟利用从现实世界驾驶中收集的数据。这些模型能够捕捉更加细腻、逼真的行为，但需要大量的标记数据进行训练。为了这项任务，已经提出了多种基于学习的技术[97, 98, 100, 102, 103, 104]。

3.2.3 Sensor Simulation

Sensor simulation is crucial for evaluating end-to-end selfdriving systems. This involves generating simulated raw sensor data, such as camera images or LiDAR scans that the driving system would receive from different viewpoints in the simulator [105, 106, 107]. This process needs to take into account noise and occlusions to realistically assess the autonomous system. There are two main branches of ideas concerning sensor simulation, as described below.
传感器模拟对于评估端到端自动驾驶系统至关重要。这包括生成模拟的原始传感器数据，例如摄像头图像或激光雷达扫描，这些是驾驶系统在模拟器中从不同视点接收到的[105, 106, 107]。这个过程需要考虑噪声和遮挡，以真实地评估自动驾驶系统。关于传感器模拟，有两个主要的思想分支，如下所述：

物理精确性：传感器模拟需要尽可能地复制真实传感器的物理特性，包括传感器的分辨率、视角、检测范围以及它们如何处理不同光照和天气条件。
传感器噪声和遮挡模型：为了模拟真实世界中的传感器数据，需要在模拟中加入噪声和遮挡效果，这可能包括模拟传感器的随机误差、图像的模糊、目标的遮挡等。
多模态传感器融合：在自动驾驶系统中，通常使用多种类型的传感器来获得周围环境的全面视图。传感器模拟应该能够同时生成这些不同传感器的数据，并考虑它们之间的融合和相互作用。
传感器模型的校准：为了确保模拟数据的准确性，传感器模型需要进行精确的校准，以匹配真实传感器的性能。
数据同步：在多传感器系统中，不同传感器的数据需要同步，以确保在模拟环境中它们能够提供一致的环境感知。
模拟与现实的一致性：传感器模拟的最终目标是生成与现实世界尽可能一致的数据，以便自动驾驶系统能够在模拟环境中得到有效的测试和验证。

通过精确的传感器模拟，可以确保自动驾驶系统在各种环境条件下都能够接受严格的测试，从而提高其在现实世界中的安全性和可靠性。随着自动驾驶技术的发展，传感器模拟将继续在系统开发和评估过程中发挥关键作用。
Graphics-Based: Recent computer graphics simulators use 3D models of the environment, along with traffic entity models, to generate sensor data via approximations of physical rendering processes in the sensors [89, 90]. For example, this can involve occlusions, shadows, and reflections present in real-world environments while simulating camera images. However, the realism of graphics-based simulation is often subpar or comes at the cost of heavy computation, making parallelization non-trivial [108]. It is closely tied to the quality of the 3D models and the approximations used in modeling the sensors. A comprehensive survey of graphicsbased rendering for driving data is provided in [109].
基于图形的传感器模拟：最新的计算机图形模拟器使用环境的3D模型以及交通实体模型，通过模拟传感器中物理渲染过程的近似来生成传感器数据[89, 90]。例如，这可能涉及在模拟摄像头图像时考虑到现实世界环境中存在的遮挡、阴影和反射。然而，基于图形的仿真的真实性常常不尽如人意，或者以沉重的计算成本为代价，使得并行化计算变得不平凡[108]。它与3D模型的质量以及在模拟传感器时所使用的近似方法密切相关。关于基于图形的驾驶数据渲染的综合调查在[109]中提供。

Data-Driven: Data-driven sensor simulation leverages real-world sensor data to create the simulation where both the ego vehicle and background traffic may move differently from the way they did in recordings [110, 111, 112]. Popular methods are Neural Radiance Fields (NeRF) [113] and 3D Gaussian Splatting [114], which can generate novel views of a scene by learning an implicit representation of the scene’s geometry and appearance. These methods can produce more realistic sensor data visually than graphics-based approaches, but they have limitations such as high rendering times or requiring independent training for each scene being reconstructed [107, 115, 116, 117, 118]. Another approach to data-driven sensor simulation is domain adaptation, which aims to minimize the gap between real and graphics-based simulated sensor data [119]. Deep learning techniques such as GANs can be employed to improve realism (Sec. 4.9.3).
数据驱动的传感器模拟利用现实世界的传感器数据来创建仿真环境，在该环境中，自车和背景交通的移动可能与录像中的移动方式不同[110, 111, 112]。流行的方法是神经辐射场（Neural Radiance Fields, NeRF）[113]和3D高斯溅射[114]，它们可以通过学习场景几何和外观的隐式表示来生成场景的新视图。这些方法在视觉上比基于图形的方法产生更逼真的传感器数据，但它们也有局限性，如高渲染时间或需要对每个正在重建的场景进行独立训练[107, 115, 116, 117, 118]。数据驱动传感器模拟的另一种方法是域适应，其目的是最小化真实和基于图形的模拟传感器数据之间的差距[119]。可以采用深度学习技术如生成对抗网络（GANs）来提高现实感（见第4.9.3节）。

3.2.4 Vehicle Dynamics Simulation

The final aspect of driving simulation pertains to ensuring that the simulated vehicle adheres to physically plausible motion. Most existing publicly available simulators use highly simplified vehicle models, such as the unicycle model [120] or the bicycle model [121]. However, in order to facilitate seamless transfer of algorithms from simulation to the real world, it is essential to incorporate more accurate physical modeling of vehicle dynamics. For instance, CARLA adopts a multi-body system approach, representing a vehicle as a collection of sprung masses on four wheels. For a comprehensive review, please refer to [122].
驾驶仿真的最后一个方面是确保模拟车辆的运动符合物理上可信的原则。大多数现有的公开可用的模拟器使用高度简化的车辆模型，如单车模型[120]或自行车模型[121]。然而，为了促进算法从仿真到现实世界的无缝转移，纳入更准确的车辆动力学物理建模至关重要。例如，CARLA采用了一种多体系统方法，将车辆表示为四个轮子上的一组悬挂质量。有关全面评述，请参考[122]。

3.2.5 Benchmarks

We give a succinct overview of end-to-end driving benchmarks available up to date in Table 1. In 2019, the original benchmark released with CARLA [90] was solved with nearperfect scores [5]. The subsequent NoCrash benchmark [123] involves training on a single CARLA town under specific weather conditions and testing generalization to another town and set of weathers. Instead of a single town, the Town05 benchmark [6] involves training on all available towns while withholding Town05 for testing. Similarly, the LAV benchmark trains on all towns except Town02 and Town05, which are both reserved for testing. Roach [21] uses a setting with 3 test towns, albeit all seen during training, and without the safety-critical scenarios in Town05 and LAV. Finally, the Longest6 benchmark [28] uses 6 test towns. Two online servers, the leaderboard (v1 and v2) [13], ensure fair comparisons by keeping evaluation routes confidential. Leaderboard v2 is highly challenging due to the long route length (over 8km on average, as opposed to 1-2km on v1) and a wide variety of new traffic scenarios.
我们对截至当前可用的端到端驾驶基准进行了简洁的概述，如表1 所示。2019年，CARLA[90]发布的原始基准[5]被以近乎完美的分数解决。随后的NoCrash基准[123]涉及在特定的天气条件下对单一CARLA城镇进行训练，并测试对另一个城镇和天气集的泛化能力。与单一城镇不同，Town05基准[6]涉及在所有可用城镇上进行训练，同时保留Town05用于测试。同样，LAV基准在所有城镇上进行训练，除了Town02和Town05，这两个城镇都被预留用于测试。Roach[21]使用了一个有3个测试城镇的设置，尽管所有这些城镇在训练期间都已被观察过，并且没有Town05和LAV中的安全关键场景。最后，Longest6基准[28]使用了6个测试城镇。两个在线服务器，排行榜（v1和v2）[13]，通过保密评估路线来确保公平比较。由于路线长度较长（平均超过8公里，与v1上的1-2公里相比）和大量新的交通场景，排行榜v2极具挑战性。
在这里插入图片描述
The nuPlan simulator is currently accessible for evaluating end-to-end systems via the NAVSIM project [124]. Further, there are two benchmarks on which agents input maps and object properties via the data-driven parameter initialization for nuPlan (Sec. 3.2.1). Val14, proposed in [125], uses a validation split of nuPlan. The leaderboard, a submission server with the private test set, was used in the 2023 nuPlan challenge, but it is no longer public for submissions.
nuPlan模拟器目前可以通过NAVSIM项目[124]访问，用于评估端到端系统。此外，还有两个基准测试，其中代理通过nuPlan的数据驱动参数初始化（第3.2.1节）输入地图和对象属性。Val14，提出于[125]，使用nuPlan的验证分割。排行榜，一个带有私有测试集的提交服务器，在2023年nuPlan挑战赛中使用，但不再公开接受提交。
条件的环境中测试和验证他们的算法。随着自动驾驶技术的发展，这些评估工具将继续在推动技术进步方面发挥重要作用。

3.3 Offline/Open-loop Evaluation

Open-loop evaluation mainly assesses a system’s performance against pre-recorded expert driving behavior. This method requires evaluation datasets that include (1) sensor readings, (2) goal locations, and (3) corresponding future driving trajectories, usually obtained from human drivers. Given sensor inputs and goal locations as inputs, performance is measured by comparing the system’s predicted future trajectory against the trajectory in the driving log. Systems are evaluated based on how closely their trajectory predictions match the human ground truth, as well as auxiliary metrics such as the collision probability with other agents. The advantage of open-loop evaluation is that it is easy to implement using realistic traffic and sensor data, as it does not require a simulator. However, the key disadvantage is that it does not measure performance in the actual test distribution encountered during deployment. During testing, the driving system may deviate from the expert driving corridor, and it is essential to verify the system’s ability to recover from such drift (Sec. 4.9.2). Furthermore, the distance between the predicted and the recorded trajectories is not an ideal metric in a multi-modal scenario. For example, in the case of merging into a turning lane, both the options of merging immediately or later could be valid, but open-loop evaluation penalizes the option that was not observed in the data. Therefore, besides measuring collision probability and prediction errors, a few metrics were proposed to cover more comprehensive aspects such as traffic violations, progress, and driving comfort [125].
开环评估主要针对预先录制的专家驾驶行为来评估系统的性能。这种方法需要包括以下内容的评估数据集：(1) 传感器读数，(2) 目标位置，以及 (3) 对应的未来驾驶轨迹，这些通常是由人类驾驶员获得的。给定传感器输入和目标位置作为输入，通过将系统预测的未来轨迹与驾驶日志中的轨迹进行比较来衡量性能。系统的性能评估基于它们的轨迹预测与人类实际轨迹的接近程度，以及辅助指标，如与其他代理发生碰撞的概率。开环评估的优势在于它易于使用现实的交通和传感器数据来实现，因为它不需要模拟器。然而，关键的缺点是它不测量在部署期间实际遇到的测试分布中的性能。在测试期间，驾驶系统可能会偏离专家驾驶的轨迹，并且验证系统从这种漂移中恢复的能力至关重要（见第4.9.2节）。此外，在多模态场景中，预测和记录轨迹之间的距离并不是一个理想的指标。例如，在并入转弯车道的情况下，立即合并或稍后合并的选项都可能是有效的，但开环评估会对数据中未观察到的选项进行惩罚。因此，除了测量碰撞概率和预测误差之外，还提出了一些指标来涵盖更全面的方面，如交通违规、进展和驾驶舒适度[125]。
开环评估为自动驾驶系统提供了一种快速且成本效益高的评估方法，但它也有局限性，特别是在评估系统在复杂和动态环境中的适应性和恢复能力方面。随着自动驾驶技术的发展，评估方法也在不断进化，以更准确地反映系统在现实世界中的表现。
This approach requires comprehensive datasets of trajectories to draw from. The most popular datasets for this purpose include nuScenes [126], Argoverse [127], Waymo [128], and nuPlan [14]. All of these datasets comprise a large number of real-world driving traversals with varying degrees of difficulty. However, open-loop results do not provide conclusive evidence of improved driving behavior in closedloop, due to the aforementioned drawbacks [123, 125, 129, 130]. Overall, a realistic closed-loop benchmarking, if available and applicable, is recommended in future research.
这种方法需要大量的轨迹数据集来进行评估。为此目的最受欢迎的数据集包括nuScenes[126]、Argoverse[127]、Waymo[128]和nuPlan[14]。所有这些数据集都包含了大量具有不同难度等级的真实世界驾驶轨迹。然而，由于前述的缺点[123, 125, 129, 130]，开环评估结果并不能为闭环中改善的驾驶行为提供决定性的证据。总体而言，如果可行且适用，建议在未来研究中使用现实的闭环基准测试。
开环评估作为一种快速评估方法，有助于初步了解系统性能，但它不能完全替代闭环评估。闭环评估提供了更全面的视角，有助于开发更安全、更可靠的自动驾驶系统。随着自动驾驶技术的发展，评估方法也在不断进步，以更好地模拟和评估系统在真实世界中的表现。

4 CHALLENGES

Following each topic illustrated in Fig. 1, we now walk through current challenges, related works or potential resolutions, risks, and opportunities. We start with challenges in handling different input modalities in Sec. 4.1, followed by a discussion on visual abstraction for efficient policy learning in Sec. 4.2. Further, we introduce learning paradigms such as world model learning (Sec. 4.3), multi-task frameworks (Sec. 4.4), and policy distillation (Sec. 4.5). Finally, we discuss general issues that impede safe and reliable end-to-end autonomous driving, including interpretability in Sec. 4.6, safety guarantees in Sec. 4.7, causal confusion in Sec. 4.8, and robustness in Sec. 4.9.
根据图1所示的每个主题，我们现在依次讨论当前的挑战、相关工作或潜在解决方案、风险和机遇。我们从第4.1节中处理不同输入模态的挑战开始，接着在第4.2节讨论有效策略学习的视觉抽象。此外，我们介绍了诸如世界模型学习（第4.3节）、多任务框架（第4.4节）和策略蒸馏（第4.5节）等学习范式。最后，我们讨论了阻碍安全和可靠端到端自动驾驶的一般问题，包括第4.6节中的可解释性、第4.7节中的安全保证、第4.8节中的因果混淆和第4.9节中的鲁棒性。
以下是对这些主题的概述：

第4.1节：不同输入模态的处理：自动驾驶系统需要能够处理多种传感器输入，如雷达、激光雷达、摄像头等。挑战在于如何有效地融合这些不同模态的数据以获得可靠的感知和决策。
第4.2节：视觉抽象：为了提高策略学习效率，需要对视觉信息进行有效的抽象，以便模型可以学习到从原始像素到有用特征的映射。
第4.3节：世界模型学习：通过学习环境的模型来预测未来状态和奖励，可以帮助自动驾驶系统做出更好的决策。
第4.4节：多任务框架：自动驾驶系统通常需要同时完成多个任务，如感知、预测、规划等。多任务学习框架可以提高模型的效率和泛化能力。
第4.5节：策略蒸馏：策略蒸馏是一种技术，通过模仿专家策略来提高学习策略的性能，尤其在模仿学习中很有用。
第4.6节：可解释性：为了提高系统的透明度和信任度，需要开发可解释的模型，使人们能够理解系统的决策过程。
第4.7节：安全保证：自动驾驶系统的安全性至关重要，需要确保系统在各种情况下都能做出安全的决策。
第4.8节：因果混淆：系统可能会错误地学习到输入和输出之间的因果关系，这可能导致在现实世界中做出错误的决策。
第4.9节：鲁棒性：自动驾驶系统需要具备鲁棒性，以应对各种不可预见的情况和潜在的攻击或故障。

4.1 Dilemma over Sensing and Input Modalities

4.1.1 Sensing and Multi-sensor Fusion

Sensing: Though early work [8] successfully achieved following a lane with a monocular camera, this single input modality cannot handle complex scenarios. Therefore, various sensors in Fig. 4 have been introduced for recent selfdriving vehicles. Particularly, RGB images from cameras replicate how humans perceive the world, with abundant semantic details; LiDARs or stereo cameras provide accurate 3D spatial knowledge. Emerging sensors like mmWave radars and event cameras excel at capturing objects’ relative movement. Additionally, vehicle states from speedometers and IMUs, together with navigation commands, are other lines of input that guide the driving system. However, various sensors possess distinct perspectives, data distributions, and huge price gaps, thereby posing challenges in effectively designing the sensory layout and fusing them to complement each other for autonomous driving.
感知：尽管早期的工作[8]成功地使用单目摄像头实现了车道跟随，但这种单一的输入模态无法处理复杂场景。因此，如图4 所示，各种传感器已被引入到最近的自动驾驶车辆中。特别是，摄像头捕获的RGB图像复制了人类感知世界的方式，具有丰富的语义细节；激光雷达或立体摄像头提供精确的3D空间知识。毫米波雷达和事件摄像头等新兴传感器擅长捕捉物体的相对运动。此外，来自速度计和惯性测量单元（IMUs）的车辆状态，以及导航命令，是指导驾驶系统的另一类输入。然而，各种传感器具有不同的观点、数据分布和巨大的价格差异，因此在有效设计感知布局和融合它们以互补自动驾驶方面带来了挑战。
以下是对自动驾驶感知系统的进一步解释：

单目摄像头：早期自动驾驶研究中使用的单一模态，能够提供二维的视觉信息。
多模态感知：现代自动驾驶系统通常采用多种传感器来获得更全面的环境感知，包括视觉、雷达、激光雷达等。
RGB图像：彩色图像提供了丰富的颜色和纹理信息，有助于理解场景的语义内容。
3D空间知识：激光雷达和立体摄像头能够提供精确的距离和形状信息，对于障碍物检测和场景重建至关重要。
新兴传感器：毫米波雷达和事件摄像头等新兴技术在检测物体运动方面具有优势，尤其是在低能见度条件下。
车辆状态和导航：速度、加速度、方向等车辆状态信息，结合GPS和地图数据，为驾驶决策提供重要依据。
传感器融合：将来自不同传感器的数据融合在一起，以获得更准确和鲁棒的环境感知，是自动驾驶中的一个关键技术挑战。
成本和设计考虑：不同传感器的成本、尺寸、性能和可靠性有很大差异，需要在系统设计时综合考虑。
互补性：不同传感器在不同环境和场景下各有优势，如何设计一个能够充分利用各传感器优势的系统是一个重要问题。

自动驾驶感知系统的设计需要综合考虑多种因素，包括传感器的性能、成本、可靠性和互补性，以实现在各种复杂环境中的安全和有效驾驶。随着技术的发展，传感器技术和融合算法将继续进步，为自动驾驶提供更强的感知能力。
在这里插入图片描述
图4：输入模态和融合策略示例。不同的模态具有不同的特点，这带来了有效传感器融合的挑战。我们以点云和图像为例来描述各种融合策略。

Multi-sensor fusion has predominantly been discussed in perception-related fields, e.g., object detection [131, 132] and semantic segmentation [133, 134], and is typically categorized into three groups: early, mid, and late fusion. Endto-end autonomous driving algorithms explore similar fusion schemes. Early fusion combines sensory inputs before feeding them into shared feature extractors, where concatenation is a common way for fusion [32, 135, 136, 137, 138]. To resolve the view discrepancy, some works project point clouds on images [139] or vice versa (predicting semantic labels for LiDAR points [52, 140]). On the other hand, late fusion combines multiple results from multi-modalities. It is less discussed due to its inferior performance [6, 141]. Contrary to these methods, middle fusion achieves multi-sensor fusion within the network by separately encoding inputs and then fusing them at the feature level. Naive concatenation is also frequently adopted [15, 22, 30, 142, 143, 144, 145, 146]. Recently, works have employed Transformers [27] to model interactions among features [6, 28, 29, 147, 148]. The attention mechanism in Transformers has demonstrated great effectiveness in aggregating the context of different sensor inputs and achieving safer end-to-end driving.
多传感器融合在与感知相关的领域，如目标检测[131, 132]和语义分割[133, 134]中，已经得到了广泛的讨论，通常被分为三类：早融合、中融合和晚融合。端到端自动驾驶算法探索了类似的融合方案。

早融合（Early Fusion）：在将感官输入送入共享特征提取器之前，先对它们进行组合，其中连接（concatenation）是常见的融合方式[32, 135, 136, 137, 138]。
解决视角差异：为了解决不同传感器视角的差异，一些工作将点云投影到图像上[139]，或者反过来（为激光雷达点预测语义标签[52, 140]）。
晚融合（Late Fusion）：晚融合结合了来自多模态的多个结果。由于其性能较差，讨论较少[6, 141]。
中融合（Mid Fusion）：与这些方法相反，中融合通过分别编码输入然后在特征级别上融合它们，在网络内部实现多传感器融合。简单的连接也经常被采用[15, 22, 30, 142, 143, 144, 145, 146]。
使用Transformers：最近，一些工作采用了Transformers来模拟特征之间的交互[6, 28, 29, 147, 148]。Transformers中的注意力机制在聚合不同传感器输入的上下文和实现更安全的端到端驾驶方面显示出了巨大的有效性。

多传感器融合是自动驾驶系统中的一个关键技术，它允许系统综合来自不同传感器的信息，以获得更全面和鲁棒的环境感知。每种融合策略都有其优势和局限性，选择哪种策略取决于具体的应用场景和系统设计考虑。随着深度学习技术的发展，特别是Transformers等模型的应用，多传感器融合在自动驾驶中的应用将变得更加灵活和强大。
Inspired by the progress in perception, it is beneficial to model modalities in a unified space such as BEV [131, 132]. End-to-end driving also requires identifying policyrelated contexts and discarding irrelevant details. We discuss perception-based representations in Sec. 4.2.1. Besides, the self-attention layer, interconnecting all tokens freely, in curs a significant computational cost and cannot guarantee useful information extraction. Advanced Transformer-based fusion mechanisms in the perception field, such as [149, 150], hold promise for application to the end-to-end driving task.
受到感知领域进展的启发，将不同模态数据建模到统一的空间，如鸟瞰图（Bird’s-Eye View, BEV）[131, 132]，是有益的。端到端驾驶还需要识别与策略相关的上下文并丢弃不相关的细节。我们在第4.2.1节讨论了基于感知的表示。
此外，自注意力层（self-attention layer）在连接所有标记（tokens）时会带来显著的计算成本，并且不能保证有效信息的提取。在感知领域，一些先进的基于Transformer的融合机制，如[149, 150]，有望应用于端到端驾驶任务。

4.1.2 Language as Input

Humans drive using both visual perception and intrinsic knowledge which together form causal behaviors. In areas related to autonomous driving such as embodied AI, incorporating natural language as fine-grained knowledge and instructions to control the visuomotor agent has achieved notable progress [151, 152, 153, 154]. However, compared to robotic applications, the driving task is more straightforward without the need for task decomposition, and the outdoor environment is much more complex with highly dynamic agents but few distinctive anchors for grounding.
人类驾驶依赖于视觉感知和内在知识，这些因素共同构成了因果行为。在与自动驾驶相关的领域，如具身智能（embodied AI），将自然语言作为细粒度知识和指令纳入控制视觉运动代理已经取得了显著进展[151, 152, 153, 154]。然而，与机器人应用相比，驾驶任务更为直接，不需要任务分解，而户外环境则更为复杂，存在高度动态的代理，但用于锚定的显著特征较少。
To incorporate linguistic knowledge into driving, a few datasets are proposed to benchmark outdoor grounding and visual language navigation tasks [155, 156, 157, 158]. HAD [159] takes human-to-vehicle advice and adds a visual grounding task. Sriram et al. [160] translate natural language instructions into high-level behaviors, while [161, 162] directly ground the texts. CLIP-MC [163] and LM-Nav [164] utilize CLIP [165] to extract both linguistic knowledge from instructions and visual features from images.
为了将语言知识融入驾驶中，已经提出了一些数据集来基准化户外锚定和视觉语言导航任务[155, 156, 157, 158]。以下是这些方法的概述：

户外锚定和视觉语言导航：这些任务涉及将自然语言指令与实际的户外环境和导航行为相结合，要求系统能够理解语言并将其转化为视觉场景中的相应动作。
HAD (Human-to-Vehicle Advice Dataset) [159]：HAD数据集收集了人类对车辆的建议，并增加了一个视觉锚定任务，即系统需要将语言指令与视觉场景中的特定对象或位置关联起来。
Sriram等人的工作 [160]：这项研究将自然语言指令翻译成高级行为，例如，将“在下一个路口右转”这样的指令转化为驾驶策略中的具体动作。
直接文本锚定 [161, 162]：这些方法尝试直接将文本指令与视觉信息关联，不需要额外的抽象或翻译步骤。
CLIP-MC (CLIP for Multi-Choice Questions) [163] 和 LM-Nav (Language Model for Navigation) [164]：这两个方法利用了CLIP模型[165]来从指令中提取语言知识，并从图像中提取视觉特征。CLIP模型能够将自然语言描述与视觉内容相结合，为自动驾驶系统提供丰富的上下文信息。

Recently, observing the rapid development of large language models (LLMs) [166, 167], works encode the perceived scene into tokens and prompt them to LLMs for control prediction and text-based explanations [168, 169, 170]. Researchers also formulate the driving task as a questionanswering problem and construct corresponding benchmarks [171, 172]. They highlight that LLMs offer opportunities to handle sophisticated instructions and generalize to different data domains, which share similar advantages to applications in robotic areas [173]. However, LLMs for onroad driving could be challenging at present, considering their long inference time, low quantitative accuracy, and instability of outputs. Potential resolutions could be employing LLMs on the cloud specifically for complex scenarios and using them solely for high-level behavior prediction.
最近，随着大型语言模型（Large Language Models, LLMs）[166, 167]的快速发展，一些研究工作开始将感知到的场景编码成标记（tokens），并提示它们到LLMs进行控制预测和基于文本的解释[168, 169, 170]。研究人员还将驾驶任务构建为问答（question-answering）问题，并构建了相应的基准测试[171, 172]。他们强调LLMs提供了处理复杂指令和泛化到不同数据领域的机会，这与机器人应用领域有相似的优势[173]。然而，目前用于道路驾驶的LLMs可能面临挑战，考虑到它们长时间的推理时间、低定量准确性和输出的不稳定性。可能的解决方案包括在云端专门用于复杂场景的LLMs，并仅用它们进行高级行为预测。

4.2 Dependence on Visual Abstraction

End-to-end autonomous driving systems roughly have two stages: encoding the state into a latent feature representation, and then decoding the driving policy with intermediate features. In urban driving, the input state, i.e., the surrounding environment and ego state, is much more diverse and high-dimensional compared to common policy learning benchmarks such as video games [18, 174], which might lead to the misalignment between representations and necessary attention areas for policy making. Hence, it is helpful to design “good” intermediate perception representations, or first pre-train visual encoders using proxy tasks. This enables the network to extract useful information for driving effectively, thus facilitating the subsequent policy stage. Furthermore, this can improve the sample efficiency for RL methods.
端到端自动驾驶系统大致可以分为两个阶段：将状态编码为潜在特征表示，然后使用中间特征解码驾驶策略。在城市驾驶中，输入状态，即周围环境和自车状态，与常见的策略学习基准（如视频游戏[18, 174]）相比，更加多样化和高维，这可能导致表示与决策制定所需关注区域之间的不一致。因此，设计“良好”的中间感知表示或首先使用代理任务预训练视觉编码器是有帮助的。这使网络能够有效地提取驾驶所需的有用信息，从而促进后续的策略阶段。此外，这可以提高强化学习方法的样本效率。

4.2.1 Representation Design

Naive representations are extracted with various backbones. Classic convolutional neural networks (CNNs) still dominate, with advantages in translation equivariance and high efficiency [175]. Depth-pre-trained CNNs [176] significantly boost perception and downstream performance. In contrast, Transformer-based feature extractors [177, 178] show great scalability in perception tasks while not being widely adopted for end-to-end driving yet. For driving-specific representations, researchers introduce the concept of bird’s-eyeview (BEV), fusing different sensor modalities and temporal information within a unified 3D space [131, 132, 179, 180]. It also facilitates easy adaptions to downstream tasks [2, 30, 181]. In addition, grid-based 3D occupancy is developed to capture irregular objects and used for collision avoidance in planning [182]. Nevertheless, the dense representation brings huge computation costs compared to BEV methods.
朴素的表示是通过各种主干网络提取的。经典的卷积神经网络（CNN）仍然占据主导地位，具有平移等变性和高效率的优势[175]。深度预训练的CNN[176]显著提升了感知能力和下游性能。相比之下，基于Transformer的特征提取器[177, 178]在感知任务中显示出极大的可扩展性，尽管尚未广泛用于端到端驾驶。对于驾驶特定的表示，研究人员引入了鸟瞰图（BEV）的概念，将不同的传感器模态和时间信息融合在一个统一的3D空间内[131, 132, 179, 180]。这也便于轻松适应下游任务[2, 30, 181]。此外，基于网格的3D占用表示被开发出来，用于捕捉不规则物体，并用于规划中的避碰[182]。然而，与BEV方法相比，密集表示带来了巨大的计算成本。
Another unsettled problem is representations of the map. Traditional autonomous driving relies on HD Maps. Due to the high cost of availability of HD Maps, online mapping methods have been devised with different formulations, such as BEV segmentation [183], vectorized lanlines [184], centerlines and their topology [185, 186], and lane segments [187]. However, the most suitable formulation for end-to-end systems remains unvalidated.
另一个尚未解决的问题是地图的表示方法。传统的自动驾驶依赖于高精地图（HD Maps）。由于高精地图的获取成本很高，因此已经设计了不同的在线地图制作方法，例如：

BEV分割：使用鸟瞰图视角进行环境分割，以提取道路和车道标记等元素[183]。
矢量化车道线：将车道线表示为矢量图形，以便于处理和分析[184]。
中心线及其拓扑结构：提取道路中心线，并分析其拓扑关系，以理解道路结构[185, 186]。
车道段：将车道划分为段，每段具有特定的属性和连接关系[187]。

Though various representation designs offer possibilities of how to design the subsequent decision-making process, they also place challenges as co-designing both parts is necessary for a whole framework. Besides, given the trends observed in several simple yet effective approaches with scaling up training resources [22, 28], the ultimate necessity of explicit representations such as maps is uncertain.
尽管各种表示设计提供了如何设计随后决策过程的可能性，但它们也带来了挑战，因为必须对整个框架的两部分进行共同设计。此外，考虑到在一些简单但有效的方法中观察到的趋势，这些方法通过扩大训练资源来实现[22, 28]，显式表示（如地图）的最终必要性是不确定的。

4.2.2 Representation Learning

Representation learning often incorporates certain inductive biases or prior information. There inevitably exist possible information bottlenecks in the learned representation, and redundant context unrelated to decisions may be removed.
表示学习通常结合了某些归纳偏差或先验信息。在学到的表示中不可避免地存在可能的信息瓶颈，与决策无关的冗余上下文可能会被移除。
Some early methods directly utilize semantic segmentation masks from off-the-shelf networks as the input representation for subsequent policy training [188, 189]. SESR [190] further encodes segmentation masks into classdisentangled representations through a VAE [191]. In [192, 193], predicted affordance indicators, such as traffic light states, offset to the lane center, and distance to the leading vehicle, are used as representations for policy learning.
一些早期的方法直接使用现成网络的语义分割掩模作为后续策略训练的输入表示[188, 189]。SESR[190]通过变分自编码器（VAE）[191]进一步将分割掩模编码为类别解耦表示。在[192, 193]中，预测的可承受性指标，如交通灯状态、偏离车道中心的距离和与前车的距离，被用作策略学习的表示。
Observing that results like segmentation as representations can create bottlenecks defined by humans and result in loss of useful information, some have chosen intermediate features from pre-training tasks as effective representations for RL training [18, 19, 194, 195]. In [196], latent features in VAE are augmented by attention maps obtained from the diffused boundary of segmentation and depth maps to highlight important regions. TARP [197] utilizes data from a series of previous tasks to perform different tasks-related prediction tasks to acquire useful representations. In [198], the latent representation is learned by approximating the π-bisimulation metric, which is comprised of differences of rewards and outputs from the dynamics model. ACO [36] learns discriminative features by adding steering angle categorization into the contrastive learning structure. Recently, PPGeo [12] proposes to learn effective representation through motion prediction together with depth estimation in a self-supervised way on uncalibrated driving videos. ViDAR [199] utilizes the raw image-point cloud pairs and pretrains the visual encoder with a point cloud forecasting pre-task. These works demonstrate that self-supervised representation learning from large-scale unlabeled data for policy learning is promising and worthy of future exploration.
观察到像分割这样的结果作为表示可能会因为人为定义的瓶颈而导致有用信息的丢失，一些人选择使用预训练任务的中间特征作为RL训练的有效表示[18, 19, 194, 195]。在[196]中，通过从分割和深度图的扩散边界获得的注意力图增强了VAE中的潜在特征，以突出重要区域。TARP[197]利用一系列先前任务的数据来执行与不同任务相关的预测任务，以获得有用的表示。在[198]中，通过逼近由奖励和动态模型的输出差异组成的π-双模拟度量来学习潜在表示。ACO[36]通过在对比学习结构中添加转向角度分类来学习区分性特征。最近，PPGeo[12]提出通过在未校准的驾驶视频中自监督地进行运动预测和深度估计来学习有效的表示。ViDAR[199]利用原始图像-点云对，并用点云预测预任务预训练视觉编码器。这些工作表明，从大规模未标记数据中自监督地学习表示用于策略学习是有希望的，并值得未来探索。

4.3 Complexity of World Modeling for Model-based RL

Besides the ability to better abstract perceptual representations, it is essential for end-to-end models to make reasonable predictions about the future to take safe maneuvers. In this section, we mainly discuss the challenges of current model-based policy learning works, where a world model provides explicit future predictions for the policy model.
除了能够更好地抽象感知表示外，端到端模型做出合理的未来预测以采取安全驾驶动作也是至关重要的。在这一部分中，我们主要讨论当前基于模型的策略学习工作中的挑战，其中世界模型为策略模型提供明确的未来预测。
Deep RL typically suffers from the high sample complexity, which is pronounced in autonomous driving. Modelbased reinforcement learning (MBRL) offers a promising direction to improve sample efficiency by allowing agents to interact with the learned world model instead of the actual environment. MBRL methods employ an explicit world (environment) model, which is composed of transition dynamics and reward functions. This is particularly helpful in driving, as simulators like CARLA are relatively slow.
深度强化学习（Deep RL）通常面临着高样本复杂性的问题，在自动驾驶中尤为明显。基于模型的强化学习（Model-based Reinforcement Learning, MBRL）提供了一个有希望的方向，通过允许代理与学习到的世界模型而不是实际环境进行交互，从而提高样本效率。MBRL方法采用显式的世界观（环境）模型，该模型由转移动态和奖励函数组成。这在驾驶中特别有帮助，因为像CARLA这样的模拟器相对较慢。
However, modeling the highly dynamic environment is a challenging task. To simplify the problem, Chen et al. [20] factor the transition dynamics into a non-reactive world model and a simple kinematic bicycle model. In [137], a probabilistic sequential latent model is used as the world model. To address the potential inaccuracy of the learned world model, Henaff et al. [200] train the policy network with dropout regularization to estimate the uncertainty cost. Another approach [201] uses an ensemble of multiple world models to provide uncertainty estimation, based on which imaginary rollouts could be truncated and adjusted accordingly. Motivated by Dreamer [82], ISO-Dream [202] decouples visual dynamics into controllable and uncontrollable states, and trains the policy on the disentangled states.
然而，对高度动态的环境进行建模是一个具有挑战性的任务。为了简化问题，Chen等人[20]将转移动态分解为非反应性世界模型和一个简单的运动学自行车模型。在[137]中，使用概率顺序潜在模型作为世界模型。为了解决学习到的世界模型可能存在的不准确性问题，Henaff等人[200]使用dropout正则化来训练策略网络，以估计不确定性成本。另一种方法[201]使用多个世界模型的集合来提供不确定性估计，据此可以截断并相应调整想象中的推演。受Dreamer[82]的启发，ISO-Dream[202]将视觉动态解耦为可控和不可控状态，并在解耦的状态下训练策略。
It is worth noting that learning world models in raw image space is non-trivial for autonomous driving. Important small details, such as traffic lights, would easily be missed in predicted images. To tackle this, GenAD [203] and DriveWM [204] employ the prevailing diffusion technique [205]. MILE [206] incorporates the Dreamer-style world model learning in the BEV segmentation space as an auxiliary task besides imitation learning. SEM2 [136] also extends the Dreamer structure but with BEV map inputs, and uses RL for training. Besides directly using the learned world model for MBRL, DeRL [195] combines a model-free actor-critic framework with the world model, by fusing selfassessments of the action or state from both models.
值得注意的是，在自动驾驶中，直接在原始图像空间学习世界模型并非易事。重要的小细节，比如交通灯，在预测图像中很容易被遗漏。为解决这个问题，GenAD[203]和DriveWM[204]采用了流行的扩散技术[205]。MILE[206]将Dreamer风格的世界模型学习作为模仿学习的辅助任务，融入到BEV分割空间中。SEM2[136]也扩展了Dreamer结构，但是使用BEV地图作为输入，并且使用RL进行训练。除了直接使用学习到的世界模型进行MBRL外，DeRL[195]将无模型的演员-评论家框架与世界模型结合起来，通过融合来自两个模型的动作或状态的自我评估。
World model learning for end-to-end autonomous driving is an emerging and promising direction as it greatly reduces the sample complexity for RL, and understanding the world is helpful for driving. However, as the driving environment is highly complex and dynamic, further study is still needed to determine what needs to be modeled and how to model the world effectively.
世界模型学习对于端到端自动驾驶来说是一个新兴且有希望的方向，因为它极大地降低了强化学习（RL）的样本复杂性，而且理解世界对于驾驶是有帮助的。然而，由于驾驶环境极其复杂和动态，仍需进一步研究以确定需要建模的内容以及如何有效地建模这个世界。

4.4 Reliance on Multi-Task Learning

Multi-task learning (MTL) involves jointly performing several related tasks based on a shared representation through separate heads. MTL provides advantages such as computational cost reduction, the sharing of relevant domain knowledge, and the ability to exploit task relationships to improve model’s generalization ability [207]. Consequently, MTL is well-suited for end-to-end driving, where the ultimate policy prediction requires a comprehensive understanding of the environment. However, the optimal combination of auxiliary tasks and appropriate weighting of losses to achieve the best performance presents a significant challenge.
多任务学习（Multi-task Learning, MTL） 涉及基于共享表示通过独立的头同时执行多个相关任务。MTL提供了诸如计算成本降低、共享相关领域知识以及利用任务关系来提高模型泛化能力等优势[207]。因此，MTL非常适合端到端驾驶，其中最终的策略预测需要对环境进行全面理解。然而，实现最佳性能的辅助任务的最优组合以及损失的适当加权是一个重大挑战。
In contrast to common vision tasks where dense predictions are closely correlated, end-to-end driving predicts a sparse signal. The sparse supervision increases the difficulty of extracting useful information for decision-making in the encoder. For image input, auxiliary tasks such as semantic segmentation [28, 31, 139, 208, 209, 210] and depth estimation [28, 31, 208, 209, 210] are commonly adopted in endto-end autonomous driving models. Semantic segmentation helps the model gain a high-level understanding of the scene; depth estimation enables the model to capture the 3D geometry of the environment and better estimate distances to critical objects. Besides auxiliary tasks on perspective images, 3D object detection [28, 31, 52] is also useful for LiDAR encoders. As BEV becomes a natural and popular representation for autonomous driving, tasks such as BEV segmentation are included in models [11, 23, 28, 29, 30, 31, 52, 148] that aggregate features in BEV space. Moreover, in addition to these vision tasks, [29, 208, 211] also predict visual affordances including traffic light states, distances to opposite lanes, etc. Nonetheless, constructing large-scale datasets with multiple types of aligned and high-quality annotations is non-trivaial for real-world applications, which remain as a great concern due to current models’ reliance on MTL.
与常见的视觉任务（其中密集预测密切相关）不同，端到端驾驶预测的是一个稀疏信号。稀疏监督增加了编码器提取决策制定有用信息的难度。对于图像输入，端到端自动驾驶模型通常采用辅助任务，如语义分割[28, 31, 139, 208, 209, 210]和深度估计[28, 31, 208, 209, 210]。语义分割帮助模型获得对场景的高级理解；深度估计使模型能够捕捉环境的3D几何形状，并更好地估计与关键对象的距离。除了透视图像上的辅助任务外，3D对象检测[28, 31, 52]对激光雷达编码器也很有用。随着BEV（鸟瞰图）成为自动驾驶的自然和流行的表示，像BEV分割这样的任务也被包含在模型中[11, 23, 28, 29, 30, 31, 52, 148]，这些模型在BEV空间聚合特征。此外，除了这些视觉任务外，[29, 208, 211]还预测视觉可承受性，包括交通灯状态、对面车道的距离等。尽管如此，对于现实世界的应用来说，构建具有多种类型对齐和高质量注释的大型数据集并非易事，由于当前模型对MTL的依赖，这仍然是一个重大问题。
多任务学习在自动驾驶中的应用可以提高模型的泛化能力和效率，但同时也带来了数据集构建和任务选择上的挑战。随着技术的发展，研究者需要解决这些问题，以便更好地利用MTL的优势。

4.5 Inefficient Experts and Policy Distillation

As imitation learning, or its predominant sub-category, behavior cloning, is simply supervised learning that mimics expert behaviors, corresponding methods usually follow the “Teacher-Student” paradigm. There lie two main challenges:(1) Teachers, such as the handcrafted expert autopilot provided by CARLA, are not perfect drivers, though having access to ground-truth states of surrounding agents and maps. (2) Students are supervised by the recorded output with sensor input only, requiring them to extract perceptual features and learn policy from scratch simultaneously.
作为模仿学习，或其主要子类别行为克隆，实际上是模仿专家行为的监督学习，相应的方法通常遵循“教师-学生”范式。这里存在两个主要挑战：

教师的不完美性：例如，由CARLA提供的手工制作的专家自动驾驶仪，虽然可以访问周围代理和地图的真实状态，但并不是完美的驾驶员。这意味着专家的行为可能包含错误或次优决策，这些可能会被学习策略的学生所模仿。
学生的学习难度：学生仅通过传感器输入的记录输出进行监督，要求它们同时提取感知特征并从头开始学习策略。这要求学生不仅要理解环境，还要从基础学习如何在该环境中做出决策。

A few studies propose to divide the learning process into two stages, i.e., training a stronger teacher network and then distilling the policy to the student. In particular, Chen et al. [5, 52] first employ a privileged agent to learn how to act with access to the state of the environment, then let the sensorimotor agent (student) closely imitate the privileged agent with distillation at the output stage. More compact BEV representations as input for the privileged agent provide stronger generalization abilities and supervision than the original expert. The process is depicted in Fig. 5.
一些研究提议将学习过程分为两个阶段：首先训练一个更强大的教师网络，然后将策略蒸馏到学生网络。具体来说，Chen等人[5, 52]首先使用一个特权代理来学习如何在访问环境状态的情况下采取行动，然后让感知运动代理（学生）在输出阶段通过蒸馏紧密模仿特权代理。为特权代理提供更紧凑的BEV（鸟瞰图）表示作为输入，比原始专家提供了更强的泛化能力和监督。这个过程在图5 中被描述。
在这里插入图片描述
图5 展示了策略蒸馏的过程，其中包括以下两个主要步骤：
(a) 特权代理学习鲁棒策略：在这个步骤中，特权代理利用对特权真实信息的访问来学习一个鲁棒的策略。这里的“专家”用虚线表示，这表明如果特权代理是通过强化学习（RL）训练的，那么专家的存在并不是强制性的。这意味着特权代理可以独立于专家，通过与环境的交互来自我学习。
(b) 感知运动代理通过蒸馏模仿特权代理：在这个步骤中，感知运动代理（即学生网络）通过特征蒸馏和输出模仿来模仿特权代理的行为。特征蒸馏是指学生网络学习如何从自己的传感器输入中提取与特权代理相同的特征表示。输出模仿则是指学生网络学习如何根据这些特征来模仿特权代理的决策输出。
这种策略蒸馏方法允许感知运动代理学习如何从有限的传感器数据中做出决策，同时受益于特权代理所获得的更全面的环境信息。通过这种方法，即使在没有完整环境状态信息的情况下，也能够提高学生网络的性能和泛化能力。这种技术在自动驾驶和其他复杂任务中非常有用，因为它可以帮助模型在现实世界的应用中更加鲁棒和可靠。

Apart from solely supervising planning results, several works also distill knowledge at the feature level. For example, FM-Net [212] employs segmentation and optical flow models as auxiliary teachers to guide feature training. SAM [213] adds L2 feature loss between teacher and student networks, while CaT [23] aligns features in BEV. WoR [20] learns a model-based action-value function and then uses it to supervise the visuomotor policy. Roach [21] trains a stronger privileged expert with RL, eliminating the upper bound of BC. It incorporates multiple distillation targets, i.e., action distribution, values/rewards, and latent features. By leveraging the powerful RL expert, TCP [22] achieves a new state-of-the-art on the CARLA leaderboard with a single camera as visual input. DriveAdpater [181] learns a perception-only student and adapters with the feature alignment objective. The decoupled paradigm fully enjoys the teacher’s knowledge and student’s training efficiency.
除了仅仅监督规划结果，一些研究还在特征层面进行知识蒸馏。以下是对这些方法的概述：

FM-Net [212]：使用分割和光流模型作为辅助教师来指导特征训练。这种方法通过利用辅助任务的输出来帮助学生网络学习有用的特征表示。
SAM [213]：在教师网络和学生网络之间添加L2特征损失，以确保学生网络学习到的特征与教师网络的特征保持一致。
CaT [23]：在鸟瞰图（BEV）中对齐特征，以确保学生网络能够从不同视角提取一致的特征表示。
WoR [20]：学习基于模型的行动价值函数，并将其用于监督视觉运动策略。这种方法结合了模型预测和策略学习。
Roach [21]：使用强化学习（RL）训练一个更强的特权专家，消除了行为克隆（BC）的上限。它结合了多个蒸馏目标，即动作分布、价值/奖励和潜在特征。
TCP [22]：利用强大的RL专家，在CARLA排行榜上以单一摄像头作为视觉输入达到了新的最佳状态。
DriveAdapter [181]：学习一个仅感知的学生网络和适配器，并以特征对齐为目标。解耦的范式充分利用了教师的知识以及学生训练的效率。

Though huge efforts have been devoted to designing a robust expert and transferring knowledge at various levels, the teacher-student paradigm still suffers from inefficient distillation. For instance, the privileged agent has access to ground-truth states of traffic lights, which are small objects in images and thus hard to distill corresponding features. As a result, the visuomotor agents exhibit large performance gaps compared to their privileged agents. It may also lead to causal confusion for students (see Sec. 4.8). It is worth exploring how to draw more inspiration from general distillation methods in machine learning to minimize the gap.
尽管已经投入了巨大的努力来设计一个稳健的专家并从各个层面转移知识，但教师-学生范式仍然面临着蒸馏效率低下的问题。例如，特权代理可以访问交通灯的真实状态信息，这些在图像中是小物体，因此很难蒸馏相应的特征。结果，视觉运动代理与它们的特权代理相比表现出较大的性能差距。这也可能导致学生出现因果混淆（见第4.8节）。值得探索的是如何从机器学习中的一般蒸馏方法中获得更多灵感，以最小化这一差距。

4.6 Lack of Interpretability

Interpretability plays a critical role in autonomous driving [214]. It enables engineers to better debug the system, provides performance guarantees from a societal perspective, and promotes public acceptance. Achieving interpretability for end-to-end driving models, which are often referred to as “black boxes”, is more essential and challenging.
可解释性在自动驾驶中扮演着至关重要的角色[214]。它使工程师能够更好地调试系统，从社会角度提供性能保证，并促进公众接受。对于通常被称为“黑盒”的端到端驾驶模型来说，实现可解释性更加重要和具有挑战性。
以下是对自动驾驶中可解释性重要性的进一步解释：

系统调试：可解释性允许工程师理解模型的决策过程，识别潜在的问题或错误，并进行相应的修正。
性能保证：对社会而言，可解释性提供了对自动驾驶系统行为的透明度，有助于建立对系统性能的信任。
公众接受度：如果公众能够理解自动驾驶系统的工作原理和决策依据，他们更可能接受这项技术。
端到端模型的挑战：端到端模型通常直接从输入映射到输出，缺乏中间步骤，这使得理解模型的行为变得更加困难。
“黑盒”问题：由于缺乏透明度，端到端模型有时被视为“黑盒”，这可能导致对模型预测的不信任。
可解释性技术：为了解决这个问题，研究者正在开发各种可解释性技术，如特征可视化、注意力图、局部可解释模型等。
模型透明度：提高模型的透明度可以帮助利益相关者理解模型的工作原理，包括它如何感知环境和做出决策。
伦理和社会影响：自动驾驶系统的决策过程需要符合伦理标准和社会价值观，可解释性有助于确保这一点。
政策和法规：可解释性还可以帮助制定政策和法规，确保自动驾驶系统的安全性和合规性。

实现自动驾驶系统的可解释性是一个多方面的挑战，需要结合技术解决方案和社会沟通策略。随着技术的进步和公众意识的提高，可解释性将继续是自动驾驶领域的一个重要研究方向。
Given trained models, some post-hoc X-AI (explainable AI) techniques could be applied to gain saliency maps [208, 215, 216, 217, 218]. Saliency maps highlight specific regions in the visual input on which the model primarily relies for planning. However, this approach provides limited information, and its effectiveness and validity are difficult to evaluate. Instead, we focus on end-to-end frameworks that directly enhance interpretability in their model design. We introduce each category of interpretability in Fig. 6 below.
给定训练好的模型，可以应用一些事后的X-AI（可解释的人工智能）技术来获得显著性图[208, 215, 216, 217, 218]。显著性图突出了视觉输入中模型主要依赖于规划的特定区域。然而，这种方法提供的信息有限，其有效性和有效性难以评估。相反，我们专注于在其模型设计中直接增强可解释性的端到端框架。我们在下图6提供了一个可视化的概述，展示了不同类别的可解释性方法如何在端到端自动驾驶模型中实现。
以下是对这段描述的进一步解释：

事后X-AI技术：这些技术在模型训练完成后应用，以提供模型决策过程的解释。它们通常用于识别和可视化模型预测中最重要的特征或输入部分。
显著性图：显著性图是一种可视化技术，用于展示模型在做出决策时最关注输入数据的哪些区域。
信息局限性：尽管显著性图可以提供一些洞察，但它们通常只关注模型输入的一小部分，可能无法全面解释模型的决策过程。
有效性和有效性评估：评估事后X-AI技术的有效性和有效性可能具有挑战性，因为它们可能无法捕捉到模型决策的所有方面。
端到端框架：与事后技术相比，端到端框架在模型设计阶段就考虑了可解释性，确保模型在训练过程中生成可解释的输出。
模型设计中的可解释性：端到端框架可能包括设计选择，如使用可解释的模型组件或集成特定技术来增强模型的透明度和解释能力。
可解释性类别：在端到端框架中，可解释性可以通过多种方式实现，包括但不限于模型的局部可解释性、全局可解释性或通过注意力机制等。
图6：假设图6提供了一个可视化的概述，展示了不同类别的可解释性方法如何在端到端自动驾驶模型中实现。

通过将可解释性集成到端到端模型设计中，研究者可以更好地理解和信任自动驾驶系统的行为，这对于提高系统的社会接受度和确保其安全运行至关重要。随着自动驾驶技术的发展，可解释性将继续是该领域的一个重要研究方向。
在这里插入图片描述
图6：不同形式的可解释性总结。它们帮助人类理解端到端模型的决策过程和输出的可靠性。
Attention Visualization: The attention mechanism provides a certain degree of interpretability. In [33, 208, 211, 218, 219], a learned attention weight is applied to aggregate important features from intermediate feature maps. Attention weights can also adaptively combine ROI pooled features from different object regions [220] or a fixed grid [221]. NEAT [11] iteratively aggregates features to predict attention weights and refine the aggregated feature. Recently, Transformer attention blocks are employed to better fuse different sensor inputs, and attention maps display important regions in the input for driving decisions [28, 29, 31, 147, 222]. In PlanT [223], attention layers process features from different vehicles, providing interpretable insights into the corresponding action. Similar to post-hoc saliency methods, although attention maps offer straightforward clues about models’ focus, their faithfulness and utility remain limited.
注意力可视化：注意力机制提供了一定程度的可解释性。在文献[33, 208, 211, 218, 219]中，学习到的注意力权重被应用于从中间特征图中聚合重要特征。注意力权重还可以自适应地结合来自不同对象区域的感兴趣区域（Region of Interest, ROI）池化特征[220]或固定网格[221]。NEAT[11]迭代地聚合特征以预测注意力权重并细化聚合的特征。最近，Transformer注意力块被用来更好地融合不同的传感器输入，注意力图展示了对驾驶决策至关重要的输入区域[28, 29, 31, 147, 222]。在PlanT[223]中，注意力层处理来自不同车辆的特征，为相应的动作提供可解释的洞察。与事后显著性方法类似，尽管注意力图提供了关于模型焦点的直接线索，但它们的准确性和实用性仍然有限。
Interpretable Tasks: Many IL-based works introduce interpretability by decoding the latent feature representations into other meaningful information besides policy prediction, such as semantic segmentation [2, 11, 15, 28, 29, 31, 52, 139, 163, 208, 209, 210, 224], depth estimation [15, 28, 31, 208, 209], object detection [2, 28, 31, 52], affordance predictions [29, 208, 211], motion prediction [2, 52], and gaze map estimation [225]. Although these methods provide interpretable information, most of them only treat these predictions as auxiliary tasks [11, 15, 28, 31, 139, 208, 209, 211], with no explicit impact on final driving decisions. Some [29, 52] do use these outputs for final actions, but they are incorporated solely for performing an additional safety check.
可解释任务：许多基于模仿学习（IL）的工作通过将潜在特征表示解码成除策略预测之外的其他有意义信息来引入可解释性，例如语义分割[2, 11, 15, 28, 29, 31, 52, 139, 163, 208, 209, 210, 224]、深度估计[15, 28, 31, 208, 209]、目标检测[2, 28, 31, 52]、可承受性预测[29, 208, 211]、运动预测[2, 52]和视线图估计[225]。尽管这些方法提供了可解释的信息，但它们大多数只将这些预测视为辅助任务[11, 15, 28, 31, 139, 208, 209, 211]，对最终驾驶决策没有直接的影响。有些[29, 52]确实使用这些输出作为最终动作的依据，但它们仅被集成用来执行额外的安全检查。
Rules Integration and Cost Learning: As discussed in Sec. 2.1.2, cost learning-based methods share similarities with traditional modular systems and thus exhibit a certain level of interpretability. NMP [32] and DSDNet [226] construct the cost volume in conjunction with detection and motion prediction results. P3 [39] combines predicted semantic occupancy maps with comfort and traffic rules constraints to construct the cost function. Various representations, such as probabilistic occupancy and temporal motion fields [1], emergent occupancy [71], and freespace [70], are employed to score sampled trajectories. In [38, 125, 227], human expertise and pre-defined rules including safety, comfort,traffic rules, and routes based on perception and prediction outputs are explicitly included to form the cost for trajectory scoring, demonstrating improved robustness and safety
规则集成和成本学习：如第2.1.2节中所讨论的，基于成本学习的方法与传统的模块化系统有相似之处，因此表现出一定程度的可解释性。以下是对这些概念的进一步解释：

NMP [32] 和 DSDNet [226]：这些方法结合检测和运动预测结果构建成本体积，为不同的轨迹样本分配成本。
P3 [39]：该方法将预测的语义占用图与舒适性和交通规则约束结合起来，构建成本函数。
不同的表现方式：使用各种表示方法，如概率占用和时间运动场[1]、紧急占用[71]和自由空间[70]，来为采样的轨迹打分。
人类专业知识和预定义规则：在[38, 125, 227]中，明确包含了人类专业知识和预定义规则，包括安全性、舒适性、交通规则以及基于感知和预测输出的路线，形成用于轨迹评分的成本，展示了提高的鲁棒性和安全性。
规则集成：将人类专家知识和预定义的规则集成到自动驾驶系统中，可以提高系统的可解释性和透明度，使开发者和用户能够理解系统的决策依据。
成本学习：通过成本学习，系统可以学习如何在不同的驾驶情境下评估轨迹的成本，从而优化决策过程。
轨迹评分：基于成本的轨迹评分方法可以帮助系统选择最佳或最安全的驾驶路径。
安全性和鲁棒性：通过明确考虑安全性和舒适性规则，这些方法能够提高系统在面对复杂交通情况时的性能和安全性。
可解释性：这种方法的可解释性有助于建立对自动驾驶系统的信任，并为系统的行为提供更清晰的解释。

Linguistic Explainability: As one aspect of interpretability is to help humans understand the system, natural language is a suitable choice for this purpose. Kim et al. [33] and Xu et al. [228] develop datasets pairing driving videos or images with descriptions and explanations, and propose endto-end models with both control and explanation outputs. BEEF [229] fuses the predicted trajectory and the intermediate perception features to predict justifications for the decision. ADAPT [230] proposes a Transformer-based network to jointly estimate action, narration, and reasoning. Recently, [169, 171, 172] resort to the progress of multi-modality and foundation models, using LLMs/VLMs to provide decisionrelated explanations, as discussed in Sec. 4.1.2.
语言可解释性：作为可解释性的一部分，帮助人类理解系统的目的，自然语言是这一目的的合适选择。Kim等人[33]和Xu等人[228]开发了将驾驶视频或图像与描述和解释配对的数据集，并提出了具有控制和解释输出的端到端模型。BEEF[229]融合预测的轨迹和中间感知特征来预测决策的理由。ADAPT[230]提出了一个基于Transformer的网络，联合估计动作、叙述和推理。最近，[169, 171, 172]利用多模态和基础模型的进展，使用大型语言模型（LLMs）/视觉语言模型（VLMs）提供与决策相关的解释，如第4.1.2节所讨论的。
Uncertainty Modeling: Uncertainty is a quantitative approach for interpreting the dependability of deep learning model outputs [231, 232], which can be helpful for designers and users to identify uncertain cases for improvement or necessary intervention. For deep learning, there are two types of uncertainty: aleatoric uncertainty and epistemic uncertainty. Aleatoric uncertainty is inherent to the task, while epistemic uncertainty is due to limited data or modeling capacity. In [233], authors leverage certain stochastic regularizations in the model to perform multiple forward passes as samples to measure the uncertainty. However, the requirement of multiple forward passes is not feasible in real-time scenarios. Loquercio et al. [232] and Filos et al. [234] propose capturing epistemic uncertainty with an ensemble of expert likelihood models and aggregating the results to perform safe planning. Regarding methods modeling aleatoric uncertainty, driving actions/planning and uncertainty (usually represented by variance) are explicitly predicted in [146, 235, 236]. Such methods directly model and quantify the uncertainty at the action level as a variable for the network to predict. The planner would generate the final action based on the predicted uncertainty, either choosing the action with the lowest uncertainty from multiple actions [235] or generating a weighted combination of proposed actions based on the uncertainties [146]. Currently, predicted uncertainty is mainly utilized in combination with hardcoded rules. Exploring better ways to model and utilize uncertainty for autonomous driving is necessary.
不确定性建模：不确定性是对深度学习模型输出可靠性的定量方法[231, 232]，这对于设计者和用户识别需要改进或必要干预的不确定情况很有帮助。对于深度学习，有两种类型的不确定性：随机不确定性和认知不确定性。随机不确定性是任务固有的，而认知不确定性是由于数据有限或建模能力不足造成的。在[233]中，作者利用模型中的某些随机正则化进行多次前向传递作为样本来测量不确定性。然而，多次前向传递的要求在实时场景中是不可行的。Loquercio等人[232]和Filos等人[234]提出使用专家似然模型的集合来捕捉认知不确定性，并将结果聚合以执行安全规划。关于建模随机不确定性的方法，驾驶动作/规划和不确定性（通常由方差表示）在[146, 235, 236]中被明确预测。这些方法直接在动作级别作为网络预测的变量对不确定性进行建模和量化。规划器将根据预测的不确定性生成最终动作，要么从多个动作中选择不确定性最低的动作[235]，要么根据不确定性生成建议动作的加权组合[146]。目前，预测的不确定性主要用于与硬编码规则结合使用。探索更好的方法来建模和利用自动驾驶中的不确定性是必要的。

4.7 Lack of Safety Guarantees

Ensuring safety is of utmost importance when deploying autonomous driving systems in real-world scenarios. However, the learning-based nature of end-to-end frameworks inherently lacks precise mathematical guarantees regarding safety, unlike traditional rule-based approaches [237].
确保安全在将自动驾驶系统部署到现实世界场景中时至关重要。然而，基于学习的端到端框架的本质，与基于传统规则的方法[237]不同，它在安全性方面缺乏精确的数学保证。
Nevertheless, it should be noted that modular driving stacks have already incorporated specific safety-related constraints or optimizations within their motion planning or speed prediction modules to enforce safety [238, 239, 240]. These mechanisms can potentially be adapted for integration into end-to-end models as post-process steps or safety checks, thereby providing additional safety guarantees. Furthermore, the intermediate interpretability predictions, as discussed in Sec. 4.6, such as detection and motion prediction results, can be utilized in post-processing procedures.
尽管如此，应当指出模块化驾驶堆栈已经在它们的动作规划或速度预测模块中纳入了特定的与安全相关的约束或优化，以强化安全性[238, 239, 240]。这些机制有可能被调整并集成到端到端模型中，作为后处理步骤或安全检查，从而提供额外的安全保证。此外，如第4.6节所讨论的，中间的可解释性预测，例如检测和运动预测结果，可以在后处理程序中被利用。
以下是对模块化驾驶系统和端到端模型中安全性集成的进一步解释：

模块化驾驶堆栈：这些系统通过将不同的驾驶任务分解为独立的模块来处理，每个模块可以专门设计以满足特定的安全要求。
安全相关约束：在模块化系统中，可以为每个模块设置安全边界和约束，以确保即使在极端情况下系统也不会执行危险动作。
优化安全性能：模块化方法允许开发者优化每个模块以提高安全性能，如通过调整速度预测来避免潜在的碰撞。
端到端模型的集成：端到端模型可以借鉴模块化系统的经验，将安全机制作为模型的后处理步骤或安全检查点。
后处理步骤：在端到端模型做出初步决策后，可以通过后处理步骤来评估和调整这些决策，以确保它们符合安全标准。
安全检查：在模型的输出被执行之前，可以实施安全检查，以检测任何可能的风险并采取预防措施。
可解释性预测的利用：通过使用可解释性预测，如对象检测和运动预测，可以在后处理中提供额外的安全信息。
增强安全保证：通过将这些机制集成到端到端模型中，可以增强系统的整体安全性，提供更全面的安全保证。

4.8 Causal Confusion（因果混淆）

Driving is a task that exhibits temporal smoothness, which makes past motion a reliable predictor of the next action. However, methods trained with multiple frames can become overly reliant on this shortcut [241] and suffer from catastrophic failure during deployment. This problem is referred to as the copycat problem [57] in some works and is a manifestation of causal confusion [242], where access to more information leads to worse performance.
驾驶是一项表现出时间平滑性的任务，这使得过去的运动成为预测下一个动作的可靠指标。然而，使用多帧训练的方法可能会过度依赖这种捷径[241]，在部署期间可能会遭遇灾难性失败。这个问题在一些作品中被称为“模仿问题”[57]，并且是因果混淆[242]的一种表现，即获取更多信息导致性能变差。
以下是对模仿问题和因果混淆的进一步解释：

时间平滑性：在驾驶任务中，车辆的运动通常具有一定的连贯性和平滑性，因此过去的行为可以在一定程度上预测未来的运动。
多帧训练：一些自动驾驶系统可能会使用多帧图像序列来训练模型，以便更好地理解车辆的运动。
过度依赖：如果模型过度依赖于从过去的运动中学习到的模式，它可能会忽略其他重要的环境因素，如交通信号、行人或其他车辆的行为。
模仿问题：这个问题指的是模型在没有理解驾驶行为背后的真正原因的情况下，简单地模仿过去的运动模式。
因果混淆：因果混淆发生在模型不能正确区分哪些信息是导致特定结果的原因，哪些只是相关因素或偶然事件。
信息过载：当模型获得过多的信息时，可能会难以识别哪些是最相关的，从而导致决策错误。
性能下降：在某些情况下，访问更多的信息可能会使模型的决策过程变得复杂，反而导致性能下降。

Causal confusion in imitation learning has been a persistent challenge for nearly two decades. One of the earliest reports of this effect was made by LeCun et al. [243]. They used a single input frame for steering prediction to avoid such extrapolation. Though simplistic, this is still a preferred solution in current state-of-the-art IL methods [22, 28]. Unfortunately, using a single frame makes it hard to extract the motion of surrounding actors. Another source of causal confusion is speed measurement [16]. Fig. 7 showcases an example of a car waiting at a red light. The action of the car could highly correlate with its speed because it has waited for many frames where the speed is zero and the action is the brake. Only when the traffic light changes from red to green does this correlation break down.
模仿学习中的因果混淆近二十年来一直是一个持续的挑战。最早关于这种效应的报告之一是由LeCun等人[243]提出的。他们使用单个输入帧进行转向预测，以避免这种外推。尽管这种方法简单，但仍然是当前最先进的模仿学习方法[22, 28]中的首选解决方案。不幸的是，使用单个帧很难提取周围参与者的运动。因果混淆的另一个来源是速度测量[16]。图7 展示了一辆在红灯前等待的汽车的例子。汽车的行为可能与其速度高度相关，因为它在许多帧中等待，速度为零，动作是刹车。只有当交通灯从红色变为绿色时，这种相关性才会破裂
在这里插入图片描述
There are several approaches to combat the causal confusion problem when using multiple frames. In [57], the authors attempt to remove spurious temporal correlations from the bottleneck representation by training an adversarial model that predicts the ego agent’s past action. Intuitively, the resulting min-max optimization trains the network to eliminate its past from intermediate layers. It works well in MuJoCo but does not scale to complex visionbased driving. OREO [59] maps images to discrete codes representing semantic objects and applies random dropout masks to units that share the same discrete code, which helps in confounded Atari. In end-to-end driving, ChauffeurNet [244] addresses the causal confusion issue by using the past ego-motion as intermediate BEV abstractions and dropping out it with a 50% probability during training. Wen et al. [58] propose upweighting keyframes in the training loss, where a decision change occurs (and hence are not predictable by extrapolating the past). PrimeNet [60] improves performance compared to keyframes by using an ensemble,where the prediction of a single-frame model is given as additional input to a multi-frame model. Chuang et al. [245] do the same but supervise the multi-frame network with action residuals instead of actions. In addition, the problem of causal confusion can be circumvented by using only LiDAR histories (with a single frame image) and realigning point clouds into one coordinate system. This removes egomotion while retaining information about other vehicles’ past states. This technique has been used in multiple works [1, 32, 52], though it was not motivated this way.
在模仿学习中使用多帧图像时，存在几种应对因果混淆问题的方法：

对抗模型：在文献[57]中，作者尝试通过训练一个预测自车过去动作的对抗模型，从瓶颈表示中移除偶然的时间相关性。直观地说，这种最小-最大优化训练网络从中间层消除其过去的影响。该方法在MuJoCo上表现良好，但在基于复杂视觉的驾驶上没有扩展性。
OREO方法[59]：将图像映射到表示语义对象的离散代码，并对共享相同离散代码的单元应用随机dropout掩码，这有助于解决Atari游戏中的混淆问题。
ChauffeurNet[244]：通过使用过去的自车运动作为中间BEV（鸟瞰图）抽象，并在训练期间以50%的概率dropout，来解决因果混淆问题。
关键帧加权：Wen等人[58]提出在训练损失中增加关键帧的权重，即在决策发生变化的地方（因此不能通过外推过去来预测）。
PrimeNet[60]：与关键帧相比，通过集成方法提高性能，将单帧模型的预测作为多帧模型的额外输入。
动作残差监督：Chuang等人[245]采取类似的方法，但使用动作残差而不是动作来监督多帧网络。
仅使用激光雷达历史：通过仅使用激光雷达历史（和单帧图像）并将点云重新对齐到一个坐标系统中，可以绕过因果混淆问题。这种方法去除了自车运动，同时保留了其他车辆过去状态的信息。这种技术已在多个研究中使用[1, 32, 52]，尽管它并非以此方式为动机。

However, these studies have used environments that are modified to simplify studying the causal confusion problem. Showing performance improvements in state-of-the-art settings as mentioned in Sec. 3.2.5 remains an open problem.
然而，这些研究使用了被修改的环境来简化研究因果混淆问题。在第3.2.5节中提到的最先进设置中展示性能提升仍然是一个未解决的问题。

4.9 Lack of Robustness

4.9.1 Long-tailed Distribution

长尾分布问题的一个重要方面是数据集不平衡，其中少数类别构成了大多数，如图8(a) 所示。这对模型在不同环境中的泛化提出了巨大挑战。有多种方法通过数据处理来缓解这个问题，包括过采样[246, 247]、欠采样[248, 249]和数据增强[250, 251]。此外，基于权重的方法[252, 253]也常用于解决这一问题。

In the context of end-to-end autonomous driving, the long-tailed distribution issue is particularly severe. Most drives are repetitive and uninteresting e.g., following a lane for many frames. Conversely, interesting safety-critical scenarios occur rarely but are diverse in nature, and hard to replicate in the real world for safety reasons. To tackle this, some works rely on handcrafted scenarios [13, 100, 254, 255, 256] to generate more diverse data in simulation. LBC [5] leverages the privileged agent to create imaginary supervisions conditioned on different navigational commands. LAV [52] includes trajectories of non-ego agents for training to promote data diversity. In [257], a simulation framework is proposed to apply importance-sampling strategies to accelerate the evaluation of rare-event probabilities.
在端到端自动驾驶的背景下，长尾分布问题尤为严重。大多数驾驶行为是重复且无趣的，例如，长时间沿着一条车道行驶。相反，有趣的、关键的安全场景很少发生，但它们在性质上是多样化的，由于安全原因很难在现实世界中复制。为了解决这个问题，一些工作依赖于手工制作的场景[13, 100, 254, 255, 256]，在模拟环境中生成更多样化的数据。LBC[5]利用特权代理来创建基于不同导航命令的条件性想象监督。LAV[52]包括非自我代理的轨迹进行训练，以促进数据多样性。在[257]中，提出了一个模拟框架，应用重要性抽样策略来加速罕见事件概率的评估。
Another line of research [7, 34, 35, 258, 259, 260] generates safety-critical scenarios in a data-driven manner through adversarial attacks. In [258], Bayesian Optimization is employed to generate adversarial scenarios. Learning to collide [35] represents driving scenarios as the joint distribution over building blocks and applies policy gradient RL methods to generate risky scenarios. AdvSim [34] modifies agents’ trajectories to cause failures, while still adhering to physical plausibility. KING [7] proposes an optimization algorithm for safety-critical perturbations using gradients through differentiable kinematics models.
另一条研究线路[7, 34, 35, 258, 259, 260]通过对抗性攻击以数据驱动的方式生成关键的安全场景。在[258]中，贝叶斯优化被用来生成对抗性场景。学习碰撞[35]将驾驶场景表示为构建块上的联合分布，并应用策略梯度强化学习方法来生成危险场景。AdvSim[34]修改代理的轨迹以引起故障，同时仍然遵守物理合理性。KING[7]提出了一种使用可微运动学模型的梯度优化算法，用于关键的安全扰动。
In general, efficiently generating realistic safety-critical scenarios that cover the long-tailed distribution remains a significant challenge. While many works focus on adversarial scenarios in simulators, it is also essential to better utilize real-world data for critical scenario mining and potential adaptation to simulation. Besides, a systematic, rigorous, comprehensive, and realistic testing framework is crucial for evaluating end-to-end autonomous driving methods under these long-tailed distributed safety-critical scenarios.
总的来说，高效地生成覆盖长尾分布的真实关键安全场景仍然是一个重大挑战。虽然许多研究集中在模拟环境中的对抗性场景，但更好地利用现实世界数据进行关键场景挖掘和潜在的模拟适应也至关重要。此外，一个系统性、严谨、全面和现实测试框架对于评估端到端自动驾驶方法在这些长尾分布的关键安全场景下的表现至关重要。

4.9.2 Covariate Shift

As discussed in Sec. 2.1, one important challenge for behavior cloning is covariate shift. The state distributions from the expert’s policy and those from the trained agent’s policy differ, leading to compounding errors when the trained agent is deployed in unseen testing environments or when the reactions from other agents differ from training time. This could result in the trained agent being in a state that is outside the expert’s distribution for training, leading to severe failures. An illustration is presented in Fig. 8 (b).
正如第2.1节所讨论的，行为克隆面临的一个重要挑战是协变量偏移。专家策略的状态分布与训练代理策略的状态分布不同，这导致在未见过的测试环境中部署训练代理或当其他代理的反应与训练时不同的情况下，错误会累积。这可能导致训练代理处于专家训练分布之外的状态，从而导致严重失败。图8(b)中展示了一个示例。
DAgger (Dataset Aggregation) [26] is a common solution for this issue. DAgger is an iterative training process. The current trained policy is rolled out in each iteration to collect new data, and the expert is used to label the visited states. This enriches the dataset by adding examples of how to recover from suboptimal states that an imperfect policy might visit. The policy is then trained on the augmented dataset, and the process repeats. However, one downside of DAgger is the need for an available expert to query online.
DAgger（数据集聚合）[26]是解决这个问题的一个常见解决方案。DAgger是一个迭代训练过程。在每次迭代中，当前训练的策略被推出以收集新数据，并使用专家来标记访问过的状态。这通过添加如何从不完美策略可能访问的次优状态中恢复的例子来丰富数据集。然后，策略在扩充的数据集上进行训练，并且这个过程会重复。然而，DAgger的一个缺点是需要一个可用的专家来进行在线查询。
For end-to-end autonomous driving, DAgger is adopted in [24] with an MPC-based expert. To reduce the cost of constantly querying the expert, SafeDAgger [25] extends the original DAgger algorithm by learning a safety policy that estimates the deviation between the current policy and the expert policy. The expert is only queried when the deviation is large. MetaDAgger [56] uses meta-learning with DAgger to aggregate data from multiple environments. LBC [5] adopts DAgger and resamples the data with higher loss more frequently. In DARB [10], to better utilize failure or safety-related samples, it proposes several mechanisms, including task-based, policy-based, and policy & expertbased mechanisms, to sample such critical states.
在端到端自动驾驶领域，[24]中采用了基于模型预测控制（MPC）的专家来实施DAgger。为了降低不断查询专家的成本，SafeDAgger[25]通过学习一个安全策略来扩展原始的DAgger算法，该策略估计当前策略与专家策略之间的偏差。只有在偏差较大时才查询专家。MetaDAgger[56]使用元学习与DAgger结合，从多个环境中聚合数据。LBC[5]采用DAgger并更频繁地重新采样损失更高的数据。在DARB[10]中，为了更好地利用失败或与安全相关的样本，它提出了几种机制，包括基于任务的、基于策略的以及策略和专家结合的机制，来采样这些关键状态。

4.9.3 Domain Adaptation

Domain adaptation (DA) is a type of transfer learning in which the target task is the same as the source task, but the domains differ. Here we discuss scenarios where labels are available for the source domain while there are no labels or a limited amount of labels available for the target domain.
域适应（Domain Adaptation, DA）是一种迁移学习类型，其中目标任务与源任务相同，但域（domain）不同。在这里，我们讨论的是源域有标签可用的情况，而目标域没有标签或只有有限数量的标签可用的场景。
在域适应中，模型首先在源域上进行训练，然后通过学习源域和目标域之间的差异来调整模型，以便在目标域上也能表现良好。这通常涉及到识别和减少域之间的分布差异，这些差异可能源于不同的数据收集环境、传感器特性或数据采集过程。
域适应的关键挑战包括：

数据分布差异：源域和目标域的数据分布可能显著不同，导致模型在目标域上的性能下降。
标签稀缺：目标域可能缺乏足够的标注数据，这限制了直接训练模型的能力。
特征空间对齐：需要找到一种方法来对齐源域和目标域的特征空间，以便模型能够泛化到新的域。

为了解决这些挑战，研究人员开发了多种技术，包括但不限于：

重加权方法：通过调整目标域样本的权重来减少源域和目标域之间的分布差异。
特征转换方法：学习将源域和目标域的特征映射到一个共同的特征空间。
对抗性训练：使用对抗性网络来最小化源域和目标域之间的分布差异。
自适应方法：利用少量的目标域数据来调整模型参数，以适应目标域。

域适应在自动驾驶、医疗诊断、图像识别等领域有着广泛的应用，特别是在数据标注成本高昂或难以获取的情况下。
As shown in Fig. 8 ©, domain adaptation for autonomous driving tasks encompasses several cases [261]:
• Sim-to-real: the large gap between simulators used for training and the real world used for deployment.
• Geography-to-geography: different geographic locations with varying environmental appearances.
• Weather-to-weather: changes in sensor inputs caused by weather conditions such as rain, fog, and snow.
• Day-to-night: illumination variations in visual inputs.
• Sensor-to-sensor: possible differences in sensor characteristics, e.g., resolution and relative position. Note that the aforementioned cases often overlap.
如图8(c ) 所示，自动驾驶任务的域适应涵盖了几种情况[261]：

仿真到现实（Sim-to-real）：用于训练的仿真器与用于部署的真实世界之间存在较大差距。
地理到地理（Geography-to-geography）：不同地理位置具有不同的环境外观。
天气到天气（Weather-to-weather）：由于雨、雾和雪等天气条件引起的传感器输入变化。
白天到夜晚（Day-to-night）：视觉输入中的照明变化。
传感器到传感器（Sensor-to-sensor）：传感器特性可能存在的差异，例如分辨率和相对位置。

需要注意的是，上述情况通常存在重叠。这意味着在实际应用中，自动驾驶系统可能需要同时适应多种域变化，例如，一个系统可能需要从仿真环境迁移到现实世界，同时还要处理不同地理位置和不同天气条件下的变化。这种多维度的域适应增加了问题的复杂性，要求算法能够灵活地识别和适应各种不同的环境特征。
Typically, domain-invariant feature learning is achieved with image translators and discriminators to map images from two domains into a common latent space or representations like segmentation maps [262, 263]. LUSR [264] and UAIL [235] adopt a Cycle-Consistent VAE and GAN, respectively, to project images into a latent representation comprised of a domain-specific part and a domain-general part. In SESR [190], class disentangled encodings are extracted from a semantic segmentation mask to reduce the sim-to-real gap. Domain randomization [265, 266, 267] is also a simple and effective sim-to-real technique for RL policy learning, which is further adapted for end-to-end autonomous driving [188, 268]. It is realized by randomizing the rendering and physical settings of the simulators to cover the variability of the real world during training.
通常，通过图像转换器和鉴别器实现域不变特征学习，将来自两个域的图像映射到一个共同的潜在空间或表示中，如分割图[262, 263]。LUSR[264]和UAIL[235]分别采用循环一致性变分自编码器（Cycle-Consistent VAE）和生成对抗网络（GAN），将图像投影到由域特定部分和域通用部分组成的潜在表示中。在SESR[190]中，从语义分割掩模中提取类别解耦编码，以减少仿真到现实的差距。域随机化[265, 266, 267]也是一种简单有效的仿真到现实技术，用于强化学习（RL）策略学习，并进一步适应端到端自动驾驶[188, 268]。它通过在训练期间随机化仿真器的渲染和物理设置来实现，以覆盖现实世界在训练中的可变性。
Currently, sim-to-real adaptation through source target image mapping or domain-invariant feature learning is the focus. Other DA cases are handled by constructing a diverse and large-scale dataset. Given that current methods mainly concentrate on the visual gap in images, and LiDAR has become a popular input modality for driving, specific adaptation techniques tailored for LiDARs must also be designed. Besides, traffic agents’ behavior gaps between the simulator and the real world should be noticed as well. Incorporating real-world data into simulation through techniques such as NeRF [113] is another promising direction.
目前，通过源目标图像映射或域不变特征学习进行仿真到现实（sim-to-real）的适应是研究的重点。其他域适应（DA）情况则通过构建多样化和大规模的数据集来处理。鉴于当前方法主要集中在图像的视觉差异上，而激光雷达（LiDAR）已成为驾驶的流行输入方式，因此必须设计针对LiDAR的具体适应技术。此外，还应注意仿真器和现实世界之间交通代理行为的差异。通过技术如NeRF[113]将现实世界数据整合到仿真中是另一个有前景的方向。

图像映射与域不变学习：通过将仿真图像映射到现实图像或学习在不同域中不变的特征，可以减少仿真与现实之间的视觉差异。
LiDAR适应技术：由于LiDAR提供了与视觉图像不同的数据，需要开发特定的算法来处理LiDAR数据的域适应问题。
交通代理行为差异：仿真环境中的交通代理行为可能与现实世界有所不同，这需要在域适应策略中加以考虑。
现实世界数据整合：使用如NeRF（神经辐射场）等技术，可以将现实世界的复杂场景和动态元素整合到仿真数据中，以提高仿真的真实性和训练模型的泛化能力。

5 FUTURE TRENDS

Considering the challenges and opportunities discussed, we list some crucial directions for future research that may have a broader impact in this field.
考虑到讨论的挑战和机遇，我们可以列出一些未来研究的关键方向，这些方向可能在这一领域产生更广泛的影响：

5.1 Zero-shot and Few-shot Learning（零样本学习和少样本学习）

It is inevitable for autonomous driving models to eventually encounter real-world scenarios that lie beyond the training data distribution. This raises the question of whether we can successfully adapt the model to an unseen target domain where limited or no labeled data is available. Formalizing this task for the end-to-end driving domain and incorporating techniques from the zero-shot/few-shot learning literature are the key steps toward achieving this [269, 270].
自动驾驶模型最终遇到超出训练数据分布范围的真实世界场景是不可避免的。这引发了一个问题：我们是否能够成功地将模型适应到一个未见过的目标域，其中可用的标记数据有限或不存在。为端到端驾驶领域正式化这一任务，并将零样本/少样本学习文献中的技术纳入其中，是实现这一目标的关键步骤[269, 270]。
以下是一些可能的研究方向和方法：

零样本学习（Zero-Shot Learning）：开发能够识别和适应未见类别的模型，即使这些类别在训练数据中没有出现。
少样本学习（Few-Shot Learning）：研究如何利用少量标记数据快速适应新任务或新环境。
元学习（Meta-Learning）：设计模型使其能够从少量经验中快速学习，并适应新任务。
迁移学习（Transfer Learning）：研究如何将从源域学到的知识迁移到目标域，尤其是在目标域数据稀缺的情况下。
生成模型：使用生成对抗网络（GANs）或其他生成模型来模拟目标域的数据分布。
特征空间对齐：研究如何将源域和目标域的特征映射到一个共同的特征空间，以减少域之间的差异。
对抗性训练：利用对抗性方法来最小化模型对源域和目标域数据的分布差异的敏感性。
数据增强：开发新的数据增强技术，以模拟目标域中可能出现的多样化情况。
模型鲁棒性：提高模型对于输入噪声和异常值的鲁棒性，以应对真实世界中的不确定性。
在线学习：研究模型如何能够在线更新，以适应不断变化的环境和条件。
解释性和可验证性：提高模型的解释性，使其决策过程对人类用户透明，便于验证和信任。
跨域泛化：研究模型如何在多个域之间泛化，而不仅仅是从一个域迁移到另一个域。

5.2 Modular End-to-end Planning

The modular end-to-end planning framework optimizes multiple modules while prioritizing the ultimate planning task, which enjoys the advantages of interpretability as indicated in Sec. 4.6. This is advocated in recent literature [2, 271] and certain industry solutions (Tesla, Wayve, etc.) have involved similar ideas. When designing these differentiable perception modules, several questions arise regarding the choice of loss functions, such as the necessity of 3D bounding boxes for object detection, whether opting for BEV segmentation over lane topology for static scene perception, or the training strategies with limited modules’ data.
模块化的端到端规划框架在优化多个模块的同时，优先考虑最终的规划任务，并如第4.6节所示，具有可解释性的优势。这在最近的文献[2, 271]中有提倡，并且某些工业解决方案（例如Tesla, Wayve等）已经涉及了类似的想法。在设计这些可微分的感知模块时，关于损失函数的选择，会出现几个问题，例如：

3D边界框在目标检测中的必要性：3D边界框可以提供目标的尺寸、位置和方向等信息，这对于自动驾驶车辆理解周围环境至关重要。然而，获取3D边界框可能需要复杂的传感器融合和计算资源。研究者需要权衡3D边界框带来的精确度和计算成本。
选择鸟瞰图（BEV）分割与车道拓扑结构对静态场景感知的影响：鸟瞰图提供了从上方观察的场景视图，有助于车辆识别车道线、交通标志和行人等。而车道拓扑结构则关注于道路的几何形状和车辆的行驶路径。选择哪种方法取决于具体的应用场景和所需的信息细节。
在模块数据有限的情况下的训练策略：当某些模块可用的数据量有限时，研究者可能需要采用数据增强、迁移学习或合成数据等策略来提高模型的性能和泛化能力。

模块化规划框架的优势在于其灵活性和可扩展性，允许开发者根据特定的应用需求和资源限制来定制和优化各个组件。同时，这种框架也支持逐步集成和优化新的模块，以适应不断变化的自动驾驶技术需求。

5.3 Data Engine

The importance of large-scale and high-quality data for autonomous driving can never be emphasized enough [272]. Establishing a data engine with an automatic labeling pipeline [273] could greatly facilitate the iterative development of both data and models. The data engine for autonomous driving, especially modular end-to-end planning systems, needs to streamline the process of annotating highquality perception labels with the aid of large perception models in an automatic way. It should also support mining hard/corner cases, scene generation, and editing to facilitate the data-driven evaluations discussed in Sec. 3.2 and promote diversity of data and the generalization ability of models (Sec. 4.9). A data engine would enable autonomous driving models to make consistent improvements.
大规模且高质量的数据对自动驾驶的重要性无论怎么强调都不为过[272]。建立一个带有自动标注流程的数据引擎[273]，可以极大地促进数据和模型的迭代开发。自动驾驶的数据引擎，特别是模块化的端到端规划系统，需要简化利用大型感知模型自动标注高质量感知标签的过程。它还应该支持挖掘困难/边缘情况、场景生成和编辑，以促进第3.2节中讨论的数据驱动评估，并推动数据的多样性和模型的泛化能力（第4.9节）。数据引擎将使自动驾驶模型能够持续改进。

5.4 Foundation Model

Recent advancements in foundation models in both language [166, 167, 274] and vision [273, 275, 276] have proved that large-scale data and model capacity can unleash the immense potential of AI in high-level reasoning tasks. The paradigm of finetuning [277] or prompt learning [278], optimization in the form of self-supervised reconstruction [279] or contrastive pairs [165], etc., are all applicable to the end-to-end driving domain. However, we contend that the direct adoption of LLMs for driving might be tricky. The output of an autonomous agent requires steady and accurate measurements, whereas the generative output in language models aims to behave like humans, irrespective of its accuracy. A feasible solution to develop a “foundation” driving model is to train a world model that can forecast the reasonable future of the environment, either in 2D, 3D, or latent space. To perform well on downstream tasks like planning, the objective to be optimized for the model needs to be sophisticated enough, beyond frame-level perception.
近期在语言[166, 167, 274]和视觉[273, 275, 276]领域的基础模型（foundation models）的进展已经证明，大规模数据和模型容量可以释放人工智能在高级推理任务上的巨大潜力。微调（finetuning）[277]或提示学习（prompt learning）[278]、自我监督重建[279]或对比对[165]等优化形式，都适用于端到端的驾驶领域。然而，我们认为直接采用大型语言模型（LLMs）用于驾驶可能存在困难。自动驾驶代理的输出需要稳定且准确的测量，而语言模型中的生成输出旨在模仿人类行为，不论其准确性如何。开发“基础”驾驶模型的一个可行解决方案是训练一个能够预测环境合理未来的世界模型，无论是在2D、3D还是潜在空间中。为了在规划等下游任务上表现良好，模型需要优化的目标必须足够复杂，超越帧级感知。

6 CONCLUSION AND OUTLOOK

In this survey, we provide an overview of fundamental methodologies and summarize various aspects of simulation and benchmarking. We thoroughly analyze the extensive literature to date, and highlight a wide range of critical challenges and promising resolutions.
在这项综述中，我们提供了基础方法论的概览，总结了模拟和基准测试的各个方面。我们对迄今为止的广泛文献进行了深入分析，并强调了一系列关键挑战和有希望的解决方案。
Outlook. The industry has dedicated considerable effort over the years to develop advanced modular-based systems capable of achieving autonomous driving on highways. However, these systems face significant challenges when confronted with complex scenarios, e.g., inner-city streets and intersections. Therefore, an increasing number of companies have started exploring end-to-end autonomous driving techniques specifically tailored for these environments. It is envisioned that with extensive high-quality data collection, large-scale model training, and the establishment of reliable benchmarks, the end-to-end approach will soon surpass modular stacks in terms of performance and effectiveness. In summary, end-to-end autonomous driving faces great opportunities and challenges simultaneously, with the ultimate goal of building generalist agents. In this era of emerging technologies, we hope this survey could serve as a starting point to shed new light on this domain.
展望未来，多年来业界一直致力于开发能够实现高速公路自动驾驶的先进模块化系统。然而，当遇到复杂场景时，例如市中心街道和交叉路口，这些系统面临重大挑战。因此，越来越多的公司开始探索特别为这些环境量身定制的端到端自动驾驶技术。预计通过广泛的高质量数据收集、大规模模型训练和建立可靠基准，端到端方法在性能和有效性方面将很快超越模块化堆栈。总结来说，端到端自动驾驶同时面临着巨大的机遇和挑战，其最终目标是构建通用代理。在这一新兴技术时代，我们希望这项综述能够作为一个起点，为这一领域提供新的视角。
在这里插入图片描述

**加粗样式**