Hydra-MDP: 端到端多模态规划与多目标 Hydra 蒸馏

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Hydra-MDP: 端到端多模态规划与多目标 Hydra 蒸馏

Abstract

We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multihead decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the 1st place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. Code will be available at https://github.com/ woxihuanjiangguo/Hydra-MDP
我们提出了 Hydra-MDP，这是一种新颖的范式，它采用了师生模型中的多个教师。这种方法利用来自人类和基于规则的教师的知识蒸馏来训练学生模型，该模型具备一个多头解码器，用以学习多样化的轨迹候选，以适应不同的评估指标。通过基于规则的教师的知识，Hydra-MDP 以端到端的方式学习环境对规划的影响，而不是诉诸于不可微分的后处理步骤。这种方法在 Navsim 挑战赛中荣获第一名，显示了其在多样化的驾驶环境和条件下的泛化能力有显著的改进。相关代码将在 GitHub 上公开，项目链接为 https://github.com/woxihuanjiangguo/Hydra-MDP。

1. Introduction

End-to-end autonomous driving, which involves learning a neural planner with raw sensor inputs, is considered a promising direction to achieve full autonomy. Despite the promising progress in this field [11, 12], recent studies [4, 8, 14] have exposed multiple vulnerabilities and limitations of imitation learning (IL) methods, particularly the inherent issues in open-loop evaluation, such as the dysfunctional metrics and implicit biases [8, 14]. This is critical as it fails to guarantee safety, efficiency, comfort, and compliance with traffic rules. To address this main limitation, several works have proposed incorporating closed-loop metrics, which more effectively evaluate end-to-end autonomous driving by ensuring that the machine-learned planner meets essential criteria beyond merely mimicking human drivers.
端到端自动驾驶，即直接从原始传感器输入学习神经规划器，被视为实现完全自动化的有希望的途径。尽管该领域已经取得了有希望的进展[11, 12]，但最新研究[4, 8, 14]揭露了模仿学习（IL）方法的多个弱点和局限性，尤其是开环评估中存在的问题，如功能失调的指标和隐含的偏见[8, 14]。这一点至关重要，因为这些问题的存在使得无法保证自动驾驶的安全性、效率、舒适性以及对交通规则的遵守。为了解决这个主要限制，一些研究工作已经提出采用闭环指标，这些指标通过确保机器学习规划器满足超越简单模仿人类驾驶员的基本标准，更有效地评估端到端自动驾驶的性能。
Therefore, end-to-end planning is ideally a multi-target and multimodal task, where multi-target planning involves meeting various evaluation metrics from either open-loop and closed-loop settings. In this context, multimodal indicates the existence of multiple optimal solutions for each metric.
因此，端到端规划本质上是一个涉及多个目标和多种模式的任务，其中多目标规划包括满足开环和闭环环境中的各种评估指标。在这种背景下，多模态指的是对于每项指标都可能有多个最优的解决方案存在。
Existing end-to-end approaches [4, 11, 12] often try to consider closed-loop evaluation via post-processing, which is not streamlined and may result in the loss of additional information compared to a fully end-to-end pipeline. Meanwhile, rule-based planners [8, 18] struggle with imperfect perception inputs. These imperfect inputs degrade the performance of rule-based planning under both closed-loop and open-loop metrics, as they rely on predicted perception instead of ground truth (GT) labels.
现有的端到端方法[4, 11, 12]经常尝试通过后处理来实现闭环评估，但这种做法并不高效，并且与完全端到端的流程相比可能会导致一些额外信息的损失。同时，基于规则的规划器[8, 18]在处理不准确的感知输入时面临挑战。由于这些规划器依赖于预测的感知结果而非真实标签（Ground Truth, GT），因此不完美的感知输入会在闭环和开环指标下降低它们的规划性能。
To address the issues, we propose a novel end-to-end autonomous driving framework called Hydra-MDP (Multimodal Planning with Multi-target Hydra-distillation). HydraMDP is based on a novel teacher-student knowledge distillation (KD) architecture. The student model learns diverse trajectory candidates tailored to various evaluation metrics through KD from both human and rule-based teachers. We instantiate the multi-target Hydra-distillation with a multihead decoder, thus effectively integrating the knowledge from specialized teachers. Hydra-MDP also features an extendable KD architecture, allowing for easy integration of additional teachers.
为了应对这些挑战，我们提出了一个名为 Hydra-MDP（多模态规划与多目标 Hydra 蒸馏）的新型端到端自动驾驶框架。Hydra-MDP 基于一种创新的师生知识蒸馏（Knowledge Distillation, KD）架构。学生模型通过从人类和基于规则的教师那里进行知识蒸馏，学习适应不同评估指标的多样化轨迹候选。我们通过一个多头解码器实现了多目标 Hydra 蒸馏，有效地整合了来自各个专业教师的知识。此外，Hydra-MDP 还具备一个可扩展的知识蒸馏架构，这使得额外教师的集成变得简单方便。
The student model uses environmental observations during training, while the teacher models use ground truth (GT) data. This setup allows the teacher models to generate better planning predictions, helping the student model to learn effectively. By training the student model with environmental observations, it becomes adept at handling realistic conditions where GT perception is not accessible during testing.
学生模型在训练时使用的是环境观测数据，而教师模型则依赖于真实标签（Ground Truth, GT）数据。这样的配置使得教师模型能够生成更优质的规划预测，进而辅助学生模型进行有效的学习。通过让学生模型****接受环境观测数据的训练，它能够变得擅长应对在测试阶段无法获得精确感知数据的真实场景。
Our contributions are summarized as follows:
1.We propose a universal framework of end-to-end multimodal planning via multi-target hydra-distillation, allowing the model to learn from both rule-based planners and human drivers in a scalable manner.
2.Our approach achieves the state-of-the-art performance under the simulation-based evaluation metrics on Navsim.
我们的贡献可以概括为以下几点：
1.我们提出了一个通过多目标 Hydra 蒸馏实现端到端多模态规划的通用框架，使模型能够以一种可扩展的方式****从基于规则的规划器和人类驾驶员那里学习。
2.我们的方案在 Navsim 模拟环境中的评估指标上达到了业界领先的性能水平。

2. Solution

2.1. Preliminaries

Let O represent sensor observations, Pˆ and P denote ground truth and predicted perceptions (e.g. 3D object detection, lane detection), Tˆ be the expert trajectory, and T∗ be the predicted trajectory. Lim represents the imitation loss. We first introduce the two prevailing paradigms and our proposed paradigm (Fig. 1) in this section:
用 𝑂 来表示传感器的观测数据，用 𝑃^ 和 𝑃 来分别表示真实感知数据和预测感知数据（例如，3D 物体检测、车道检测），用 𝑇^ 表示专家轨迹，用 𝑇∗ 表示预测轨迹。 $L_{im}$ 表示模仿损失。在这一部分，我们首先介绍两种流行的范式以及我们提出的新范式（见图 1）：
在这里插入图片描述
A. Single-modal Planning + Single-target Learning. In this paradigm [11, 12, 14], the planning network directly regresses the planned trajectory from the sensor observations. Ground truth perceptions can be used as auxiliary supervision but does not influence the planning output. Perception losses are not included in the formula for simplicity. The whole processing can be formulated as:
A. 单模态规划与单目标学习。在这种范式下[11, 12, 14]，规划网络直接基于传感器观测数据预测轨迹。虽然可以使用真实感知数据作为辅助监督信号，但它们并不直接影响规划结果。为了简化模型，感知损失并未纳入计算公式中。整个过程可以用以下公式表示：
在这里插入图片描述
where Lim is usually an L2 loss.
在这里，Lim 通常指的是 L2 损失，也就是欧几里得损失，它是一种常用的损失函数，用于衡量预测值与真实值之间的差异。L2 损失的计算公式为：
$Loss=\sum_{i=1}^n(y_i-\hat y_i)^2$
其中，y 是真实值， $\hat y$ 是预测值。
B. Multimodal Planning + Single-target Learning. This approach [1, 4] predicts multiple trajectories {Ti}k i=1, whose similarities to the expert trajectory are computed:
B. 多模态规划与单目标学习。在这种方法[1, 4]中，模型预测一组多个轨迹 $\{T_i\}^k_{i=1}$ ，其中 k 代表预测出的轨迹数量，𝑖 是从 1 到 𝑘 的索引。接着，计算这些轨迹与专家轨迹 $\hat T$ 之间的相似度：
在这里插入图片描述
where Lim can be KL-Divergence [4] or the max-margin loss [1]. Perception outputs P are explicitly used to postprocess suitable trajectories via a cost function f(Ti, P). The trajectory with the lowest cost is selected:
在这里， $L_{im}$ 可以是 KL 散度[4]或最大边界损失[1]。感知输出 𝑃 被明确地用于通过代价函数
𝑓(𝑇_𝑖,𝑃) , 对合适的轨迹进行后处理。选择代价最低的轨迹作为最终输出：
在这里插入图片描述

这种范式允许模型在后处理阶段利用感知输出来优化轨迹选择，从而提高规划的精确度和适应性。
which is a non-differentiable process based on imperfect perception P.
这个过程是基于不完美感知 𝑃 的一个不可微分的过程。在深度学习和神经网络的训练中，大多数优化算法都是基于微分的，如梯度下降法。但是，后处理步骤，比如基于感知输出 𝑃 的代价函数 𝑓(𝑇_𝑖,𝑃) 的优化，可能包含不可微分的操作。例如，代价函数可能涉及启发式规则或非线性阈值操作，在数学上不可导，因此不能直接使用标准的基于梯度的优化技术。在这种情况下，系统可能需要采用其他优化策略，如遗传算法、模拟退火或蒙特卡洛方法等，这些方法不依赖于梯度信息，而是通过迭代搜索来寻找最优解。
C. Multimodal Planning + Multi-target Learning. We propose this paradigm to simultaneously predict various costs (e.g., collision cost, drivable area compliance cost) via a neural network f˜. This is performed in a teacher-student distillation manner, where the teacher has access to ground truth perception Pˆ but the student relies only on sensor observations O. This paradigm can be formulated as:
C. 多模态规划与多目标学习。我们提出了这种范式，以便通过神经网络 $\widetilde{f}$ 同时预测各种成本（例如，碰撞成本、可行驶区域合规性成本）。这个过程采用师生蒸馏的方式进行，教师模型可以访问到真实感知数据 $\widetilde{P}$ ，而学生模型则仅依赖于传感器观测 𝑂。这种范式可以用以下公式表示：

就是教师能看到标准答案，而学生只能看到考题，教师根据标准答案中的主要得分点给学生评分，而不是一字一句都要和标准答案一样，学生根据老师的得分来进行学习

在这里插入图片描述

Kimi给的（4）式的解释：

Here, we only consider one cost function f for clarity. The trajectory with the lowest predicted cost is selected:
为了简化问题，我们只考虑单一的代价函数 𝑓。选择具有最低预测代价的轨迹作为最终输出：
在这里插入图片描述

Kimi给出的解释：

We stress that this framework is not restricted by nondifferentiable post-processing. It can be easily scaled in an end-to-end fashion by involving more cost functions or leveraging imitation similarity in our implementation (Sec. 2.4).
我们强调，此框架不受非微分后处理步骤的限制。通过加入更多的成本函数或在我们的实现中利用模仿相似度（见第 2.4 节），它可以方便地以端到端的方式进行扩展。这意味着框架可以灵活地适应更复杂的任务，通过整合额外的评估指标或改进模仿学习的方法来提升性能。

2.2. Overall Framework

As shown in Fig. 2, Hydra-MDP consists of two networks: a Perception Network and a Trajectory Decoder.
如图 2 所示，Hydra-MDP 包含两个关键组件：感知网络和轨迹解码器。感知网络负责处理原始传感器数据并提供环境的感知理解，而轨迹解码器则负责根据这些感知信息生成轨迹预测。这种设计允许系统以端到端的方式进行训练，同时保持了灵活性和扩展性。
在这里插入图片描述
感知网络。我们的感知网络是基于官方挑战赛的基线 Transfuser [5, 6] 构建的，它由图像主干网络、激光雷达主干网络以及用于 3D 目标检测和鸟瞰视图（Bird’s Eye View, BEV）分割的感知头组成。多个变换器层 [19] 将两个主干网络的特征连接起来，从不同的数据源中提取有意义的信息。感知网络的最终输出是环境特征令牌 $F_{env}$ ，这些编码丰富的语义信息包含了从图像和激光雷达点云中提取的信息。
Trajectory Decoder. Following Vadv2 [4], we construct a fixed planning vocabulary to discretize the continuous action space. To build the vocabulary, we first sample 700K trajectories randomly from the original nuPlan database [2]. Each trajectory Ti(i = 1, …, k) consists of 40 timestamps of (x, y, heading), corresponding to the desired 10Hz frequency and a 4-second future horizon in the challenge. The planning vocabulary Vk is formed as K-means clustering centers of the 700K trajectories, where k denotes the size of the vocabulary. Vk is then embedded as k latent queries with an MLP, sent into layers of transformer encoders [19], and added to the ego status E:
在这里插入图片描述

To incorporate environmental clues in Fenv, transformer decoders are leveraged:

Using the log-replay trajectory Tˆ, we implement a distancebased cross-entropy loss to imitate human drivers:
使用记录重放的轨迹 $\hat T$ ，我们实现了基于距离的交叉熵损失来模仿人类驾驶员：
在这里插入图片描述
where Sim i is the i-th softmax score of Vk′′, and yi is the imitation target produced by L2 distances between log-replays and the vocabulary. Softmax is applied on L2 distances to produce a probability distribution:

The intuition behind this imitation target is to reward trajectory proposals that are close to human driving behaviors.
这种模仿学习目标的直观理念是对那些与人类驾驶行为相近的轨迹建议给予奖励。通过这种方式，模型被鼓励去学习并模仿人类驾驶员的决策过程和驾驶风格，从而提高自动驾驶系统在真实世界场景中的性能和安全性。

2.3. Multi-target Hydra-Distillation(多目标 Hydra 蒸馏)

Though the imitation target provides certain clues for the planner, it is insufficient for the model to associate the planning decision with the driving environment under the closedloop setting, leading to failures such as collisions and leaving drivable areas [14]. Therefore, to boost the closed-loop performance of our end-to-end planner, we propose Multi-target Hydra-Distillation, a learning strategy that aligns the planner with simulation-based metrics in this challenge.
尽管模仿目标为规划器提供了一些线索，但这并不足以使模型在闭环环境中将规划决策与实际驾驶环境联系起来，从而导致碰撞和驶离可行驶区域等失败情况[14]。因此，为了增强我们端到端规划器的闭环性能，我们提出了一种名为多目标 Hydra 蒸馏的学习策略。这种策略通过与本次挑战赛中的基于模拟的评估指标相一致，来调整规划器的行为。
在本节中，我们将探讨 Hydra-MDP 框架中的多目标蒸馏策略。该策略涉及从多个教师模型中提取知识，并将这些知识融合到学生模型中，目的是训练一个能够在多种评估指标下表现出色的端到端自动驾驶模型。
多目标蒸馏可能包括以下几个关键环节：
教师模型的选定：选定或训练一系列针对特定评估指标或驾驶行为优化的教师模型。
知识的提炼：从每个教师模型中提炼出关键知识，这可能包括策略、行为模式或在特定情境下的决策。
学生模型的训练：利用从教师模型中提炼的知识来训练学生模型，使其能够学习在不同驾驶场景中做出决策。
损失函数的构建：构建一个损失函数，它能够同时考量多个评估指标，确保学生模型在各个方面都能实现优秀的表现。
端到端的优化：在端到端训练过程中，持续优化学生模型，使其在模仿人类驾驶行为的同时，也能满足其他评估指标的要求。
模型知识的融合：在某些情况下，可能需要以某种方式融合来自多个教师模型的知识，以增强学生模型的泛化能力。
多目标 Hydra 蒸馏的目标是开发一个能够综合考虑多种驾驶目标和约束的自动驾驶模型，以便在各种交通环境和条件下都能做出安全、高效和舒适的决策。
The distillation process expands the learning target through two steps: (1) running offline simulations [8] of the planning vocabulary Vk for the entire training dataset; (2) introducing supervision from simulation scores for each trajectory in Vk during the training process. For a given scenario, step 1 generates ground truth simulation scores {Sˆim|i = 1, …, k}| mM=1 | for each metric m ∈ M and the i-th trajectory, where M represents the set of closed-loop metrics used in the challenge. For score predictions, latent vectors Vk′′ are processed with a set of Hydra Prediction Heads, yielding predicted scores {Sim|i = 1, …, k}| mM=1 | . With a binary cross-entropy loss, we distill rule-based driving knowledge into the end-to-end planner:
在这里插入图片描述
For a trajectory Ti, its distillation loss of each sub-score acts as a learned cost value in Eq. 4, measuring the violation of particular traffic rules associated with that metric.
对于一条给定的轨迹 𝑇𝑖，其每个子得分的蒸馏损失在公式 4 中充当一个学习到的成本值，用以衡量违反与该度量相关的特定交通规则的程度。这种方法允许模型在训练过程中考虑到交通规则的重要性，并通过损失函数来优化遵守这些规则的性能。

2.4. Inference and Post-processing

2.4.1 Inference

Given the predicted imitation scores {Siim|i = 1, …, k} and metric sub-scores {Sim|i = 1, …, k}| mM=1 | , we calculate an assembled cost measuring the likelihood of each trajectory being selected in the given scenario as follows:
考虑到预测的模仿得分 $\{ S^{im}_i|i=1,...,k\}$ 和各个度量子得分 $\{ S^{m}_i|i=1,...,k\}^{|M|}_{m=1}$ ，我们可以通过以下方式计算一个综合成本，该成本衡量在特定场景中每条轨迹被选择的概率：
在这里插入图片描述
where {wi}4 i=1 represent confidence weighting parameters to mitigate the imperfect fitting of different teachers. The optimal combination of weights is obtained via grid search, which typically fall within the following ranges: 0.01 ≤ w1 ≤ 0.1, 0.1 ≤ w2, w3 ≤ 1, 1 ≤ w4 ≤ 10, indicating the necessity to prioritize rule-based costs over imitation. Finally, the trajectory with the lowest overall cost is chosen.
在这里插入图片描述

2.4.2 Model Ensembling

We present two model ensembling techniques: Mixture of Encoders and Sub-score Ensembling. The former technique uses a linear layer to combine features from different vision encoders, while the latter calculates a weighted sum of subscores from independent models for trajectory selection.
我们提出了两种模型集成方法：编码器混合和子得分集成。编码器混合技术利用线性层来整合来自不同视觉编码器的特征，而子得分集成方法则是将独立模型产生的子得分进行加权求和，以便于轨迹的选择。
编码器混合技术可以增强模型对不同视觉特征的融合能力，通过学习不同编码器输出的最优线性组合，来提高特征的表达能力。这种方法适用于处理来自多个传感器或不同类型输入数据的情况。
子得分集成技术则侧重于从多个独立模型中获取决策信号，通过为每个模型的输出子得分分配不同的权重，然后将这些加权子得分相加，以形成一个综合得分。这种方法有助于平衡不同模型的优势和劣势，提高整体的预测性能和鲁棒性。
这两种技术都可以用于提升模型在复杂场景下的表现，尤其是在自动驾驶和其他需要高精度预测的任务中。

3. Experiments

3.1. Dataset and metrics

Dataset. The Navsim dataset builds on the existing OpenScene [7] dataset, a compact version of nuPlan [3] with only relevant annotations and sensor data sampled at 2 Hz. The dataset primarily focuses on scenarios involving changes in intention, where the ego vehicle’s historical data cannot be extrapolated into a future plan. The dataset provides annotated 2D high-definition maps with semantic categories and 3D bounding boxes for objects. The dataset is split into two parts: Navtrain and Navtest, which respectively contain 1192 and 136 scenarios for training/validation and testing.
数据集。Navsim 数据集是在现有的 OpenScene [7] 数据集基础上构建的，它是 nuPlan [3] 的一个精简版，仅包含必要的注释和以 2 Hz 频率采样的传感器数据。这个数据集主要集中于那些自车历史数据无法直接预测未来计划的意图变化场景。它提供了带有语义分类的 2D 高清地图和物体的 3D 边界框。数据集分为 Navtrain 和 Navtest 两部分，分别包含 1192 和 136 个场景，用于训练/验证以及测试。
Metrics. For this challenge, we evaluate our models based on the PDM score, which can be formulated as follows:
度量标准。对于这次挑战赛，我们基于 PDM 分数来评估我们的模型，其公式可以表示如下：
在这里插入图片描述
where sub-metrics NC, DAC, T TC, C, EP correspond to the No at-fault Collisions, Drivable Area Compliance, Time to Collision, Comfort, and Ego Progress. For the distillation process and subsequent results, DDC is neglected due to an implementation problem.1.
在该挑战中，我们根据以下子度量标准来评估模型的性能：
NC (No at-fault Collisions)：无过错碰撞，确保模型预测的轨迹不会发生由自身引起的碰撞。
DAC (Drivable Area Compliance)：可行驶区域合规性，衡量轨迹是否保持在可行驶区域内。
TTC (Time to Collision)：碰撞时间，预测轨迹与潜在碰撞的时间距离。
C (Comfort)：舒适度，评估轨迹的平滑性和驾驶体验。
EP (Ego Progress)：自车进度，衡量自车沿轨迹的前进距离。
在蒸馏过程和最终结果中，由于实现上的问题，DDC（可能指驾驶距离合规性或其他相关指标）被暂时忽略。这可能意味着在当前的模型实现中，DDC 指标没有被纳入损失函数或评估过程中。

3.2. Implementation Details

We train our models on the Navtrain split using 8 NVIDIA A100 GPUs, with a total batch size of 256 across 20 epochs. The learning rate and weight decay are set to 1×10−4 and 0.0 following the official baseline. LiDAR points from 4 frames are splatted onto the BEV plane to form a density BEV feature, which is encoded using ResNet34 [10]. For images, the front-view image is concatenated with the center-cropped front-left-view and front-right-view images, yielding an input resolution of 256 × 1024 by default. ResNet34 is also applied for feature extraction unless otherwise specified. No data or test-time augmentations are used.
我们在 Navtrain 数据集上使用 8 块 NVIDIA A100 GPU 来训练模型，在 20 个 epoch 中，整体的批量大小为 256。学习率和权重衰减分别设置为 $10^{-4}$ 和 0.0，遵循官方的基线设置。激光雷达数据从 4 帧中提取，然后在鸟瞰视图（Bird’s Eye View, BEV）平面上进行投影，形成密度 BEV 特征图，这些特征图通过 ResNet34 网络 [10] 进行编码。对于图像数据，我们将前视图与中心裁剪的前左视图和前右视图图像拼接起来，得到的输入分辨率默认为 256 × 1024。除非另有说明，图像特征提取同样采用 ResNet34 网络。在训练过程中，我们没有使用任何数据增强或测试时增强技术。

3.3. Main Results

Our results, presented in Tab. 1, highlight the absolute advantage of Hydra-MDP over the baseline. In our exploration of different planning vocabularies [4], utilizing a larger vocabulary V8192 demonstrates improvements across different methods. Furthermore, non-differentiable post-processing yields fewer performance gains than our framework, while weighted confidence enhances the performance comprehensively. To ablate the effect of different learning targets, the continuous metric EP (Ego Progress) is not considered in early experiments and we attempt the distillation of the overall PDM score. Nonetheless, the irregular distribution of the PDM score incurs performance degradation, which suggests the necessity of our multi-target learning paradigm. In the final version of Hydra-MDP-V8192-W-EP, the distillation of EP can improve the corresponding metric.
表 1 中展示的我们的结果显示了 Hydra-MDP 与基线相比的明显优势。在对不同规划词汇 [4] 的研究中，使用更大的词汇表 $V_{8192}$ 在多种方法上都显示出了性能提升。此外，不可微分的后处理相比于我们的框架在性能增益上较少，而加权置信度则全面增强了模型性能。为了评估不同学习目标的影响，在初期实验中我们没有考虑连续的度量指标 EP（自车进度），而是尝试了对整体 PDM 分数进行蒸馏。然而，PDM 分数的不规则分布导致了性能下降，这强调了我们多目标学习方法的必要性。在 $Hydra-MDP-V_{8192}-W-EP$ 的最终版本中，对 EP 进行蒸馏可以改进该特定度量指标的表现。
在这里插入图片描述
表 1 展示了在 Navtest 数据集上的性能表现。注意，官方 Navsim 实现的 PDM-Closed 可能因为与 nuPlan 实现 [8] 在制动策略和偏移公式上的不一致性而容易出现误差。所有端到端方法都采用了官方的 Transfuser [5] 作为它们的感知网络。* 我们采用了基于距离的模仿损失进行模型训练。PP 表示在后处理中使用了 Transfuser 感知。PDM 表示学习目标是整个 PDM 分数。W 表示在推理时使用了加权置信度。EP 表示模型被训练以适应连续的自车进度（Ego Progress）度量标准。

3.4. Scaling Up and Model Ensembling (扩展和模型集成)

在这里插入图片描述
Previous literature [11] suggests larger backbones only lead to minor improvements in planning performance. Nevertheless, we further demonstrate the scalability of our model with larger backbones. Tab. 2 shows three best-performing versions of Hydra-MDP with ViT-L [9, 20] and V2-99 [13] as the image backbone. For the final submission, we use the ensembled sub-scores of these three models for inference.
先前的研究 [11] 指出，扩大主干网络的规模只能在规划性能上带来有限的提升。然而，我们进一步证明了我们的模型具有更大的主干网络时的可扩展性。表 2 展示了 Hydra-MDP 的三个最佳性能版本，它们分别采用了 ViT-L [9, 20] 和 V2-99 [13] 作为图像识别的主干网络。在最终提交的版本中，我们使用了这三个模型的子得分进行集成，以用于推理过程。
这表明，尽管扩大主干网络可能不会带来显著的性能提升，但通过模型集成，我们可以有效地结合不同模型的优势，从而提高整体的推理性能。这种方法允许我们在不同的主干网络之间进行灵活选择和组合，以适应不同的应用场景和性能要求。
在这里插入图片描述