ICCV2023知识蒸馏相关论文速览

Paper1 Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes

摘要原文: Object detection via inaccurate bounding box supervision has boosted a broad interest due to the expensive high-quality annotation data or the occasional inevitability of low annotation quality (e.g. tiny objects). The previous works usually utilize multiple instance learning (MIL), which highly depends on category information, to select and refine a low-quality box. Those methods suffer from part domination, object drift and group prediction problems without exploring spatial information. In this paper, we heuristically propose a Spatial Self-Distillation based Object Detector (SSD-Det) to mine spatial information to refine the inaccurate box in a self-distillation fashion. SSD-Det utilizes a Spatial Position Self-Distillation SPSD) module to exploit spatial information and an interactive structure to combine spatial information and category information, thus constructing a high-quality proposal bag. To further improve the selection procedure, a Spatial Identity Self-Distillation (SISD) module is introduced in SSD-Det to obtain spatial confidence to help select the best proposals. Experiments on MS-COCO and VOC datasets with noisy box annotation verify our method’s effectiveness and achieve state-of-the-art performance. The code is available at https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det.

中文总结: 这段话主要介绍了通过不准确的边界框监督进行目标检测的方法。由于高质量标注数据昂贵，或者由于低质量标注数据的偶然不可避免性（例如微小目标），这种方法引起了广泛关注。以往的研究通常利用多实例学习（MIL）来选择和优化低质量的边界框，但这些方法存在部分支配、目标漂移和群体预测等问题，没有充分探索空间信息。因此，本文提出了一种基于空间自蒸馏的目标检测器（SSD-Det），以启发式方式挖掘空间信息，以自蒸馏的方式优化不准确的边界框。SSD-Det利用空间位置自蒸馏（SPSD）模块来利用空间信息，利用交互结构来结合空间信息和类别信息，从而构建高质量的提议包。为了进一步改进选择过程，SSD-Det引入了空间身份自蒸馏（SISD）模块，以获得空间置信度，帮助选择最佳提议。在带有嘈杂边界框标注的MS-COCO和VOC数据集上的实验证明了我们方法的有效性，并取得了最先进的性能。源代码可在https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det找到。

Paper2 Alleviating Catastrophic Forgetting of Incremental Object Detection via Within-Class and Between-Class Knowledge Distillation

摘要原文: Incremental object detection (IOD) task requires a model to learn continually from newly added data. However, directly fine-tuning a well-trained detection model on a new task will sharply decrease the performance on old tasks, which is known as catastrophic forgetting. Knowledge distillation, including feature distillation and response distillation, has been proven to be an effective way to alleviate catastrophic forgetting. However, previous works on feature distillation heavily rely on low-level feature information, while under-exploring the importance of high-level semantic information. In this paper, we discuss the cause of catastrophic forgetting in IOD task as destruction of semantic feature space. We propose a method that dynamically distills both semantic and feature information with consideration of both between-class discriminativeness and within-class consistency on Transformer-based detector. Between-class discriminativeness is preserved by distilling class-level semantic distance and feature distance among various categories, while within-class consistency is preserved by distilling instance-level semantic information and feature information within each category. Extensive experiments are conducted on both Pascal VOC and MS COCO benchmarks. Our method outperforms all the previous CNN-based SOTA methods under various experimental scenarios, with a remarkable mAP improvement from 36.90% to 39.80% under one-step IOD task.

中文总结: 这段话主要讨论了增量目标检测（IOD）任务需要模型持续学习新添加的数据。然而，直接在新任务上对经过良好训练的检测模型进行微调会显著降低旧任务的性能，这被称为灾难性遗忘。知识蒸馏，包括特征蒸馏和响应蒸馏，已被证明是减轻灾难性遗忘的有效方法。然而，先前关于特征蒸馏的工作在很大程度上依赖于低级特征信息，而对高级语义信息的重要性进行了不足的探索。本文讨论了IOD任务中灾难性遗忘的原因，即语义特征空间的破坏。我们提出了一种方法，在基于Transformer的检测器上动态蒸馏语义和特征信息，考虑了类间区分性和类内一致性。类间区分性通过蒸馏各类别之间的类级语义距离和特征距离来保留，而类内一致性通过蒸馏每个类别内的实例级语义信息和特征信息来保留。在Pascal VOC和MS COCO基准测试上进行了大量实验。我们的方法在各种实验场景下均优于所有先前基于CNN的最优方法，其中在一步IOD任务下，mAP的显著提升从36.90%到39.80%。

Paper3 Automated Knowledge Distillation via Monte Carlo Tree Search

摘要原文: In this paper, we present Auto-KD, the first automated search framework for optimal knowledge distillation design. Traditional distillation techniques typically require handcrafted designs by experts and extensive tuning costs for different teacher-student pairs. To address these issues, we empirically study different distillers, finding that they can be decomposed, combined, and simplified. Based on these observations, we build our uniform search space with advanced operations in transformations, distance functions, and hyperparameters components. For instance, the transformation parts are optional for global, intra-spatial, and inter-spatial operations, such as attention, mask, and multi-scale. Then, we introduce an effective search strategy based on the Monte Carlo tree search, modeling the search space as a Monte Carlo Tree (MCT) to capture the dependency among options. The MCT is updated using test loss and representation gap of student trained by candidate distillers as the reward for better exploration-exploitation balance. To accelerate the search process, we exploit offline processing without teacher inference, sparse training for student, and proxy settings based on distillation properties. In this way, our Auto-KD only needs small costs to search for optimal distillers before the distillation phase. Moreover, we expand Auto-KD for multi-layer and multi-teacher scenarios with training-free weighted factors. Our method is promising yet practical, and extensive experiments demonstrate that it generalizes well to different CNNs and Vision Transformer models and attains state-of-the-art performance across a range of vision tasks, including image classification, object detection, and semantic segmentation. Code is provided at https://github.com/lilujunai/Auto-KD.

中文总结: 在这篇论文中，我们提出了Auto-KD，这是第一个用于自动搜索最佳知识蒸馏设计的框架。传统的蒸馏技术通常需要专家手工设计，并需要针对不同的师生对进行大量调整成本。为了解决这些问题，我们对不同的蒸馏器进行了实证研究，发现它们可以进行分解、组合和简化。基于这些观察结果，我们建立了一个包含高级操作的统一搜索空间，其中包括转换、距离函数和超参数组件。例如，转换部分可以选择全局、内部空间和间隔空间操作，如注意力、掩码和多尺度。然后，我们引入了一种基于蒙特卡洛树搜索的有效搜索策略，将搜索空间建模为蒙特卡洛树（MCT），以捕捉选项之间的依赖关系。MCT根据候选蒸馏器训练的学生的测试损失和表示差异作为奖励来更新，以实现更好的探索-开发平衡。为了加速搜索过程，我们利用离线处理、学生的稀疏训练和基于蒸馏属性的代理设置。通过这种方式，我们的Auto-KD只需要在蒸馏阶段之前进行小成本的搜索以找到最佳的蒸馏器。此外，我们还将Auto-KD扩展到多层和多教师场景，使用无需训练的加权因子。我们的方法既有前景又实用，广泛的实验表明它在不同的CNN和Vision Transformer模型上具有良好的泛化性能，并在一系列视觉任务中取得了最先进的性能，包括图像分类、目标检测和语义分割。代码可在https://github.com/lilujunai/Auto-KD找到。

Paper4 Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only

摘要原文: Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP vision encoder) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations. The code is publicly available at https://github.com/facebookresearch/ZeroSeg

中文总结: 这段话主要讨论了语义分割在计算机视觉中的重要性，以及现有方法在模型训练中通常依赖昂贵的人工标注作为监督，限制了它们在大规模未标记数据集上的可扩展性。为了解决这一挑战，作者提出了一种名为ZeroSeg的新方法，利用现有的预训练视觉-语言（VL）模型（例如CLIP视觉编码器）来训练开放词汇的零样本语义分割模型。ZeroSeg通过将VL模型学习的视觉概念提炼为一组分段标记来克服这一挑战，每个标记总结了目标图像的一个局部区域。作者在多个流行的分割基准上评估了ZeroSeg的性能，包括PASCAL VOC 2012、PASCAL Context和COCO，在零样本方式下，该方法在相同训练数据下与其他零样本分割方法相比实现了最先进的性能，同时与强监督方法相比也表现出竞争力。最后，作者还通过人类研究和定性可视化展示了ZeroSeg在开放词汇分割上的有效性。代码可在https://github.com/facebookresearch/ZeroSeg 公开获取。

Paper5 Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation

摘要原文: The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. However, they generally operate under the closed-set adaptation setting assuming an identical label set between the source and target domains, which is over-restrictive in clinical practice where new classes commonly exist across datasets due to taxonomic inconsistency. While several methods have been presented to tackle both domain shifts and incoherent label sets, none of them take into account the common characteristics of the two issues and consider the learning dynamics along network training. In this work, we propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective. It exploits the low-rank nature of gradient space and devises a dual-stream distillation algorithm to regularize the learning dynamics of insufficiently annotated domain and classes with the external guidance obtained from reliable sources. Our approach resolves the issue of inadequate navigation along network optimization, which is the major obstacle in the taxonomy adaptive cross-domain adaptation scenario. We evaluate the proposed method extensively on several tasks towards various endpoints with clinical significance. The results demonstrate its effectiveness and improvements over previous methods.

中文总结: 这段话主要讨论了自动化医学图像分析的成功取决于大规模和专家注释的训练集。无监督域自适应（UDA）被提出作为减轻标记数据收集负担的一种有希望的方法。然而，它们通常在封闭集自适应设置下运行，假设源域和目标域之间的标签集相同，这在临床实践中是过于严格的，因为由于分类不一致，新类别通常存在于数据集之间。虽然已经提出了几种方法来解决域漂移和不一致标签集的问题，但没有一种方法考虑到这两个问题的共同特点，并考虑到网络训练过程中的学习动态。在这项工作中，我们提出了优化轨迹蒸馏，这是一种统一的方法，从新的角度解决这两个技术挑战。它利用梯度空间的低秩特性，并设计了一个双流蒸馏算法，用可靠来源获得的外部指导来规范不充分注释的域和类别的学习动态。我们的方法解决了网络优化过程中不足的导航问题，这是分类自适应跨域适应场景中的主要障碍。我们在几个具有临床意义的任务上对所提出的方法进行了广泛评估。结果表明，与先前方法相比，我们的方法具有更高的有效性和改进。

Paper6 TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

摘要原文: In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers’ behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8x compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at aka.ms/tinyclip.

中文总结: 在这篇论文中，我们提出了一种新颖的跨模态蒸馏方法，称为TinyCLIP，用于大规模语言-图像预训练模型。该方法引入了两种核心技术：亲和力模仿和权重继承。亲和力模仿探索了蒸馏过程中模态之间的交互作用，使学生模型能够模仿教师在视觉-语言亲和空间中学习跨模态特征对齐的行为。权重继承将预训练模型的权重传递给它们的学生对应模型，以提高蒸馏效率。此外，我们将该方法扩展为多阶段渐进蒸馏，以减轻在极端压缩过程中丢失信息性权重的问题。全面的实验证明了TinyCLIP的有效性，显示它可以将预训练的CLIP ViT-B/32模型大小减小50％，同时保持可比的零样本性能。在追求可比性能的同时，通过权重继承进行蒸馏可以使训练速度提高1.4-7.8倍，相比从头开始训练。此外，我们的TinyCLIP ViT-8M/16，在YFCC-15M上训练，实现了在ImageNet上41.1％的零样本top-1准确率，超过原始的CLIP ViT-B/16 3.5％，同时仅利用了8.9％的参数。最后，我们展示了TinyCLIP在各种下游任务中的良好可迁移性。代码和模型将在aka.ms/tinyclip开源。

Paper7 Representation Disparity-aware Distillation for 3D Object Detection

摘要原文: In this paper, we focus on developing knowledge distillation (KD) for compact 3D detectors. We observe that off-the-shelf KD methods manifest their efficacy only when the teacher model and student counterpart share similar intermediate feature representations. This might explain why they are less effective in building extreme-compact 3D detectors where significant representation disparity arises due primarily to the intrinsic sparsity and irregularity in 3D point clouds. This paper presents a novel representation disparity-aware distillation (RDD) method to address the representation disparity issue and reduce performance gap between compact students and over-parameterized teachers. This is accomplished by building our RDD from an innovative perspective of information bottleneck (IB), which can effectively minimize the disparity of proposal region pairs from student and teacher in features and logits. Extensive experiments are performed to demonstrate the superiority of our RDD over existing KD methods. For example, our RDD increases mAP of CP-Voxel-S to 57.1% on nuScenes dataset, which even surpasses teacher performance while taking up only 42% FLOPs.

中文总结: 本文关注开发用于紧凑三维检测器的知识蒸馏（KD）。我们观察到，现成的KD方法只有在教师模型和学生模型具有相似的中间特征表示时才表现出有效性。这可能解释了为什么它们在构建极致紧凑的三维检测器时效果较差，因为由于三维点云中固有的稀疏性和不规则性，会出现显著的表示差异。本文提出了一种新颖的表示差异感知蒸馏（RDD）方法，以解决表示差异问题，并减少紧凑学生模型与过度参数化教师模型之间的性能差距。通过从信息瓶颈（IB）的创新视角构建我们的RDD，可以有效地最小化学生和教师之间特征和logits的提议区域对的差异。进行了大量实验，以展示我们的RDD相对于现有KD方法的优越性。例如，我们的RDD将CP-Voxel-S的mAP提高到了nuScenes数据集的57.1％，甚至超过了教师的性能，而仅占42％的FLOPs。

Paper8 Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

摘要原文: Conditional diffusion models have demonstrated impressive performance in image manipulation tasks. The general pipeline involves adding noise to the image and then denoising it. However, this method faces a trade-off problem: adding too much noise affects the fidelity of the image while adding too little affects its editability. This largely limits their practical applicability. In this paper, we propose a novel framework, Selective Diffusion Distillation (SDD), that ensures both the fidelity and editability of images. Instead of directly editing images with a diffusion model, we train a feedforward image manipulation network under the guidance of the diffusion model. Besides, we propose an effective indicator to select the semantic-related timestep to obtain the correct semantic guidance from the diffusion model. This approach successfully avoids the dilemma caused by the diffusion process. Our extensive experiments demonstrate the advantages of our framework.

中文总结: 这段话主要内容是：条件扩散模型在图像处理任务中表现出色。一般的流程包括向图像添加噪声，然后对其进行去噪。然而，这种方法面临着一个折衷问题：添加过多的噪声会影响图像的保真度，而添加过少则会影响其可编辑性。这在很大程度上限制了它们的实际适用性。在本文中，我们提出了一种新颖的框架，选择性扩散蒸馏（SDD），以确保图像的保真度和可编辑性。我们并不直接使用扩散模型编辑图像，而是在扩散模型的指导下训练一个前馈图像处理网络。此外，我们提出了一个有效的指标来选择语义相关的时间步，以从扩散模型中获取正确的语义指导。这种方法成功地避免了扩散过程造成的困境。我们的广泛实验证明了我们框架的优势。

Paper9 Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

摘要原文: Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.

中文总结: 这段话主要讨论了声音在我们日常生活中传达重要信息以进行空间推理的能力。为了赋予深度网络这种能力，作者通过跨模态知识蒸馏来解决使用声音进行室内密集预测的挑战，涉及2D和3D两个维度。作者提出了一种名为SAM的空间对齐匹配蒸馏框架，通过视听知识传递中的本地对应关系来整合音频特征和视觉一致的可学习空间嵌入，以解决学生模型多个层次的不一致性。该方法不依赖于特定的输入表示，允许输入形状或维度的灵活性而不会降低性能。通过一个名为DAPS的新基准，作者首次使用音频观测来处理2D和3D中的全方位室内密集预测。具体来说，对于基于音频的深度估计、语义分割和具有挑战性的3D场景重建，所提出的蒸馏框架在各种指标和骨干架构上始终实现了最先进的性能。

Paper10 Class-relation Knowledge Distillation for Novel Class Discovery

摘要原文: We tackle the problem of novel class discovery, which aims to learn novel classes without supervision based on labeled data from known classes. A key challenge lies in transferring the knowledge in the known-class data to the learning of novel classes. Previous methods mainly focus on building a shared representation space for knowledge transfer and often ignore modeling class relations. To address this, we introduce a class relation representation for the novel classes based on the predicted class distribution of a model trained on known classes. Empirically, we find that such class relation becomes less informative during typical discovery training. To prevent such information loss, we propose a novel knowledge distillation framework, which utilizes our class-relation representation to regularize the learning of novel classes. In addition, to enable a flexible knowledge distillation scheme for each data point in novel classes, we develop a learnable weighting function for the regularization, which adaptively promotes knowledge transfer based on the semantic similarity between the novel and known classes. To validate the effectiveness and generalization of our method, we conduct extensive experiments on multiple benchmarks, including CIFAR100, Stanford Cars, CUB, and FGVC-Aircraft datasets. Our results demonstrate that the proposed method outperforms the previous state-of-the-art methods by a significant margin on almost all benchmarks.

中文总结: 这段话主要讨论了关于新类别发现的问题，即在没有监督的情况下，基于已知类别的标记数据学习新类别。其中的关键挑战在于如何将已知类别数据中的知识转移到新类别的学习中。先前的方法主要集中在构建一个共享的表示空间以进行知识转移，通常忽略了对类别关系的建模。为了解决这个问题，引入了一种基于已知类别模型预测的类别分布的类别关系表示来表示新类别。经验上发现，在典型的发现训练中，这种类别关系变得不太具信息性。为了防止这种信息丢失，提出了一种新颖的知识蒸馏框架，利用我们的类别关系表示来规范新类别的学习。此外，为了对新类别中的每个数据点实现灵活的知识蒸馏方案，开发了一个可学习的加权函数用于规范化，根据新类别与已知类别之间的语义相似性自适应地促进知识转移。为了验证我们方法的有效性和泛化性，我们在多个基准数据集上进行了大量实验，包括CIFAR100、Stanford Cars、CUB和FGVC-Aircraft数据集。我们的结果表明，所提出的方法在几乎所有基准数据集上均明显优于先前的最先进方法。

Paper11 Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving

摘要原文: Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.

中文总结: 这篇论文介绍了一种多模态自动标注流程，能够生成非模态的3D边界框和轨迹，用于训练模型以识别自动驾驶等安全关键应用中可能遇到的新物体类型。该流程利用点云序列中固有的运动线索，结合免费提供的2D图像-文本对，识别和跟踪所有交通参与者。与该领域中最近的研究相比，这种方法不仅可以处理移动物体，还可以处理静态物体，并且能够通过提出的视觉-语言知识蒸馏输出开放词汇语义标签。在Waymo开放数据集上的实验证明，该方法在各种无监督3D感知任务上显著优于以往的工作。

Paper12 Distribution Shift Matters for Knowledge Distillation with Webly Collected Images

摘要原文: Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed “Knowledge Distillation between Different Distributions” (KD^ 3 ), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD^ 3 can outperform the state-of-the-art data-free knowledge distillation approaches.

中文总结: 知识蒸馏旨在从预训练的教师网络中学习一个轻量级的学生网络。在实践中，现有的知识蒸馏方法通常在原始训练数据由于某些隐私问题和数据管理考虑而不可用时是不可行的。因此，提出了无数据知识蒸馏方法，从互联网收集训练实例。然而，大多数方法忽视了原始训练数据和从网络收集的数据之间的常见分布偏移，影响了训练的学生网络的可靠性。为了解决这个问题，我们提出了一种名为“不同分布之间的知识蒸馏”（KD^{3）的新方法，包括三个组件。具体而言，我们首先根据教师网络和学生网络的综合预测动态选择来自网络收集数据中有用的训练实例。随后，我们对两个网络的加权特征和分类器参数进行对齐以进行知识记忆。同时，我们还构建了一个名为MixDistribution的新对比学习块，用于生成具有新分布的扰动数据以进行实例对齐，从而使学生网络能够进一步学习分布不变的表示。对各种基准数据集的大量实验表明，我们提出的KD}3方法可以胜过最先进的无数据知识蒸馏方法。

Paper13 Multi-Task Learning with Knowledge Distillation for Dense Prediction

摘要原文: While multi-task learning (MTL) has become an attractive topic, its training usually poses more difficulties than the single-task case. How to successfully apply knowledge distillation into MTL to improve training efficiency and model performance is still a challenging problem. In this paper, we introduce a new knowledge distillation procedure with an alternative match for MTL of dense prediction based on two simple design principles. First, for memory and training efficiency, we use a single strong multi-task model as a teacher during training instead of multiple teachers, as widely adopted in existing studies. Second, we employ a less sensitive Cauchy-Schwarz (CS) divergence instead of the Kullback-Leibler (KL) divergence and propose a CS distillation loss accordingly. With the less sensitive divergence, our knowledge distillation with an alternative match is applied for capturing inter-task and intra-task information between the teacher model and the student model of each task, thereby learning more “dark knowledge” for effective distillation. We conducted extensive experiments on dense prediction datasets, including NYUD-v2 and PASCAL-Context, for multiple vision tasks, such as semantic segmentation, human parts segmentation, depth estimation, surface normal estimation, and boundary detection. The results show that our proposed method decidedly improves model performance and the practical inference efficiency.

中文总结: 这段话主要讨论了多任务学习（MTL）在训练过程中通常比单任务情况更具挑战性，如何成功将知识蒸馏应用到MTL中以提高训练效率和模型性能仍然是一个具有挑战性的问题。作者介绍了一种新的知识蒸馏程序，基于两个简单的设计原则，用于基于密集预测的MTL。首先，为了内存和训练效率，作者在训练过程中使用单一强大的多任务模型作为教师，而不是像现有研究中普遍采用的多个教师。其次，作者采用了较不敏感的柯西-施瓦茨（CS）散度代替Kullback-Leibler（KL）散度，并相应地提出了CS蒸馏损失。通过使用较不敏感的散度，作者的知识蒸馏方法应用于捕获教师模型和每个任务的学生模型之间的任务间和任务内信息，从而学习更多的“暗知识”以进行有效的蒸馏。作者在包括NYUD-v2和PASCAL-Context在内的密集预测数据集上进行了大量实验，涵盖语义分割、人体部位分割、深度估计、表面法线估计和边界检测等多个视觉任务。结果表明，作者提出的方法显著提高了模型性能和实际推理效率。

Paper14 Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning

摘要原文: Most self-supervised methods for representation learning leverage a cross-view consistency objective i.e., they maximize the representation similarity of a given image’s augmented views. Recent work NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrapping in a contrastive setting. We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrapping in a self-distillation scheme can lead to a performance drop or even collapse. We scrutinize the reason for this unexpected behavior and provide a solution. We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space. We report consistent improvements compared to the naive bootstrapping approach and the original baselines. Our approach leads to performance improvements for various self-distillation method/backbone combinations and standard downstream tasks. Our code is publicly available at https://github.com/tileb1/AdaSim.

中文总结: 这段话主要讨论了自监督学习中表示学习的方法。大多数自监督方法利用跨视图一致性目标，即最大化给定图像的增强视图的表示相似性。最近的工作NNCLR超越了跨视图范式，使用了通过最近邻引导获得的不同图像的正对样本，在对比设置中。我们在实证中发现，与依赖负样本的对比学习设置相反，将最近邻引导合并到自蒸馏方案中可能导致性能下降甚至崩溃。我们审查了这种意外行为的原因并提供了解决方案。我们提出根据潜在空间的估计质量自适应引导邻居。与朴素引导方法和原始基线相比，我们报告了一致的改进。我们的方法在各种自蒸馏方法/骨干组合和标准下游任务中实现了性能改进。我们的代码公开可用于https://github.com/tileb1/AdaSim。

Paper15 FerKD: Surgical Label Adaptation for Efficient Distillation

摘要原文: We present FerKD, a novel efficient knowledge distillation framework that incorporates partial soft-hard label adaptation coupled with a region-calibration mechanism. Our approach stems from the observation and intuition that standard data augmentations, such as RandomResizedCrop, tend to transform inputs into diverse conditions: easy positives, hard positives, or hard negatives. In traditional distillation frameworks, these transformed samples are utilized equally through their predictive probabilities derived from pretrained teacher models. However, merely relying on prediction values from a pretrained teacher, a common practice in prior studies, neglects the reliability of these soft label predictions. To address this, we propose a new scheme that calibrates the less-confident regions to be the context using softened hard groundtruth labels. Our approach involves the processes of hard regions mining + calibration. We demonstrate empirically that this method can dramatically improve the convergence speed and final accuracy. Additionally, we find that a consistent mixing strategy can stabilize the distributions of soft supervision, taking advantage of the soft labels. As a result, we introduce a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image. FerKD is an intuitive and well-designed learning system that eliminates several heuristics and hyperparameters in former FKD solution. More importantly, it achieves remarkable improvement on ImageNet-1K and downstream tasks. For instance, FerKD achieves 81.2% on ImageNet-1K with ResNet-50, outperforming FKD and FunMatch by remarkable margins. Leveraging better pre-trained weights and larger architectures, our finetuned ViT-G14 even achieves 89.9%. Our code is available at https://github.com/szq0214/FKD/tree/main/FerKD.

中文总结: 本文介绍了一种新颖的高效知识蒸馏框架FerKD，该框架结合了部分软硬标签适应和区域校准机制。该方法的灵感来自于对标准数据增强方法（如RandomResizedCrop）的观察和直觉，这些方法往往将输入转换为不同的条件：易正例、难正例或难负例。在传统的蒸馏框架中，这些转换后的样本通过预训练教师模型产生的预测概率被平等利用。然而，仅依赖于预训练教师的预测值，这是以往研究中的常见做法，忽略了这些软标签预测的可靠性。为了解决这个问题，我们提出了一种新的方案，通过软化的硬标准标签来校准不太自信的区域作为上下文。我们的方法涉及硬区域挖掘+校准的过程。我们在实证上证明了这种方法可以显著提高收敛速度和最终准确性。此外，我们发现一致的混合策略可以稳定软监督的分布，充分利用软标签。因此，我们引入了一种稳定的SelfMix增强方法，通过在同一图像内混合相似区域，减弱了混合图像和相应软标签的变化。FerKD是一个直观和精心设计的学习系统，消除了以前FKD解决方案中的几个启发式和超参数。更重要的是，它在ImageNet-1K和下游任务上取得了显著的改进。例如，FerKD在ResNet-50上在ImageNet-1K上达到了81.2%，超过了FKD和FunMatch很大幅度。利用更好的预训练权重和更大的架构，我们微调的ViT-G14甚至达到了89.9%。我们的代码可以在https://github.com/szq0214/FKD/tree/main/FerKD 上找到。

Paper16 MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition

摘要原文: Recently, multi-expert methods have led to significant improvements in long-tail recognition (LTR). We summarize two aspects that need further enhancement to contribute to LTR boosting: (1) More diverse experts; (2) Lower model variance. However, the previous methods didn’t handle them well. To this end, we propose More Diverse experts with Consistency Self-distillation (MDCS) to bridge the gap left by earlier methods. Our MDCS approach consists of two core components: Diversity Loss (DL) and Consistency Self-distillation (CS). In detail, DL promotes diversity among experts by controlling their focus on different categories. To reduce the model variance, we employ KL divergence to distill the richer knowledge of weakly augmented instances for the experts’ self-distillation. In particular, we design Confident Instance Sampling (CIS) to select the correctly classified instances for CS to avoid biased/noisy knowledge. In the analysis and ablation study, we demonstrate that our method compared with previous work can effectively increase the diversity of experts, significantly reduce the variance of the model, and improve recognition accuracy. Moreover, the roles of our DL and CS are mutually reinforcing and coupled: the diversity of experts benefits from the CS, and the CS cannot achieve remarkable results without the DL. Experiments show our MDCS outperforms the state-of-the-art by 1% 2% on five popular long-tailed benchmarks, including CIFAR10-LT, CIFAR100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The code is available at https://github.com/fistyee/MDCS.

中文总结: 最近，多专家方法在长尾识别（LTR）方面取得了显著的改进。我们总结了需要进一步加强以促进LTR提升的两个方面：（1）更多样化的专家；（2）降低模型方差。然而，先前的方法并没有很好地处理这些方面。为此，我们提出了更多样化专家与一致性自蒸馏（MDCS）方法，以填补先前方法留下的差距。我们的MDCS方法包括两个核心组成部分：多样性损失（DL）和一致性自蒸馏（CS）。具体而言，DL通过控制专家对不同类别的关注来促进专家之间的多样性。为了减少模型方差，我们采用KL散度来提炼专家自蒸馏的弱增强实例的更丰富知识。特别地，我们设计了自信实例采样（CIS）来选择正确分类的实例用于CS，以避免偏见/噪声知识。在分析和消融研究中，我们证明我们的方法相比先前的工作可以有效增加专家的多样性，显著降低模型的方差，并提高识别准确性。此外，我们的DL和CS的作用相互加强和耦合：专家的多样性受益于CS，而CS在没有DL的情况下无法取得显著的结果。实验表明，我们的MDCS在包括CIFAR10-LT、CIFAR100-LT、ImageNet-LT、Places-LT和iNaturalist 2018在内的五个热门长尾基准上优于最先进方法1%至2%。代码可在https://github.com/fistyee/MDCS获得。

Paper17 Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation

摘要原文: Given a composite image, image harmonization aims to adjust the foreground illumination to be consistent with background. Previous methods have explored transforming foreground features to achieve competitive performance. In this work, we show that using global information to guide foreground feature transformation could achieve significant improvement. Besides, we propose to transfer the foreground-background relation from real images to composite images, which can provide intermediate supervision for the transformed encoder features. Additionally, considering the drawbacks of existing harmonization datasets, we also contribute a ccHarmony dataset which simulates the natural illumination variation. Extensive experiments on iHarmony4 and our contributed dataset demonstrate the superiority of our method. Our ccHarmony dataset is released at https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony.

中文总结: 这段话主要讲述了图像和谐化的概念和方法。图像和谐化旨在调整前景的照明，使其与背景保持一致。先前的方法探索了转换前景特征以实现竞争性性能的方式。在这项工作中，我们展示了利用全局信息来指导前景特征转换可以实现显著的改进。此外，我们提出将前景-背景关系从真实图像转移到合成图像，这可以为转换后的编码器特征提供中间监督。此外，考虑到现有和谐化数据集的缺点，我们还贡献了一个ccHarmony数据集，模拟自然照明变化。对iHarmony4和我们贡献的数据集进行了大量实验，证明了我们方法的优越性。我们的ccHarmony数据集已发布在https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony。

Paper18 Remembering Normality: Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection

摘要原文: Knowledge distillation (KD) has been widely explored in unsupervised anomaly detection (AD). The student is assumed to constantly produce representations of typical patterns within trained data, named “normality”, and the representation discrepancy between the teacher and student model is identified as anomalies. However, it suffers from the “normality forgetting” issue. Trained on anomaly-free data, the student still well reconstructs anomalous representations for anomalies and is sensitive to fine patterns in normal data, which also appear in training. To mitigate this issue, we introduce a novel Memory-guided Knowledge-Distillation (MemKD) framework that adaptively modulates the normality of student features in detecting anomalies. Specifically, we first propose a normality recall memory (NR Memory) to strengthen the normality of student-generated features by recalling the stored normal information. In this sense, representations will not present anomalies and fine patterns will be well described. Subsequently, we employ a normality embedding learning strategy to promote information learning for the NR Memory. It constructs a normal exemplar set so that the NR Memory can memorize prior knowledge in anomaly-free data and later recall them from the query feature. Consequently, comprehensive experiments demonstrate that the proposed MemKD achieves promising results on five benchmarks, i.e., MVTec AD, VisA, MPDD, MVTec 3D-AD, and Eyecandies.

中文总结: 这段话主要讨论了知识蒸馏（KD）在无监督异常检测（AD）中的应用。学生模型被假设不断产生训练数据中典型模式的表示，称为“正常性”，并且通过教师和学生模型之间的表示差异来识别异常。然而，它存在“正常性遗忘”问题。尽管学生模型在无异常数据上训练，但仍能很好地重建异常表示，并对正常数据中出现的细微模式敏感，这些模式也出现在训练中。为了缓解这个问题，引入了一种新颖的记忆引导知识蒸馏（MemKD）框架，该框架可以自适应地调节学生特征的正常性以检测异常。具体来说，首先提出了一种正常性召回记忆（NR Memory），通过召回存储的正常信息来加强学生生成的特征的正常性。然后，采用正常性嵌入学习策略来促进NR Memory的信息学习，构建正常示例集，使NR Memory可以记忆无异常数据中的先前知识，并在需要时从查询特征中召回。综合实验表明，提出的MemKD在五个基准测试中取得了有希望的结果，即MVTec AD、VisA、MPDD、MVTec 3D-AD和Eyecandies。

Paper19 Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection

摘要原文: Knowledge distillation (KD) has shown potential for learning compact models in dense object detection. However, the commonly used softmax-based distillation ignores the absolute classification scores for individual categories. Thus, the optimum of the distillation loss does not necessarily lead to the optimal student classification scores for dense object detectors. This cross-task protocol inconsistency is critical, especially for dense object detectors, since the foreground categories are extremely imbalanced. To address the issue of protocol differences between distillation and classification, we propose a novel distillation method with cross-task consistent protocols, tailored for the dense object detection. For classification distillation, we address the cross-task protocol inconsistency problem by formulating the classification logit maps in both teacher and student models as multiple binary-classification maps and applying a binary-classification distillation loss to each map. For localization distillation, we design an IoU-based Localization Distillation Loss that is free from specific network structures and can be compared with existing localization distillation losses. Our proposed method is simple but effective, and experimental results demonstrate its superiority over existing methods.

中文总结: 这段话主要讨论了知识蒸馏（KD）在密集目标检测中学习紧凑模型的潜力。然而，常用的基于softmax的蒸馏方法忽略了各个类别的绝对分类分数。因此，蒸馏损失的最优并不一定导致密集目标检测器的最佳学生分类分数。这种跨任务协议不一致对于密集目标检测器尤为关键，因为前景类别极度不平衡。为了解决蒸馏和分类之间的协议差异问题，作者提出了一种针对密集目标检测的新型蒸馏方法，具有跨任务一致的协议。对于分类蒸馏，作者通过将教师和学生模型中的分类logit映射为多个二元分类映射，并对每个映射应用二元分类蒸馏损失来解决跨任务协议不一致问题。对于定位蒸馏，作者设计了一种基于IoU的定位蒸馏损失，该损失不受特定网络结构限制，并可与现有的定位蒸馏损失进行比较。作者提出的方法简单而有效，实验结果表明其优于现有方法。

Paper20 Multimodal Distillation for Egocentric Action Recognition

摘要原文: The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well, however, their performance improves

further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further present a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views.

中文总结: 这段话的主要内容是关于自我中心视频理解的焦点是对手-物体互动建模。标准模型，如CNN或Vision Transformers，接收RGB帧作为输入表现良好，但通过利用提供互补线索的额外输入模态（如物体检测、光流、音频等），它们的性能会进一步提高。然而，特定于模态的模块的增加复杂性使得这些模型在部署时不实用。本工作的目标是在推断时仅使用RGB帧作为输入的情况下保持这种多模态方法的性能。我们证明，在Epic-Kitchens和Something-Something数据集上的自我中心动作识别中，由多模态教师教授的学生比以单模态或多模态方式训练的等效模型更准确和更好校准。我们进一步提出了一个基于原则的多模态知识蒸馏框架，使我们能够处理在应用多模态知识蒸馏时出现的问题。最后，我们展示了在计算复杂性方面实现的减少，并表明我们的方法在减少输入视图数量的情况下保持更高的性能。

Paper21 Semi-Supervised Learning via Weight-Aware Distillation under Class Distribution Mismatch

摘要原文: Semi-Supervised Learning (SSL) under class distribution mismatch aims to tackle a challenging problem wherein unlabeled data contain lots of unknown categories unseen in the labeled ones. In such mismatch scenarios, traditional SSL suffers severe performance damage due to the harmful invasion of the instances with unknown categories into the target classifier. In this study, by strict mathematical reasoning, we reveal that the SSL error under class distribution mismatch is composed of pseudo-labeling error and invasion error, both of which jointly bound the SSL population risk. To alleviate the SSL error, we propose a robust SSL framework called Weight-Aware Distillation (WAD) that, by weights, selectively transfers knowledge beneficial to the target task from unsupervised contrastive representation to the target classifier. Specifically, WAD captures adaptive weights and high-quality pseudo-labels to target instances by exploring point mutual information (PMI) in representation space to maximize the role of unlabeled data and filter unknown categories. Theoretically, we prove that WAD has a tight upper bound of population risk under class distribution mismatch. Experimentally, extensive results demonstrate that WAD outperforms five state-of-the-art SSL approaches

and one standard baseline on two benchmark datasets, CIFAR10 and CIFAR100, and an artificial cross-dataset. The code is available at https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master.

中文总结: 这段话主要内容是关于半监督学习（SSL）在类别分布不匹配情况下的应用。研究旨在解决一个具有挑战性的问题，即未标记数据中包含许多在标记数据中未见的未知类别。在这种不匹配的情况下，传统的SSL由于未知类别的实例对目标分类器的有害入侵而遭受严重的性能损害。研究通过严格的数学推理揭示了在类别分布不匹配情况下的SSL错误由伪标签错误和入侵错误组成，两者共同限制了SSL的总体风险。为了减轻SSL错误，研究提出了一种名为Weight-Aware Distillation（WAD）的稳健SSL框架，通过权重有选择地将有益于目标任务的知识从无监督对比表示转移到目标分类器。具体来说，WAD通过在表示空间中探索点互信息（PMI）来捕获自适应权重和高质量的伪标签，以最大化未标记数据的作用并过滤未知类别。理论上，研究证明了在类别分布不匹配情况下，WAD具有总体风险的严格上界。实验结果表明，WAD在两个基准数据集CIFAR10和CIFAR100以及一个人工跨数据集上优于五种最先进的SSL方法和一个标准基线。代码可在https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master找到。

Paper22 DOT: A Distillation-Oriented Trainer

摘要原文: Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student’s optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.

中文总结: 这段话主要讨论了知识蒸馏（Knowledge distillation）通过任务损失和蒸馏损失将大型模型的知识转移到小型模型的过程。作者观察到任务损失和蒸馏损失之间存在一种权衡关系，即引入蒸馏损失会限制任务损失的收敛。作者认为这种权衡是由于蒸馏损失的优化不足所致。为了打破这种权衡，他们提出了面向蒸馏的训练器（Distillation-Oriented Trainer，DOT）。DOT分别考虑任务和蒸馏损失的梯度，然后对蒸馏损失应用更大的动量来加速其优化。作者通过实验证明，DOT打破了任务损失和蒸馏损失之间的权衡，使得两者都得到了充分的优化。大量实验验证了DOT的优越性。值得注意的是，DOT在ResNet50-MobileNetV1对上在ImageNet-1k数据集上实现了+2.59%的准确率提升。总的来说，DOT极大地有利于学生模型在损失收敛和模型泛化方面的优化特性。代码将公开发布。

Paper23 Bayesian Optimization Meets Self-Distillation

摘要原文: Bayesian optimization (BO) has contributed greatly to improving model performance by suggesting promising hyperparameter configurations iteratively based on observations from multiple training trials. However, only partial knowledge (i.e., the measured performances of trained models and their hyperparameter configurations) from previous trials is transferred. On the other hand, Self-Distillation (SD) only transfers partial knowledge learned by the task model itself. To fully leverage the various knowledge gained from all training trials, we propose the BOSS framework, which combines BO and SD. BOSS suggests promising hyperparameter configurations through BO and carefully selects pre-trained models from previous trials for SD, which are otherwise abandoned in the conventional BO process. BOSS achieves significantly better performance than both BO and SD in a wide range of tasks including general image classification, learning with noisy labels, semi-supervised learning, and medical image analysis tasks. Our code is available at https://github.com/sooperset/boss.

中文总结: Bayesian optimization (BO)通过基于多次训练试验观察提出有前途的超参数配置，极大地提高了模型性能。然而，仅从先前试验中转移了部分知识（即，经过训练的模型的测量性能及其超参数配置）。另一方面，自我蒸馏（SD）仅传递了任务模型本身学到的部分知识。为了充分利用所有训练试验中获得的各种知识，我们提出了BOSS框架，将BO和SD结合起来。BOSS通过BO提出有前途的超参数配置，并精心选择来自先前试验的预训练模型进行SD，这些模型在传统的BO过程中通常被放弃。在包括一般图像分类、带有噪声标签的学习、半监督学习和医学图像分析任务在内的广泛任务中，BOSS的性能明显优于BO和SD。我们的代码可在https://github.com/sooperset/boss找到。

Paper24 UniKD: Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object Detectors

摘要原文: Knowledge distillation (KD) has become a standard method to boost the performance of lightweight object detectors. Most previous works are feature-based, where students mimic the features of homogeneous teacher detectors. However, distilling the knowledge from the heterogeneous teacher fails in this manner due to the serious semantic gap, which greatly limits the flexibility of KD in practical applications. Bridging this semantic gap now requires case-by-case algorithm design which is time-consuming and heavily relies on experienced adjustment. To alleviate this problem, we propose Universal Knowledge Distillation (UniKD), introducing additional decoder heads with deformable cross-attention called Adaptive Knowledge Extractor (AKE). In UniKD, AKEs are first pretrained on the teacher’s output to infuse the teacher’s content and positional knowledge into a fixed-number set of knowledge embeddings. The fixed AKEs are then attached to the student’s backbone to encourage the student to absorb the teacher’s knowledge in these knowledge embeddings. In this query-based distillation paradigm, detection-relevant information can be dynamically aggregated into a knowledge embedding set and transferred between different detectors. When the teacher model is too large for online inference, its output can be stored on disk in advance to save the computation overhead, which is more storage efficient than feature-based methods. Extensive experiments demonstrate that our UniKD can plug and play in any homogeneous or heterogeneous teacher-student pairs and significantly outperforms conventional feature-based KD.

中文总结: 这段话主要讨论了知识蒸馏（KD）已成为提升轻量级目标检测器性能的标准方法。先前的研究大多是基于特征的，其中学生模仿同质教师检测器的特征。然而，在从异构教师中蒸馏知识时，由于严重的语义差距，这种方法失败了，这严重限制了KD在实际应用中的灵活性。现在要弥合这种语义差距需要逐个算法设计，这是耗时的，并且严重依赖经验调整。为了缓解这个问题，作者提出了通用知识蒸馏（UniKD），引入了具有可变形交叉注意力的附加解码器头，称为自适应知识提取器（AKE）。在UniKD中，AKE首先在教师的输出上进行预训练，将教师的内容和位置知识注入到一组固定数量的知识嵌入中。然后，这些固定的AKE附加到学生的骨干上，以鼓励学生在这些知识嵌入中吸收教师的知识。在这种基于查询的蒸馏范式中，检测相关信息可以动态聚合到一个知识嵌入集中，并在不同的检测器之间传输。当教师模型对于在线推理来说太大时，其输出可以提前存储在磁盘上以节省计算开销，这比基于特征的方法更节省存储空间。大量实验证明，我们的UniKD可以在任何同质或异质的师生配对中即插即用，并且明显优于传统基于特征的KD方法。

Paper25 Multi-Label Knowledge Distillation

摘要原文: Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods.

中文总结: 现有的知识蒸馏方法通常通过将教师网络的输出logits或中间特征映射的知识传递给学生网络来工作，这在多类单标签学习中非常成功。然而，这些方法几乎无法扩展到多标签学习场景，其中每个实例与多个语义标签相关联，因为预测概率不会总和为一，在这种情况下，整个示例的特征映射可能会忽略较小的类别。在本文中，我们提出了一种新颖的多标签知识蒸馏方法。一方面，它通过将多标签学习问题划分为一组二元分类问题，利用logits中的信息语义知识；另一方面，通过利用标签嵌入的结构信息，增强了学习到的特征表示的独特性。在多个基准数据集上的实验结果验证了所提出的方法可以避免标签之间的知识相互作用，从而在性能上优于各种比较方法。

Paper26 DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation

摘要原文: 3D perception based on the representations learned from multi-camera bird’s-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry. However, there exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection. One key reason is that LiDAR captures accurate depth and other geometry measurements, while it is notoriously challenging to infer such 3D information from merely image input. In this work, we propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector. We propose effective balancing strategy to enforce the student to focus on learning the crucial features from the teacher, and generalize knowledge transfer to multi-scale layers with temporal fusion. We conduct extensive evaluations on multiple representative models of multi-camera BEV. Experiments reveal that our approach renders significant improvement over the student models, leading to the state-of-the-art performance on the popular benchmark nuScenes.

中文总结: 这段话主要讨论了基于多摄像头鸟瞰图（BEV）学习的3D感知正成为自动驾驶行业中的趋势，因为摄像头在大规模生产中成本效益高。然而，多摄像头BEV和基于LiDAR的3D物体检测之间存在明显的性能差距。其中一个关键原因是LiDAR可以捕捉准确的深度和其他几何测量，而仅仅从图像输入中推断这样的3D信息是极具挑战性的。作者提出通过训练多摄像头BEV学生检测器来模仿经过良好训练的LiDAR基础教师检测器的特征来增强表示学习。他们提出了有效的平衡策略，强制学生专注于从教师那里学习关键特征，并将知识传递泛化到多尺度层和时间融合。作者在多个代表性的多摄像头BEV模型上进行了广泛评估。实验证明，他们的方法显著提高了学生模型的性能，在流行的基准测试nuScenes上达到了最先进的性能水平。

Paper27 ORC: Network Group-based Knowledge Distillation using Online Role Change

摘要原文: In knowledge distillation, since a single, omnipotent teacher network cannot solve all problems, multiple teacher-based knowledge distillations have been studied recently. However, sometimes their improvements are not as good as expected because some immature teachers may transfer the false knowledge to the student. In this paper, to overcome this limitation and take the efficacy of the multiple networks, we divide the multiple networks into teacher and student groups, respectively. That is, the student group is a set of immature networks that require learning the teacher’s knowledge, while the teacher group consists of the selected networks that are capable of teaching successfully. We propose our online role change strategy where the top-ranked networks in the student group are able to promote to the teacher group at every iteration. After training the teacher group using the error samples of the student group to refine the teacher group’s knowledge, we transfer the collaborative knowledge from the teacher group to the student group successfully. We verify the superiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet which achieves high performance. We further show the generality of our method with various backbone architectures such as ResNet, WRN, VGG, Mobilenet, and Shufflenet.

中文总结: 这段话主要讨论了在知识蒸馏中，由于单个全能教师网络无法解决所有问题，最近研究了多个基于教师的知识蒸馏。然而，有时它们的改进效果并不如预期，因为一些不成熟的教师可能会向学生传递错误的知识。为了克服这一局限性并充分利用多个网络的效能，作者将多个网络分为教师组和学生组。学生组是一组需要学习教师知识的不成熟网络，而教师组由能够成功教授的选定网络组成。作者提出了在线角色变换策略，在每次迭代中，学生组中排名靠前的网络可以晋升为教师组。通过使用学生组的错误样本训练教师组来完善教师组的知识，成功地将协同知识从教师组转移到学生组。作者验证了所提出方法在CIFAR-10、CIFAR-100和ImageNet上取得了高性能。此外，作者还展示了我们方法在各种主干架构（如ResNet、WRN、VGG、Mobilenet和Shufflenet）上的普适性。

Paper28 PointDC: Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-Modal Distillation and Super-Voxel Clustering

摘要原文: Semantic segmentation of point clouds usually requires exhausting efforts of human annotations, hence it attracts wide attention to a challenging topic of learning from unlabeled or weaker form of annotations. In this paper, we take the first attempt for fully unsupervised semantic segmentation of point clouds, which aims to delineate semantically meaningful objects without any form of annotations. Previous works of unsupervised pipeline on 2D images fails in this task of point clouds, due to: 1) Clustering Ambiguity caused by limited magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity caused by the irregular sparsity of point cloud. Therefore, we propose a novel framework, PointDC, which is comprised of two steps that handles the aforementioned problems respectively: Cross-Modal Distillation (CVD) and Super-Voxel Clustering (SVC). In the first stage of CVD, multi-view visual features are back-projected to the 3D space and aggregated to a unified point feature to distill the training of the point representation. In the second stage of SVC, the point features are aggregated to super-voxels and then fed to the iterative clustering process for excavating semantic classes. PointDC yields a significant improvement over the prior state-of-the-art unsupervised methods, on both the ScanNet v2 (+18.4 mIOU) and S3DIS (+11.5 mIOU) semantic segmentation benchmarks.

中文总结: 这段话主要讨论了关于点云的语义分割通常需要耗费大量人工标注的工作，因此吸引了对从未标记或较弱形式标注学习的挑战性主题的广泛关注。该论文首次尝试对点云进行完全无监督的语义分割，旨在在没有任何形式标注的情况下描绘语义上有意义的对象。先前在2D图像上的无监督流程在点云任务中失败，原因是：1) 由于数据量有限和类别分布不平衡而导致的聚类模糊性；2) 由于点云的不规则稀疏性而导致的不规则模糊性。因此，我们提出了一个新颖的框架PointDC，它由两个步骤组成，分别处理了上述问题：交叉模态蒸馏（CVD）和超体素聚类（SVC）。在CVD的第一阶段，多视图视觉特征被反投影到3D空间并聚合为统一的点特征，以蒸馏点表示的训练。在SVC的第二阶段，点特征被聚合到超体素，然后输入到迭代聚类过程中以挖掘语义类别。PointDC在ScanNet v2（+18.4 mIOU）和S3DIS（+11.5 mIOU）语义分割基准上相比先前最先进的无监督方法取得了显著改进。