⭕️【读论文】Learning To Count Everything

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

任务名称：

visual counting ； few shot counting；crowd counting；cell counting

Abstract

1. Introduction

2. Related Works

3. Few-Shot Adaptation & Matching Network

3.1. Network architecture

3.2. Training

3.3. Test-time adaptation

4. The FSC-147 Dataset

4.1. Image Collection

4.2. Image Annotation

4.3. Dataset split

4.4. Data Statistics

5. Experiments

Abstract

Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. To this end, we pose counting as a few-shot regression task. To tackle this task, we present a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query image. We also present a novel adaptation strategy to adapt our network to any novel visual category at test time, using only a few exemplar objects from the novel category. We also introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. The images are annotated with two types of annotation, dots and bounding boxes, and they can be used for developing few-shot counting models. Experiments on this dataset shows that our method outperforms several state-of-the-art object detectors and few-shot counting approaches.

（1）count objects from any category given only a few annotated instances

（2）pose counting as a few-shot regression task

1. Introduction

Humans can count objects from most of the visual object categories with ease, while current state-of-the-art computational methods [29, 48, 55] for counting can only handle a limited number of visual categories. In fact, most of the counting neural networks [4, 48] can handle a single category at a time, such as people, cars, and cells.

人类可以轻松地对大多数视觉对象类别的物体进行计数，而当前最先进的计算方法[29, 48, 55]在计数方面只能处理有限的视觉类别。实际上，大多数计数神经网络[4, 48]一次只能处理单一类别，例如人、汽车和细胞。

There are two major challenges preventing the Computer Vision community from designing systems capable of counting a large number of visual categories. First, most of the contemporary counting approaches [4, 48, 55] treat counting as a supervised regression task, requiring thousands of labeled images to learn a fully convolutional regressor that maps an input image to its corresponding density map, from which the estimated count is obtained by summing all the density values. These networks require dot annotations for millions of objects on several thousands of training images, and obtaining this type of annotation is a costly and laborious process. As a result, it is difficult to scale these contemporary counting approaches to handle a large number of visual categories. Second, there are not any large enough unconstrained counting datasets with many visual categories for the development of a general counting method. Most of the popular counting datasets [1416, 43, 49, 55] consist of a single object category.

计算机视觉界在设计能够对大量视觉类别进行计数的系统方面存在两个主要挑战。首先，大多数当代的计数方法[4, 48, 55]将计数视为一个有监督的回归任务，需要数千张标记图像来学习一个全卷积回归器，该回归器将输入图像映射到相应的密度图，然后通过将所有密度值相加来获得估计的计数。这些网络需要在数千张训练图像上对数百万对象进行点注解，获取这种类型的注释是一个成本高昂且费力的过程。因此，将这些当代计数方法扩展到处理大量视觉类别是困难的。其次，没有足够大的、包含多个视觉类别的无约束计数数据集来发展一种通用的计数方法。大多数流行的计数数据集[14-16, 43, 49, 55]只包含单一的对象类别。

In this work, we address both of the above challenges. To handle the first challenge, we take a detour from the existing counting approaches which treat counting as a typical fully supervised regression task, and pose counting as a few shot regression task, as shown in Fig. 1. In this few-shot setting, the inputs for the counting task are an image and few examples from the same image for the object of interest, and the output is the count of object instances. The examples are provided in the form of bounding boxes around the objects of interest. In other words, our few shot counting task deals with counting instances within an image which are similar to the exemplars from the same image. Following the convention from the few-shot classification task [9, 20, 46], the classes at test time are completely different from the ones seen during training. This makes few-shot counting very different from the typical counting task, where the training and test classes are the same. Unlike the typical counting task, where hundreds [55] or thousands [16] of labeled examples are available for training, a few-shot counting method needs to generalize to completely novel classes using only the input image and a few exemplars.

在这项工作中，我们解决了上述两个挑战。为了应对第一个挑战，我们采取了与现有计数方法不同的途径，现有方法将计数视为一种典型的全监督回归任务，我们将计数视为一个小样本回归任务，如图1所示。在这个小样本设置中，计数任务的输入是一张图像和同一图像中感兴趣对象的几个示例，输出是对象实例的计数。这些示例以围绕感兴趣对象的边界框的形式提供。换句话说，我们的小样本计数任务处理的是计算图像中与同一图像中的示例相似的实例的数量。按照小样本分类任务[9, 20, 46]的惯例，测试时的类别与训练期间看到的类别完全不同。这使得小样本计数与典型的计数任务非常不同，典型计数任务中，训练和测试类别是相同的。与典型的计数任务不同，在典型的计数任务中，有数百个[55]或数千个[16]标记的示例可用于训练，小样本计数方法需要使用输入图像和几个示例泛化到完全新颖的类别。

任务名：few shot counting task few-shot classification task

泛化能力

We propose a novel architecture called Few Shot Adaptation and Matching Network (FamNet) for tackling the few-shot counting task. FamNet has two key components: 1) a feature extraction module, and 2) a density prediction module. The feature extraction module consists of【 a general feature extractor capable of handling a large number of visual categories. 】The density prediction module is designed to be agnostic to the visual category. As will be seen in our experiments, both the feature extractor and density prediction modules can already generalize to the novel categories at test time. // We further improve the performance of FamNet by developing a novel few-shot adaptation scheme at test time. This adaptation scheme uses the provided exemplars themselves and adapts the counting network to them with a few gradient descent updates, where the gradients are computed based on two loss functions which are designed to utilize the locations of the exemplars to the fullest extent. Empirically, this adaptation scheme improves the performance of FamNet.

我们提出了一种新颖的架构，称为小样本适应和匹配网络（FamNet），用于解决小样本计数任务。FamNet有两个关键组成部分：1）特征提取模块；2）密度预测模块。特征提取模块由一个通用的特征提取器组成，能够处理大量视觉类别。密度预测模块被设计成对视觉类别不可知。正如我们的实验所见，特征提取器和密度预测模块在测试时已经可以泛化到新类别。我们通过在测试时开发一种新颖的小样本适应方案来进一步提高FamNet的性能。这种适应方案使用提供的示例本身，并通过几次梯度下降更新使计数网络适应它们，其中梯度基于两个损失函数计算，这些损失函数旨在充分利用示例的位置。从经验上看，这种适应方案提高了FamNet的性能。

Finally, to address the lack of a dataset for developing and evaluating the performance of few-shot counting methods, we introduce a medium-scale dataset consisting of more than 6000 images from 147 visual categories. The dataset comes with dot and bounding box annotations, and is suitable for the few-shot counting task. We name this dataset Few-Shot Counting-147 (FSC-147).

最后，为了解决开发和评估小样本计数方法缺乏数据集的问题，我们引入了一个包含来自147个视觉类别的6000多张图像的中等规模数据集。该数据集带有点注解和边界框注解，适合进行小样本计数任务。我们将这个数据集命名为Few-Shot Counting-147（FSC-147）。

In short, the main contributions of our work are as follows. First, we pose counting as a few-shot regression task. Second, we propose a novel architecture called FamNet for handling the few-shot counting task, with a novel few-shot adaptation scheme at test time. Third, we present a novel few-shot counting dataset called FSC-147, comprising of over 6000 images with 147 visual categories.

简而言之，我们工作的主要贡献如下：

1. 我们将计数问题定义为一个小样本回归任务。
2. 我们提出了一种名为FamNet的新颖架构，用于处理小样本计数任务，并在测试时引入了一种新颖的小样本适应方案。
3. 我们提出了一个名为FSC-147的新颖小样本计数数据集，包含超过6000张图像，涵盖147个视觉类别。

2. Related Works

In this work, we are interested in counting objects of interest in a given image with a few labeled examples from the same image. Most of the previous counting methods are for specific types of objects such as people [2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55], cars [30], animals [4], cells [3, 18, 53], and fruits [31]. These methods often require training images with tens of thousands or even millions of annotated object instances. Some of these works [34] tackle the issue of costly annotation cost to some extent by adapting a counting network trained on a source domain to any target domain using labels for only few informative samples from the target domain. However, even these approaches require a large amount of labeled data in the source domain.

在这项工作中，我们关注的是利用来自同一图像的少量标记示例来计算图像中感兴趣对象的数量。大多数先前的计数方法针对特定类型的对象，例如人[2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55]、汽车[30]、动物[4]、细胞[3, 18, 53]和水果[31]。这些方法通常需要训练图像，其中包含数万甚至数百万的标注对象实例。这些工作[34]中的一些在某种程度上解决了昂贵的注释成本问题，通过将一个在源域上训练的计数网络适应到任何目标域，仅使用目标域中少数信息样本的标签。然而，即使这些方法也需要源域中有大量的标记数据。

The proposed FamNet works by exploiting the strong similarity between a query image and the provided exemplar objects in the image. To some extent, it is similar the decade-old self-similarity work of Shechtman and Irani [41]. Also related to this idea is the recent work of Lu and Zisserman[28], who proposed a Generic Matching Network (GMN) for class-agnostic counting. GMN was pre-trained with tracking video data, and it had an explicit adaptation module to adapt the network to an image domain of interest. GMN has been shown to work well if several dozens to hundreds of examples are available for adaptation. Without adaptation, GMN does not perform very well on novel classes, as will be seen in our experiments.

提出的FamNet通过利用查询图像和图像中提供的示例对象之间的强相似性来工作。在某种程度上，它与Shechtman和Irani[41]十年前的自相似性工作相似。与这一理念相关的还有Lu和Zisserman[28]最近的工作，他们提出了一个用于类别不可知计数的通用匹配网络（GMN）。GMN是用跟踪视频数据预训练的，并且它有一个明确的适应模块，可以将网络适应到感兴趣的图像域。如果有足够的几十到几百个示例用于适应，GMN已被证明可以很好地工作。如果不进行适应，GMN在新颖类别上的表现并不十分出色，正如我们的实验所见。

Related to few-shot counting is the few-shot detection task (e.g., [8, 17]), where the objective is to learn a detector for a novel category using a few labeled examples. Fewshot counting differs from few-shot detection in two primary aspects. First, few-shot counting requires dot annotations while detection requires bounding box annotations. Second, few-shot detection methods can be affected by severe occlusion whereas few-shot counting is tackled with a density estimation approach [22, 55], which is more robust towards 【occlusion 】than the detection-then-counting approach because the density estimation methods do not have to commit to binarized decisions at an early stage. The benefits of the density estimation approach has been empirically demonstrated in several domains, especially for crowd and cell counting.

与小样本计数相关的是小样本检测任务（例如，[8, 17]），其目标是使用少量标记的示例学习一个新类别的检测器。小样本计数与小样本检测在两个主要方面有所不同：

1. 小样本计数需要点注释，而检测需要边界框注释。这意味着小样本计数任务依赖于图像中对象的精确位置点，而小样本检测任务则依赖于识别和定位对象的边界框。

2. 小样本检测方法可能会受到严重遮挡的影响，而小样本计数采用密度估计方法[22, 55]来解决这个问题，与检测后计数的方法相比，密度估计对【遮挡】更为鲁棒，因为在早期阶段，密度估计方法不需要做出二元化决策。密度估计方法的优势已在几个领域得到实证证明，特别是对于人群和细胞计数。

总的来说，小样本计数任务关注于在只有少量示例的情况下估计图像中某类对象的总数，而小样本检测任务则是识别并定位图像中的新类别对象。小样本计数采用的密度估计方法在处理遮挡和部分可见性方面显示出了其优势。

Also related to our work is the task of few-shot image classification [9, 19, 21, 35, 40, 46]. The few-shot classification task deals with classifying images from novel categories at test time, given a few training examples from these novel test categories. The Model Agnostic Meta Learning (MAML) [9] based few-shot approach is relevant for our few-shot counting task, and it focuses on learning parameters which can adapt to novel classes at test time by means of few gradient descent steps. However, MAML involves computing second order derivatives during training which makes it expensive, even more so for the pixel level prediction task of density map prediction being considered in our paper. Drawing inspiration from these works, we propose a novel adaptation scheme which utilizes the exemplars available at test time and performs a few steps of gradient descent in order to adapt FamNet to any novel category. Unlike MAML, our training scheme does not require higher order gradients at training time. We compare our approach with MAML, and empirically show that it leads to better performance and is also much faster to train.

与我们的工作相关的还有小样本图像分类任务[9, 19, 21, 35, 40, 46]。小样本分类任务涉及在测试时对来自新颖类别的图像进行分类，前提是在这些新颖测试类别上有一些训练示例。基于模型无关元学习（Model Agnostic Meta Learning, MAML）[9]的小样本方法与我们的小样本计数任务相关，它侧重于学习可以在测试时通过少量梯度下降步骤适应新颖类别的参数。然而，MAML在训练期间涉及计算二阶导数，这使得它代价昂贵，尤其是对于我们论文中考虑的像素级密度图预测任务。从这些作品中汲取灵感，我们提出了一种新颖的适应方案，该方案利用测试时可用的示例，并执行几步梯度下降，以使FamNet适应任何新颖类别。与MAML不同，我们的训练方案在训练时不需要更高阶的梯度。我们将我们的方法与MAML进行了比较，并实证展示了它能够带来更好的性能，并且训练速度也快得多。

小样本图像分类任务的目标是在只有少量训练样本的情况下，对新类别的图像进行分类。
MAML是一种小样本学习方法，它通过少量梯度下降步骤快速适应新类别，但计算成本高，因为它需要计算二阶导数。
作者的工作提出了一种新的适应方案，用于小样本计数任务，这种方案在测试时使用少量示例通过梯度下降来适应新类别。
与MAML相比，作者的方法在训练时不需要计算高阶导数，从而降低了训练成本，提高了训练速度，并在实验中显示出更好的性能。

3. Few-Shot Adaptation & Matching Network

In this section, we describe the proposed FamNet for tackling the few-shot counting task.

任务：few-shot counting task.

方法：FamNet

3.1. Network architecture

Fig. 2 depicts the pipeline of FamNet. The input to the network is an image $X \in R^{H\times W \times 3}$ and a few exemplar bounding boxes depicting the object to be counted from the same image. The output of the network is the predicted density map $Z \in R^{H\times W}$ , and the count for the object of interest is obtained by summing over all density values.

翻译：
图2描述了FamNet的处理流程。网络的输入是一张图像和一些示例边界框，这些边界框描绘了要计数的同一图像中的对象。网络的输出是预测的密度图，通过对所有密度值求和得到感兴趣对象的计数。

简要讲解：
- FamNet是一个用于图像中对象计数的神经网络。
- 它的输入包括一张图像以及该图像中的一些示例边界框，这些边界框标注了需要计数的对象。
- 输入图像和边界框之后，FamNet会产生一个预测的密度图，这张图表示了对象在图像中的分布密度。
- 最终，通过对密度图中的所有像素值进行求和，可以得到图像中感兴趣对象的总数。这种方法可以处理遮挡和部分可见性的问题，因为它不需要在早期阶段就做出二元化的决策。

图 2：小样本适应与匹配网络 接收查询图像以及少数表示感兴趣对象的边界框作为输入，并预测密度图。通过对密度图中的所有像素值求和来获得计数。适应损失是基于边界框信息计算的，来自这个损失的梯度被用来更新密度预测模块的参数。适应损失仅在测试时使用。

小样本适应与匹配网络（Few-shot adaptation & matching Network） 是一种针对小样本计数任务设计的神经网络架构。
该网络的输入包括一个查询图像（query image）和一些边界框（bounding boxes），这些边界框描绘了图像中要计数的对象。
网络的输出是一个预测的密度图（predicted density map），这个密度图表示了对象在图像中的分布密度。
通过对密度图中的每个像素值进行求和，可以得到图像中感兴趣对象的总数。
适应损失（adaptation loss, LAdapt） 是在测试阶段使用的一个损失函数，它基于边界框信息来计算。这个损失函数的目的是使网络能够快速适应新类别的对象。
适应损失计算得到的梯度用于更新密度预测模块（density prediction module）的参数，以改善网络对新类别的预测性能。
重要的是，适应损失只在测试时使用，而不是在训练时。这意味着网络在测试时能够根据少量示例进行自我调整，以更好地适应新的数据。

FamNet consists of two key modules: 1) a multi-scale feature extraction module, and 2) a density prediction module. We design both of these modules so that they can handle novel categories at test time. We use an ImageNet pretrained network [12] for the feature extraction, since such networks can handle a broad range of visual categories. The density prediction module is designed to be agnostic to the visual categories. The multi-scale feature extraction module consists of the first four blocks from a pre-trained ResNet-50 backbone [12] (the parameters of these blocks are frozen during training). We represent an image by the convolutional feature maps at the third and fourth blocks. We also obtain the multi-scale features for an exemplar by 【 performing ROI pooling on the convolutional feature maps from the third and fourth Resnet-50 blocks. 】

翻译：
FamNet由两个关键模块组成：1) 一个多尺度特征提取模块，以及2) 一个密度预测模块。我们设计这两个模块以便它们能够在测试时处理新颖类别。我们使用了一个在ImageNet上预训练的网络[12]来进行特征提取，因为这样的网络可以处理广泛的视觉类别。密度预测模块被设计成对视觉类别不可知。多尺度特征提取模块由一个预训练的ResNet-50主干的前四个块组成[12]（这些块的参数在训练期间被冻结）。我们通过第三和第四块的卷积特征图来表示图像。我们还通过对来自第三和第四个ResNet-50块的卷积特征图执行ROI池化来获得示例的多尺度特征。

简要讲解：

FamNet是一个为小样本计数任务设计的神经网络，它包含两个主要的模块。
多尺度特征提取模块：这个模块负责从输入图像中提取特征。为了能够广泛地处理各种视觉类别，这里使用了在ImageNet数据集上预训练的网络。特别是，这个模块采用了ResNet-50模型的前四个块，并且这些块的参数在训练时是冻结的，即不进行更新。
密度预测模块：这个模块独立于具体的视觉类别，意味着它可以泛化到未见过的类别上进行密度图的预测。
特征表示：图像通过第三和第四块的卷积特征图来表示，这些特征图捕捉了图像的不同层次的特征。
多尺度特征：为了获得示例的多尺度特征，对第三和第四个ResNet-50块的卷积特征图执行区域感兴趣（ROI）池化操作。这样可以得到不同尺度的特征表示，有助于后续的密度预测。

FamNet能够在测试时适应新的视觉类别，即使这些类别在训练时未曾见过。这种能力是通过使用预训练网络的泛化能力和设计与具体类别无关的密度预测模块来实现的

To make the density prediction module agnostic to the visual categories, we do not use the features obtained from the feature extraction module directly for density prediction. Instead, we only use the correlation map between the exemplar features and image features as the input to the density prediction module. To account for the objects of interest at different scales, we scale the exemplar features to different scales, and correlate the scaled exemplar features with the image features to obtain multiple correlation maps, one for each scale. For all of our experiments, we use the scales of 0.9 and 1.1, along with the original scale. The correlation maps are concatenated and fed into the density prediction module. The density prediction module consists of five convolution blocks and three upsampling layers placed after the first, second, and third convolution layers. The last layer is a 1×1 convolution layer, which predicts the 2D density map. The size of the predicted density map is the same as the size of the input image.

为了使密度预测模块对视觉类别不可知，我们不直接使用从特征提取模块获得的特征进行密度预测。相反，我们只使用示例特征和图像特征之间的相关性图作为密度预测模块的输入。为了考虑不同尺度上感兴趣的对象，我们将示例特征缩放到不同尺度，并与图像特征进行相关性比较，以获得多个相关性图，每个尺度一个。【为了应对不同尺度下的感兴趣对象，我们将示例特征缩放到不同的尺度，并将这些缩放后的示例特征与图像特征进行相关性计算，从而获得多个相关性图，每个尺度对应一个相关性图。】在我们的所有实验中，我们使用了0.9和1.1的尺度，以及原始尺度。相关性图被连接起来并输入到密度预测模块中。密度预测模块由五个卷积块组成，在第一、第二和第三卷积层之后放置了三个上采样层。最后一层是一个1×1的卷积层，它预测2D密度图。预测的密度图的大小与输入图像的大小相同。

简要讲解：

密度预测模块的目的是生成一个密度图，该密度图表示图像中对象的分布密度，与具体的视觉类别无关。
为了实现这一点，不直接使用特征提取模块的输出，而是使用示例特征与图像特征之间的相关性图。
通过将示例特征在不同尺度上进行缩放并与图像特征进行相关性比较，可以获得多个尺度上的相关性图，这样可以捕获不同尺寸的对象。
在实验中，除了原始尺度外，还使用了0.9和1.1两个尺度对特征进行缩放。
这些在不同尺度上获得的相关性图被连接起来，形成最终输入到密度预测模块的数据。
密度预测模块由多个卷积块和上采样层组成，这些层负责从相关性图中提取特征并逐步恢复图像的空间分辨率。
最后，使用1×1卷积层来预测最终的2D密度图，其尺寸与原始输入图像相同，每个像素的值表示相应位置对象的密度。

3.2. Training

We train the FamNet using the training images of our dataset. Each training image contains multiple objects of interest, but only the exemplar objects are annotated with bounding boxes and the majority of the objects only have dot annotations. It is, however, difficult to train a density estimation network with the training loss that is defined based on the dot annotations directly. Most existing works for visual counting, especially for crowd counting [55], convolve the dot annotation map with a Gaussian window of a fixed size, typically 15×15, to generate a smoothed target density map for training the density estimation network.

我们使用我们数据集的训练图像来训练FamNet。每个训练图像包含多个感兴趣的对象，但只有示例对象用边界框进行了注释，大多数对象只有点注释。然而，直接使用基于点注释定义的训练损失来训练密度估计网络是困难的。大多数现有的视觉计数工作，特别是人群计数[55]，在训练密度估计网络时，通常会将点注释图与固定大小的高斯窗口（通常为15×15）进行卷积，以生成用于训练的平滑目标密度图。

【me】6000张图片，147个类，训练图片3000多张，3000多张，只有一些图片中的一些物体用边界框标注了，大多数只有点标注

【A】

FamNet的训练过程使用特定数据集中的图像，这些图像中包含了多个感兴趣的对象。
在这些训练图像中，只有少量示例对象被用边界框进行了详细的注释，而其余大部分对象则只有简单的点注释。
直接利用这些点注释来训练一个密度估计网络是具有挑战性的，因为点注释的形式与密度图的形式不同。
为了解决这个问题，现有的视觉计数方法，尤其是人群计数领域，通常会采用一种技术：将点注释图通过一个固定大小的高斯窗口进行卷积处理。高斯窗口可以平滑点注释，生成一个连续的密度图，这个密度图可以作为训练密度估计网络的目标。
这种高斯平滑处理有助于将离散的点注释转换为更加平滑和连续的密度图，从而更好地指导网络学习如何从图像特征中估计对象的密度。

Our dataset consists of 147 different categories, where there is huge variation in the sizes of the objects. Therefore, to generate the target density map, we use Gaussian smoothing with adaptive window size. First, we use dot annotations to estimate the size of the objects. Given the dot annotation map, where each dot is at an approximate center of an object, we compute the distance between each dot and its nearest neighbor, and average these distances for all the dots in the image. This average distance is used as the size of the Gaussian window to generate the target density map. The standard deviation of the Gaussian is set to be a quarter of the window size.

我们的数据集包含147个不同的类别，这些对象的大小变化很大。因此，为了生成目标密度图，我们使用具有自适应窗口大小的高斯平滑。首先，我们使用点注释来估计对象的大小。在给定的点注释图中，每个点都位于一个对象的近似中心，我们计算每个点与其最近邻点之间的距离，并对图像中的所有点取这些距离的平均值。这个平均距离被用作生成目标密度图的高斯窗口的大小。高斯的标准差被设置为窗口大小的四分之一。

【me】所有的图片都有点标注嘛

【A】高斯窗口的大小和标准差

数据集由147个不同类别组成，这些类别中的对象大小差异显著。
为了生成目标密度图，采用了自适应窗口大小的高斯平滑技术，以适应不同大小的对象。
利用点注释图，每个点代表一个对象的中心位置，通过计算每个点到其最近邻点的距离来估计对象的大小。
计算所有点的这些距离的平均值，这个平均距离决定了用于生成目标密度图的高斯窗口的尺寸。
选择高斯窗口的标准差为其窗口大小的四分之一，这是为了在平滑过程中保持密度图的局部特性，同时减少尺寸较小对象的高斯窗口对周围的影响。

To train FamNet, we minimize the mean squared error between the predicted density map and the ground truth density map. We use Adam optimizer with a learning rate of 10^−5, and batch size of 1. We resize each image to a fixed height of 384, and the width is adjusted accordingly to preserve the aspect ratio of the original image.

为了训练FamNet，我们最小化预测密度图与真实密度图之间的均方误差。我们使用Adam优化器，学习率为10^−5，批量大小为1。我们将每个图像的高度调整为固定的384像素，宽度相应调整以保持原始图像的纵横比。

A（分析：）

FamNet的训练目标是减少预测的密度图和实际的（ground truth）密度图之间的差异，这通过最小化两者之间的均方误差（Mean Squared Error, MSE）来实现。
训练过程中选用了Adam优化器，这是一种基于自适应估计的梯度下降方法，它能够自动调整学习率，适应不同的参数更新需求。
设置的学习率为10^−5，这是一个较小的学习率，有助于模型在训练过程中稳定地收敛。
批量大小设置为1，意味着每次迭代只使用一个图像样本进行更新，这在某些情况下可以提供更灵活的更新和更好的泛化能力。
为了统一输入尺寸，将每个图像的高度调整为384像素，而宽度按比例调整，以保持图像的原始纵横比。

3.3. Test-time adaptation

Since the two modules of the FamNet are not dependent on any object categories, the trained FamNet can already be used for counting objects from novel categories given a few exemplars. In this section, we describe a novel approach to adapt this network to the exemplars, further improving the accuracy of the estimated count. The key idea is to harness the information provided by the locations of the exemplar bounding boxes. So far, we have only used the bounding boxes of the exemplars to extract appearance features of the exemplars, and we have not utilized their locations to the full extent.

由于FamNet的两个模块不依赖于任何具体的对象类别，训练好的FamNet已经可以用于在给定少量示例的情况下对新类别的对象进行计数。在这一部分中，我们描述了一种新颖的方法来使这个网络适应这些示例，进一步提高估计计数的准确性。关键思想是利用由示例边界框的位置提供的信息。到目前为止，我们只使用了示例的边界框来提取示例的外观特征，并没有充分利用它们的位置信息。

A：

FamNet设计了两个核心模块：多尺度特征提取模块和密度预测模块，这两个模块不依赖于特定的对象类别，使其能够在给定少量示例的情况下，对新类别的对象进行计数。
即使在没有见过的类别上，只要提供一些示例，FamNet也能够进行计数任务，这显示了其良好的泛化能力。
作者在这一部分提出了一种新颖的适应方法，目的是进一步利用示例边界框的位置信息来提高计数的准确性。
目前，边界框主要用于提取示例对象的外观特征，但它们的位置信息在之前的处理中并没有被充分利用。
通过这种新颖的适应方法，FamNet可以在测试时更精确地调整自己，以适应新类别的对象，尤其是在只有少量示例可用的情况下，从而提高计数结果的准确性。

me：示例边界框 ①外观特征 ②位置特征

Let B denote the set of provided exemplar bounding boxes. For a bounding box b ∈ B, let Z_b be the crop from the density map Z at location b. To harness the extra information provided by the locations of the bounding boxes B, we propose to consider the following two losses.

设B表示提供的示例边界框集合。对于一个边界框b属于B，设Z_b是在位置b处从密度图Z中裁剪出的部分。为了利用边界框B的位置提供的额外信息，我们提议考虑以下两种损失。

在这里，作者提出了一种利用示例边界框位置信息的方法，以提高计数的准确性。首先定义了示例边界框的集合B。
对于集合B中的每一个边界框b，从预测的密度图Z中裁剪出与其对应的部分，记作Z_b。
为了充分利用边界框位置信息，作者提出了两种损失函数，这些损失函数将用于训练过程中优化模型，从而使模型能够更好地学习和适应示例边界框的位置信息。
这种方法的目的是通过考虑边界框的位置信息，使得模型能够更加精确地预测密度图，进而提高对新类别对象计数的准确性。这种损失的设计是小样本学习中的一个关键创新点。

Min-Count Loss. For each exemplar bounding box(示例框) b, the sum of the density values within Z_b should be at least one. This is because the predicted count is taken as the sum of predicted density values, and there is at least one object at the location specified by the bounding box b. However, we cannot assert that the sum of the density values within Z_b to be exactly one, due to possible overlapping between b and other nearby objects of interest. This observation leads to an inequality constraint: ||Z_b||_1 ≥ 1, where ||Z_b||_1 denotes the sum of all the values in Z_b. Given the predicted density map and the set of provided bounding boxes for the exemplars, we define the following Min-Count loss to quantify the amount of constraint violation:

最小计数损失。对于每个示例边界框b，密度图Z_b内的密度值之和至少应该是1。这是因为预测的计数是作为预测密度值的和来取得的，并且在边界框b指定的位置至少有一个对象。然而，我们不能断言Z_b内的密度值之和恰好为1，因为b和其他附近感兴趣的对象之间可能存在重叠。这一观察结果导致了一个不等式约束：||Z_b||_1 ≥ 1，其中||Z_b||_1表示Z_b中所有值的总和。鉴于预测的密度图和为示例提供的边界框集合，我们定义了以下最小计数损失来量化约束违反的程度：

最小计数损失（Min-Count Loss）是一种用于训练过程中的损失函数，目的是确保预测的密度图中每个示例边界框内至少包含一个对象。
对于每个示例边界框b，从预测的密度图Z中裁剪出的部分记为Z_b。Z_b内的密度值之和应该至少为1，以反映在该边界框内有至少一个对象的存在。
由于可能存在对象之间的重叠，我们不能强求Z_b内的密度值之和严格等于1，因此这里使用了一个不等式约束||Z_b||_1 ≥ 1，其中||Z_b||_1表示Z_b中所有像素值的绝对值之和。
这种损失函数的设计允许模型在训练时考虑到边界框可能的重叠情况，同时确保每个示例边界框内至少被预测为包含一个对象。
最小计数损失通过量化不等式约束的违反程度，帮助模型在训练过程中学习如何更准确地预测密度图。

公式的详细讲解：

最小计数损失（Min-Count Loss）的数学表达式，用于量化预测密度图在示例边界框内的密度值之和与实际至少存在一个对象的要求之间的差距

Perturbation Loss. Our second loss to harness the positional information provided by the exemplar bounding boxes is inspired by the success of tracking algorithms based on correlation filter [13, 44, 51]. Given the bounding box of an object to track, these algorithms learn a filter that has highest response at the exact location of the bounding box and lower responses at perturbed locations. The correlation filter can be learned by optimizing a regression function to map from a perturbed location to a target response value, where the target response value decreases exponentially as the perturbation distance increases, usually specified by a Gaussian distribution.

扰动损失。我们利用示例边界框提供的位置信息的第二种损失是从基于相关滤波器[13, 44, 51]的跟踪算法的成功中获得灵感的。给定要跟踪的对象的边界框，这些算法学习一个滤波器，该滤波器在边界框的确切位置处具有最高的响应，并在扰动位置处具有较低的响应。通过对一个回归函数进行优化来学习相关滤波器，该函数将扰动位置映射到目标响应值，其中目标响应值随着扰动距离的增加而呈指数级减小，通常由高斯分布指定。

扰动损失（Perturbation Loss）是一种利用示例边界框的位置信息来提高计数精度的损失函数。
这种损失函数的设计灵感来源于基于相关滤波器的跟踪算法，这些算法在目标跟踪领域已被证明是成功的。
在这些跟踪算法中，给定一个目标对象的边界框，算法会学习一个滤波器，该滤波器在边界框的确切位置处产生最大的响应，而在边界框位置发生扰动（即偏离目标实际位置）时，响应值会降低。
相关滤波器通过优化一个回归函数来学习，该回归函数将扰动位置作为输入，并将其映射到一个目标响应值。目标响应值的设计使得当扰动距离（即滤波器中心与目标实际位置之间的距离）增加时，响应值会根据高斯分布呈指数级减小。
通过这种方式，模型被训练为能够识别和响应目标对象的确切位置，同时对周围区域的扰动位置产生较小的响应，从而提高模型对目标位置的敏感性和鲁棒性。
扰动损失有助于模型在面对目标对象可能的位移或部分遮挡时，仍能准确地进行计数和定位。

In our case, the predicted density map Z is essentially the correlation response map between the exemplars and the image. To this end, the density values around the location of an exemplar should ideally look like a Gaussian. Let G_h×w be the 2D Gaussian window of size h×w. We define the perturbation loss as follows:

在我们的情况下，预测的密度图Z本质上是示例与图像之间的相关响应图。为此，示例位置周围的密度值理想情况下应该看起来像一个高斯分布。设Gh×w为大小为h×w的二维高斯窗口。我们定义扰动损失如下：

扰动损失（Perturbation Loss）的计算方式，用于训练过程中优化密度图的预测，使得示例对象周围的密度值分布类似于高斯分布。

以下是对公式的详细解释：【高斯分布】

The combined adaptation Loss. The loss used for test time adaptation is the weighted combination of the MinCount loss and the Perturbation loss. The final test time adaptation loss is given as

结合适应损失。用于测试时适应的损失是最小计数损失和扰动损失的加权组合。最终的测试时适应损失给定为

在机器学习和深度学习中，特别是在小样本学习或少样本适应（Few-Shot Adaptation）的场景下，适应损失（Adaptation Loss）是在测试阶段对模型进行微调时使用的一种损失函数。
这种损失函数通常是由两种或多种不同的损失组合而成的，每种损失都关注模型性能的不同方面。
最小计数损失（MinCount Loss） 确保了模型对于每个示例对象至少能给出最小数量的预测（例如，至少为1），这有助于模型不会忽略任何示例对象。
扰动损失（Perturbation Loss） 鼓励模型在示例对象的位置生成类似高斯分布的密度响应，并对周围区域的响应进行抑制，这有助于提高模型对目标位置的精确度。
结合这两种损失，可以形成一个综合的适应损失，用于在测试时进一步优化模型参数，使模型更好地适应新的数据或任务。
这种加权组合允许调整每种损失对总损失的贡献程度，权重的选择可以基于特定任务的需求或通过交叉验证来确定。

4. The FSC-147 Dataset

To train the FamNet, we need a dataset suitable for the few-shot counting task, consisting of many visual categories. Unfortunately, existing counting datasets are mostly dedicated for specific object categories such as people, cars, and cells. Meanwhile, existing multi-class datasets do not contain many images that are suitable for visual counting. For example, although some images from the COCO dataset [25] contains multiple instances from the same object category, most of the images do not satisfy the conditions of our intended applications due to the small number of object instances or the huge variation in pose and appearance of the object instances in each image.

为了训练FamNet，我们需要一个适用于小样本计数任务的数据集，包含许多视觉类别。不幸的是，现有的计数数据集大多专门用于特定的对象类别，如人、汽车和细胞。与此同时，现有的多类别数据集并不包含许多适合视觉计数的图像。例如，尽管COCO数据集[25]中的一些图像包含来自同一类别的多个实例，但由于每个图像中对象实例的数量少或对象实例的姿态和外观变化巨大，大多数图像并不符合我们预期应用的条件。

Since there was no dataset that was large and diverse enough for our purpose, we collected and annotated images ourselves. Our dataset consists of 6135 images across a diverse set of 147 object categories, from kitchen utensils and office stationery to vehicles and animals. The object count in our dataset varies widely, from 7 to 3731 objects, with an average count of 56 objects per image. In each image, each object instance is annotated with a dot at its approximate center. In addition, three object instances are selected randomly as exemplar instances; these exemplars are also annotated with axis-aligned bounding boxes. In the following subsections, we will describe how the data was collected and annotated. We will also report the detailed statistics and how the data was split into disjoint training, validation, and testing sets.

由于没有足够大且多样化的数据集来满足我们的目的，我们自己收集并注释了图像。我们的数据集包含6135张图像，涵盖了147个不同的对象类别，从厨房用具、办公文具到车辆和动物。我们数据集中的对象数量变化很大，从7个到3731个，平均每张图像有56个对象。在每张图像中，每个对象实例都用一个点标注在其大致中心。此外，还随机选择了三个对象实例作为示例实例；这些示例也用轴对齐的边界框进行了注释。在接下来的子章节中，我们将描述数据是如何收集和注释的。我们还将报告详细的统计数据以及数据是如何被划分为不相交的训练集、验证集和测试集的。

【me】每张图片都有点标注，但是框标注只有随机的三个，是这个意思？

【A】

这个数据集包含6135张图像，覆盖了147个不同的视觉类别，这些类别包括厨房用具、办公文具、车辆和动物等。
数据集中的图像每张包含的对象数量跨度很大，从7个到3731个不等，平均每张图像包含56个对象。
在图像的注释工作中，每个对象实例都在其大致中心位置用一个点来表示。
为了训练和测试小样本计数方法，从每张图像中随机选择了三个对象实例作为示例（exemplar）实例，并用轴对齐的边界框来注释这些示例。
接下来，研究者们将详细介绍数据的收集和注释过程，包括数据的来源、注释的方法、统计数据以及如何将数据划分为训练集、验证集和测试集，以确保数据集的合理利用和模型的有效评估。

4.1. Image Collection

To obtain the set of 6135 images for our dataset, we started with a set of candidate images obtained by keyword searches. Subsequently, we performed manual inspection to filter out images that do not satisfy our predefined conditions as described below.

为了获得我们数据集的6135张图像，我们首先通过关键词搜索获得了一组候选图像。随后，我们进行了手动检查，以筛选出不符合我们预定义条件的图像，如下所述。

Image retrieval. We started with a list of object categories, and collected 300–3000 candidate images for each category by scraping the web. We used Flickr, Google, and Bing search engines with the open source image scrappers [7, 45]. We added adjectives such as many, multiple, lots of, and stack of in front of the category names to create the search query keywords.

图像检索。我们从一列表对象类别开始，并通过在网络上刮取图像为每个类别收集了300到3000张候选图像。我们使用了Flickr、Google和Bing搜索引擎，并使用了开源的图像刮取工具[7, 45]。我们在类别名称前添加了许多、多重、许多、一堆等形容词来创建搜索查询关键词。

在构建数据集的图像检索阶段，研究者们首先确定了一个对象类别的列表。
为了每个类别，他们通过网络刮取技术从Flickr、Google和Bing等搜索引擎上收集了300到3000张候选图像。
为了提高搜索的效果，研究者们在类别名称前添加了一些形容词，如“many”（许多）、“multiple”（多重）、“lots of”（许多的）、“stack of”（一堆），以帮助搜索引擎更准确地找到包含多个目标对象的图像。

Manual verification and filtering. We manually inspected the candidate images and only kept the suitable ones satisfying the following criteria:

1. High image quality: The resolution should be high enough to easily differentiate between objects.

2. Large enough object count: The number of objects of interest should be at least 7. We are more interested in counting a large number of objects, since humans do not need help counting a small number of objects.

3. Appearance similarity: we selected images where object instances have somewhat similar poses, texture, and appearance.

4. No severe occlusion: in most cases, we removed candidate images where severe occlusion prevents humans from accurately counting the objects.

手动验证和筛选。我们手动检查了候选图像，并仅保留了满足以下标准的合适图像：

高图像质量：分辨率应足够高，以便容易区分物体。
足够多的物体数量：感兴趣的物体数量至少应为7。我们更感兴趣的是计数大量物体，因为人类不需要帮助计数少量物体。
外观相似性：我们选择了物体实例在姿势、纹理和外观上有一定相似性的图像。
无严重遮挡：在大多数情况下，我们移除了严重遮挡阻碍人类准确计数物体的候选图像。

4.2. Image Annotation

Images in the dataset were annotated by a group of annotators using the OpenCV Image and Video Annotation Tool [1]. Two types of annotation were collected for each image, dots and bounding boxes, as illustrated in Fig. 4. For images containing multiple categories, we picked only one of the categories. Each object instance in an image was marked with a dot at its approximate center. In case of occlusion, the occluded instance was only counted and annotated if the amount of occlusion was less than 90%. For each image, we arbitrarily chose three objects as exemplar instances and we drew axis-aligned bounding boxes for those instances.

数据集中的图像由一组注释者使用 OpenCV 图像和视频注释工具 [1] 进行了注释。对于每张图像，我们收集了两种类型的注释：点和边界框，如图4所示。对于包含多个类别的图像，我们只选择了其中一个类别。图像中每个物体实例都以其大致中心点标记。在遮挡的情况下，只有当遮挡程度小于90%时，被遮挡的实例才会被计数和注释。对于每张图像，我们随机选择了三个对象作为示例实例，并为这些实例绘制了与轴对齐的边界框。

4.3. Dataset split

We divided the dataset into train, validation, and test sets such that they do not share any object category. We randomly selected 89 object categories for the train set, and 29 categories each for the validation and test sets. The train, validation, and test sets consist of 3659, 1286 and 1190 images respectively.

我们将数据集划分为训练集、验证集和测试集，以确保它们不共享任何物体类别。我们随机选择了89个物体类别用于训练集，并且分别为验证集和测试集选择了29个类别。训练集、验证集和测试集分别包含3659张、1286张和1190张图像。

4.4. Data Statistics

The dataset contains a total of 6135 images. The average height and width of the images are 774 and 938 pixels, respectively. The average number of objects per image is 56, and the total number of objects is 343,818. The minimum and maximum number of objects for one image are 7 and 3701, respectively. The three categories with the highest number of objects per image are: Lego (303 objects/image), Brick (271), and Marker (247). The three categories with lowest number of objects per image are: Supermarket shelf (8 objects/image), Meat Skewer (8), and Oyster (11). Fig. 3b is a histogram plot for the number of images in several ranges of object count.

数据集总共包含6135张图像。图像的平均高度和宽度分别是774和938像素。平均每张图像中有56个物体，总物体数量为343,818个。一张图像中物体的最小数量和最大数量分别是7和3701。每张图像中物体数量最多的三个类别是：乐高积木（303个物体/图像）、砖块（271个）、记号笔（247个）。每张图像中物体数量最少的三个类别是：超市货架（8个物体/图像）、烤肉串（8个）、牡蛎（11个）。图3b是一张直方图，展示了不同物体数量范围内图像的数量。

表格 1：将 FamNet 与两个简单基线（均值、中位数）和四个更强的基线（特征重加权（FR）小样本检测器、FSOD 小样本检测器、GMN 和 MAML）进行比较，这些都是已经适应并训练用于计数的小样本方法。FamNet 在验证集和测试集上的 MAE（平均绝对误差）和 RMSE（均方根误差）都是最低的。

表格 2：将 FamNet 与预训练的对象检测器比较，在有预训练对象检测器的类别中计数对象。

表格 2 展示了几种不同的对象检测方法在两个数据集上的性能比较：Val-COCO 集和 Test-COCO 集。COCO 数据集是一个广泛使用的对象检测数据集。表格中列出了每种方法在这两个数据集上的 MAE（平均绝对误差）和 RMSE（均方根误差）。

Faster R-CNN：一种深度卷积神经网络结构，用于对象检测任务，以其速度快和准确率高而著名。
RetinaNet：一种使用特征金字塔网络（FPN）和焦点损失（Focal Loss）的对象检测网络，旨在提高小物体的检测性能。
Mask R-CNN：在 Faster R-CNN 的基础上增加了一个分支用于实例分割，能够同时进行对象检测和遮罩分割。
FamNet (Proposed)：这是表格中提出的新方法，用于对象计数任务。

表格中的数据展示了每种方法在不同误差度量下的表现：

MAE：在 Val-COCO 集上，FamNet 的 MAE 为 39.82，低于 Faster R-CNN、RetinaNet 和 Mask R-CNN。在 Test-COCO 集上，FamNet 的 MAE 为 22.76，是所有方法中最低的。
RMSE：在 Val-COCO 集上，FamNet 的 RMSE 为 108.13，低于 Faster R-CNN 和 RetinaNet，但略高于 Mask R-CNN。在 Test-COCO 集上，FamNet 的 RMSE 为 45.92，也是所有方法中最低的。

从表格中可以看出，FamNet 在两个数据集上的性能都优于传统的对象检测方法（Faster R-CNN、RetinaNet、Mask R-CNN）。这表明 FamNet 在对象计数任务上具有较高的准确性，尤其是在 Test-COCO 集上，FamNet 的 MAE 和 RMSE 都是最低的，显示出其在新数据集上的泛化能力。