✅【文献串读】Object Counting论文串读

get宝藏博主：Tags - 郑之杰的个人网站 (0809zheng.github.io)

目标计数(Object Counting) - 郑之杰的个人网站 (0809zheng.github.io)

2.（2024CVPR）《DAVE – A Detect-and-Verify Paradigm for Low-Shot Counting 》

✔️3.（2024AAAI）Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

✔️4.《Learning Spatial Similarity Distribution for Few-shot Object Counting》

⭐️ 5.LOCA

✔️6.《Semantic Generative Augmentations for Few-Shot Counting》

✔️7.CounTR: Transformer-based Generalised Visual Counting

8.《Semantic Generative Augmentations for Few-Shot Counting》

9.《Scale-Prior Deformable Convolution for Exemplar-Guided Class-Agnostic Counting》

10《Few-shot Object Counting with Similarity-Aware Feature Enhancement》

paper：2407.04619v1 (arxiv.org)

code：CountGD: Multi-Modal Open-World Counting (ox.ac.uk)

niki-amini-naieni/CountGD: Includes the code for training and testing the CountGD model from the paper CountGD: Multi-Modal Open-World Counting. (github.com)

本文的目标是提高图像中开放词汇表对象计数的通用性和准确性。为了提高通用性，我们重新利用了一个开放词汇表检测基础模型（GroundingDINO）进行计数任务，并通过引入模块使其能够通过视觉样本指定要计数的目标对象。反过来，这些新的能力——能够通过多模态（文本和样本）指定目标对象——导致计数准确性的提高。我们做出了三项贡献：首先，我们介绍了第一个开放世界计数模型COUNTGD，其中提示可以通过文本描述或视觉样本或两者来指定；其次，我们展示了模型的性能在多个计数基准测试上显著提高了现有技术水平——当仅使用文本时，COUNTGD与所有以前的仅文本作品相当或更优，当同时使用文本和视觉样本时，我们超越了所有以前的模型；第三，我们对文本和视觉样本提示之间的不同交互进行了初步研究，包括它们相互加强的情况以及一个限制另一个的情况。代码和测试模型的应用程序可获取。

图1：CoUNTGD能够同时使用视觉样本和文本提示生成高度准确的对象计数(a)，但也无缝支持仅使用文本查询或仅视觉样本进行计数(b)。多模态视觉样本和文本查询为开放世界计数任务带来额外的灵活性，例如使用一个短语(c)，或添加额外的约束（“左”或“右”这些词）来选择对象的子集(d)。这些示例取自FSC-147[39]和CountBench[36]测试集。视觉样本显示为黄色框。(d)展示了模型预测的置信度图，其中颜色强度高表示置信度高。

In summary, we make the following three contributions: First, we introduce COUNTGD, the first openworld object counting model that accepts either text or visual exemplars or both simultaneously, in a single-stage architecture; Second, we evaluate the model on multiple standard counting benchmarks, including FSC-147 [39], CARPK [18] and CountBench [36], and show that COUNTGD significantly improves on the state-of-the-art performance by specifying the target object using both exemplars and text. It also meets or improves on the state-of-the-art for text-only approaches when trained and evaluated using text-only; Third, we investigate how the text can be used to refine the visual information provided by the exemplar, for example by filtering on color or relative position in the image, to specify a sub-set of the objects to count. In addition we make two minor improvements to the inference stage: one that addresses the problem of double counting due to self-similarity, and the other to handle the problem of a very high count.

总结来说，我们做出了以下三项贡献：首先，我们介绍了COUNTGD，这是第一个开放世界对象计数模型，它可以接受文本或视觉样本或同时接受两者，在单阶段架构中；其次，我们在多个标准计数基准上评估了模型，包括FSC-147[39]、CARPK[18]和CountBench[36]，并表明COUNTGD通过使用样本和文本指定目标对象显著提高了现有技术水平。当使用文本进行训练和评估时，它也满足或提高了仅文本方法的现有技术水平；第三，我们研究了如何使用文本来细化样本提供的视觉信息，例如通过按颜色或图像中的相对位置过滤，来指定要计数的对象子集。此外，我们对推理阶段进行了两个小改进：一个解决了由于自相似性导致的重复计数问题，另一个用于处理非常高计数的问题。

2.（2024CVPR）《DAVE – A Detect-and-Verify Paradigm for Low-Shot Counting 》

论文：2404.16622v1 (arxiv.org)

code：jerpelhan/DAVE (github.com)

解读：

DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting - 郑之杰的个人网站 (0809zheng.github.io)

Abstract

✔️3.（2024AAAI）Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

paper：2305.04440v2 (arxiv.org)

code：Xu3XiWang/CACViT-AAAI24: Official implementation of Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting (github.com)

解读：

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting - 郑之杰的个人网站 (0809zheng.github.io)

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extractand-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention. The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the orderof-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.

类别无关计数（CAC）的目标是在只有少量样本的情况下，从一个查询图像中计算出感兴趣的对象数量。这个任务通常通过分别提取查询图像和样本的特征，然后匹配它们的相似性特征来解决，导致了一个先提取后匹配的范式。在这项工作中，我们展示了如何以提取和匹配的方式简化CAC，特别是使用视觉变换器（ViT），在自注意力中同时执行特征提取和相似性匹配。我们从自注意力的解耦视角揭示了这种简化的合理性。所得到的模型，称为CACViT，将CAC流程简化为单个预训练的纯ViT。此外，为了弥补因缩放和标准化而在纯ViT中丢失的尺度和数量级信息，我们提出了两种有效的尺度和数量级嵌入策略。在FSC147和CARPK数据集上的广泛实验表明，CACViT在有效性（23.60%的错误降低）和泛化能力上显著超越了现有的CAC方法，这表明CACViT为CAC提供了一个简洁而强大的基线。代码将提供。

In a nutshell, our contributions are three-fold:

• A novel extract-and-match paradigm: we show that simultaneous feature extraction and matching can be made possible in CAC;

• CACViT: a simple and strong ViT-based baseline, sets the new state-of-the-art on the FSC-147 benchmark;

• We introduce two effective strategies to embed scale, aspect ratio, and order of magnitude information tailored to CACViT.

简而言之，我们的贡献是三方面的：

• 一种新颖的提取和匹配范式：我们展示了在CAC中可以同时进行特征提取和匹配的可能性；

• CACViT：一个简单而强大的基于ViT的基线，在FSC-147基准测试上设立了新的最先进水平；

• 我们引入了两种有效的策略来嵌入尺度、纵横比和数量级信息，这些策略专为CACViT量身定制。

✔️4.《Learning Spatial Similarity Distribution for Few-shot Object Counting》

paper： 2405.11770v1 (arxiv.org)

code：CBalance/SSD: SSD: Learning Spatial Similarity Distribution for Few-shot Object Counting (github.com)

解读：

Learning Spatial Similarity Distribution for Few-shot Object Counting - 郑之杰的个人网站 (0809zheng.github.io)

Few-shot object counting aims to count the number of objects in a query image that belong to the same class as the given exemplar images. Existing methods compute the similarity between the query image and exemplars in the 2D spatial domain and perform regression to obtain the counting number. However, these methods overlook the rich information about the spatial distribution of similarity on the exemplar images, leading to significant impact on matching accuracy. To address this issue, we propose a network learning Spatial Similarity Distribution (SSD) for few-shot object counting, which preserves the spatial structure of exemplar features and calculates a 4D similarity pyramid point-to-point between the query features and exemplar features, capturing the complete distribution information for each point in the 4D similarity space. We propose a Similarity Learning Module (SLM) which applies the efficient center-pivot 4D convolutions on the similarity pyramid to map different similarity distributions to distinct predicted density values, thereby obtaining accurate count. Furthermore, we also introduce a Feature Cross Enhancement (FCE) module that enhances query and exemplar features mutually to improve the accuracy of feature matching. Our approach outperforms state-of-the-art methods on multiple datasets, including FSC-147 and CARPK. Code is available at https://github.com/CBalance/SSD.

Our contributions can be summarized as follows:

• We design a model based on learning the 4D spatial similarity distribution between query and exemplar features in Similarity Learning Module (SLM). This model is capable of obtaining accurate counting results after comprehensive integration of similarity distribution information among point pairs and their surroundings.

• Before calculating the similarity between query and exemplar features, we introduce a Feature Cross Enhancement (FCE) module, which enhances the interaction between them, reducing the distance between the target objects and exemplar features to achieve better matching performance.

• Extensive experiments on large-scale counting benchmarks, such as FSC-147 and CARPK, are conducted and the results demonstrate that our method outperforms the state-of-the-art approaches

⭐️ 5.LOCA

paper https://arxiv.org/pdf/2211.08217v2.pdf

code: djukicn/loca: LOCA - A Low-Shot Object Counting Network With Iterative Prototype Adaptation (ICCV 2023) (github.com)

解读：A Low-Shot Object Counting Network With Iterative Prototype Adaptation - 郑之杰的个人网站 (0809zheng.github.io)

✅【文献阅读】（23ICCV）LOCA-CSDN博客

✔️6.《Semantic Generative Augmentations for Few-Shot Counting》

paper：2311.16122v1 (arxiv.org)

code：perladoubinsky/SemAug: [WAVC 2024] Official implementation of the paper: Semantic Generative Augmentations for Few-shot Counting (github.com)

With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK

To tackle few-shot counting, we propose to synthesize unseen data with Stable Diffusion conditioned by both a textual prompt and a density map. We thus build an augmented FSC dataset that is used to train a deep counting network. The double conditioning, implemented with ControlNet [42], allows us to generate novel synthetic images with a precise control, preserving the ground truth for the counting task. It deals well with large numbers of objects, while current methods fail in such cases [19, 27]. To increase the diversity of the augmented training set, we swap image descriptions between the n available training samples, leading to n(n−1) 2 novel couples, each being the source of several possible synthetic images. However, we show that some combinations do not make sense and lead to poor quality samples. Therefore, we only select plausible pairs, resulting in improved augmentation quality. We evaluate our approach on two class-agnostic counting networks, namely SAFECount [41] and CounTR [6]. We show that it significantly improves the performances on the benchmark dataset FSC147 [28] and allow for a better generalization on the CARPK dataset [14].

✔️7.CounTR: Transformer-based Generalised Visual Counting

paper:

2208.13721v3 (arxiv.org)

code:Verg-Avesta/CounTR: CounTR: Transformer-based Generalised Visual Counting (github.com)

解读： CounTR: Transformer-based Generalised Visual Counting - 郑之杰的个人网站 (0809zheng.github.io)

To summarise, in this paper, we make four contributions: First, we introduce an architecture for generalised visual object counting based on transformer, termed as CounTR (pronounced as counter). It exploits the attention mechanisms to explicitly capture the similarity between image patches, or with the few-shot instance “exemplars” provided by the end user; Second, we adopt a two-stage training regime (self-supervised pre-training, followed by supervised fine-tuning) and show its effectiveness for the task of visual counting; Third, we propose a simple yet scalable pipeline for synthesizing training images with a large number of instances, and demonstrate that it can significantly improve the performance on images containing a large number of object instances; Fourth, we conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147 [24], and demonstrate state-of-the-art performance on both zero-shot and few-shot settings, improving the previous best approach by over 18.3% on the mean absolute error of the test set.

总结来说，在本文中，我们做出了四项贡献：

首先，我们引入了一种基于变换器的广义视觉对象计数架构，称为CounTR（发音为counter）。它利用注意力机制明确捕获图像块之间的相似性，或与最终用户提供的少样本实例“样本”之间的相似性；

其次，我们采用两阶段训练机制（自监督预训练，后跟监督微调），并展示了其对视觉计数任务的有效性；

第三，我们提出了一个简单但可扩展的管道，用于合成具有大量实例的训练图像，并证明它可以显著提高在包含大量对象实例的图像上的性能；

第四，我们在大规模计数基准测试上进行了彻底的消融研究，例如FSC-147[24]，并在零样本和少样本设置上展示了最先进性能，在测试集的平均绝对误差上比之前最好的方法提高了超过18.3%。

8.《Semantic Generative Augmentations for Few-Shot Counting》

paper: https://arxiv.org/pdf/2311.16122v1

code: perladoubinsky/SemAug: [WAVC 2024] Official implementation of the paper: Semantic Generative Augmentations for Few-shot Counting (github.com)

解读：

随着强大的文本到图像扩散模型的出现，最近的研究探索了使用合成数据来提高图像分类性能。这些研究表明，它可以有效地增强甚至替代真实数据。在这项工作中，我们研究了合成数据如何能够使少样本类别无关计数受益。这需要生成与给定输入对象数量相对应的图像。然而，文本到图像模型难以把握计数的概念。我们提议依靠Stable Diffusion的双重条件作用，既使用提示也使用密度图，以增强少样本计数的训练数据集。由于数据集规模小，微调模型倾向于生成接近训练图像的图像。我们提出通过交换图像之间的标题来增强合成图像的多样性，从而创建未见过的物体类型和空间布局的配置。我们的实验表明，我们的多样化生成策略显著提高了FSC147和CARPK上两种近期表现良好的少样本计数模型的计数准确性。

为了解决少样本计数问题，我们提议使用由文本提示和密度图双重条件化的Stable Diffusion来合成未见数据。因此，我们构建了一个增强的FSC数据集，用于训练深度计数网络。双重条件化，通过ControlNet[42]实现，使我们能够生成具有精确控制的新颖合成图像，同时保留计数任务的真实情况。它很好地处理了大量对象的情况，而当前方法在这种情况下会失败[19, 27]。为了增加增强训练集的多样性，我们在n个可用训练样本之间交换图像描述，从而产生n(n-1)/2个新颖的组合，每个组合都是多个可能合成图像的来源。然而，我们表明有些组合没有意义，并导致质量较差的样本。因此，我们只选择合理的对，从而提高了增强的质量。我们在两个类别无关计数网络上评估了我们的方法，即SAFECount[41]和CounTR[6]。我们展示了它在基准数据集FSC147[28]上显著提高了性能，并在CARPK数据集[14]上允许更好的泛化。

9.《Scale-Prior Deformable Convolution for Exemplar-Guided Class-Agnostic Counting》

paper：0313.pdf (mpg.de)

code：Elin24/SPDCN-CAC: BMVC-2022 paper "Scale-Prior Deformable Convolution for Class-Agnostic Counting"(https://bmvc2022.mpi-inf.mpg.de/313) (github.com)

解读：

Scale-Prior Deformable Convolution for Exemplar-Guided Class-Agnostic Counting - 郑之杰的个人网站 (0809zheng.github.io)

Class-agnostic counting has recently emerged as a more practical counting task, which aims to predict the number and distribution of any exemplar objects, instead of counting specific categories like pedestrians or cars. However, recent methods are developed by designing suitable similarity matching rules between exemplars and query images, but ignoring the robustness of extracted features. To address this issue, we propose a scale-prior deformable convolution by integrating exemplars' information, \eg, scale, into the counting network backbone. As a result, the proposed counting network can extract semantic features of objects similar to the given exemplars and effectively filter irrelevant backgrounds. Besides, we find that traditional L2 and generalized loss are not suitable for class-agnostic counting due to the variety of object scales in different samples. Here we propose a scale-sensitive generalized loss to tackle this problem. It can adjust the cost function formulation according to the given exemplars, making the difference between prediction and ground truth more prominent. Extensive experiments show that our model obtains remarkable improvement and achieves state-of-the-art performance on a public class-agnostic counting benchmark. the source code is available at https://github.com/Elin24/SPDCN-CAC.

To summarize, the key contributions of this paper are:

• To address class-agnostic counting, we propose a scale-prior deformable network to better extract exemplar-related features, followed by a segmentation-then-counting stage to count objects.

• We propose a scale-sensitive generalized loss to make the model training adaptive to objects of different sizes, boosting the performance and generalization of trained models.

• Extensive experiments and visualizations demonstrate these two designs work well, and outstanding performance is obtained when our model is tested on benchmarks.

总结来说，本文的主要贡献是：

• 为了解决类别无关计数问题，我们提出了一种尺度先验可变形网络，以更好地提取与样本相关的特征，然后通过分割-然后计数阶段来计数对象。

• 我们提出了一种尺度敏感的广义损失，使模型训练能够适应不同大小的对象，提高训练模型的性能和泛化能力。

• 通过广泛的实验和可视化，我们证明了这两种设计的有效性，并且在基准测试中，我们的模型表现出色。

10《Few-shot Object Counting with Similarity-Aware Feature Enhancement》

paper：2201.08959v5 (arxiv.org)

code：zhiyuanyou/SAFECount: [WACV 2023] Few-shot Object Counting with Similarity-Aware Feature Enhancement (github.com)

解读：

Few-shot Object Counting with Similarity-Aware Feature Enhancement - 郑之杰的个人网站 (0809zheng.github.io)

This work studies the problem of few-shot object counting, which counts the number of exemplar objects (i.e., described by one or several support images) occurring in the query image. The major challenge lies in that the target objects can be densely packed in the query image, making it hard to recognize every single one. To tackle the obstacle, we propose a novel learning block, equipped with a similarity comparison module and a feature enhancement module. Concretely, given a support image and a query image, we first derive a score map by comparing their projected features at every spatial position. The score maps regarding all support images are collected together and normalized across both the exemplar dimension and the spatial dimensions, producing a reliable similarity map. We then enhance the query feature with the support features by employing the developed point-wise similarities as the weighting coefficients. Such a design encourages the model to inspect the query image by focusing more on the regions akin to the support images, leading to much clearer boundaries between different objects. Extensive experiments on various benchmarks and training setups suggest that we surpass the state-of-the-art methods by a sufficiently large margin. For instance, on a recent large-scale FSC-147 dataset, we surpass the state-of-the-art method by improving the mean absolute error from 22.08 to 14.32 (35%↑). Code has been released。

这项研究探讨了少样本对象计数问题，即计算在查询图像中出现的样本对象（即由一个或多个支持图像描述）的数量。主要挑战在于目标对象可能在查询图像中密集堆积，使得很难识别每一个单独的对象。为了解决这个障碍，我们提出了一个新的学习块，配备了一个相似性比较模块和一个特征增强模块。具体来说，给定一个支持图像和一个查询图像，我们首先通过比较它们在每个空间位置上投影的特征来派生出一个得分图。关于所有支持图像的得分图被收集在一起，并在样本维度和空间维度上进行归一化，生成一个可靠的相似性图。然后我们通过使用开发的逐点相似性作为权重系数来增强查询特征与支持特征。这样的设计鼓励模型通过更多地关注类似于支持图像的区域来检查查询图像，导致不同对象之间的边界更加清晰。在各种基准和训练设置上的广泛实验表明，我们以足够的优势超越了最先进的方法。例如，在最近的大规模FSC-147数据集上，我们通过将平均绝对误差从22.08提高到14.32（提高了35%），超越了最先进的方法。代码已经发布。

In this work, we propose a Similarity-Aware Feature Enhancement block for object Counting (SAFECount). As discussed above, feature is more informative while similarity better captures the support-query relationship. Our novel block adequately integrates both of the advantages by exploiting similarity as a guidance to enhance the features for regression. Intuitively, the enhanced feature not only carries the rich semantics extracted from the image, but also gets aware of which regions within the query image are similar to the exemplar object. Specifically, we come up with a similarity comparison module (SCM) and a feature enhancement module (FEM), as illustrated in Fig. 2c. On one hand, different from the naive feature comparison in Fig. 2b, our SCM learns a feature projection, then performs a comparison on the projected features to derive a score map. This design helps select from features the information that is most appropriate for object counting. After the comparison, we derive a reliable similarity map by collecting the score maps with respect to all support images (i.e., few-shot) and normalizing them along both the exemplar dimension and the spatial dimensions. On the other hand, the FEM takes the point-wise similarities as the weighting coefficients, and fuses the support features into the query feature. Such a fusion is able to make the enhanced query feature focus more on the regions akin to the exemplar object defined by support images, facilitating more precise counting.

在这项工作中，我们提出了一个用于对象计数的相似性感知特征增强块（SAFECount）。如上所述，特征在相似性方面更具信息性，同时相似性更好地捕获了支持查询关系。我们的新颖块通过利用相似性作为指导来增强回归特征，充分整合了两者的优势。直观地说，增强的特征不仅携带了从图像中提取的丰富语义，而且还意识到查询图像中哪些区域与样本对象相似。具体来说，我们提出了一个相似性比较模块（SCM）和一个特征增强模块（FEM），如图2c所示。一方面，与图2b中的简单特征比较不同，我们的SCM学习了一个特征投影，然后在投影特征上执行比较以派生得分图。这种设计有助于从特征中选择最适合对象计数的信息。比较之后，我们通过收集所有支持图像（即少样本）的得分图，并沿着样本维度和空间维度对它们进行归一化，从而得到一个可靠的相似性图。另一方面，FEM将逐点相似性作为权重系数，并将支持特征融合到查询特征中。这样的融合能够使增强的查询特征更多地关注支持图像定义的样本对象相似的区域，从而促进更精确的计数。

Experimental results on a very recent large-scale FSC dataset, FSC-147 [24], and a car counting dataset, CARPK [10], demonstrate our substantial improvement over state-of-the-art methods. Through visualizing the intermediate similarity map and the final predicted density map, we find that our SAFECount substantially benefits from the clear boundaries learned between objects, even when they are densely packed in the query image.

在最近一个非常大规模的FSC数据集FSC-147[24]和一个汽车计数数据集CARPK[10]上的实验结果表明，我们的方法在性能上大大超越了现有的最先进方法。通过可视化中间相似性图和最终预测的密度图，我们发现我们的SAFECount从查询图像中清晰学习到的对象边界中获益匪浅，即使是在查询图像中密集堆积的对象也是如此。