Paper1 CellTranspose: Few-Shot Domain Adaptation for Cellular Instance Segmentation
摘要原文: Automated cellular instance segmentation is a process utilized for accelerating biological research for the past two decades, and recent advancements have produced higher quality results with less effort from the biologist. Most current endeavors focus on completely cutting the researcher out of the picture by generating highly generalized models. However, these models invariably fail when faced with novel data, distributed differently than the ones used for training. Rather than approaching the problem with methods that presume the availability of large amounts of target data and computing power for retraining, in this work we address the even greater challenge of designing an approach that requires minimal amounts of new annotated data as well as training time. We do so by designing specialized contrastive losses that leverage the few annotated samples very efficiently. A large set of results show that 3 to 5 annotations lead to models with accuracy that: 1) significantly mitigate the covariate shift effects; 2) matches or surpasses other adaptation methods; 3) even approaches methods that have been fully retrained on the target distribution. The adaptation training is only a few minutes, paving a path towards a balance between model performance, computing requirements and expert-level annotation needs.
Paper2 Towards Online Domain Adaptive Object Detection
摘要原文: Existing object detection models assume both the training and test data are sampled from the same source domain. This assumption does not hold true when these detectors are deployed in real-world applications, where they encounter new visual domains. Unsupervised Domain Adaptation (UDA) methods are generally employed to mitigate the adverse effects caused by domain shift. Existing UDA methods operate in an offline manner where the model is first adapted toward the target domain and then deployed in real-world applications. However, this offline adaptation strategy is not suitable for real-world applications as the model frequently encounters new domain shifts. Hence, it is critical to develop a feasible UDA method that generalizes to the new domain shifts encountered during deployment time in a continuous online manner. To this end, we propose a novel unified adaptation framework that adapts and improves generalization on the target domain in both offline and online settings. Specifically, we introduce MemXformer - a cross-attention transformer-based memory module where items in the memory take advantage of domain shifts and record prototypical patterns of the target distribution. Further, MemXformer produces strong positive and negative pairs to guide a novel contrastive loss, which enhances target-specific representation learning. Experiments on diverse detection benchmarks show that the proposed strategy producs state-of-the-art performance in both offline and online settings. To the best of our knowledge, this is the first work to address online and offline adaptation settings for object detection. Source code will be released after review.
Paper3 Exploiting Instance-Based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-Adaptive Action Detection
摘要原文: We propose a novel domain adaptive action detection approach and a new adaptation protocol that leverages the recent advancements in image-level unsupervised domain adaptation (UDA) techniques and handle vagaries of instance-level video data. Self-training combined with cross-domain mixed sampling has shown remarkable performance gain in semantic segmentation in UDA (unsupervised domain adaptation) context. Motivated by this fact, we propose an approach for human action detection in videos that transfers knowledge from the source domain (annotated dataset) to the target domain (unannotated dataset) using mixed sampling and pseudo-label-based selftraining. The existing UDA techniques follow a ClassMix algorithm for semantic segmentation. However, simply adopting ClassMix for action detection does not work, mainly because these are two entirely different problems, i.e., pixel-label classification vs. instance-label detection. To tackle this, we propose a novel action instance mixed sampling technique that combines information across domains based on action instances instead of action classes. Moreover, we propose a new UDA training protocol that addresses the long-tail sample distribution and domain shift problem by using supervision from an auxiliary source domain (ASD). For the ASD, we propose a new action detection dataset with dense frame-level annotations. We name our proposed framework as domain-adaptive action instance mixing (DA-AIM). We demonstrate that DA-AIM consistently outperforms prior works on challenging domain adaptation benchmarks. The source code is available at https://github.com/wwwfan628/DA-AIM.
Paper4 Cross-Domain Video Anomaly Detection Without Target Domain Adaptation
摘要原文: Most cross-domain unsupervised Video Anomaly Detection (VAD) works assume that at least few task-relevant target domain training data are available for adaptation from the source to the target domain. However, this requires laborious model-tuning by the end-user who may prefer to have a system that works “out-of-the-box”. To address such practical scenarios, we identify a novel target domain (inference-time) VAD task where no target domain training data are available. To this end, we propose a new ‘Zero-shot Cross-domain Video Anomaly Detection (zxvad)’ framework that includes a future-frame prediction generative model setup. Different from prior future-frame prediction models, our model uses a novel Normalcy Classifier module to learn the features of normal event videos by learning how such features are different “relative” to features in pseudo-abnormal examples. A novel Untrained Convolutional Neural Network based Anomaly Synthesis module crafts these pseudo-abnormal examples by adding foreign objects in normal video frames with no extra training cost. With our novel relative normalcy feature learning strategy, zxvad generalizes and learns to distinguish between normal and abnormal frames in a new target domain without adaptation during inference. Through evaluations on common datasets, we show that zxvad outperforms the state-of-the-art (SOTA), regardless of whether task-relevant (i.e., VAD) source training data are available or not. Lastly, zxvad also beats the SOTA methods in inference-time efficiency metrics including the model size, total parameters, GPU energy consumption, and GMACs.
中文总结: 这段话主要介绍了一个名为"Zero-shot Cross-domain Video Anomaly Detection (zxvad)"的新框架,用于处理跨领域无监督视频异常检测任务。传统的方法通常假设在源领域和目标领域之间至少有一些相关任务的目标领域训练数据可用于适应,但这需要用户进行繁琐的模型调整。为了解决这一问题,作者提出了一个新的目标领域(推理时间)VAD任务,其中没有目标领域的训练数据可用。他们的框架通过未来帧预测生成模型设置来实现,利用了一种新颖的正常性分类器模块来学习正常事件视频的特征,并通过学习这些特征如何与伪异常示例中的特征“相对”不同来区分正常和异常帧。通过在常见数据集上的评估,作者展示了zxvad在效率指标方面的优势,包括模型大小、总参数、GPU能耗和GMACs等。最后,zxvad在推理时间的效率指标上也击败了现有方法。
Paper5 CoNMix for Source-Free Single and Multi-Target Domain Adaptation
摘要原文: This work introduces the novel task of Source-free Multi-target Domain Adaptation and proposes adaptation framework comprising of Consistency with Nuclear-Norm Maximization and MixUp knowledge distillation (CoNMix) as a solution to this problem. The main motive of this work is to solve for Single and Multi target Domain Adaptation (SMTDA) for the source-free paradigm, which enforces a constraint where the labeled source data is not available during target adaptation due to various privacy-related restrictions on data sharing. The source-free approach leverages target pseudo labels, which can be noisy, to improve the target adaptation. We introduce consistency between label preserving augmentations and utilize pseudo label refinement methods to reduce noisy pseudo labels. Further, we propose novel MixUp Knowledge Distillation (MKD) for better generalization on multiple target domains using various source-free STDA models. We also show that the Vision Transformer (VT) backbone gives better feature representation with improved domain transferability and class discriminability. Our proposed framework achieves the state-of-the-art (SOTA) results in various paradigms of source-free STDA and MTDA settings on popular domain adaptation datasets like Office-Home, Office-Caltech, and DomainNet. Project Page: https://sites.google.com/view/conmix-vcl
Paper6 Empirical Generalization Study: Unsupervised Domain Adaptation vs. Domain Generalization Methods for Semantic Segmentation in the Wild
摘要原文: For autonomous vehicles and mobile robots to safely operate in the real world, i.e., the wild, scene understanding models should perform well in the many different scenarios that can be encountered. In reality, these scenarios are not all represented in the model’s training data, leading to poor performance. To tackle this, current training strategies attempt to either exploit additional unlabeled data with unsupervised domain adaptation (UDA), or to reduce overfitting using the limited available labeled data with domain generalization (DG). However, it is not clear from current literature which of these methods allows for better generalization to unseen data from the wild. Therefore, in this work, we present an evaluation framework in which the generalization capabilities of state-of-the-art UDA and DG methods can be compared fairly. From this evaluation, we find that UDA methods, which leverage unlabeled data, outperform DG methods in terms of generalization, and can deliver similar performance on unseen data as fully-supervised training methods that require all data to be labeled. We show that semantic segmentation performance can be increased up to 30% for a priori unknown data without using any extra labeled data.
Paper7 Domain Invariant Vision Transformer Learning for Face Anti-Spoofing
摘要原文: Existing face anti-spoofing (FAS) models have achieved high performance on specific datasets. However, for the application of real-world systems, the FAS model should generalize to the data from unknown domains rather than only achieve good results on a single baseline. As vision transformer models have demonstrated astonishing performance and strong capability in learning discriminative information, we investigate applying transformers to distinguish the face presentation attacks over unknown domains. In this work, we propose the Domain-invariant Vision Transformer (DiVT) for FAS, which adopts two losses to improve the generalizability of the vision transformer. First, a concentration loss is employed to learn a domain-invariant representation that aggregates the features of real face data. Second, a separation loss is utilized to union each type of attack from different domains. The experimental results show that our proposed method achieves state-of-the-art performance on the protocols of domain-generalized FAS tasks. Compared to previous domain generalization FAS models, our proposed method is simpler but more effective.
Paper8 Learning Across Domains and Devices: Style-Driven Source-Free Domain Adaptation in Clustered Federated Learning
摘要原文: Federated Learning (FL) has recently emerged as a possible way to tackle the domain shift in real-world Semantic Segmentation (SS) without compromising the private nature of the collected data. However, most of the existing works on FL unrealistically assume labeled data in the remote clients. Here we propose a novel task (FFREEDA) in which the clients’ data is unlabeled and the server accesses a source labeled dataset for pre-training only. To solve FFREEDA, we propose LADD, which leverages the knowledge of the pre-trained model by employing self-supervision with ad-hoc regularization techniques for local training and introducing a novel federated clustered aggregation scheme based on the clients’ style. Our experiments show that our algorithm is able to efficiently tackle the new task outperforming existing approaches. The code is available at https://github.com/Erosinho13/LADD.
Paper9 Reducing Annotation Effort by Identifying and Labeling Contextually Diverse Classes for Semantic Segmentation Under Domain Shift
摘要原文: Abstract not available
中文总结: 这段话的主要内容是:抽象内容不可用。
Paper10 How To Practice VQA on a Resource-Limited Target Domain
摘要原文: Visual question answering (VQA) is an active research area at the intersection of computer vision and natural language understanding. One major obstacle that keeps VQA models that perform well on benchmarks from being as successful on real-world applications, is the lack of annotated Image-Question-Answer triplets in the task of interest. In this work, we focus on a previously overlooked perspective, which is the disparate effectiveness of transfer learning and domain adaptation methods depending on the amount of labeled/unlabeled data available. We systematically investigated the visual domain gaps and question-defined textual gaps, and compared different knowledge transfer strategies under unsupervised, self-supervised, semi-supervised and fully-supervised adaptation scenarios. We show that different methods have varied sensitivity and requirements for data amount in the target domain. We conclude by sharing the best practice from our exploration regarding transferring VQA models to resource-limited target domains.
Paper11 Select, Label, and Mix: Learning Discriminative Invariant Feature Representations for Partial Domain Adaptation
摘要原文: Partial domain adaptation which assumes that the unknown target label space is a subset of the source label space has attracted much attention in computer vision. Despite recent progress, existing methods often suffer from three key problems: negative transfer, lack of discriminability, and domain invariance in the latent space. To alleviate the above issues, we develop a novel ‘Select, Label, and Mix’ (SLM) framework that aims to learn discriminative invariant feature representations for partial domain adaptation. First, we present an efficient “select” module that automatically filters out the outlier source samples to avoid negative transfer while aligning distributions across both domains. Second, the “label” module iteratively trains the classifier using both the labeled source domain data and the generated pseudo-labels for the target domain to enhance the discriminability of the latent space. Finally, the “mix” module utilizes domain mixup regularization jointly with the other two modules to explore more intrinsic structures across domains leading to a domain-invariant latent space for partial domain adaptation. Extensive experiments on several benchmark datasets demonstrate the superiority of our proposed framework over state-of-the-art methods. Project page: https://cvir.github.io/projects/slm.
Paper12 Discrete Cosin TransFormer: Image Modeling From Frequency Domain
摘要原文: In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transformer-based networks are able to learn semantics directly from frequency domain representation based on discrete cosine transform (DCT) without compromising the performance. To achieve the desired efficiency-effectiveness trade-off, we then leverage an input information compression on its frequency domain representation, which highlights the visually significant signals inspired by JPEG compression. We explore different frequency domain down-sampling strategies and show that it is possible to preserve the semantic meaningful information by strategically dropping the high-frequency components. The proposed DCFormer is tested on various downstream tasks including image classification, object detection and instance segmentation, and achieves state-of-the-art comparable performance with less FLOPs, and outperforms the commonly used backbone (e.g. SWIN) at similar FLOPs. Our ablation results also show that the proposed method generalizes well on different transformer backbones.
Paper13 Intra-Source Style Augmentation for Improved Domain Generalization
摘要原文: The generalization with respect to domain shifts, as they frequently appear in applications such as autonomous driving, is one of the remaining big challenges for deep learning models. Therefore, we propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation. Our method is based on a novel masked noise encoder for StyleGAN2 inversion. The model learns to faithfully reconstruct the image preserving its semantic layout through noise prediction. Random masking of the estimated noise enables the style mixing capability of our model, i.e. it allows to alter the global appearance without affecting the semantic layout of an image. Using the proposed masked noise encoder to randomize style and content combinations in the training set, ISSA effectively increases the diversity of training data and reduces spurious correlation. As a result, we achieve up to 12.4% mIoU improvements on driving-scene semantic segmentation under different types of data shifts, i.e., changing geographic locations, adverse weather conditions, and day to night. ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers. It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by 3% mIoU in Cityscapes to Dark Zurich.
Paper14 TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation
摘要原文: Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the performance of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior generalization ability, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT’s intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.
Paper15 Center-Aware Adversarial Augmentation for Single Domain Generalization
摘要原文: Domain generalization (DG) aims to learn a model from multiple training (i.e., source) domains that can generalize well to the unseen test (i.e., target) data coming from a different distribution. Single domain generalization (Single-DG) has recently emerged to tackle a more challenging, yet realistic setting, where only one source domain is available at training time. The existing Single-DG approaches typically are based on data augmentation strategies and aim to expand the span of source data by augmenting out-of-domain samples. Generally speaking, they aim to generate hard examples to confuse the classifier. While this may make the classifier robust to small perturbation, the generated samples are typically not diverse enough to mimic a large domain shift, resulting in sub-optimal generalization performance To alleviate this, we propose a center-aware adversarial augmentation technique that expands the source distribution by altering the source samples so as to push them away from the class centers via a novel angular center loss. We conduct extensive experiments to demonstrate the effectiveness of our approach on several benchmark datasets for Single-DG and show that our method outperforms the state-of-the-art in most cases.
Paper16 Backprop Induced Feature Weighting for Adversarial Domain Adaptation With Iterative Label Distribution Alignment
摘要原文: The requirement for large labeled datasets is one of the limiting factors for training accurate deep neural networks. Unsupervised domain adaptation tackles this problem of limited training data by transferring knowledge from one domain, which has many labeled data, to a different domain for which little to no labeled data is available. One common approach is to learn domain-invariant features for example with an adversarial approach. Previous methods often train the domain classifier and label classifier network separately, where both classification networks have little interaction with each other. In this paper, we introduce a classifier-based backprop-induced weighting of the feature space. This approach has two main advantages. Firstly, it lets the domain classifier focus on features that are important for the classification, and, secondly, it couples the classification and adversarial branch more closely. Furthermore, we introduce an iterative label distribution alignment method, that employs results of previous runs to approximate a class-balanced dataloader. We conduct experiments and ablation studies on three benchmarks Office-31, OfficeHome, and DomainNet to show the effectiveness of our proposed algorithm.
Paper17 Domain Adaptive Video Semantic Segmentation via Cross-Domain Moving Object Mixing
摘要原文: The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER -> Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq -> Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.
Paper18 Self-Distillation for Unsupervised 3D Domain Adaptation
摘要原文: Point cloud classification is a popular task in 3D vision. However, previous works, usually assume that point clouds at test time are obtained with the same procedure or sensor as those at training time. Unsupervised Domain Adaptation (UDA) instead, breaks this assumption and tries to solve the task on an unlabeled target domain, leveraging only on a supervised source domain. For point cloud classification, recent UDA methods try to align features across domains via auxiliary tasks such as point cloud reconstruction, which however do not optimize the discriminative power in the target domain in feature space. In contrast, in this work, we focus on obtaining a discriminative feature space for the target domain enforcing consistency between a point cloud and its augmented version. We then propose a novel iterative self-training methodology that exploits Graph Neural Networks in the UDA context to refine pseudo-labels. We perform extensive experiments and set the new state-of-the art in standard UDA benchmarks for point cloud classification. Finally, we show how our approach can be extended to more complex tasks such as part segmentation.
Paper19 ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidence-Based Mixing
摘要原文: Unsupervised Domain Adaptation (UDA) for object detection aims to adapt a model trained on a source domain to detect instances from a new target domain for which annotations are not available. Different from traditional approaches, we propose ConfMix, the first method that introduces a sample mixing strategy based on region-level detection confidence for adaptive object detector learning. We mix the local region of the target sample that corresponds to the most confident pseudo detections with a source image, and apply an additional consistency loss term to gradually adapt towards the target data distribution. In order to robustly define a confidence score for a region, we exploit the confidence score per pseudo detection that accounts for both the detector-dependent confidence and the bounding box uncertainty. Moreover, we propose a novel pseudo labelling scheme that progressively filters the pseudo target detections using the confidence metric that varies from a loose to strict manner along the training. We perform extensive experiments with three datasets, achieving state-of-the-art performance in two of them and approaching the supervised target model performance in the other. Code is available at https://github.com/giuliomattolin/ConfMix.
Paper20 SALAD: Source-Free Active Label-Agnostic Domain Adaptation for Classification, Segmentation and Detection
摘要原文: We present a novel method, SALAD, for the challenging vision task of adapting a pre-trained “source” domain network to a “target” domain, with a small budget for annotation in the “target” domain and a shift in the label space. Further, the task assumes that the source data is not available for adaptation, due to privacy concerns or otherwise. We postulate that such systems need to jointly optimize the dual task of (i) selecting fixed number of samples from the target domain for annotation and (ii) transfer of knowledge from the pre-trained network to the target domain. To do this, SALAD consists of a novel Guided Attention Transfer Network (GATN) and an active learning function, HAL. The GATN enables feature distillation from pre-trained network to the target network, complemented with the target samples mined by HAL using transfer-ability and uncertainty criteria. SALAD has three key benefits: (i) it is task-agnostic, and can be applied across various visual tasks such as classification, segmentation and detection; (ii) it can handle shifts in output label space from the pre-trained source network to the target domain; (iii) it does not require access to source data for adaptation. We conduct extensive experiments across 3 visual tasks, viz. digits classification (MNIST, SVHN, VISDA), synthetic (GTA5) to real (CityScapes) image segmentation, and document layout detection (PubLayNet to DSSE). We show that our source-free approach, SALAD, results in an improvement of 0.5%-31.3% (across datasets and tasks) over prior adaptation methods that assume access to large amounts of annotated source data for adaptation.
Paper21 Learning Classifiers of Prototypes and Reciprocal Points for Universal Domain Adaptation
摘要原文: Universal Domain Adaptation aims to transfer the knowledge between the datasets by handling two shifts: domain-shift and category-shift. The main challenge is correctly distinguishing the unknown target samples while adapting the distribution of known class knowledge from source to target. Most existing methods approach this problem by first training the target adapted known classifier and then relying on the single threshold to distinguish unknown target samples. However, this simple threshold-based approach prevents the model from considering the underlying complexities existing between the known and unknown samples in the high-dimensional feature space. In this paper, we propose a new approach in which we use two sets of feature points, namely dual Classifiers for Prototypes and Reciprocals (CPR). Our key idea is to associate each prototype with corresponding known class features while pushing the reciprocals apart from these prototypes to locate them in the potential unknown feature space. The target samples are then classified as unknown if they fall near any reciprocals at test time. To successfully train our framework, we collect the partial, confident target samples that are classified as known or unknown through on our proposed multi-criteria selection. We then additionally apply the entropy loss regularization to them. For further adaptation, we also apply standard consistency regularization that matches the predictions of two different views of the input to make more compact target feature space. We evaluate our proposal, CPR, on three standard benchmarks and achieve comparable or new state-of-the-art results. We also provide extensive ablation experiments to verify our main design choices in our framework.
Paper22 Domain Adaptive Object Detection for Autonomous Driving Under Foggy Weather
摘要原文: Most object detection methods for autonomous driving usually assume a onsistent feature distribution between training and testing data, which is not always the case when weathers differ significantly. The object detection model trained under clear weather might be not effective enough on the foggy weather because of the domain gap. This paper proposes a novel domain adaptive object detection framework for autonomous driving under foggy weather. Our method leverages both image-level and object-level adaptation to diminish the domain discrepancy in image style and object appearance. To further enhance the model’s capabilities under challenging samples, we also come up with a new adversarial gradient reversal layer to perform adversarial mining for the hard examples together with domain adaptation. Moreover, we propose to generate an auxiliary domain by data augmentation to enforce a new domain-level metric regularization. Experimental results on public benchmarks show the effectiveness and accuracy of the proposed method.
Paper23 Image-Free Domain Generalization via CLIP for 3D Hand Pose Estimation
摘要原文: RGB-based 3D hand pose estimation has been successful for decades thanks to large-scale databases and deep learning. However, the hand pose estimation network does not operate well for hand pose images whose characteristics are far different from the training data. This is caused by various factors such as illuminations, camera angles, diverse backgrounds in the input images, etc. Many existing methods tried to solve it by supplying additional large-scale unconstrained/target domain images to augment data space; however collecting such large-scale images takes a lot of labors. In this paper, we present a simple image-free domain generalization approach for the hand pose estimation framework that uses only source domain data. We try to manipulate the image features of the hand pose estimation network by adding the features from text descriptions using the CLIP (Contrastive Language-Image Pre-training) model. The manipulated image features are then exploited to train the hand pose estimation network via the contrastive learning framework. In experiments with STB and RHD datasets, our algorithm shows improved performance over the state-of-the-art domain generalization approaches.
Paper24 D2F2WOD: Learning Object Proposals for Weakly-Supervised Object Detection via Progressive Domain Adaptation
摘要原文: Weakly-supervised object detection (WSOD) models attempt to leverage image-level annotations in lieu of accurate but costly-to-obtain object localization labels. This oftentimes leads to substandard object detection and localization at inference time. To tackle this issue, we propose D2DF2WOD, a Dual-Domain Fully-to-Weakly Supervised Object Detection framework that leverages synthetic data, annotated with precise object localization, to supplement a natural image target domain, where only image-level labels are available. In its warm-up domain adaptation stage, the model learns a fully-supervised object detector (FSOD) to improve the precision of the object proposals in the target domain, and at the same time learns target-domain-specific and detection-aware proposal features. In its main WSOD stage, a WSOD model is specifically tuned to the target domain. The feature extractor and the object proposal generator of the WSOD model are built upon the fine-tuned FSOD model. We test D2DF2WOD on five dual-domain image benchmarks. The results show that our method results in consistently improved object detection and localization compared with state-of-the-art methods.
Paper25 Learning Style Subspaces for Controllable Unpaired Domain Translation
摘要原文: The unpaired domain-to-domain translation aims to learn inter-domain relationships between diverse modalities without relying on paired data, which can help complex structure prediction tasks such as age transformation, where it is challenging to attain paired samples. A common approach used by most current methods is to factorize the data into a domain-invariant content space and a domain-specific style space. In this work, we argue that the style space can be further decomposed into smaller subspaces. Learning these style subspaces has two-fold advantages: (i) it allows more robustness and reliability in the generation of images in unpaired domain translation; and (ii) it allows better control and thereby interpolating the latent space, which can be helpful in complex translation tasks involving multiple domains. To achieve this decomposition, we propose a novel scalable approach to partition the latent space into style subspaces. We also propose a new evaluation metric that quantifies the controllable generation capability of domain translation methods. We compare our proposed method with several strong baselines on standard domain translation tasks such as gender translation (male-to-female and female-to-male), age transformation, reference-guided image synthesis, multi-domain image translation, and multi-attribute domain translation on celebA-HQ and AFHQ datasets. The proposed technique achieves state-of-the-art performance on various domain translation tasks while outperforming all the baselines on controllable generation tasks.
Paper26 Contrastive Learning of Semantic Concepts for Open-Set Cross-Domain Retrieval
摘要原文: We consider the problem of image retrieval where query images during testing belong to classes and domains both unseen during training. This requires learning a feature space that has the ability to generalize across both classes and domains. To this end, we propose semantic contrastive concept network (SCNNet), a new learning framework that helps take a step towards class and domain generalization in a principled fashion. Unlike existing methods that rely on global object representations, SCNNet proposes to learn a set of local concept vectors to facilitate unseen-class generalization. To this end, SCNNet’s key innovations include (a) a novel trainable local concept extraction module that learns an orthonormal set of basis vectors, and (b) computes local features for any unseen-class data as a linear combination of the learned basis set. Next, to enable unseen-domain generalization, SCNNet proposes to generate supervisory signals from an adjacent data modality, i.e., natural language, by mining freely available textual label information associated with images. SCNNet derives these signals from our novel trainable semantic ordinal distance constraints that ensure semantic consistency between pairs of images sampled from different domains. Both the proposed modules above enable end-to-end training of the SCNNet, resulting in a model that helps establish state-of-the-art performance on the standard DomainNet, PACS, and Sketchy benchmark datasets with average Prec@200 improvements of 42.6%, 6.5%, and 13.6% respectively over the most recently reported results.
Paper27 Camera Alignment and Weighted Contrastive Learning for Domain Adaptation in Video Person ReID
摘要原文: Systems for person re-identification (ReID) can achieve a high level of accuracy when trained on large fully-labeled image datasets. However, the domain shift typically associated with diverse operational capture conditions (e.g., camera viewpoints and lighting) may translate to a significant decline in performance. This paper focuses on unsupervised domain adaptation (UDA) for video-based ReID – a relevant scenario that is less explored in the literature. In this scenario, the ReID model must adapt to a complex target domain defined by a network of diverse video cameras based on tracklet information. State-of-art methods cluster unlabeled target data, yet domain shifts across target cameras (sub-domains) can lead to poor initialization of clustering methods that propagates noise across epochs, and the ReID model cannot accurately associate samples of the same identity. In this paper, an UDA method is introduced for video person ReID that leverages knowledge on video tracklets, and on the distribution of frames captured over target cameras to improve the performance of CNN backbones trained using pseudo-labels. Our method relies on an adversarial approach, where a camera-discriminator network is introduced to extract discriminant camera-independent representations, facilitating the subsequent clustering. In addition, a weighted contrastive loss is proposed to leverage the confidence of clusters, and mitigate the risk of incorrect identity associations. Experimental results obtained on three challenging video-based person ReID datasets – PRID2011, iLIDS-VID, and MARS – indicate that our proposed method can outperform related state-of-the-art methods. The code is available at: https://github.com/wacv23775/775.
Paper28 Semi-Supervised Domain Adaptation With Auto-Encoder via Simultaneous Learning
摘要原文: We present a new semi-supervised domain adaptation framework that combines a novel auto-encoder-based domain adaptation model with a simultaneous learning scheme providing stable improvements over state-of-the-art domain adaptation models. Our framework holds strong distribution matching property by training both source and target auto-encoders using a novel simultaneous learning scheme on a single graph with an optimally modified MMD loss objective function. Additionally, we design a semi-supervised classification approach by transferring the aligned domain invariant feature spaces from source domain to the target domain. We evaluate on three datasets and show proof that our framework can effectively solve both fragile convergence (adversarial) and weak distribution matching problems between source and target feature space (discrepancy) with a high ‘speed’ of adaptation requiring a very low number of iterations.
Paper29 CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head Redirection
摘要原文: The robustness of gaze and head pose estimation models is highly dependent on the amount of labeled data. Recently, generative modeling has shown excellent results in generating photo-realistic images, which can alleviate the need for annotations. However, adopting such generative models to new domains while maintaining their ability to provide fine-grained control over different image attributes, e.g., gaze and head pose directions, has been a challenging problem. This paper proposes CUDA-GHR, an unsupervised domain adaptation framework that enables fine-grained control over gaze and head pose directions while preserving the appearance-related factors of the person. Our framework simultaneously learns to adapt to new domains and disentangle visual attributes such as appearance, gaze direction, and head orientation by utilizing a label-rich source domain and an unlabeled target domain. Extensive experiments on the benchmarking datasets show that the proposed method can outperform state-of-the-art techniques on both quantitative and qualitative evaluations. Furthermore, we demonstrate the effectiveness of generated image-label pairs in the target domain for pretraining networks for the downstream task of gaze and head pose estimation. The source code and pre-trained models are available at https://github.com/jswati31/cuda-ghr.
Paper30 FFM: Injecting Out-of-Domain Knowledge via Factorized Frequency Modification
摘要原文: This work addresses the Single Domain Generalization (SDG) problem, and aims to generalize a model from a single source (i.e., training) domain to multiple target (i.e., test) domains with different distributions. Most of the existing SDG approaches aim at generating out-of-domain samples by either transforming the source images into different styles or optimizing adversarial noise perturbations. In this paper, we show that generating images with diverse styles can be complementary to creating hard samples when tackling the SDG task. This inspires us to propose our approach of Factorized Frequency Modification (FFM) which can fulfill the requirement of generating diverse and hard samples to tackle the problem of out-of-domain generalization. Specifically, we design a unified framework consisting of a style transformation module, an adversarial perturbation module, and a dynamic frequency selection module. We seamlessly equip the framework with iterative adversarial training which facilitates the task model to learn discriminative features from hard and diverse augmented samples. We perform extensive experiments on four image recognition benchmark datasets of Digits-DG, CIFAR-10-C, CIFAR-100-C, and PACS, which demonstrates that our method outperforms existing state-of-the-art approaches.
中文总结: 这项工作解决了单领域泛化(SDG)问题,并旨在将模型从单个源(即训练)域泛化到具有不同分布的多个目标(即测试)域。大多数现有的SDG方法旨在通过将源图像转换为不同风格或优化对抗性噪声扰动来生成领域外样本。在本文中,我们表明在处理SDG任务时,生成具有多样风格的图像可以与创建困难样本互补。这启发我们提出了我们的Factorized Frequency Modification(FFM)方法,可以满足生成多样和困难样本的要求,以解决领域外泛化问题。具体而言,我们设计了一个包含风格转换模块、对抗扰动模块和动态频率选择模块的统一框架。我们无缝地为框架配备了迭代对抗训练,这有助于任务模型从困难和多样化的增强样本中学习有区分力的特征。我们在四个图像识别基准数据集Digits-DG、CIFAR-10-C、CIFAR-100-C和PACS上进行了大量实验,结果表明我们的方法优于现有的最先进方法。
Paper31 WHFL: Wavelet-Domain High Frequency Loss for Sketch-to-Image Translation
摘要原文: Even a rough sketch can effectively convey the descriptions of objects, as humans can imagine the original shape from the sketch. The sketch-to-photo translation is a computer vision task that enables a machine to do this imagination, taking a binary sketch image and generating plausible RGB images corresponding to the sketch. Hence, deep neural networks for this task should learn to generate a wide range of frequencies because most parts of the input (binary sketch image) are composed of DC signals. In this paper, we propose a new loss function named Wavelet-domain High-Frequency Loss (WHFL) to overcome the limitations of previous methods that tend to have a bias toward low frequencies. The proposed method emphasizes the loss on the high frequencies by designing a new weight matrix imposing larger weights on the high bands. Unlike existing hand-craft methods that control frequency weights using binary masks, we use the matrix with finely controlled elements according to frequency scales. The WHFL is designed in a multi-scale form, which lets the loss function focus more on the high frequency according to decomposition levels. We use the WHFL as a complementary loss in addition to conventional ones defined in the spatial domain. Experiments show we can improve the qualitative and quantitative results in both spatial and frequency domains. Additionally, we attempt to verify the WHFL’s high-frequency generation capability by defining a new evaluation metric named Unsigned Euclidean Distance Field Error (UEDFE).
中文总结: 这段话主要讨论了草图转照片的计算机视觉任务,使机器能够通过草图生成对应的彩色图像。为了克服先前方法的局限性,提出了一种名为Wavelet-domain High-Frequency Loss(WHFL)的新损失函数,通过设计一个新的权重矩阵,在高频带上施加更大的权重,以强调高频损失。WHFL以多尺度形式设计,使损失函数根据分解级别更加关注高频率。实验证明,WHFL可以在空间和频率领域中提高定性和定量结果。此外,为了验证WHFL的高频生成能力,定义了一种名为Unsigned Euclidean Distance Field Error(UEDFE)的新评估指标。
Paper32 Improving Diversity With Adversarially Learned Transformations for Domain Generalization
摘要原文: To be successful in single source domain generalization (SSDG), maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Recent success in SSDG comes from methods that pre-specify diversity inducing image augmentations during training, so that it may lead to better generalization on new domains. However, naive pre-specified augmentations are not always effective, either because they cannot model large domain shift, or because the specific choice of transforms may not cover the types of shifts commonly occurring in domain generalization. To address this issue, we present a novel framework called ALT: adversarially learned transformations, that uses an adversary neural network to model plausible, yet hard image transformations that fool the classifier. ALT learns image transformations by randomly initializing the adversary network for each batch and optimizing it for a fixed number of steps to maximize classification error. The classifier is trained by enforcing a consistency between its predictions on the clean and transformed images. With extensive empirical analysis, we find that this new form of adversarial transformations achieves both objectives of diversity and hardness simultaneously, outperforming all existing techniques on competitive benchmarks for SSDG. We also show that ALT can seamlessly work with existing diversity modules to produce highly distinct, and large transformations of the source domain leading to state-of-the-art performance. Code: https://github.com/tejas-gokhale/ALT
Paper33 Generative Alignment of Posterior Probabilities for Source-Free Domain Adaptation
摘要原文: Existing domain adaptation literature comprises multiple techniques that align the labeled source and unlabeled target domains at different stages, and predict the target labels. In a source-free domain adaptation setting, the source data is not available for alignment. We present a source-free generative paradigm that captures the relations between the source categories and enforces them onto the unlabeled target data, thereby circumventing the need for source data without introducing any new hyper-parameters. The adaptation is performed through the adversarial alignment of the posterior probabilities of the source and target categories. The proposed approach demonstrates competitive performance against other source-free domain adaptation techniques and can also be used for source-present settings.
Paper34 Auxiliary Task-Guided CycleGAN for Black-Box Model Domain Adaptation
摘要原文: The research area of domain adaptation investigates methods that enable the transfer of existing models across different domains, e.g., addressing environmental changes or the transfer from synthetic to real data. Especially unsupervised domain adaptation is beneficial because it does not require any labeled target domain data. Usually, existing methods are targeted at specific tasks and require access or even modifications to the source model and its parameters which is a major drawback when only a black-box model is available. Therefore, we propose a CycleGAN-based approach suitable for black-box source models to translate target domain data into the source domain on which the source model can operate. Inspired by multi-task learning, we extend CycleGAN with an additional auxiliary task that can be arbitrarily chosen to support the transfer of task-related information across domains without the need for having access to a differentiable source model or its parameters. In this work, we focus on the regression task of 2D human pose estimation and compare our results in four different domain adaptation settings to CycleGAN and RegDA, a state-of-the-art method for unsupervised domain adaptation for keypoint detection.
Paper35 Domain Adaptation Using Self-Training With Mixup for One-Stage Object Detection
摘要原文: In this paper, we present an end-to-end domain adaptation technique that utilizes both feature distribution alignment and Self-Training effectively for object detection. One set of methods for domain adaptation relies on feature distribution alignment and adapts models on an unlabeled target domain by learning domain invariant representations through adversarial loss. Although this approach is effective, it may not be adequate or even have an adverse effect when domain shifts are large and inconsistent. Another set of methods utilizes Self-Training which relies on pseudo labels to approximate the target domain distribution directly. However, it can also have a negative impact on the model performance due to erroneous pseudo labels. To overcome these two issues, we propose to generate reliable pseudo labels through feature distribution alignment and data distillation. Further, to minimize the adverse effect of incorrect pseudo labels during Self-Training we employ interpolation-based consistency regularization called mixup. While distribution alignment helps in generating more accurate pseudo labels, mixup regularization of Self-Training reduces the adverse effect of less accurate pseudo labels. Both approaches supplement each other and achieve effective adaptation on the target domain which we demonstrate through extensive experiments on one-stage object detector. Experiment results show that our approach achieves a significant performance improvement on multiple benchmark datasets.
使用词袋表示的主要缺点之一就是完全舍弃了单词顺序。因此“its bad,not good at all”和“its good,not bad at all”这两个字符串的词袋表示完全相同,尽管它们的含义相反。幸运的是,使用词袋表示时有一种获取上下文的方法&#…