Contrastive Learning in Image (CVPR 2023)

news2025/2/24 7:53:53

文章目录

1. Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning （图文匹配）
- - 1.目标
  - 2.任务类型
  - 3.解决思路
  - 4. 总结
2. MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining （图片蒸馏）
- - 核心思想
3.Twin Contrastive Learning with Noisy Labels（图分类）
- - 1. 目标
  - 2.模型
  - - Out-Of-Distribution Label Noise Detection
    - learni robust representation with contrastive loss
    - align loss
    - train & inference
4. Align and Attend: Multimodal Summarization with Dual Contrastive Losses
- - - 1. 核心思想
    - 2.损失函数
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
- - - 1核心思想
    - 2 损失函数
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery

1. Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning （图文匹配）

1.目标

train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder，期望达到的效果是：一个模型可以识别与给定文本输入相对应的图像区域

2.任务类型

open vocabulary semantic segmentation问题，仅仅使用image-text data .

3.解决思路

**评测指标——**分类准确率：提供了一个衡量模型中视觉和文本表征之间的补丁级别的一致性，其中高分类精度表明高一致性，反之亦然。
**选择使用对比学习的依据——**在image中语义相似的regions应该相似度值更大。
semantically similar regions in an image should produce similar patch representations in the vision encoder

使用cosine similarity ** 计算patch representation的相似值，并使用二分类**函数盘算patches 之间是否有相同的taget table。

对比学习的损失函数一般是InfoNCE loss
在这里插入图片描述

image 是x, text 是y.
image 的表示是 有weight patch level的representation得到的。
权重选择上是计算篇patch embedding和text的CLS 的相似度值得到的。
在这里插入图片描述
在最上方的loss函数上，更新之后的φ函数值计算如下：

4. 总结

patch alignment，是从每个patch计算text的CLS之间的相似值，作为权重，得到的image representation。
然后，使用INfoLOSS 更新模型参数。

2. MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining （图片蒸馏）

核心思想

The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image

在这里插入图片描述
如上图所示，图d是文中模型，EI是mask的图片，EI—是蒸馏中的teacher model，EI（不带横岗）是student model，ET 是text。

在contrastive learning部分，是masked 的图片表征，用来和image做匹配对齐。（contrastive learning ）
在知识蒸馏部分，是用teacher model来教masked的student model。
所以，是两部分，共同实现了model蒸馏。

润物细无声
在**Vision language 对比学习中，**图片是由多个pixel考虑的，text是多个token考虑的。

在这里插入图片描述
σ stands for the temperature for the loss functions
在image的distill学习中， 是使用的蒸馏损失函数。

最终的损失函数为：

在这里插入图片描述

3.Twin Contrastive Learning with Noisy Labels（图分类）

1. 目标

a novel twin contrastive learning model to learn robust representations and handle noisy labels for
classification
解决思路是建模为label noise detection as an out-of-distribution (OOD) problem

2.模型

模型整体分为3个部分：

(i) 在第3.1节中通过GMM (球形高斯混合模型,spherical Gaussian mixture model) 对模型预测和表征的数据分布进行建模；(ii) 在第3.2节中检测具有错误标签的例子作为分布外的样本；(iii) 在第3.3节中通过引导真实目标进行交叉监督；以及(iv) 在第3.4节中通过对比学习和混合学习进行robust representations

在这里插入图片描述

Out-Of-Distribution Label Noise Detection

Our idea is that the samples with clean labels should have the same cluster indices after linking the cluster index and class label
(clean labels的samples应该更加聚集，out-of-distribution 的labels 的分布和clean labels的分布应该不一致）
在这里插入图片描述
** regularization loss ：**

第一个项可以通过最大化平均预测的熵来避免预测坍缩为一个单一的类别。第二项是最小熵正则化，以鼓励模型对预测有较高的信心，这在以前的半监督学习文献中已有研究[9]。

learni robust representation with contrastive loss

在这里插入图片描述

align loss

在这里插入图片描述

train & inference

在这里插入图片描述

4. Align and Attend: Multimodal Summarization with Dual Contrastive Losses

1. 核心思想

two novel contrastive losses to model both inter-sample and intra-sample correlations；两种新的对比性损失来模拟样本间和样本内的关联性
在这里插入图片描述

首先，为了利用不同模态之间的对齐信息，我们提出了对齐引导的自我注意模块，以对齐视频和文本模态之间的时间对应关系，并以统一的方式融合跨模态信息
双重对比损失，结合样本间和样本内的对比损失，来模拟不同粒度的跨模式相关性

2.损失函数

分类损失函数 在这里插入图片描述
Inter-Sample Contrastive Loss ：we maximize the cosine similarity of the video embedding [CLSV] and
the text embedding [CLST] from B real pairs in the batch while minimizing the cosine similarity of embeddings from
the B2 − B incorrect pairs

Intra-Sample Contrastive Loss
these keyframes and key-sentences should be deeply correlated with each other and share similar high-level semantic meanings. Motivated by this observation, we propose the intra-sample contrastive loss which is calculated within each video and text pair sample rather than across different sample pairs

在这里插入图片描述
total loss

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

1核心思想

unifying the information granularities of images and texts can help generate better multimodal representations

对于一个图像，它的patch embedding首先由一个图像编码器提取。然后，通过对所有patches中的FDT(Finite Discrete Tokens)的注意力权重进行最大集合来测量FDT和图像之间的对应关系。最后，基于FDT的图像表示被计算为FDT的注意力加权和。输入文本的基于FDT的嵌入可以用同样的方法构建。**The encoders and FDT are trained to pull close the FDT-based representations of matched image-text pairs while pushing away those of unmatched pairs by using **编码器和FDT被训练为拉近匹配的图像-文本对的基于FDT的表示，同时通过使用InfoNCE损失推开那些不匹配的对。

在这里插入图片描述

2 损失函数

在这里插入图片描述

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Dynamic Conceptional Contrastive Learning for Generalized Category Discovery

主要的挑战是，未标记的数据所包含的实例不仅来自标记数据的已知类别，也来自新的类别。这导致传统的新类别发现（NCD）方法不能用于GCD，因为它们假设未标记的数据只来自新类别。GCD的一个有效方法是应用自我监督学习来学习无标签数据的区分性表示。然而，这种方式在很大程度上忽略了同一概念的实例之间的潜在关系（例如，类、超类和子类），这导致了较差的表示学习。在本文中，我们提出了一个动态概念对比学习（DCCL）框架，它可以通过交替估计潜在的视觉概念和学习概念表征来有效提高聚类的准确性。此外，我们还设计了一个动态的概念生成和更新机制，它能够保证概念学习的一致性，从而进一步促进DCCL的优化。

在这里插入图片描述