「VLM」CLIP 文本与图像的桥梁

在这里插入图片描述

github：https://github.com/OpenAI/CLIP
paper：Learning Transferable Visual Models From Natural Language Supervision

CLIP全称：Contrastive Language-Imge Pre-training，即对比语言-图像预训练。 对比学习是一种更关注于学习同类物体之间的共同特征且区分不同类物体之间的特征差异的方法；它的核心思想来源于人类在学习的过程中不仅可以从正向反馈中进步，还能够在负向反馈中反思并纠正错误行为。

CLIP 是一个基于 “image-text” 对进行训练的神经网络，使用自然语言指示来预测给定图像的最相关文本片段，而不需要直接针对任务进行优化，类似于GPT-2和GPT-3的零样本能力。此外 CLIP 在“零样本”情况下的表现相当于原始的ResNet50，在没有使用原始的128万标记样本的情况下，克服了计算机视觉中的几个主要挑战。

文章目录

前言
一、CLIP模型
- 1.1 模型效果和特点
- 1.2 训练数据：CogVLM-SFT-311K
- 1.3 模型训练
- 1.4 损失函数
- 1.5 模型超参数
- 1.6 零样本学习：
二、代码解析
- 2.1 模型预测 Demo
- 2.2 Loading the model
- 2.3 Image Preprocessing
- 2.4 Text Preprocessing 分词
- 2.5 model.encode_text
- 2.6 model.encode_image(image)
- 2.7 Calculating cosine similarity
- 2.8 Zero-Shot 图像分类
- 2.9 Zero-Shot Prediction
CLIP 局限性
参考链接

前言

在CLIP之前，常见的计算机视觉模型都是通过有监督学习的方式进行训练，即使用一组固定好的预定义物体标签集合，eg:ImageNet-1k数据集使用的1000种分类，COCO数据集采用的80种分类。之所以先预定义好物体标签，一个目的是更方便的收集和整理数据，另一个目的是更好地训练模型。但是这种操作相当于简化了任务，极大的限制了模型的泛化能力，特别是当模型要识别超出预定义类别的新物体时，其效果会急剧下降。

CLIP 提出：直接从关于图像的描述文本中获取标签，这大大扩充了监督信号的来源，只要是在描述中出现过的物体，就有可能让模型学习到，这样就可以不局限于定义好的标签类别。作者从网上搜集了4亿个图像和文本的匹配对数据集，然后定义了一个简单的预训练任务，即预测哪个标题和哪个图像相对应，也可以看做给图片添加对应的文字说明。训练完成后，自然语言就可以用于引导视觉信号，令模型预测各种各样的类别，打破了传统的固定类别的分类范式，从而使模型可以更好地迁移至下游任务中，具有很强的 Zero-Shot (零样本)能力。

CLIP 由图像编码器(Image Encoder：Vision Transformer) 和文本编码器(Text-Encoder：Transformer)组合而成，并且这两个编码器都是可以直接使用已经预训练好的模型。

在这里插入图片描述
在训练过程中，模型的输入就是一张图像和一句文字描述的数据对，如图一所示，输入的图片是一只狗，对应的文字描述是 “Peper the aussie pup”。N张图片的描述句会通过文本编码器得到N个特征向量，然后N张图片也会通过图像编码器得到N个特征向量，这里的图像编码器可以任意选择，论文中使用了ResNet和ViT两种。CLIP根据 N 个图像特征向量和 N 个文本特征向量进行对比学习，一个配对的 “图像-文本”对就是一个正样本，而其它的都是负样本。因此，在一个矩阵对应关系上，对角线上的 N个元素都是正样本，非对角线上的 $N^2-N$ 个元素都是负样本。

一、CLIP模型

1.1 模型效果和特点

CLIP模型是一个双骨架结构，包括一个文本编码器Text Encoder和一个图像编码器Image Encoder。训练数据集的形式为(image, text)，对于每个正确匹配的image和text，text是对image的一句正确描述。CLIP模型需要对(image, text)的数据对进行预测，即(image, text)匹配的为1，不匹配的为0。

Text Encoder：对每个 Text 编码成一个隐向量 $T_i$ 维度 $[1, 512]$ , $N$ 个Text 及 $[N, 512]$
Image Encoder：对于每张 Image编码成一个隐向量 $I_i$ 维度 $[1, 512]$ , $N$ 张图即 $[N, 512]$

CLIP通过使用 一个不区分大小写的分词器，通过 clip.tokenize() 来调用（默认情况下，输出会被填充到 len=77 的 text向量），经过分词器处理后的维度是 $[B, 77]$ 。由于 Text Encoder 和 Image Encoder 最后都是输出 $[N, 512]$ 的Tensor，因此可以很方便地计算 images 和 texts 两两之间的 Cosine相似度。CLIP可以选在ResNet或ViT作为Backbone。实验表明，ViT的效果要好于ResNet。

1.2 训练数据：CogVLM-SFT-311K

主要使用了三个数据集：MS-COCO、Visual Genome & YFCC100M。其中很多图片都是使用自动生成的文件名，eg：20160716_11397.jpg 作为标题或者包含相机曝光设置的描述，显然这些事无法作为标签使用的。经过过滤，仅保留具有自然语言标题或英文描述的图像，数据集缩小了6倍，仅剩1500w张图片，大小与ImageNet相同。由于这些数据没办法充分反映数据的多样性，仅考虑这些数据集必然会低估模型的潜力。为此，作者构建了一个新的数据集，其中包含从互联网上各种公开来源收集的4亿组数据对，简称为 WIT for WebImageText。

每个数据集都有两个列表，classes（类别）和templates（模板），其中模板中的字符串 {} 将被替换为相应的类名。特别针对 Facial Emotion Recognition 2013 数据集，作者对某些类别使用了更细标签。

1.3 模型训练

CLIP官方没有提供训练脚本，可以参考 https://github.com/mlfoundations/open_clip

作者训练了一系列5个 ResNet 和 3个 Vision Transformer。对于ResNets，训练了ResNet-50、ResNet-101以及遵循 EfficientNet风格的模型 scaling，使用了4倍、16倍和64倍的 ResNet-50计算量，分别表示为 ResNet50x4、ResNet50x16 & ResNet50x64。对于ViT模型，分别训练了 ViT-B/32、 ViT-B/16 & ViT-L/14。所有模型的训练epochs为32。优化器采用 Adam，对所有的无偏执 biases 或无增益 not gains 采用耦合权重衰减正则化处理，并使用余弦调度衰减学习率。

当训练 1个 epoch时，使用网格搜索、随机搜索和手动调整基线 ResNet-50模型的组合来设置初始超参数。然后，由于计算约束，超参数被启发式地调整为更大的模型。

可学习的温度参数 t 被初始化为 0.07 的等效值，并被裁切以防止对数缩放超过100，对于防止训练不稳定有效，

训练采用非常大的 minibach：32768 ，混合精度用于加速训练和节省内存。为了节省额外的内存，还使用了梯度检查点、半精度Adam统计和半精度随机四舍五入的文本编码权重，嵌入相似度的计算也被分片，单个GPU仅计算其本地嵌入批所需的成对相似度的子集。

最大的ResNet型号 ResNet-50x64在 592个V100 GPUs上训练了12天，对于ViT-L/14 ，还以更高的 226分辨率进行了额外的一轮epoch预训练，以提高与 FixRes类似的性能。此模型表示为 ViT-L/14@336px。

1.4 损失函数

CLIP采用对称损失函数，简单来说，就是对相似度矩阵，分别从行方向和列方向计算loss，最后取两者的平均。

在这里插入图片描述

数学表示和损失函数实现：
给定批量中有 $N$ 个图像和文本对, 损失函数由两部分交叉嫡组成：

$L=\frac{1}{2N}\Biggl( \sum^N_{i=1} \biggl( -\log \frac{e^{\frac{S_{i,i}}{\tau}}}{\sum^N_{j=1}e^{\frac{S_{i,j}}{\tau}}} \biggr) + \sum^N_{i=1} \biggl(-\log \frac{e^{\frac{S_{i,i}}{\tau}}}{\sum^N_{j=1} e^{\frac{S_{j,i}} {\tau} } } \biggr) \Biggr)$

其中 $S_{i,j}$ 是图像 $i$ 和文本 $j$ 的特征向量的点积, $\tau$ 是超参温度系数，对应伪代码里面的 np.exp(t) 。

伪代码如下：

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
# 图像到文本的损失函数，第0维度即图片的行维度
loss_i = cross_entropy_loss(logits, labels, axis=0)
# 文本到图像的损失函数
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

在这里插入图片描述

1.5 模型超参数

在这里插入图片描述

1.6 零样本学习：

零样本学习：通常是指在图像分类任务中将模型推广到未曾遇到过的对象类别的研究，更广泛的定义是研究对未训练类别的泛化能力。

CLIP 经过预训练，可以预测图像和文本片段在其数据集中是否配对在一起。对于每个数据集，CLIP 采用数据集中所有类的名称作为潜在的文本配对集，并根据CLIP 预测最可能的 (image, text) 配对。更详细的说，就是首先通过各自的编码器计算图像的特征嵌入和可能文本集的特征嵌入，然后计算这些嵌入的余弦相似度，用温度参数 t 进行缩放，并通过 softmax 归一化为概率分类。该预测层是一个多项式逻辑回归分类器，具有L2归一化输入、L2归一化权重、无偏差和温度缩放，当以这种方式解释时，图像编码器是计算图像特征表示的计算机视觉骨干，文本编码器是一个超网络，文本编码器根据指定类所表示的视觉概念的文本生成线程分类器的权重。

二、代码解析

在进行代码解析之前，首先概括下几个常用的 clip API接口

import clip
# Returns the names of the available CLIP models.
clip.available_models()
> ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']

# Returns the model and the TorchVision transform needed by the model, specified by the model name returned by clip.available_models(). 
# It will download the model as necessary. The name argument can also be a path to a local checkpoint.
# The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, 
# otherwise the CPU. When jit is False, a non-JIT version of the model will be loaded.
clip.load(name, device=..., jit=False)

# Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
clip.tokenize(text: Union[str, List[str]], context_length=77)

############################################################################################################
"""The model returned by clip.load() supports the following methods:"""
#Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
model.encode_image(image: Tensor)

# Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
model.encode_text(text: Tensor)

# Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. 
# The values are cosine similarities between the corresponding image and text features, times 100.
model(image: Tensor, text: Tensor)

2.1 模型预测 Demo

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

2.2 Loading the model

模型加载权重

import clip

clip.available_models()


model, preprocess = clip.load("ViT-B/32")
model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

Model parameters: 151,277,313
Input resolution: 224
Context length: 77
Vocab size: 49408

单张图片预测

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

其中 model.forward(image, text) 如下

self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

def forward(self, image, text):
    image_features = self.encode_image(image)
    text_features = self.encode_text(text)

    # normalized features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)		# 归一化
    text_features = text_features / text_features.norm(dim=1, keepdim=True)			# 归一化

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()

    # shape = [global_batch_size, global_batch_size]
    return logits_per_image, logits_per_text

2.3 Image Preprocessing

图像预处理相对而言较为常规，对图像进行 resize 并且中心裁切，并将图像格式转换为 RGB格式，最后进行张量处理，并做归一化处理。

Compose(Resize(size=224, interpolation=bicubic, max_size=None, antialias=None)
    	CenterCrop(size=(224, 224))
    	_convert_image_to_rgb,
    	ToTensor()
    	Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
		)

2.4 Text Preprocessing 分词

词元(token) 可以理解为最小的语义单元，分词的目的是将输入文本转换为一系列的词元，并且还要保证每个词元拥有相对完整的独立语义。eg：输入“Hello World!”，可以将其分为4个词元，即[“Hello”, " ", “World”, “!”]，然后把每个词元转换成一个数字，后续就可以使用这个数字来表示这个词元(token)，这个数字被称为词元ID(token ID)。分词方法有很多种，如BPE分词算法、jieba分词等。

使用一个不区分大小写的分词器，可以通过 clip.tokenize() 来调用。默认情况下，输出会被填充到 len=77 的 text向量。

>>> clip.tokenize("Hello World!")

tensor([[49406,  3306,  1002,   256, 49407,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])

此处引入了 startoftext & endoftext 起始符&终止符分别对应 49406 和 49407
在这里插入图片描述

输入texts=[‘a diagram’, ‘a dog’, ‘a cat’]，输出结果如下，维度为 [Batch, 77]，起始符&终止符分别对应 49406 和 49407
在这里插入图片描述

2.5 model.encode_text

首先对分词token 进行 embedding 词嵌入，然后添加位置编码 positional_embedding

其中 positional_embedding 采用 nn.Parameter创建一个同尺寸的张量，nn.Parameter是一个特殊的张量（Tensor）类型，主要用于定义模型中的可训练参数。它的主要功能是将张量标记为模型的参数，这样它就会被自动包含在模型的参数列表中，并且在训练过程中会被优化。主要特点是

自动加入到模型的参数中： nn.Parameter 对象会自动被 PyTorch 的 Module 类 所管理。这意味着它会自动出现在 model.parameters() 返回的参数列表中，并且会在优化过程中被更新。

需要梯度计算： nn.Parameter 默认启用梯度计算（requires_grad=True），这使得它在反向传播中可以计算梯度并被优化器更新。如果你使用普通的 Tensor 对象并手动设置 requires_grad=True，它不会自动加入到模型参数中。

初始化：你可以通过初始化 nn.Parameter 对象来指定参数的初始值。通常情况下，这些初始化值是通过调用 torch.Tensor 的构造函数生成的张量。例如：

torch.empty 是 PyTorch 提供的一个函数，用于创建一个未初始化的张量。它会根据给定的形状返回一个内存中未被初始化的张量，这意味着张量中的数据可能是任意的垃圾值。

self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))


def encode_text(self, text):
    x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # x.shape = [batch_size, n_ctx, transformer.width]
    # take features from the eot embedding (eot_token is the highest number in each sequence)
    x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection		# 从eot_token 提取出隐向量(EOT编辑每个序列中的最大值)   [batch_size, n_ctx, transformer.width] -> [batch_size, transformer.width]

    return x

细心的小伙伴可能发现进行了两次 x.permute(1, 0, 2) 操作，why？留个bug，回头补上 在此采用的多头注意力机制做的 Transformer 操作。

2.6 model.encode_image(image)

encode_image 处理采用的 ViT处理，首先通过卷积运算conv1获得 7x7 的patch特征图，然后通过self.class_embedding 添加类别编码以及self.positional_embedding 位置编码，之后就是self.transformer处理(多头注意力)，最后通过一个 self.proj 将其转化为 512尺寸的隐向量

class VisionTransformer(nn.Module):
    def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
        super().__init__()
        self.input_resolution = input_resolution
        self.output_dim = output_dim
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
        scale = width ** -0.5
        self.class_embedding = nn.Parameter(scale * torch.randn(width))
        self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
        self.ln_pre = LayerNorm(width)
        self.transformer = Transformer(width, layers, heads)
        self.ln_post = LayerNorm(width)
        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

    def forward(self, x: torch.Tensor):
        x = self.conv1(x)  # shape = [*, width, grid, grid]
        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
        x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
        x = x + self.positional_embedding.to(x.dtype)
        x = self.ln_pre(x)

        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD

        x = self.ln_post(x[:, 0, :])

        if self.proj is not None:
            x = x @ self.proj

        return x

2.7 Calculating cosine similarity

image_features /= image_features.norm(dim=-1, keepdim=True)					# 归一化
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T	# 计算cosine相似度

count = len(descriptions)

plt.figure(figsize=(20, 14))
plt.imshow(similarity, vmin=0.1, vmax=0.3)
# plt.colorbar()
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12)

for side in ["left", "top", "right", "bottom"]:
  plt.gca().spines[side].set_visible(False)

plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

plt.title("Cosine similarity between text and image features", size=20)

在这里插入图片描述

2.8 Zero-Shot 图像分类

from torchvision.datasets import CIFAR100

cifar100 = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True)

text_descriptions = [f"This is a photo of a {label}" for label in cifar100.classes]
text_tokens = clip.tokenize(text_descriptions).cuda()

with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
top_probs, top_labels = text_probs.cpu().topk(5, dim=-1)

plt.figure(figsize=(16, 16))

for i, image in enumerate(original_images):
    plt.subplot(4, 4, 2 * i + 1)
    plt.imshow(image)
    plt.axis("off")

    plt.subplot(4, 4, 2 * i + 2)
    y = np.arange(top_probs.shape[-1])
    plt.grid()
    plt.barh(y, top_probs[i])
    plt.gca().invert_yaxis()
    plt.gca().set_axisbelow(True)
    plt.yticks(y, [cifar100.classes[index] for index in top_labels[i].numpy()])
    plt.xlabel("probability")

plt.subplots_adjust(wspace=0.5)
plt.show()

在这里插入图片描述

2.9 Zero-Shot Prediction

# pip install git+https://github.com/modestyachts/ImageNetV2_pytorch
from imagenetv2_pytorch import ImageNetV2Dataset


def zeroshot_classifier(classnames, templates):
    with torch.no_grad():
        zeroshot_weights = []
        for classname in tqdm(classnames):
            texts = [template.format(classname) for template in templates] #format with class
            texts = clip.tokenize(texts).cuda() #tokenize
            class_embeddings = model.encode_text(texts) #embed with text encoder
            class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
            class_embedding = class_embeddings.mean(dim=0)
            class_embedding /= class_embedding.norm()
            zeroshot_weights.append(class_embedding)
        zeroshot_weights = torch.stack(zeroshot_weights, dim=1).cuda()
    return zeroshot_weights


def accuracy(output, target, topk=(1,)):
    pred = output.topk(max(topk), 1, True, True)[1].t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    return [float(correct[:k].reshape(-1).float().sum(0, keepdim=True).cpu().numpy()) for k in topk]


if __name__ == '__main__':

    images = ImageNetV2Dataset(transform=preprocess)
    loader = torch.utils.data.DataLoader(images, batch_size=32, num_workers=2)

    with torch.no_grad():
        top1, top5, n = 0., 0., 0.
        for i, (images, target) in enumerate(tqdm(loader)):
            images = images.cuda()
            target = target.cuda()

            # predict
            image_features = model.encode_image(images)
            image_features /= image_features.norm(dim=-1, keepdim=True)
            logits = 100. * image_features @ zeroshot_weights

            # measure accuracy
            acc1, acc5 = accuracy(logits, target, topk=(1, 5))
            top1 += acc1
            top5 += acc5
            n += images.size(0)

    top1 = (top1 / n) * 100
    top5 = (top5 / n) * 100

    print(f"Top-1 accuracy: {top1:.2f}")

CLIP 局限性

CLIP 在很多数据集上的效果与基线模型相近，但是基线模型在大多数数据集上的表现并不是很好，效果相对来说差一些。

CLIP 在一些细分类的数据集上的效果也并不好，并且无法处理特别抽象的概念，eg计算图中有多少物体。因此，CLIP 在检索任务上的表现非常优秀，但是在 VQA(Visual Question Answering, 视觉问答) 等一些需要逻辑推理的任务上能力稍显不足。

虽然 CLIP 的泛化性能很好，但是如果测试数据集完全偏离训练数据集，那么 CLIP 的表现依旧较差。分析 CLIP 的训练数据集可以发现，其训练数据集中虽然有 4亿个 “图像-文本” 对，但是极个别方向相关的图片数量十分稀少，所以在此方向的数据对CLIP而言就是处于特征分布外的数据。

与其他的深度学习模型一样，CLIP 对数据的利用并不是很高效，还是需要大量的数据才能训练出好的效果。CLIP 的训练数据集有 4 亿个 “图像-文本” 对，共训练了32轮，相当于对 128亿对数据训练了一遍，假设每张照片需要 0.01s ，那么所有数据训练一遍需要4.05年左右。

虽然 CLIP 可以做 Zero-Shot 的分类任务，但是要预先定义分类，相对而言，有一种更灵活的方法就是通过语言模型直接生成图片标题。因此，如何同时利用对比学习的目标函数和语言模型的目标函数是后续优化 CLIP 的方向之一。

对于很多复杂的任务或者概念，无法使用自然无言精确描述，如果能够在做下游任务的时候为模型提供一些示例，对于模型的判断就会非常有帮助。但CLIP并不是为了实现 Few-Shot 这种设定而提出的，所以就导致了：当给CLIP提供一些示例做 Few-Shot任务的时候，其效果反而不如 Zero-Shot 任务。