CLIP 图文检索，相似度计算

news2025/4/26 12:50:42

CLIP 是OpenAI提出的神经网络，它可以从自然语言监督中有效地学习视觉概念。
CLIP 可以应用于任何视觉分类基准，只需提供要识别的视觉类别的名称，类似于 GPT-2 和 GPT-3 的“零样本”功能。

1.零样本图像分类

这里的分类并不是直接让CLIP预测一个标签，而是你给出一些标签的候选项，它会给这些候选项预测概率。

比如这张图片
请添加图片描述
给CLIP一些标签：“a dog”, “a cat”,“a man”,“a tree”, “food”，它会给每个标签预测一个概率，概率最高的就是最后的label.
你会看到"a cat"的得分最高。

同样的，如果用多张图片去匹配一个提示标签，可以用下面代码的logits_per_text.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model, preprocess = clip.load('ViT-B/32', device=device)

image = preprocess(Image.open('cat.png')).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat","a man","a tree", "food"]).to(device)

with torch.no_grad():
   #计算图像和token之间的分数
   #logits_per_image是image和每个token的分数，Tensor(1,5)
   #logits_per_text是每个token和image的分数，Tensor(5,1)
   logits_per_image, logits_per_text = model(image, text)
   probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print('Label probs:', probs)
#output:Label probs: [[3.159e-03 9.927e-01 1.589e-03 3.490e-04 2.241e-03]]

2. 用提示语搜索图片

现在用COCO数据集的图片来做测试，用val_2017数据，大概5000张图片。
用一个提示语"a red bus"搜索最相近的3张图片，看能得到什么。
这里计算image feature和text feature的相似度时，用了余弦相似度。

data_folder = 'coco/val2017'
images = []
for root,dirs,files in os.walk(data_folder):
     for file in files:
         if file.endswith('jpg'):
             images.append(root + '/' + file)
text = clip.tokenize(['a red bus']).to(device)
text_features = model.encode_text(text)
result = {}
cos = nn.CosineSimilarity(dim=0)

for img in images:
    with torch.no_grad():
        image_preprocess = preprocess(Image.open(img)).unsqueeze(0).to(device)
        image_features = model.encode_image(image_preprocess)
        sim = cos(image_features[0], text_features[0]).item()
        sim = (sim+1)/2 #(-1,1) --> (0,1)
        result[img] = sim

sorted_value = sorted(result.items(), key=lambda x:x[1], reverse=True)
sorted_res = dict(sorted_value)
top_3 = dict(itertools.islice(sorted_res.items(),3))
print(top_3)
#
# fig,axs =plt.subplots(1,3)
#
# i=0
# for key in top_3:
#     key_img = cv2.cvtColor(cv2.imread(key),cv2.COLOR_BGR2RGB)
#     axs[i].imshow(key_img)
#     axs[i].set_title('sim='+"{:.3f}".format(top_3[key]))
#     axs[i].axis('off')
#     i=i+1
# fig.suptitle('a red bus')
#
# plt.show()

根据提示语，按相似度从高到低，检索出如下3张图片。

请添加图片描述

3.图片的相似度

给出两张图片，计算它们的相似度。
现在要比的是上面“a red bus"中左边2个图片的相似度。
是通过计算image feature的余弦相似度实现的，而image feature是通过CLIP的encode得到。

img1 = 'bus1.jpg'
img2 = 'bus2.jpg'
cos = nn.CosineSimilarity(dim=0)

img1_process = preprocess(Image.open(img1)).unsqueeze(0).to(device)
img2_process = preprocess(Image.open(img2)).unsqueeze(0).to(device)

img1_feature = model.encode_image(img1_process)
img2_feature = model.encode_image(img2_process)

sim = cos(img1_feature[0], img2_feature[0]).item()
sim = (sim+1)/2
print("similarity: ", sim)
#output: similarity:  0.844970703125

4.用图片检索图片

还是用这个红色的bus, 看看用它能从COCO数据中检索出什么。

请添加图片描述

img1='bus1.jpg'
input_image = preprocess(Image.open(img1)).unsqueeze(0).to(device)
input_image_features = model.encode_image(input_image)

result = {}
for img in images:
    with torch.no_grad():
        image_preprocess = preprocess(Image.open(img)).unsqueeze(0).to(device)
        image_features = model.encode_image( image_preprocess)
        cos = torch.nn.CosineSimilarity(dim=0)
        sim = cos(image_features[0],input_image_features[0]).item()
        sim = (sim+1)/2
        result[img]=sim


sorted_value = sorted(result.items(), key=lambda x:x[1], reverse=True)
sorted_res = dict(sorted_value)

top_3 = dict(itertools.islice(sorted_res.items(), 3))

print(top_3)

请添加图片描述

参考资料：
https://medium.com/@jeremy-k/unlocking-openai-clip-part-1-intro-to-zero-shot-classification-f81194f4dff7
https://medium.com/@jeremy-k/unlocking-openai-clip-part-2-image-similarity-bf0224ab5bb0

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1564168.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！