YOLOv8+CLIP实现图文特征匹配

本文通过结合YOLOv8s的高效物体检测能力与CLIP的先进图像-文本匹配技术，展示了深度学习在处理和分析复杂多模态数据中的潜力。这种技术的应用不仅限于学术研究，还能广泛应用于工业、商业和日常技术产品中，以实现更智能的人机交互和信息处理。

一、物体检测与图像处理

1. 物体检测模型：YOLOv8s

物体检测是计算机视觉领域的核心问题之一，它涉及识别图像中的物体并精确地定位它们的位置。YOLO（You Only Look Once）技术自2016年由Redmon等人首次提出以来，就以其高效的处理速度和优秀的性能引领了物体检测领域。YOLO的核心创新是将物体检测任务视为一个端到端的回归问题，通过单次前向传播，同时预测图像中多个边界框和相应的类别概率。YOLOv8s是YOLO系列中的最新迭代，继承并增强了YOLO体系的主要优点，如实时性能和高准确率。YOLOv8s进一步优化了模型架构和训练过程，使用深度卷积神经网络一次性分析整个图像，从而预测物体的类别和位置。这种端到端的训练方式不仅简化了训练过程，还增强了模型对不同尺寸物体的泛化能力。

在本文中，我们采用预训练的YOLOv8s模型进行物体检测。该模型接受输入的原始图像，并输出包括物体类别、置信度和边界框坐标在内的详细检测信息，为后续的图像处理和分析奠定了基础。

在这里插入图片描述

2. 图像处理过程

图像处理流程从使用PIL（Python Imaging Library）库加载图像开始，这是一个广泛应用于Python中的图像处理库。图像加载后，即被输入到YOLOv8s模型进行推断处理。模型输出的结果包括每个检测到的物体的类别、置信度和边界框坐标。基于这些信息，在原始图像上绘制边界框和相应的标注信息，如物体的类别和置信度，这样的视觉呈现有助于我们直观地理解和评估模型的检测效果。

代码中还实现了一个关键步骤：根据检测到的边界框坐标，将每个物体裁剪出来并保存为独立的图像文件。这一处理步骤极为重要，因为它允许对每个单独检测到的物体进行进一步分析，无论是用于进一步的图像处理任务，还是用作其他应用，如机器学习训练数据集的构建。

通过这种方法，我们不仅提高了对物体检测技术的理解，还能有效地利用检测结果进行多种后续的图像分析任务，展示了计算机视觉技术在实际应用中的广泛可能性。

在这里插入图片描述

二、图像与文本的语义匹配.

CLIP模型技术介绍

CLIP（Contrastive Language-Image Pre-training）是由OpenAI开发的一种革命性多模态学习框架，设计目的是为了深入理解图像与文本之间的语义联系。此框架基于大规模对比学习的原则，通过并行训练图像编码器和文本编码器来识别图像与其描述之间的匹配关系，显著提升了模型对视觉任务的泛化能力。CLIP模型的核心优势在于其独特的训练方法，该方法采用海量的图像-文本对作为训练数据，通过优化图像和文本表示之间的相似度进行训练。这种策略使得CLIP不仅能理解广泛的视觉概念，还能将这些概念与自然语言有效结合，从而实现深层次的语义理解。

在本文中，我们采用了预训练的CLIP模型（ViT-B/32版本），该模型结合了视觉Transformer（ViT）架构和先进的文本处理能力，使其能够处理高维度的图像数据及复杂的文本输入。

图像与文本的预处理与特征提取

为了有效利用CLIP模型进行图像与文本的匹配，数据的适当预处理至关重要。图像数据经过CLIP提供的标准预处理流程，包括调整尺寸、归一化等步骤，确保符合模型的输入规范。文本数据则通过CLIP的分词器处理，将文本转换为一系列标记，进而转换为数值型向量，以适配模型的处理需求。

在特征提取阶段，预处理后的图像和文本数据分别输入到CLIP的图像编码器和文本编码器中。这两个编码器分别为图像和文本生成高维的特征表示，这些表示在多维空间中捕捉并表达了图像和文本的深层语义内容。

相似度计算与分析

获取图像和文本的特征表示后，下一步是计算它们之间的语义相似度。这通过计算两组特征向量之间的点积来实现，其中点积结果反映了图像与文本在语义层面的匹配程度。在本研究中，进行相似度计算之前，对特征向量进行了标准化处理，以消除由于向量长度不同而带来的潜在偏差。

通过这种方式，我们能够定量评估特定文本描述与一组图像之间的匹配程度，这对于多模态数据分析、内容检索、以及自动化标注等应用领域具有重要的实际意义。这些分析结果不仅展示了CLIP模型在桥接视觉与语言之间差异方面的强大能力，也为未来的多模态研究和应用提供了重要的技术基础。

在这里插入图片描述

三、完整代码

from ultralytics import YOLO
import os
from PIL import Image, ImageDraw, ImageFont
import numpy as np
import torch
import matplotlib.pyplot as plt
import clip

# Load a pretrained YOLOv8s model
model = YOLO('yolov8s.pt')

# Load an image with PIL
original_image = Image.open('img/img1.jpg')

# Run inference on an image
results = model(original_image)

# Create a copy of the original image to draw bounding boxes
draw_image = original_image.copy()
draw = ImageDraw.Draw(draw_image)

# 类别映射
class_names = {
    0: "person", 1: "bicycle", 2: "car", 3: "motorcycle", 4: "airplane",
    5: "bus", 6: "train", 7: "truck", 8: "boat", 9: "traffic light",
    10: "fire hydrant", 11: "stop sign", 12: "parking meter", 13: "bench",
    14: "bird", 15: "cat", 16: "dog", 17: "horse", 18: "sheep", 19: "cow",
    20: "elephant", 21: "bear", 22: "zebra", 23: "giraffe", 24: "backpack",
    25: "umbrella", 26: "handbag", 27: "tie", 28: "suitcase", 29: "frisbee",
    30: "skis", 31: "snowboard", 32: "sports ball", 33: "kite", 34: "baseball bat",
    35: "baseball glove", 36: "skateboard", 37: "surfboard", 38: "tennis racket",
    39: "bottle", 40: "wine glass", 41: "cup", 42: "fork", 43: "knife",
    44: "spoon", 45: "bowl", 46: "banana", 47: "apple", 48: "sandwich",
    49: "orange", 50: "broccoli", 51: "carrot", 52: "hot dog", 53: "pizza",
    54: "donut", 55: "cake", 56: "chair", 57: "couch", 58: "potted plant",
    59: "bed", 60: "dining table", 61: "toilet", 62: "tv", 63: "laptop",
    64: "mouse", 65: "remote", 66: "keyboard", 67: "cell phone", 68: "microwave",
    69: "oven", 70: "toaster", 71: "sink", 72: "refrigerator", 73: "book",
    74: "clock", 75: "vase", 76: "scissors", 77: "teddy bear", 78: "hair drier",
    79: "toothbrush"
}

# Process results list
for result in results:
    boxes = result.boxes  # Boxes object for bbox outputs
    cls = boxes.cls
    conf = boxes.conf
    xyxyn = boxes.xyxyn

    # Convert tensor to numpy array and move data to CPU
    cls_numpy = cls.cpu().numpy()
    conf_numpy = conf.cpu().numpy()
    xyxyn_numpy = xyxyn.cpu().numpy()

    # Iterate over each detection
    for i in range(len(cls_numpy)):
        # Convert normalized coordinates to image coordinates
        box = xyxyn_numpy[i]
        xmin, ymin, xmax, ymax = box[0] * original_image.width, box[1] * original_image.height, box[2] * original_image.width, box[3] * original_image.height

        # Draw the bounding box
        draw.rectangle([(xmin, ymin), (xmax, ymax)], outline="red", width=2)

        # Get class label from cls number using the mapping
        class_label = class_names.get(int(cls_numpy[i]), 'Unknown')

        # Prepare text with class and confidence
        label = f'{class_label}: {conf_numpy[i]:.2f}'

        # Draw the class and confidence text
        draw.text((xmin, ymin), label, fill="white")

# # Save the annotated image
draw_image.save('annotated_image/annotated_img1.png')


# 定义一个函数来裁剪图像
def crop_image(original_image, box):
    x1, y1, x2, y2 = map(int, box)
    return original_image.crop((x1, y1, x2, y2))


# 创建一个目录来保存裁剪的图像，如果该目录不存在的话
os.makedirs('cropped_images', exist_ok=True)

# Process results list
for result in results:
    boxes = result.boxes  # Boxes object for bbox outputs
    cls = boxes.cls
    conf = boxes.conf
    xyxyn = boxes.xyxyn

    # Convert tensor to numpy array and move data to CPU
    cls_numpy = cls.cpu().numpy()
    conf_numpy = conf.cpu().numpy()
    xyxyn_numpy = xyxyn.cpu().numpy()

    # Iterate over each detection
    for i in range(len(cls_numpy)):
        # Convert normalized coordinates to image coordinates
        box = xyxyn_numpy[i]
        xmin, ymin, xmax, ymax = box[0] * original_image.width, box[1] * original_image.height, box[
            2] * original_image.width, box[3] * original_image.height

        # Draw the bounding box
        draw.rectangle([(xmin, ymin), (xmax, ymax)], outline="red", width=2)

        # Get class label from cls number using the mapping
        class_label = class_names.get(int(cls_numpy[i]), 'Unknown')

        # Prepare text with class and confidence
        label = f'{class_label}: {conf_numpy[i]:.2f}'

        # Draw the class and confidence text
        draw.text((xmin, ymin), label, fill="white")

        # Crop the image around the bounding box and save it
        cropped_img = crop_image(original_image, (xmin, ymin, xmax, ymax))
        cropped_img_path = os.path.join('cropped_images', f'{class_label}_{i}_{conf_numpy[i]:.2f}.jpg')
        cropped_img.save(cropped_img_path)

# Load the CLIP model
model, preprocess = clip.load("ViT-B/32", device='cuda')
model.eval()

# Prepare your images and texts
your_image_folder = "cropped_images"  # Change to the folder where you stored cropped images
your_texts = ["Drink water"]  # Replace with your list of texts

images = []
for filename in os.listdir(your_image_folder):
    if filename.endswith(".png") or filename.endswith(".jpg"):
        path = os.path.join(your_image_folder, filename)
        image = Image.open(path).convert("RGB")
        images.append(preprocess(image))

# Image and text preprocessing
image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(your_texts).cuda()

# Compute features
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    text_features = model.encode_text(text_tokens).float()

# Normalize the features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

print('image_features:', image_features)
print('text_features:', text_features)

print('image_features_shape:', image_features.shape)
print('text_features_shape:', text_features.shape)

# Calculate similarity
similarity = (text_features.cpu().numpy() @ image_features.cpu().numpy().T)

# Print similarity scores
print('Similarity:', similarity)