Elasticsearch：探索 CLIP 替代方案

作者：来自 Elastic Jeffrey Rengifo 及 Tomás Murúa

分析图像到图像和文本到图像搜索的 CLIP 模型的替代方案。

在本文中，我们将通过一个模拟房地产网站的实际示例介绍 CLIP 多模态模型，探索替代方案，并分析它们的优缺点，该网站允许用户使用图片作为参考来搜索房产。

什么是 CLIP？

CLIP（Contrastive Language–Image Pre-training - 对比语言 - 图像预训练）是由 OpenAI 创建的神经网络，使用图像和文本对进行训练，以解决在文本和图像之间寻找相似性的任务，并对 “零样本” 图像进行分类，因此模型不是使用固定标签进行训练的，而是我们为模型提供未知类别，以便它可以对我们提供的图像进行分类。

CLIP 一直是最先进的模型，你可以在此处阅读有关它的更多文章：

实现图像搜索
如何实现图像相似性搜索

然而，随着时间的推移，出现了更多的替代方案。

在本文中，我们将使用房地产示例介绍 CLIP 的两种替代方案的优缺点。以下是我们在本文中将遵循的步骤的摘要：

基本配置：CLIP 和 Elasticsearch

在我们的示例中，我们将使用 Python 创建一个带有交互式 UI 的小项目。我们将安装一些依赖项，例如 Python 转换器，这将授予我们访问我们将使用的某些模型的权限。

创建一个文件夹 /clip_comparison 并按照此处的安装说明进行操作。完成后，安装 Elasticsearch 的 Python 客户端、Cohere SDK 和 Streamlit：

注意：作为一种选择，我建议使用 Python 虚拟环境 (venv)。如果你不想在计算机上安装所有依赖项，这将非常有用。

pip install elasticsearch==8.15.0 cohere streamlit

Streamlit 是一个开源 Python 框架，可让你使用少量代码轻松获得 UI。

我们还将创建一些文件来保存稍后将使用的指令：

app.py：UI 逻辑。
/services/elasticsearch.py：Elasticsearch 客户端初始化、查询和批量 API 调用以索引文档。
/services/models.py：用于生成嵌入的模型实例和方法。
index_data.py：用于从本地源索引图像的脚本。
/data：我们的数据集目录。

我们的应用程序结构应如下所示：

/clip_comparison
  |--app.py
  |--index_data.py
  |--/data
  |--/venv # If you decide to use venv
  |--/services
        |-- __init__.py
        |-- models.py
        |-- elasticsearch.py

配置 Elasticsearch

按照以下步骤存储示例图像。然后我们将使用 knn 向量查询搜索它们。

注意：我们也可以存储文本文档，但对于此示例，我们将仅在图像中搜索。

索引映射

访问 Kibana 开发工具（从 Kibana：Management > Dev Tools）以使用这些映射构建数据结构：

[ ]
PUT clip-images
{
  "mappings": {
    "properties": {
      "image_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "image_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": "true",
        "similarity": "cosine"
      },
      "image_data": {
        "type": "binary"
      }
    }
  }
}

PUT embed-images
{
  "mappings": {
    "properties": {
      "image_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "image_embedding": {
        "type": "dense_vector",
        "dims": 1024,
        "index": "true",
        "similarity": "cosine"
      },
      "image_data": {
        "type": "binary"
      }
    }
  }
}
PUT jina-images
{
  "mappings": {
    "properties": {
      "image_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "image_embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": "true",
        "similarity": "cosine"
      },
      "image_data": {
        "type": "binary"
      }
    }
  }
}

字段类型 dense_vector 将存储模型生成的嵌入。字段 binary 将以 base64 格式存储图像。

注意：将图像以二进制形式存储在 Elasticsearch 中不是一个好习惯。我们这样做只是为了这个例子的实际目的。建议使用静态文件存储库。

现在来看看代码。我们需要做的第一件事是使用 cloud id 和 api-key 初始化 Elasticsearch 客户端。在文件 /services/elasticsearch.py 的开头写入以下代码：

[ ]
from elasticsearch import Elasticsearch, exceptions, helpers
ELASTIC_ENDPOINT = "https://your-elastic-endpoint.com:9243"
ELASTIC_API_KEY = "your-elasticsearch-api-key"
# Elasticsearch client
es_client = Elasticsearch(
    ELASTIC_ENDPOINT,
    api_key=ELASTIC_API_KEY,
)
# index documents using bulk api
def index_images(index_name: str, images_obj_arr: list):

    actions = [
        {
            "_index": index_name,
            "_source": {
                "image_data": obj["image_data"],
                "image_name": obj["image_name"],
                "image_embedding": obj["image_embedding"],
            },
        }
        for obj in images_obj_arr
    ]
    try:
        response = helpers.bulk(es_client, actions)
        return response
    except exceptions.ConnectionError as e:
        return e

# knn search
def knn_search(index_name: str, query_vector: list, k: int):
    query = {
        "size": 4,
        "_source": ["image_name", "image_data"],
        "query": {
            "knn": {
                "field": "image_embedding",
                "query_vector": query_vector,
                "k": k,
                "num_candidates": 100,
                "boost": 10,
            }
        },
    }
    try:
        response = es_client.search(index=index_name, body=query)
        return response
    except exceptions.ConnectionError as e:
        return e
# match all query
def get_all_query(index_name: str):
    query = {
        "size": 400,
        "source": ["image_name", "image_data"],
        "query": {"match_all": {}},
    }
    try:
        return es_client.search(index=index_name, body=query)
    except exceptions.ConnectionError as e:
        return e

配置模型

要配置模型，请将模型实例及其方法放入此文件中：/services/models.py。

Cohere Embed-3 模型作为 Web 服务工作，因此我们需要一个 API 密钥才能使用它。你可以在此处免费获取一个。试用限制为每分钟 5 次调用，每月 1,000 次调用。

要配置模型并使图像可在 Elasticsearch 中搜索，请按照以下步骤操作：

使用 CLIP 将图像转换为向量
将图像像量存储在 Elasticsearch 中
将我们要与存储的图像进行比较的图像或文本向量化。
运行查询以将上一步的条目与存储的图像进行比较并获取最相似的图像。

[ ]
# /services/models.py
# dependencies
import base64
import io
import cohere
from PIL import Image
from transformers import CLIPModel, CLIPProcessor, AutoModel
COHERE_API_KEY = "your-cohere-api-key"
## CLIP model call
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# JinaCLip model call
jina_model = AutoModel.from_pretrained("jinaai/jina-clip-v1", trust_remote_code=True)
# Cohere client initialization
co = cohere.ClientV2(COHERE_API_KEY)

配置 CLIP

要配置 CLIP，我们需要在 models.py 文件中添加生成图像和文本嵌入的方法。

# /services/models.py
# convert images to vector using CLIP
async def clip_image_embeddings(image: Image.Image):
    try:
        inputs = clip_processor(images=image, return_tensors="pt", padding=True)
        outputs = clip_model.get_image_features(**inputs)
        return outputs.detach().cpu().numpy().flatten().tolist()
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None
# convert text to to vector
async def clip_text_embeddings(description: str):
    try:
        inputs = clip_processor([description], padding=True, return_tensors="pt")
        outputs = clip_model.get_text_features(**inputs)
        return outputs.detach().cpu().numpy().flatten().tolist()
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None

对于所有模型，你需要声明类似的方法：一个用于从图像生成嵌入（clip_image_embeddings），另一个用于从文本生成嵌入（clip_text_embeddings）。

outputs.detach().cpu().numpy().flatten().tolist() 链是一种将 pytorch tensors 转换为更可用格式的常见操作：

.detach()：从计算图中删除张量，因为我们不再需要计算梯度。
.cpu()：将 tensors 从 GPU 移动到 CPU，因为 numpy 仅支持 CPU。
.numpy()：将 tensors 转换为 numPy 数组。
.flatten()：转换为 1D 数组。
.toList()：转换为 Python 列表。

此操作将多维 tensor 转换为可用于嵌入操作的纯数字列表。

现在让我们看一些 CLIP 替代方案。

竞争对手 1：JinaCLIP

JinaCLIP 是 Jina AI 开发的 CLIP 变体，专门用于改进多模态应用中的图像和文本搜索。它通过增加图像和文本表示的灵活性来优化 CLIP 性能。

与原始 OpenAI CLIP 模型相比，JinaCLIP 在文本转文本、文本转图像、图像转文本和图像转图像任务中表现更好，如下图所示：

Model	Text-Text	Text-to-Image	Image-to-Text	Image-Image
jina-clip-v1	0.429	0.899	0.803	0.916
openai-clip-vit-b16	0.162	0.881	0.756	0.816
%increase vs OpenAI CLIP	165%	2%	6%	12%

它能够提高不同类型查询的精度，因此非常适合需要更精确、更详细分析的任务。

你可以在此处阅读有关 JinaCLIP 的更多信息。

要在我们的应用中使用 JinaCLIP 并生成嵌入，我们需要声明以下方法：

[ ]
# /services/models.py
# convert images to vector using JinaClip model
async def jina_image_embeddings(image: Image.Image):
    try:
        image_embeddings = jina_model.encode_image([image])
        return image_embeddings[0].tolist()
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None
# convert text to vector
async def jina_text_embeddings(description: str):
    try:
        text_embeddings = jina_model.encode_text(description)
        return text_embeddings.tolist()
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None

竞争对手 2：Cohere Image Embeddings V3

Cohere 开发了一种名为 Embed-3 的图像嵌入模型，它是 CLIP 的直接竞争对手。主要区别在于 Cohere 专注于企业数据（如图表、产品图像和设计文件）的表示。Embed-3 使用一种先进的架构，可以降低对文本数据的偏见风险，这目前是 CLIP 等其他多模态模型的劣势，因此它可以在文本和图像之间提供更精确的结果。

你可以在下方看到 Cohere 的图表，该图表显示了在这种数据中使用 Embed 3 与 CLIP 相比的改进结果：

有关更多信息，请访问 Embed3。

就像我们对之前的模型所做的那样，让我们声明使用 Embed 3 的方法：

[ ]
# /services/models.py
# convert images to vector using Cohere Embed model
async def embed_image_embeddings(image: Image.Image):
    try:
        img_byte_arr = io.BytesIO()
        image.save(img_byte_arr, format="JPEG")
        img_byte_arr = img_byte_arr.getvalue()
        stringified_buffer = base64.b64encode(img_byte_arr).decode("utf-8")
        content_type = "image/jpeg"
        image_base64 = f"data:{content_type};base64,{stringified_buffer}"
        response = co.embed(
            model="embed-english-v3.0",
            input_type="image",
            embedding_types=["float"],
            images=[image_base64],
        )
     return response.embeddings.float_[0]
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None
# convert text to vector
async def embed_text_embeddings(description: str):
    try:
        response = co.embed(
            texts=[description],
            model="embed-english-v3.0",
            input_type="classification",
            embedding_types=["float"],
        )
     return response.embeddings.float_[0]
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None

准备好函数后，让我们通过在文件 index_data.py 中添加以下代码来在 Elasticsearch 中索引数据集：

[ ]
# dependencies
import asyncio
import base64
import os
from PIL import Image
from services.elasticsearch import index_images
from services.models import (
    clip_image_embeddings,
    embed_image_embeddings,
    jina_image_embeddings,
)
# function to encode images
def encode_image_to_base64(image_path):
    with open(image_path, "rb") as img_file:
        return base64.b64encode(img_file.read()).decode("utf-8")
async def main():
    # folder with images
    folder_path = "./data"
    jina_obj_arr = []
    embed_obj_arr = []
    clip_obj_arr = []
    for filename in os.listdir(folder_path):
        img_path = os.path.join(folder_path, filename)
        print(f"Processing {filename}...")
        try:
            image_data = Image.open(img_path)
            # generating images embeddings
            clip_result, embed_result, jina_result = await asyncio.gather(
                clip_image_embeddings(image_data),
                embed_image_embeddings(image_data),
                jina_image_embeddings(image_data),
            )
            image_base64 = encode_image_to_base64(img_path)
            # building documents
            jina_obj_arr.append(
                {
                    "image_name": filename,
                    "image_embedding": jina_result,
                    "image_data": image_base64,
                }
            )
            embed_obj_arr.append(
                {
                    "image_name": filename,
                    "image_embedding": embed_result,
                    "image_data": image_base64,
                }
            )
            clip_obj_arr.append(
                {
                    "image_name": filename,
                    "image_embedding": clip_result,
                    "image_data": image_base64,
                }
            )
        except Exception as e:
            print(f"Error with {filename}: {e}")
    print("Indexing images in Elasticsearch...")
    # indexing images
    jina_count, _ = index_images(jina_index, jina_obj_arr)
    cohere_count, _ = index_images(embed_index, cohere_obj_arr)
    openai_count, _ = index_images(clip_index, openai_obj_arr)
    print("Cohere count: ", cohere_count)
    print("Jina count: ", jina_count)
    print("OpenAI count: ", openai_count)
if __name__ == "__main__":
    asyncio.run(main())

使用以下命令对文档进行索引：

python index_data.py

一旦数据集被索引，我们就可以创建 UI。

测试 UI

创建 UI

我们将使用 Streamlit 构建 UI 并并排比较这三种替代方案。

要构建 UI，我们首先将导入和依赖项添加到文件 app.py：

[ ]
# app.py
import asyncio
import base64
from io import BytesIO
import streamlit as st
from PIL import Image
from services.elasticsearch import get_all_query, knn_search
# declared functions imports
from services.models import (
    clip_image_embeddings,
    clip_text_embeddings,
    embed_image_embeddings,
    embed_text_embeddings,
    jina_image_embeddings,
    jina_text_embeddings,
)

对于此示例，我们将使用两个视图；一个用于图像搜索，另一个用于查看图像数据集：

[ ]
# app.py
if "selected_view" not in st.session_state:
    st.session_state.selected_view = "Index"
def change_view(view):
    st.session_state.selected_view = view
st.sidebar.title("Menu")
if st.sidebar.button("Search image"):
    change_view("Index")
if st.sidebar.button("All images"):
    change_view("Images")

让我们添加搜索图像的视图代码：

[ ]
if st.session_state.selected_view == "Index":
    # Index page
    st.title("Image Search")
    col1, col_or, col2 = st.columns([2, 1, 2])
    uploaded_image = None
    with col1:
        uploaded_image = st.file_uploader("Upload image", type=["jpg", "jpeg", "png"])
    with col_or:
        st.markdown(
            "<h3 style='text-align: center; margin-top: 50%;'>OR</h3>",
            unsafe_allow_html=True,
        )
    input_text = None
    with col2:
        st.markdown(
            "<div style='display: flex; margin-top: 3rem;  align-items: center; height: 100%; justify-content: center;'>",
            unsafe_allow_html=True,
        )
        input_text = st.text_input("Type text")
        st.markdown("</div>", unsafe_allow_html=True)
    st.write("")
    st.write("")
    search_button = st.markdown(
        """
        <style>
            .stButton>button {
                width: 50%;
                height: 50px;
                font-size: 20px;
                margin: 0 auto;
                display: block;
            }
        </style>
        """,
        unsafe_allow_html=True,
    )
    submit_button = st.button("Search")
    if uploaded_image:
        st.image(uploaded_image, caption="Uploaded Image", use_container_width=True)
    if submit_button:
        if uploaded_image or input_text:
            async def fetch_embeddings():
                data = None
                if uploaded_image:
                    image = Image.open(uploaded_image)
                    data = image
                elif input_text:
                    data = input_text
                # Getting image or text embeddings
                if uploaded_image:
                    openai_result, cohere_result, jina_result = await asyncio.gather(
                        clip_image_embeddings(data),
                        embed_image_embeddings(data),
                        jina_image_embeddings(data),
                    )
                elif input_text:
                    openai_result, cohere_result, jina_result = await asyncio.gather(
                        clip_text_embeddings(data),
                        embed_text_embeddings(data),
                        jina_text_embeddings(data),
                    )
                return openai_result, cohere_result, jina_result
            results = asyncio.run(fetch_embeddings())
            openai_result, cohere_result, jina_result = results
            if openai_result and cohere_result and jina_result:
                # calling knn query
                clip_search_results = knn_search("clip-images", openai_result, 5)
                jina_search_results = knn_search("jina-images", jina_result, 5)
                embed_search_results = knn_search("embed-images", cohere_result, 5)
                clip_search_results = clip_search_results["hits"]["hits"]
                jina_search_results = jina_search_results["hits"]["hits"]
                embed_search_results = embed_search_results["hits"]["hits"]
                st.subheader("Search Results")
                col1, spacer1, col2, spacer2, col3 = st.columns([3, 0.2, 3, 0.2, 3])
                def print_results(results):
                    for hit in results:
                        image_data = base64.b64decode(hit["_source"]["image_data"])
                        image = Image.open(BytesIO(image_data))
                        st.image(image, use_container_width=True)
                        st.write("score: ", hit["_score"])
                # printing results
                with col1:
                    st.write("CLIP")
                    print_results(clip_search_results)
                with col2:
                    st.write("JinaCLIP")
                    print_results(jina_search_results)
                with col3:
                    st.write("Cohere")
                    print_results(embed_search_results)
        else:
            st.warning("Please upload an image or type text to search.")

现在，图像视图的代码：

[ ]
elif st.session_state.selected_view == "Images":
    # images page
    st.header("All images")
    # getting all images
    images = get_all_query("jina-images")
    hits = images["hits"]["hits"]
    columns = st.columns(5)
    for idx, hit in enumerate(hits):
        image_data = base64.b64decode(hit["_source"]["image_data"])
        image = Image.open(BytesIO(image_data))
        with columns[idx % 5]:
            st.image(image, use_container_width=True)

我们将使用以下命令运行该应用程序：

streamlit run app.py

使用 Elasticsearch 来进行图形搜索 - CLIP 替代品

由于多模态性，我们可以在图像数据库中根据文本（文本到图像的相似性）或图像（图像到图像的相似性）运行搜索。

使用 UI 搜索

为了比较这三种模型，我们将使用一个场景，即房地产网页希望通过允许用户使用图像或文本进行搜索来改善其搜索体验。我们将讨论每种模型提供的结果。

我们将上传 “rustic home” 的图片：

以下是搜索结果。如你所见，根据我们上传的图像，每个模型都生成了不同的结果：

此外，你还可以看到根据文本查找房屋特征的结果：

如果搜索 “modern”，这三个模型都会显示良好的结果。但是，JinaCLIP 和 Cohere 会在第一个位置显示相同的房屋。

功能比较

下面是本文中介绍的三个选项的主要功能和价格的摘要：

Model	Created by	Estimated Price	Features
CLIP	OpenAI	每次重复运行 0.00058 美元 (https://replicate.com/krthr/clip-embeddings)	针对文本和图像的通用多模态模型；适用于无需特定训练的各种应用。
JinaCLIP	Jina AI	每 100 万枚 Jina tokens 需 0.018 美元 (https://jina.ai/embeddings/)	针对多模式应用优化的 CLIP 变体。提高了检索文本和图像的精度。
Embed-3	Cohere	Cohere 上每 100 万个 tokens 收费 0.10 美元，每份数据和图像收费 0.0001 美元（https://cohere.com/pricing）	专注于企业数据。改进了图形和图表等复杂视觉数据的检索。