【RAG落地利器】向量数据库Chroma入门教程

安装部署

官方有pip安装的方式，为了落地使用，我们还是采用Docker部署的方式，参考链接来自官方部署:

https://cookbook.chromadb.dev/running/running-chroma/#docker-compose-cloned-repo

我们在命令终端运行：

docker run -d --rm --name chromadb -p 8001:8000 -v H:/Projects/chroma/volumes/index_data:/chroma/chroma -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE chromadb/chroma:0.6.4.dev19

其中-v H:/Projects/chroma/volumes/index_data:/chroma/chroma代表将本地的目录映射到容器的数据存储目录。容器启动日志如下:

通过浏览器访问http://localhost:8000/docs，如果能看到ChromaDB的文档页面，说明ChromaDB已经成功安装并运行。

Chroma简介

高效管理文本嵌入与相似度搜索的向量数据库

随着大型语言模型（LLM）的广泛应用，向量数据库逐渐成为处理文本嵌入和相似度搜索的关键工具。Chroma 是一个开源的向量数据库，专门设计用于存储和检索文本嵌入，帮助开发者更高效地构建基于大模型的应用。本文将带你了解Chroma的核心功能、设计理念以及如何使用它进行文本嵌入管理和相似度搜索。

什么是向量数据库？

向量数据库是一种专门用于存储和检索高维向量数据的数据库。与传统的关系型数据库不同，向量数据库针对非结构化数据（如文本、图像等）的嵌入表示进行了优化。通过将文本转换为向量嵌入，计算机能够以数字形式理解文本内容，从而支持语义搜索、个性化推荐等高级功能。

在大模型应用中，向量数据库的作用尤为重要。用户输入的文本会被转换为向量嵌入，随后通过相似度搜索算法在数据库中查找相关文档，最终生成个性化的响应。这种机制不仅提高了大模型的响应速度，还降低了计算成本。

Chroma的核心特点

Chroma 是一个开源的向量数据库，专注于简化文本嵌入的存储和检索过程。它的主要特点包括：

支持多种存储后端：Chroma支持多种底层存储选项，如DuckDB（适用于独立应用）和ClickHouse（适用于大规模扩展）。
多语言支持：Chroma提供了Python和JavaScript/TypeScript的SDK，方便开发者快速集成。
简单易用：Chroma的设计理念是“简单至上”，旨在提升开发者的效率。
高性能：Chroma不仅支持快速的相似度搜索，还提供了对搜索结果的分析功能。

Chroma的工作原理

Chroma的工作流程可以分为以下几个步骤：

创建集合（Collection）：集合类似于关系数据库中的表，用于存储文档及其嵌入。默认情况下，Chroma使用all-MiniLM-L6-v2模型将文本转换为嵌入，但开发者可以根据需求选择其他嵌入模型。
添加文档：将文本文档及其元数据添加到集合中。Chroma会自动将文本转换为嵌入并存储。
查询与搜索：通过文本或嵌入查询集合，Chroma会返回与查询内容相似的文档。开发者还可以根据元数据对结果进行过滤。

Chroma的设计理念

Chroma的设计目标是为开发者提供一种简单、高效的工具，帮助他们将现实世界中的知识、事实和技能整合到大模型中。其设计理念包括：

简单性与开发效率：Chroma的API设计简洁，开发者可以快速上手并集成到现有应用中。
搜索与分析并重：除了支持高效的相似度搜索，Chroma还提供了对搜索结果的分析功能，帮助开发者更好地理解数据。
高性能：Chroma在保证功能丰富的同时，也追求极致的性能表现。

在 ChromaDB 中，条件查询是一个非常强大的功能，允许你根据元数据（metadata）或文档内容（document content）来过滤查询结果。以下是如何在 ChromaDB 中进行条件查询的详细说明和示例代码。

ChromaDB 入门教程

ChromaDB 是一个开源的向量数据库，专门用于存储和查询向量嵌入。它非常适合用于自然语言处理（NLP）任务，如文本相似性搜索、推荐系统等。本教程将带你从零开始，学习如何使用 Python 操作 ChromaDB。

1. 安装 ChromaDB

首先，你需要安装 ChromaDB 和 OpenAI 的嵌入模型库。确保你已经设置了 OpenAI API 密钥。

pip install chromadb openai

2. 创建 ChromaDB 客户端

ChromaDB 支持内存模式和持久化模式。内存模式适合临时数据存储，而持久化模式会将数据保存在磁盘上。

import chromadb
from chromadb.config import Settings

# 创建持久化客户端，数据将保存在 "db/" 目录中
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="db/"
))

如果是docker部署的话，需要改成

import chromadb

client = chromadb.HttpClient(host="127.0.0.1",
                               port=8000,
                               settings=chromadb.Settings(
                                  chroma_client_auth_provider="chromadb.auth.token_authn.TokenAuthClientProvider",
                                   chroma_client_auth_credentials="your_token"))

3. 创建集合（Collection）

集合类似于传统数据库中的表。你可以通过 create_collection 方法创建一个新的集合。

collection = client.create_collection(name="Students")

4. 添加数据到集合

你可以将文本数据添加到集合中，ChromaDB 会自动将文本转换为向量嵌入并存储。

student_info = """
Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""

club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""

university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""

# 添加数据到集合
collection.add(
    documents=[student_info, club_info, university_info],
    metadatas=[{"source": "student info"}, {"source": "club info"}, {"source": "university info"}],
    ids=["id1", "id2", "id3"]
)

5. 查询数据

你可以使用 query 方法进行相似性搜索。ChromaDB 会将查询文本转换为向量，并使用相似性算法返回最相关的结果。

results = collection.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

6. 使用其他嵌入模型

ChromaDB 默认使用 all-MiniLM-L6-v2 模型进行嵌入。你也可以使用其他模型，如 OpenAI 的 text-embedding-ada-002。

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-ada-002"
)

students_embeddings = openai_ef([student_info, club_info, university_info])
print(students_embeddings)

7. 更新和删除数据

你可以更新或删除集合中的数据。

更新数据

collection.update(
    ids=["id1"],
    documents=["Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA"],
    metadatas=[{"source": "student info"}]
)

删除数据

collection.delete(ids=["id1"])

# 查询以验证删除
results = collection.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

8. 其他操作

获取集合列表

collections = client.list_collections()
print(collections)

获取集合中的数据

collection = client.get_collection(name="Students")
data = collection.peek()  # 获取集合中的前10条数据
print(data)

删除集合

client.delete_collection(name="Students")

9. 数据持久化

如果你希望数据持久化存储，可以使用 PersistentClient。

client = chromadb.PersistentClient(path="./chroma_db")

10. 条件查询

你可以根据元数据或文档内容进行过滤查询。

results = collection.query(
    query_texts=["What is the student name?"],
    n_results=2,
    where={"source": "student info"},
    where_document={"$contains": "computer science"}
)

print(results)

11. 更新文档

你可以通过 update 方法更新集合中的文档。

collection.update(
    ids=["id1"],
    documents=["Updated student info"],
    metadatas=[{"source": "updated student info"}]
)

12. 删除文档

你可以通过 delete 方法删除集合中的文档。

collection.delete(ids=["id1"])

Chroma条件查询教程

1. 按元数据过滤（`where` 参数）

你可以使用 where 参数来根据元数据字段进行过滤。元数据是你在添加数据时提供的附加信息。

支持的运算符

$eq：等于
$ne：不等于
$gt：大于
$gte：大于或等于
$lt：小于
$lte：小于或等于
$in：在列表中
$nin：不在列表中

示例代码

假设我们有以下数据：

collection.add(
    documents=["Document about AI", "Document about food", "Document about travel"],
    metadatas=[
        {"category": "technology", "year": 2022},
        {"category": "lifestyle", "year": 2021},
        {"category": "travel", "year": 2023}
    ],
    ids=["id1", "id2", "id3"]
)

查询 `category` 为 `technology` 的文档

results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={"category": {"$eq": "technology"}}  # 过滤条件
)

print(results)

查询 `year` 大于 2021 的文档

results = collection.query(
    query_texts=["technology"],
    n_results=2,
    where={"year": {"$gt": 2021}}  # 过滤条件
)

print(results)

查询 `category` 在 `["technology", "travel"]` 中的文档

results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={"category": {"$in": ["technology", "travel"]}}  # 过滤条件
)

print(results)

2. 按文档内容过滤（`where_document` 参数）

你可以使用 where_document 参数来根据文档内容进行过滤。支持的操作符是 $contains，用于检查文档中是否包含指定的字符串。

示例代码

查询文档中包含 `AI` 的文档

results = collection.query(
    query_texts=["technology"],
    n_results=2,
    where_document={"$contains": "AI"}  # 过滤条件
)

print(results)

查询文档中包含 `food` 的文档

results = collection.query(
    query_texts=["lifestyle"],
    n_results=2,
    where_document={"$contains": "food"}  # 过滤条件
)

print(results)

3. 组合条件查询

你可以将 where 和 where_document 结合使用，进行更复杂的查询。

示例代码

查询 `category` 为 `technology` 且文档中包含 `AI` 的文档

results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={"category": {"$eq": "technology"}},  # 元数据过滤
    where_document={"$contains": "AI"}  # 文档内容过滤
)

print(results)

查询 `year` 大于 2021 且文档中包含 `travel` 的文档

results = collection.query(
    query_texts=["travel"],
    n_results=2,
    where={"year": {"$gt": 2021}},  # 元数据过滤
    where_document={"$contains": "travel"}  # 文档内容过滤
)

print(results)

4. 逻辑运算符（`$and` 和 `$or`）

你可以使用逻辑运算符 $and 和 $or 来组合多个条件。

示例代码

查询 `category` 为 `technology` 或 `year` 大于 2021 的文档

results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={
        "$or": [
            {"category": {"$eq": "technology"}},
            {"year": {"$gt": 2021}}
        ]
    }
)

print(results)

查询 `category` 为 `technology` 且 `year` 大于 2021 的文档

results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={
        "$and": [
            {"category": {"$eq": "technology"}},
            {"year": {"$gt": 2021}}
        ]
    }
)

print(results)

5. 完整条件查询示例代码

以下是一个完整的示例，展示了如何添加数据并进行条件查询：

import chromadb
from chromadb.config import Settings

# 创建客户端
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="db/"
))

# 创建集合
collection = client.create_collection(name="Documents")

# 添加数据
collection.add(
    documents=["Document about AI", "Document about food", "Document about travel"],
    metadatas=[
        {"category": "technology", "year": 2022},
        {"category": "lifestyle", "year": 2021},
        {"category": "travel", "year": 2023}
    ],
    ids=["id1", "id2", "id3"]
)

# 条件查询：category 为 technology 且文档中包含 AI
results = collection.query(
    query_texts=["AI"],
    n_results=2,
    where={"category": {"$eq": "technology"}},
    where_document={"$contains": "AI"}
)

print("查询结果：", results)