如何使用 DeepEval 优化 Elasticsearch 中的 RAG 检索

作者：来自 Elastic Kritin Vongthongsri

学习如何使用 DeepEval 优化 RAG 流水线中的 Elasticsearch 检索器。

LLMs 容易产生幻觉、缺乏特定领域的专业知识，并受限于上下文窗口。检索增强生成（Retrieval-Augmented Generation - RAG）通过使 LLM 访问相关的外部上下文来解决这些问题，从而使其回答更加准确。

多种 RAG 方法（如 GraphRAG 和 AdaptiveRAG）已经出现，以提高检索的准确性。然而，检索性能仍可能因 RAG 应用的领域和具体用例而有所不同。

要针对特定用例优化检索，需要确定能提供最佳质量的超参数，包括嵌入模型的选择、返回的顶部结果数量（top-K）、相似度函数、重排序策略等。

优化检索意味着评估并迭代这些超参数，直到找到性能最佳的组合。在本文中，我们将探讨如何使用 DeepEval 优化 RAG 流水线中的 Elasticsearch 检索器。

我们首先安装 Elasticsearch 和 DeepEval：

pip install deepeval elasticsearch

衡量检索性能

要优化 Elasticsearch 检索器并对每种超参数组合进行基准测试，需要一种评估检索质量的方法。以下是三个关键指标，可用于衡量检索性能：上下文精确度（contextual precision）、上下文召回率（contextual recall） 和 上下文相关性（contextual relevancy）。

上下文精确度（Contextual Precision）：

from deepeval.metrics import ContextualPrecisionMetric
contextual_precision = ContextualPrecisionMetric()

上下文召回率（Contextual Recall）：

该指标衡量检索到的上下文中是否包含所有与输入相关的信息块。换句话说，它检查检索器是否遗漏了关键信息，以确保返回的内容尽可能完整。

from deepeval.metrics import ContextualRecallMetric
contextual_precision = ContextualRecallMetric()

上下文相关性（Contextual Relevancy）：

该指标衡量检索到的信息块与输入查询的整体相关性。它确保返回的内容不仅包含相关信息，而且整体上对生成高质量 LLM 响应是有意义的。较高的上下文相关性得分表明检索系统能够提供更加精准和有用的上下文支持。

from deepeval.metrics import ContextualRelevancyMetric
contextual_relevancy = ContextualRelevancyMetric()

最终，上下文相关性指标评估检索上下文中的信息与 RAG 应用用户输入的相关性。

这三个指标的结合对于确保检索器获取适量的正确信息、按适当顺序排列，并为 LLM 提供干净、结构良好的数据以生成准确输出至关重要。

理想情况下，应该找到使所有三个指标得分最高的超参数组合。然而，在某些情况下，提高召回率可能会不可避免地降低相关性。因此，在这些因素之间找到平衡点是实现最佳性能的关键。

如果需要针对特定用例的自定义指标，可以考虑使用 G-Eval 和 DAG。这些工具允许你定义具有特定评估标准的精确指标。

以下资源可能有助于更好地理解这些指标的计算方法：

如何计算上下文精度
如何计算上下文召回率
如何计算上下文相关性
RAG 应用中的检索评估

Elasticsearch 可优化的超参数

Elasticsearch 在 RAG 管道中的信息检索提供了极大的灵活性，可通过多种超参数调整来优化检索性能。本节将介绍一些关键的超参数。

检索前：

为了在将数据插入 Elasticsearch 向量数据库之前优化其结构，可以调整诸如 chunk 大小 和 chunk 重叠度 等参数。此外，选择合适的 嵌入模型 也至关重要，以确保高效且有意义的向量表示。

检索过程中：
Elasticsearch 允许完全控制检索过程。可以配置 相似度函数，首先确定近似搜索的候选数量，再对 top-K 进行 KNN 计算，最终选出最相关的 top-K 结果。

此外，还可以定义检索策略 —— 语义检索（基于向量嵌入）、文本检索（基于查询规则），或 混合检索（结合两者）。

检索后：

在获取检索结果后，Elasticsearch 允许进一步优化，例如 重排序（reranking）。可以选择 重排序模型、定义 重排序窗口、设置 最低分数阈值 等，以确保优先返回最相关的结果。

…

不同的超参数会对特定的检索指标影响较大。例如，如果发现 上下文相关性（contextual relevance） 低，可能与 top-K 等参数设置有关。通过将特定超参数映射到各个检索指标，可以更高效地调整 RAG 管道，从而精准优化检索性能。

下面是不同超参数对检索指标影响的映射表：

Metric	Hyperparameter
上下文精确度	重排序模型、重排序窗口、重排序阈值
上下文召回率	检索策略（文本 vs 嵌入）、嵌入模型、候选计数、相似度函数、top-K
上下文相关性	top-K、块大小、块重叠

在下一节中，我们将通过代码示例演示如何评估和优化我们的 Elasticsearch 检索器。我们将使用 "all-MiniLM-L6-v2" 来嵌入文本文档，将 top-K 设置为 3，并将候选数配置为 10。

设置 RAG 与 Elastic Retriever

首先，连接到本地或基于云的 Elastic 集群：

from elasticsearch import Elasticsearch

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

接下来，创建一个 Elasticsearch 索引，并配置适当的类型映射，以存储文本和作为密集向量的嵌入。

if not client.indices.exists(index="knowledge_base"):
    client.indices.create(
        index="knowledge_base",
        mappings={
            "properties": {
                "text": {
                    "type": "text"
                },
                "embedding": {
                    "type": "dense_vector",
                    "dims": 384,
                    "index": "true",
                    "similarity": "cosine"
                }
            }
        }
    )

为了将文档块插入到 Elastic 索引中，首先使用嵌入模型将其编码为向量。在这个例子中，我们使用 "all-MiniLM-L6-v2"。

# Example document chunks
document_chunks = [
    "Elasticsearch is a distributed search engine.",
    "RAG improves AI-generated responses with retrieved context.",
    "Vector search enables high-precision semantic retrieval.",
    "Elasticsearch uses dense vector and sparse vector similarity for semantic search.",
    "Scalable architecture allows Elasticsearch to handle massive volumes of data.",
    "Document chunking can help improve retrieval performance.",
    "Elasticsearch supports a wide range of search features."
    # Add more document chunks as needed...
]
operations = []
for i, chunk in enumerate(document_chunks):
    operations.append({"index": {"_index": "knowledge_base", "_id": i}})
    # Convert the document chunk to an embedding vector
    operations.append({
        "text": chunk,
        "embedding": model.encode(chunk).tolist()
    })

client.bulk(index="knowledge_base", operations=operations, refresh=True)

最后，定义一个检索器函数，从你的 Elasticsearch 客户端中搜索以供 RAG 摄入管道使用。

def search(input, top_k=3):
    # Encode the query using the model
    input_embedding = model.encode(input).tolist()

    # Search the Elasticsearch index using kNN on the "embedding" field
    res = client.search(index="knowledge_base", body={
        "knn": {
            "field": "embedding",
            "query_vector": input_embedding,
            "k": top_k,  # Retrieve the top k matches
            "num_candidates": 10  # Controls search speed vs accuracy
        }
    })

    # Return a list of texts from the hits if available, otherwise an empty list
    return [hit["_source"]["text"] for hit in res["hits"]["hits"]] if res["hits"]["hits"] else []

评估你的 RAG 检索器

在设置好 Elasticsearch 检索器后，你可以开始将其作为 RAG 管道的一部分进行评估。评估过程包括两步：

准备输入查询及预期的 LLM 响应，并使用该输入生成 RAG 管道的响应，以创建一个 LLMTestCase，其中包含输入、实际输出、预期输出和检索上下文。
使用一系列检索指标来评估该测试用例。

准备测试用例

在这里，我们准备一个输入问题：“Elasticsearch 是如何工作的？”以及相应的预期输出：“Elasticsearch 使用密集向量和稀疏向量相似度进行语义搜索。”

input = "How does Elasticsearch work?"
expected_output= "Elasticsearch uses dense vector and sparse vector similarity for semantic search."
retrieval_context = search(input)

prompt = """
Answer the user question based on the supporting context

User Question:
{input}

Supporting Context:
{retrieval_context}
"""
actual_output = generate(prompt) # hypothetical function, replace with your own LLM
print(actual_output)

让我们来检查一下我们的 RAG 管道生成的实际输出：

“Elasticsearch indexes document chunks using an inverted index for fast full-text search and retrieval.”

最后，将所有测试用例参数合并为一个单一的 LLM 测试用例。

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=”"How does Elasticsearch work?",
    actual_output=actual_output,
    retrieval_context=retrieval_context,
    expected_output=,
)

运行评估

要对你的 Elastic 检索器运行评估，将我们之前定义的测试用例和度量传递给 evaluate 函数。

evaluate(
   [test_case],
   metrics=[contextual_recall, contextual_precision, contextual_relevancy]
)

优化检索器

一旦你评估了测试用例，我们就可以开始分析结果。下面是我们创建的测试用例的示例评估结果，以及用户可能向你的 RAG 系统提出的其他假设性查询。

Query	Contextual precision	Contextual recall	Contextual relevancy
"How does Elasticsearch work?"	0.63	0.93	0.52
"Explain Elasticsearch's indexing method."	0.57	0.87	0.49
"What makes Elasticsearch efficient for search?"	0.65	0.90	0.55

Contextual precision 较差 → 一些检索到的上下文可能过于泛泛或离题。
Contextual recall 较强 → Elasticsearch 检索到了足够相关的文档。
Contextual relevancy 不一致 → 检索到的文档质量在不同查询中有所变化。

提高检索质量

如前所述，每个指标受到特定检索超参数的影响。由于上下文精确度较差且上下文相关性不一致，显然需要改进重排器超参数，以及 top-K、块大小和块重叠。

以下是如何使用简单的 for 循环迭代 top-K 的示例。

# Example of running multiple test cases with different retrieval settings
for top_k in ["1", "3", "5", "7"]:
    retrieval_context = search(query, top_k)
    test_case = LLMTestCase(
        input=query,
        actual_output=generate(query, retrieval_context),
        retrieval_context=retrieval_context,
        expected_output="Elasticsearch is an optimized vector database for AI applications.",
    )

    evaluate([test_case], metrics=[contextual_recall, contextual_precision, contextual_relevancy])

这个 for 循环有助于识别产生最佳指标得分的 top-K 值。你应该将这种方法应用于所有影响检索系统相关性和精确度得分的超参数。这样，你就能确定最佳组合！