释放你的元数据：使用 Elasticsearch 的自查询检索器

news2025/7/2 12:27:10

作者：来自 Elastic Josh Asres

了解如何使用 Elasticsearch 的 “self-quering” 检索器来通过结构化过滤器提高语义搜索的相关性。

在人工智能搜索的世界中，在海量的数据集中高效地找到正确的数据至关重要。传统的基于关键词的搜索在处理涉及自然语言的查询时往往会失效，这时就需要语义搜索了。然而，如果你想将语义搜索的功能与过滤日期和数字值等结构化元数据的能力结合起来，那么自查询检索器（self-querying retrievers）就可以发挥作用了。

自查询检索器提供了一种强大的方法来利用元数据进行更精确、更细致的搜索。当与 Elasticsearch 的搜索和索引功能相结合时，自查询变得更加强大，使开发人员能够提高 RAG 应用程序的相关性。这篇博文将探讨自查询检索器的概念，展示它们使用 LangChain 和 Python 与 Elasticsearch 的集成，以及它如何帮助你的搜索变得更加强大！

什么是自查询检索器（self-querying Retrievers）？

自查询检索器是 LangChain 提供的一项功能，它弥合了自然语言查询和结构化元数据过滤之间的差距。他们不再仅仅依靠关键字与文档内容的匹配，而是使用大型语言模型 (LLM) 以及 Elasticsearch 的想量搜索功能来解析用户的自然语言查询并智能地提取相关的元数据过滤器。例如，用户可能会问 “Find science fiction movies released after 2000 with a rating above 8 - 查找 2000 年后上映的评分高于 8 的科幻电影”。传统的搜索引擎如果没有关键词就很难找到隐含的含义，而单独的语义搜索可以理解查询的上下文，但无法应用日期和评级过滤器来获得最佳答案。但是，自查询检索器会分析查询，识别元数据字段（类型、年份、评级），并生成 Elasticsearch 可以理解和有效执行的结构化查询。这可以提供更加直观和用户友好的搜索体验，用户可以用简单的英语表达包含过滤器的复杂搜索条件。

所有这些都通过 LLM 链进行，其中 LLM 解析查询以从自然语言查询中提取过滤器，然后将新的结构化过滤器应用于包含 Elasticsearch 中的嵌入和元数据的文档。

实现自查询检索器

将自查询检索器与 Elasticsearch 集成涉及几个关键步骤。在我们的 Python 示例中，我们将使用 LangChain 的 AzureChatOpenAI 和 AzureOpenAIEmbeddings 以及 ElasticsearchStore 来管理它。我们首先引入所有 LangChain 库，设置 LLM 以及用于创建向量的嵌入模型：

from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_elasticsearch import ElasticsearchStore
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.docstore.document import Document
import os

llm = AzureChatOpenAI(
   azure_endpoint=os.environ["AZURE_ENDPOINT"],
   deployment_name=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
   model_name="gpt-4",
   api_version="2024-02-15-preview"
)


embeddings = AzureOpenAIEmbeddings(
   azure_endpoint=os.environ["AZURE_ENDPOINT"],
   model="text-embedding-ada-002"
)

在我的示例中，我使用 Azure OpenAI 作为 LLM（gpt-4）以及使用 text-embedding-ada-002 作为嵌入。然而，这应该适用于任何基于云的 LLM 以及像 Llama 3 这样的本地 LLM，甚至适用于我使用 OpenAI 的嵌入模型，因为我们已经在使用 gpt-4。

然后，我们使用元数据定义文档，然后使用已建立的元数据字段将文档索引到 Elasticsearch 中：

# --- Define Metadata Attributes ---
metadata_field_info = [
   AttributeInfo(
       name="year",
       description="The year the movie was released",
       type="integer",
   ),
   AttributeInfo(
       name="rating",
       description="The rating of the movie (out of 10)",
       type="float",
   ),
   AttributeInfo(
       name="genre",
       description="The genre of the movie",
       type="string",
   ),
   AttributeInfo(
       name="director",
       description="The director of the movie",
       type="string",
   ),
   AttributeInfo(
       name="title",
       description="The title of the movie",
       type="string",
   )
]

docs = [
   Document(
       page_content="Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.",
       metadata={"year": 2012, "rating": 7.7, "genre": "science fiction", "title": "Prometheus"},
   ),
...more documents

接下来将它们添加到 Elasticsearch 索引中，es_store.add_embeddings 函数会将文档添加到你在 ELASTIC_INDEX_NAME 变量中选择的索引中，如果在集群中找不到该索引，则会创建具有该名称的索引。在我的示例中，我使用的是 Elastic Cloud 部署，但这也适用于自管理集群：

es_store = ElasticsearchStore(
   es_cloud_id=ELASTIC_CLOUD_ID,
   es_user=ELASTIC_USERNAME,
   es_password=ELASTIC_PASSWORD,
   index_name=ELASTIC_INDEX_NAME,
   embedding=embeddings,
)
es_store.add_embeddings(text_embeddings=list(zip(texts, doc_embeddings)), metadatas=metadatas)

然后创建自查询检索器，接受用户的查询，使用 LLM（我们之前设置的 Azure OpenAI）来解释它，然后构建一个结合语义搜索和元数据过滤器的 Elasticsearch 查询。这一切都由 docs = trieser.invoke(query) 执行：

# --- Create the self-querying Retriever (Using your LLM) ---
retriever = SelfQueryRetriever.from_llm(
   llm,
   es_store,
   "Search for movies",
   metadata_field_info,
   verbose=True,
)

while True:
   # Prompt the user for a query
   query = input("\nEnter your search query (or type 'exit' to quit): ")
   if query.lower() == 'exit':
       break
  
   # Execute the query and print the results
   print(f"\nQuery: {query}")
   docs = retriever.invoke(query)
   print(f"Found {len(docs)} documents:")
   for doc in docs:
       print(doc.page_content)
       print(doc.metadata)
       print("-" * 20)

我们做到了！然后根据 Elasticsearch 索引执行查询，返回与内容和元数据标准最匹配的相关文档。此过程使用户能够进行自然语言查询，如下面的示例所示：

Query: What is a highly rated movie from the 1970s?
Found 3 documents:
The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.
{'year': 1972, 'rating': 9.2, 'genre': 'crime', 'title': 'The Godfather'}
--------------------
Three men walk into the Zone, three men walk out of the Zone
{'year': 1979, 'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'title': 'Stalker'}
--------------------
Four armed men hijack a New York City subway car and demand a ransom for the passengers
{'year': 1974, 'rating': 7.6, 'director': 'Joseph Sargent', 'genre': 'action', 'title': 'The Taking of Pelham One Two Three'}