【可能是全网最丝滑的LangChain教程】十七、LangChain进阶之Retrievers

人生不能像做菜，把所有的料都准备好了才下锅。

01 Retrievers介绍

检索器（Retrievers）是一种接口，用于根据非结构化查询返回文档，它比向量存储更为通用，既可以使用向量存储作为底层，也可以是其他类型。（这里说的这些检索器基本都是和向量存储有关联的）

检索器接受字符串查询作为输入，并输出一系列文档。

检索器顶层接口如下：

class BaseRetriever(RunnableSerializable[RetrieverInput, RetrieverOutput], ABC):
    
    # 根据查询内容获取相关的文档
    @abstractmethod
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        

    # 异步，根据查询内容获取相关的文档
    async def _aget_relevant_documents(
        self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    ) -> List[Document]:
        return await run_in_executor(
            None,
            self._get_relevant_documents,
            query,
            run_manager=run_manager.get_sync(),
        )

后续如果我们如果有自定义的需求就可以实现这两个接口做具体实现。

02 LangChain种的Retrievers

Vector store-backed retriever

矢量存储检索器

矢量存储检索器是一种利用矢量存储来检索文档的检索器，它是对矢量存储类的封装，使其能够通过相似性搜索和最大边际相关性（MMR）等方法进行文本查询。

示例Demo如下：

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# 加载文档
loader = TextLoader("index.txt")
documents = loader.load()

# 切割文档
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# 创建词嵌入模型
embeddings = OpenAIEmbeddings()

# 基于FAISS，创建向量数据库存储
db = FAISS.from_documents(texts, embeddings)

# 向量数据库转换成检索器
retriever = db.as_retriever()

# 一、执行最基础的检索
docs = retriever.invoke("what did he say about ketanji brown jackson")

# 二、最大边际相关性检索
retriever = db.as_retriever(search_type="mmr")
docs = retriever.invoke("what did he say about ketanji brown jackson")

# 三、相似性分数阈值检索
# 低于0.5分的搜索步返回
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
docs = retriever.invoke("what did he say about ketanji brown jackson")

# 四、指定topk
retriever = db.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke("what did he say about ketanji brown jackson")
len(docs) # 这里返回1，意思是只返回最相似的那一条搜索结果

这里我们简单分析下源码：

class VectorStoreRetriever(BaseRetriever):
    
    # 具体的实现
    # 本质还是向量搜索
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        if self.search_type == "similarity":
            # 这里本质还是做相似度查询
            docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
        elif self.search_type == "similarity_score_threshold":
            # 返回相关度（分数）的相似度查询
            docs_and_similarities = (
                self.vectorstore.similarity_search_with_relevance_scores(
                    query, **self.search_kwargs
                )
            )
            docs = [doc for doc, _ in docs_and_similarities]
        elif self.search_type == "mmr":
            # 最大边际相关性搜索
            docs = self.vectorstore.max_marginal_relevance_search(
                query, **self.search_kwargs
            )
        else:
            raise ValueError(f"search_type of {self.search_type} not allowed.")
        return docs

    # 异步获取搜索结果，本质没变
    async def _aget_relevant_documents(
        self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun
    ) -> List[Document]:
        if self.search_type == "similarity":
            docs = await self.vectorstore.asimilarity_search(
                query, **self.search_kwargs
            )
        elif self.search_type == "similarity_score_threshold":
            docs_and_similarities = (
                await self.vectorstore.asimilarity_search_with_relevance_scores(
                    query, **self.search_kwargs
                )
            )
            docs = [doc for doc, _ in docs_and_similarities]
        elif self.search_type == "mmr":
            docs = await self.vectorstore.amax_marginal_relevance_search(
                query, **self.search_kwargs
            )
        else:
            raise ValueError(f"search_type of {self.search_type} not allowed.")
        return docs

通过上面源码可以看出，本质还是做向量相似度搜索，后续其他的检索器的真正实现也在 _get_relevant_documents 和 _aget_relevant_documents 里面，不再做具体的逻辑分析，感兴趣的可以自己去查看。

MultiQueryRetriever

多查询检索器

MultiQueryRetriever 是一个自动化的查询调优工具，它利用大型语言模型（LLM）为用户输入的查询生成多个不同角度的查询，以克服基于距离的向量数据库检索方法的局限性，并提供更丰富的检索结果。

MultiQueryRetriever 旨在解决基于距离的向量数据库检索时可能出现的问题，如查询表述的微小变化导致的不同检索结果，以及嵌入不足以捕捉数据语义的情况。该工具通过 LLM 自动生成多个查询变体，对于每个变体检索相关文档，并将所有查询的结果取并集，形成一个更全面的相关文档集合。

具体使用如下：

from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers.utils import is_torch_cuda_available, is_torch_mps_available
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_query import MultiQueryRetriever

# 这里加载的是电影《让子弹飞》百度百科描述
loader = WebBaseLoader("https://baike.baidu.com/item/%E8%AE%A9%E5%AD%90%E5%BC%B9%E9%A3%9E/5358")
data = loader.load()

# 做文档切割
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
# 词嵌入模型
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})
vectordb = FAISS.from_documents(documents=splits, embedding=embedding)


question = "电影让子弹飞的影评怎么样？"
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=chat_model
)

unique_docs = retriever_from_llm.invoke(question)

"""
['1. 对于电影《让子弹飞》的评论，观众们的评价如何？', 
'2. 我想了解关于电影《让子弹飞》的专业影评，请问有哪些值得一看的观点？', 
'3. 有没有人分享他们对《让子弹飞》这部电影的看法？特别是对于其艺术价值和娱乐性的评价如何？']
"""

这里通过LLM本身的泛化能力，在原始问题的基础上生成了3个问题，如下：

[
'1. 对于电影《让子弹飞》的评论，观众们的评价如何？',
'2. 我想了解关于电影《让子弹飞》的专业影评，请问有哪些值得一看的观点？',
'3. 有没有人分享他们对《让子弹飞》这部电影的看法？特别是对于其艺术价值和娱乐性的评价如何？'
]

然后将三个问题分别做相似度检索，最后将检索的结果取并集。

Contextual Compression

上下文压缩检索器。

上下文压缩是一种用于提高检索效率的技术，它通过在返回文档之前使用查询的上下文来压缩和过滤文档，确保只有与查询相关的信息被返回。

传统检索方法的局限性，即检索到的文档可能包含大量不相关的信息，这会导致更高昂的语言模型调用成本和较差的响应质量。上下文压缩通过在返回文档之前使用查询的上下文来解决这个问题，它可以减少文档内容，甚至完全过滤掉不相关的文档。

要使用上下文压缩检索器，需要两个基本组件：基础检索器（base retriever）和文档压缩器（Document Compressor）。基础检索器负责从文档集合中检索初始文档，而文档压缩器则对这些文档进行压缩和过滤，以提供更精确的信息。

当我们没有使用上下文压缩检索器时：

from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers.utils import is_torch_cuda_available, is_torch_mps_available
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embeddings = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

documents = TextLoader("index.txt",encoding='utf-8',autodetect_encoding=True).load()
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, embeddings).as_retriever()

docs = retriever.invoke("狮子王的经典台词")

"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt'}),
 Document(page_content='这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt'}),
 Document(page_content='电影中的经典台词往往能够深入人心，成为人们记忆中的一部分。以下是一些电影中的经典台词：\n\n"生活就像一盒巧克力，你永远不知道你会得到什么。" ——《阿甘正传》（Forrest Gump, 1994）', metadata={'source': 'index.txt'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt'})]
"""

现在让我们用 ContextualCompressionRetriever 包裹我们的基础检索器。我们将添加一个 LLMChainExtractor，它将遍历最初返回的文档，并仅从每个文档中提取与查询相关的内容。

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm_model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "狮子王的经典台词"
)

"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）', metadata={'source': 'index.txt'})]
"""

很明显，这里只返回了一条结果，并且是对原始Document内容：[Document(page_content=‘“永远不要小看自己，因为你永远不知道自己有多强大。” ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）’, metadata={‘source’: ‘index.txt’})] 的总结输出！

LLMChainFilter 是一个稍微简单但更强大的压缩器，它使用 LLM 链来决定要过滤掉哪些最初检索的文档以及要返回哪些文档，而无需操作文档内容。

from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm_model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "狮子王的经典台词"
)

"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt'}),
 Document(page_content='这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt'})]
"""

很明显，上面仅仅对文档做了压缩，并没有改变文档的内容！

对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。EmbeddingsFilter 通过嵌入文档和查询，并仅返回那些与查询具有足够相似嵌入的文档，提供了一个更便宜、更快捷的选项。

from langchain.retrievers.document_compressors import EmbeddingsFilter

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embeddings = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "狮子王的经典台词"
)

"""
[_DocumentWithState(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt'}, state={'embedded_doc': [xx,xx,xx...], 'query_similarity_score': 0.7641647574958407})]
"""

这里通过分数过滤，最终获取到1条符合要求的结果！

通过组合不同的压缩器和转换器，可以创建灵活且高效的文档处理流程，以满足不同的检索需求。

下面我们将创建一个压缩器管道，首先将文档拆分为更小的块，然后删除冗余文档，然后根据与查询的相关性进行筛选。

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain_text_splitters import CharacterTextSplitter

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embeddings = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

# 首先将文档拆分为更小的块，然后删除冗余文档，然后根据与查询的相关性进行筛选。
splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "狮子王的经典台词"
)

"""
[_DocumentWithState(
    page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', 
    metadata={'source': 'index.txt'}, 
    state={
        'embedded_doc': [xx,xx,xx...], 
        'query_similarity_score': 0.7641647574958407
        }
    )
]
"""

Ensemble Retriever

集成检索器Ensemble Retriever 是一个组合检索器，它通过结合多个检索器的结果并使用互惠秩融合算法重新排序，以提高检索性能。

Ensemble Retriever 是一个组合检索器，它可以同时使用多个检索器，如 BM25 和 FAISS，将它们的检索结果整合起来，并通过 Reciprocal Rank Fusion 算法对结果进行重新排序。这种方法结合了稀疏检索器和密集检索器的优势，前者优于关键词查找，后者优于语义相似度查找，从而实现所谓的 “混合搜索”。

示例代码如下：

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "You like apples",
    "You like oranges",
]

faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

docs = ensemble_retriever.invoke("apples")

"""
[Document(page_content='I like apples', metadata={'source': 1}),
 Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='Apples and oranges are fruits', metadata={'source': 1}),
 Document(page_content='You like oranges', metadata={'source': 2})]
"""

Long-Context Reorder

长上下文重新排序检索器当语言模型需要在长上下文中访问相关信息时，性能可能会下

降的问题。当包含 10 个以上的检索文档时，模型往往会忽略提供的文档。为了解决这

个问题，可以在检索文档后对它们进行重新排序，以提高模型的性能。将最相关的文档

放置在上下文的开始和结束位置，可以帮助模型更好地利用这些信息来回答查询。

from langchain.chains import LLMChain, StuffDocumentsChain
from langchain_community.document_transformers import (
    LongContextReorder,
)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate

# Get embeddings.
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# Create a retriever
retriever = FAISS.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# Get relevant documents ordered by relevance score
docs = retriever.invoke(query)

# 按照原始相关性输出的结果
"""
[Document(page_content='The Celtics are my favourite team.'),
 Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='This is just a random text.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='Larry Bird was an iconic NBA player.')]
"""

# ================= 做重排序 ====================
# Reorder the documents:
# Less relevant document will be at the middle of the list and more
# relevant elements at beginning / end.
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# Confirm that the 4 relevant documents are at beginning and end.
# 对结果进行重排序后，强相关的会出现在开头和结尾
"""
[Document(page_content='This is a document about the Boston Celtics'),
 Document(page_content='L. Kornet is one of the best Celtics players.'),
 Document(page_content='I simply love going to the movies'),
 Document(page_content='Fly me to the moon is one of my favourite songs.'),
 Document(page_content='Larry Bird was an iconic NBA player.'),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.'),
 Document(page_content='Basquetball is a great sport.'),
 Document(page_content='This is just a random text.'),
 Document(page_content='The Boston Celtics won the game by 20 points'),
 Document(page_content='The Celtics are my favourite team.')]
"""

MultiVector Retriever

多向量检索器

LangChain 提供了一个名为 MultiVectorRetriever 的工具，它允许用户为单个文档存储多个向量，以便更准确地进行文档检索。这个工具支持多种创建向量的方法，例如将文档分割成较小的块（ParentDocumentRetriever），为每个文档生成摘要并嵌入这些摘要，或者创建适合文档回答的假设性问题并将这些问题嵌入。此外，用户还可以手动添加嵌入，以便更精确地控制文档检索过程。

首先，我们需要正常加载文档，这里的chunk_size需要设置的大一些。

from langchain_community.document_loaders import TextLoader

loader = TextLoader("index.txt",encoding='utf-8',autodetect_encoding=True)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
docs = text_splitter.split_documents(docs)

"""
[Document(page_content='电影中的经典台词往往能够深入人心，成为人们记忆中的一部分。以下是一些电影中的经典台词：\n\n"生活就像一盒巧克力，你永远不知道你会得到什么。" ——《阿甘正传》（Forrest Gump, 1994）\n\n"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）', metadata={'source': 'index.txt'}),
 Document(page_content='"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）', metadata={'source': 'index.txt'}),
 Document(page_content='"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）', metadata={'source': 'index.txt'}),
 Document(page_content='"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）\n\n"你只需要跟随你的黄砖路。" ——《绿野仙踪》（The Wizard of Oz, 1939）\n\n这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt'})]
"""

Smaller chunks示例：

from langchain.retrievers import MultiVectorRetriever
from langchain_core.stores import InMemoryByteStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers.utils import is_torch_cuda_available, is_torch_mps_available
from langchain_community.vectorstores import FAISS

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

# The vectorstore to use to index the child chunks
vectorstore = FAISS.from_documents(docs,embedding)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

# 这里生成了4个父文档id
"""
['acff96e4-e86d-4305-8446-ddb45de574b3',
 '333dfffd-f245-46cd-b1f1-ebe7bffe25c1',
 'ad7e4e35-ba91-4add-b03b-733d7805e9d2',
 'd1ed38d3-a301-45fc-be4c-380e0a2480cf']
"""

from langchain_text_splitters import RecursiveCharacterTextSplitter

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=10)
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    # 对父文档再切割，形成子文档
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

# 子文档入向量数据库
retriever.vectorstore.add_documents(sub_docs)
# 子父文档建立映射，方便查询
retriever.docstore.mset(list(zip(doc_ids, docs)))

# 查询
documents = retriever.vectorstore.similarity_search("狮子王")

"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt', 'doc_id': 'acff96e4-e86d-4305-8446-ddb45de574b3'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt', 'doc_id': '333dfffd-f245-46cd-b1f1-ebe7bffe25c1'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt', 'doc_id': 'ad7e4e35-ba91-4add-b03b-733d7805e9d2'}),
 Document(page_content='电影中的经典台词往往能够深入人心，成为人们记忆中的一部分。以下是一些电影中的经典台词：\n\n"生活就像一盒巧克力，你永远不知道你会得到什么。" ——《阿甘正传》（Forrest Gump, 1994）\n\n"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）', metadata={'source': 'index.txt'})]
"""

Summary示例：

import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# LCEL表达式，构建执行链
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("总结下面的一段文本:\n\n{doc}")
    | chat_model
    | StrOutputParser()
)

# 对原始文档做总结
summaries = chain.batch(docs, {"max_concurrency": 5})

# 词嵌入模型
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

# The vectorstore to use to index the child chunks
vectorstore = FAISS.from_documents(docs,embedding)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 构建总结文档的document列表
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# 添加总结文档到向量数据库
retriever.vectorstore.add_documents(summary_docs)
# 总结文档和原始文档建立映射
retriever.docstore.mset(list(zip(doc_ids, docs)))

# 执行向量搜索
sub_docs = vectorstore.similarity_search("泰坦尼克号")

"""
[Document(page_content='"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）', metadata={'source': 'index.txt'}),
 Document(page_content='"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）', metadata={'source': 'index.txt'}),
 Document(page_content='这些引述表达了不同的情感和生活理念：\n\n1. 《泰坦尼克号》中的台词体现了坚定不移的爱情承诺，即使在极端困境中也不放弃对彼此的陪伴。\n\n2. 《当幸福来敲门》传递的是鼓舞人心的信息，鼓励人们追求自己的梦想，不受他人的限制或质疑。\n\n3. 《保镖》中的经典表白表达了永恒不变的爱情誓言，让人感受到深情与承诺的力量。\n\n4. 《美丽人生》则以一种乐观和智慧的方式揭示生活的真谛，提醒我们无论处境如何，都要以积极的态度面对。\n\n5. 最后，《我有一个梦想》是马丁·路德·金的著名演讲，体现了对平等、自由和平等权利的强烈渴望，激励人们为理想而奋斗。', metadata={'doc_id': 'b40d73bb-6def-4418-8137-49edaaf6131f'}),
 Document(page_content='这段文本讲述了电影中的经典台词对人们的影响，列举了几个深入人心的例句。这些台词包括《阿甘正传》中关于生活不确定性的哲理，“永远不要小看自己”的鼓舞人心语句，《终结者》中的承诺，《回到未来》关于把握未来的警醒，以及《泰坦尼克号》中表达的爱情誓言。这些台词不仅在电影中起着重要的剧情推动作用，也在观众心中留下了深刻的印象。', metadata={'doc_id': 'eeb6a418-5cd0-4218-9f53-b739713c7c51'})]
"""

Hypothetical Queries示例：

# 定义函数，openai专用function_call
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

# LCEL表达式构建执行链，这里要求模型执行预定义的函数
chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | chat_model.bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

# 执行函数调用
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

# 词嵌入模型
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

# 向量存储
vectorstore = FAISS.from_documents(docs,embedding)

# 父文档存储
store = InMemoryByteStore()
id_key = "doc_id"

# 定义检索器
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 构建提问文档列表
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

# 添加提问文档到向量数据库    
retriever.vectorstore.add_documents(question_docs)
# 将原始文档和提问文档建立映射
retriever.docstore.mset(list(zip(doc_ids, docs)))

# 执行向量检索
sub_docs = vectorstore.similarity_search("狮子王")

Parent Document Retriever

父文档检索器

ParentDocumentRetriever 是一个用于在精确性和上下文保留之间取得平衡的文档检索工具，它通过拆分和存储数据的小块，在检索时既能返回精确的小块，也能通过查找这些块的父 ID 返回更大的文档。

该工具旨在解决在分割文档以进行检索时，既需要小文档以便其嵌入能够准确反映其含义，又需要足够长的文档以保留每个块的上下文信息的问题。ParentDocumentRetriever 首先将数据拆分成小块并存储，在检索时，会先获取这些小块，然后根据小块的父 ID 查找并返回更大的文档。

示例代码如下：

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader("index.txt",encoding='utf-8',autodetect_encoding=True)
docs = loader.load()

from langchain.retrievers import ParentDocumentRetriever
from langchain_core.stores import InMemoryByteStore, InMemoryStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers.utils import is_torch_cuda_available, is_torch_mps_available
from langchain_community.vectorstores import FAISS

EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})

# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=10)

# The vectorstore to use to index the child chunks
vectorstore = FAISS.from_documents(docs,embedding)

# The storage layer for the parent documents
store = InMemoryStore()

# The retriever (empty to start)
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)

"""
list(store.yield_keys())输出如下，只有一个id，因为整个文档都作为一个Document处理了

['0d15e752-fb35-4190-a15c-54aadad82e47']
"""

# 检索切分成小块后的文档
sub_docs = vectorstore.similarity_search("狮子王")
"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt', 'doc_id': '6dbe9646-9544-4fa6-94d9-325242d9c443'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt', 'doc_id': '6dbe9646-9544-4fa6-94d9-325242d9c443'}),
 Document(page_content='这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt', 'doc_id': '6dbe9646-9544-4fa6-94d9-325242d9c443'}),
 Document(page_content='"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）', metadata={'source': 'index.txt', 'doc_id': '6dbe9646-9544-4fa6-94d9-325242d9c443'})]
"""

# 检索整个文档
retrieved_docs = retriever.invoke("狮子王")
"""
[Document(page_content='电影中的经典台词往往能够深入人心，成为人们记忆中的一部分。以下是一些电影中的经典台词：\n\n"生活就像一盒巧克力，你永远不知道你会得到什么。" ——《阿甘正传》（Forrest Gump, 1994）\n\n"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）\n\n"你只需要跟随你的黄砖路。" ——《绿野仙踪》（The Wizard of Oz, 1939）\n\n这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt'})]
"""

有时，完整的文档可能太大，无法按原样检索它们。在这种情况下，我们真正想做的是首先将原始文档拆分为较大的块，然后将其拆分为较小的块。然后，我们对较小的块进行索引，但在检索时，我们会检索较大的块（但仍然不是完整的文档）。

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=10)
# The vectorstore to use to index the child chunks
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})
vectorstore = FAISS.from_documents(docs,embedding)
# The storage layer for the parent documents
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

"""
这里len(list(store.yield_keys()))返回结果是4，说明有4个父文档id
"""

sub_docs = vectorstore.similarity_search("狮子王")
"""
[Document(page_content='"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）', metadata={'source': 'index.txt', 'doc_id': '8f4c2878-0cb8-4a86-94e1-f80bc977de6e'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt', 'doc_id': 'bd4631ac-4b97-47af-b4c6-58f87d10c79e'}),
 Document(page_content='"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）', metadata={'source': 'index.txt', 'doc_id': 'c81dbd21-cff4-4208-aab1-9bf499ad1b0c'}),
 Document(page_content='这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt', 'doc_id': '3d4c986c-c259-4005-8d39-f99f886d267a'})]
"""

retrieved_docs = retriever.invoke("狮子王")
"""
[Document(page_content='电影中的经典台词往往能够深入人心，成为人们记忆中的一部分。以下是一些电影中的经典台词：\n\n"生活就像一盒巧克力，你永远不知道你会得到什么。" ——《阿甘正传》（Forrest Gump, 1994）\n\n"永远不要小看自己，因为你永远不知道自己有多强大。" ——《狮子王》（The Lion King, 1994）\n\n"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）', metadata={'source': 'index.txt'}),
 Document(page_content='"我会回来的。" ——《终结者》（The Terminator, 1984）\n\n"你不能改变过去，但你可以改变未来。" ——《回到未来》（Back to the Future, 1985）\n\n"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）', metadata={'source': 'index.txt'}),
 Document(page_content='"即使世界末日来临，我也要和你一起度过。" ——《泰坦尼克号》（Titanic, 1997）\n\n"不要让别人告诉你你能做什么，不能做什么。如果你有一个梦想，就去捍卫它。" ——《当幸福来敲门》（The Pursuit of Happyness, 2006）\n\n"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）', metadata={'source': 'index.txt'}),
 Document(page_content='"我将永远爱你。" ——《保镖》（The Bodyguard, 1992）\n\n"这就是生活，小家伙。" ——《美丽人生》（Life is Beautiful, 1997）\n\n"我有一个梦想。" ——《我有一个梦想》（I Have a Dream, 1963）\n\n"你只需要跟随你的黄砖路。" ——《绿野仙踪》（The Wizard of Oz, 1939）\n\n这些台词不仅在电影中令人难忘，也在现实生活中激励着许多人。', metadata={'source': 'index.txt'})]
"""

Self Query

自查询检索器

自查询检索器是一种能够自动构建结构化查询并应用于矢量存储的检索器，用于提取用户查询的筛选器，并对存储文档的元数据进行过滤和执行。

自查询检索器能够将自然语言查询转换为结构化查询，并将其应用于基础的矢量存储（VectorStore），如Chroma。这种检索器不仅能够根据用户输入的查询进行语义相似性匹配，还能从用户查询中提取出对文档元数据的筛选条件，并执行这些条件。

自查询检索器对向量数据库有要求，部分向量数据库不支持（FAISS不支持）

from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

# 创建示例文档列表
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

# 定义词嵌入模型
EMBEDDING_DEVICE = "cuda" if is_torch_cuda_available() else "mps" if is_torch_mps_available() else "cpu"
embedding = HuggingFaceEmbeddings(model_name='D:\models\m3e-base', model_kwargs={'device': EMBEDDING_DEVICE})
vectorstore = FAISS.from_documents(docs, embedding)

# 定义元数据信息
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"

# 创建子查询检索器
retriever = SelfQueryRetriever.from_llm(
    llm_model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# 检索
retriever.invoke("I want to watch a movie rated higher than 8.5")

"""
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]
"""

Time-weighted vector store retriever

时间加权矢量存储检索器是一种考虑了对象上次访问时间和语义相似性的检索方法。其评分算法为 semantic_similarity + (1.0 - decay_rate) ^ hours_passed，其中 hours_passed 是指自上次访问检索器中的对象以来经过的小时数。这种方法使得经常访问的对象保持 “新鲜” 状态。