【LangChain】检索器之上下文压缩

news2025/1/9 15:37:34

LangChain学习文档

  • 【LangChain】检索器(Retrievers)
  • 【LangChain】检索器之MultiQueryRetriever
  • 【LangChain】检索器之上下文压缩

上下文压缩

    • LangChain学习文档
  • 概要
  • 内容
  • 使用普通向量存储检索器
  • 使用 LLMChainExtractor 添加上下文压缩(Adding contextual compression with an LLMChainExtractor)
  • 更多内置压缩机:过滤器(More built-in compressors: filters)
    • LLMChainFilter
    • EmbeddingsFilter
  • 将压缩器和文档转换器串在一起(Stringing compressors and document transformers together)
  • 总结

概要

检索的一项挑战是,通常我们不知道:当数据引入系统时,文档存储系统会面临哪些特定查询。

这意味着与查询最相关的信息可能被隐藏在包含大量不相关文本的文档中。

通过我们的应用程序传递完整的文件可能会导致更昂贵的llm通话和更差的响应。

上下文压缩旨在解决这个问题。

这个想法很简单:我们可以使用给定查询的上下文来压缩它们,以便只返回相关信息,而不是立即按原样返回检索到的文档。

这里的“压缩”既指压缩单个文档的内容,也指批量过滤文档。

要使用上下文压缩检索器,我们需要:

  • 基础检索器
  • 文档压缩器

上下文压缩检索器将查询传递给基础检索器,获取初始文档并将它们传递给文档压缩器。文档压缩器获取文档列表并通过减少文档内容或完全删除文档来缩短它。

在这里插入图片描述

内容

# 打印文档的辅助功能

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

使用普通向量存储检索器

让我们首先初始化一个简单的向量存储检索器并存储 2023 年国情咨文演讲(以块的形式)。我们可以看到,给定一个示例问题,我们的检索器返回一两个相关文档和一些不相关的文档。甚至相关文档中也有很多不相关的信息。

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
# 加载文档
documents = TextLoader('../../../state_of_the_union.txt').load()
# 拆分器
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# 拆分文档
texts = text_splitter.split_documents(documents)
# 构建索引,并构建检索器
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
# 运行
docs = retriever.get_relevant_documents("What did the president say about Ketanji Brown Jackson")
# 美化打印
pretty_print_docs(docs)

结果:

    Document 1:
    
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 
    
    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 
    
    And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 
    
    We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  
    
    We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  
    
    We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. 
    
    We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
    ----------------------------------------------------------------------------------------------------
    Document 3:
    
    And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. 
    
    As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 
    
    While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. 
    
    And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. 
    
    So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  
    
    First, beat the opioid epidemic.
    ----------------------------------------------------------------------------------------------------
    Document 4:
    
    Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. 
    
    And as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up.  
    
    That ends on my watch. 
    
    Medicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect. 
    
    We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees. 
    
    Let’s pass the Paycheck Fairness Act and paid leave.  
    
    Raise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty. 
    
    Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.

使用 LLMChainExtractor 添加上下文压缩(Adding contextual compression with an LLMChainExtractor)

现在让我们用 ContextualCompressionRetriever 包装我们的基本检索器。我们将添加一个 LLMChainExtractor,它将迭代最初返回的文档,并从每个文档中仅提取与查询相关的内容。

from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# 构建大模型
llm = OpenAI(temperature=0)
# 从大模型中构建LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
# 构建压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:
    
    "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence."
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    "A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

更多内置压缩机:过滤器(More built-in compressors: filters)

LLMChainFilter

LLMChainFilter 是稍微简单但更强大的压缩器,它使用 LLM Chain来决定过滤掉最初检索到的文档中的哪些文档以及返回哪些文档,而无需操作文档内容。

from langchain.retrievers.document_compressors import LLMChainFilter

# 构建LLMChainFilter
_filter = LLMChainFilter.from_llm(llm)
# 构建上下文压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)
    Document 1:
    
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 
    
    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

EmbeddingsFilter

对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。 EmbeddingsFilter 通过嵌入文档和查询并仅返回那些与查询具有足够相似嵌入的文档来提供更便宜且更快的选项。

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
# 构建嵌入
embeddings = OpenAIEmbeddings()
# 构建EmbeddingsFilter
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
# 构建上下文压缩检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:
    
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 
    
    Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 
    
    And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 
    
    We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  
    
    We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  
    
    We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. 
    
    We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
    ----------------------------------------------------------------------------------------------------
    Document 3:
    
    And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. 
    
    As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 
    
    While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. 
    
    And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. 
    
    So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  
    
    First, beat the opioid epidemic.

将压缩器和文档转换器串在一起(Stringing compressors and document transformers together)

使用 DocumentCompressorPipeline 我们还可以轻松地按顺序组合多个压缩器。除了压缩器之外,我们还可以将 BaseDocumentTransformers 添加到管道中,它不执行任何上下文压缩,而只是对一组文档执行一些转换。

例如,TextSplitters 可以用作文档转换器,将文档分割成更小的部分,而 EmbeddingsRedundantFilter 可以用于根据文档之间嵌入的相似性来过滤掉冗余文档。

下面我们创建一个压缩器管道,首先将文档分割成更小的块,然后删除冗余文档,然后根据与查询的相关性进行过滤。

from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter
# 构建拆分器
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
# 构建EmbeddingsRedundantFilter
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
# 构建嵌入过滤器:EmbeddingsFilter
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
# 构建文档管道
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)
# 构建上下文检索器
compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)
# 运行
compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
# 美化打印
pretty_print_docs(compressed_docs)

结果:

    Document 1:
    
    One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 
    
    And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson
    ----------------------------------------------------------------------------------------------------
    Document 2:
    
    As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 
    
    While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year
    ----------------------------------------------------------------------------------------------------
    Document 3:
    
    A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder

总结

我们在进行文档搜索的时候,正相关的文档是少部分,大部分都是不相关的文档。
我们可以使用上下文压缩检索器,只返回正相关的那部分文档。

主要步骤:

  1. 构建一个普通检索器:retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
  2. 构建一个上下文压缩检索器:ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)

特别是第二步骤:构建上下文压缩器的第一个参数,有很多花样:
① LLMChainExtractor 提取,精炼
② LLMChainFilter 普通过滤
③ EmbeddingsFilter 嵌入过滤
④ DocumentCompressorPipeline 文档管道,可以将多个过滤器组合在一起。

参考地址:

https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/contextual_compression/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/788996.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

数据结构基本概念及算法分析

文章目录 1. 数据结构基本概念1.1 基本概念和术语1.1.1 数据1.1.2 数据元素1.1.3 数据项1.1.4 数据对象1.1.5 数据结构 1.2 逻辑结构与物理结构1.2.1 逻辑结构(我们最需要关注的问题)1.2.2 物理机构 1.3 数据类型1.3.1 数据类型定义1.3.2 抽象数据类型 2. 算法分析2.1 算法的复…

【Python机器学习】实验02 线性回归

文章目录 线性回归1. 单变量的线性回归1.1 数据读取1.2 训练数据的准备1.3 假设函数定义--假设函数是为了去预测1.4 损失函数的定义1.5 利用梯度下降算法来优化参数w1.6 可视化误差曲线1.7 可视化回归线/回归平面 1.2 单变量的线性回归--基于sklearn试试?1.3 多变量…

object tracking论文代码汇总

文章目录 2023Segment and Track AnythingTrack Anything: Segment Anything Meets VideosSAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation 2023 Segment and Track Anything code:https://github.com/z-x-yang/Segment-and-Track-Anyt…

响应式赋值Object.assign()和JSON.parse(JSON.stringify())的区别

一、需求:点击编辑弹出编辑框,修改后的内容点击认按钮修改后的数据更新回显到原列表。今天优化代码的时候发现了Object.assign()和JSON.parse(JSON.stringify())的区别。 优化前代码如下: // 编辑药品回显editMedicData(data) {this.table…

会员系统怎么搭建,适合门店的会员系统有哪些?

会员系统是一种为企业和门店提供会员管理和服务的工具。会员系统可以通过提供专属优惠、积分奖励、个性化推荐等方式,激励顾客成为会员并保持长期关系。 我们在自己搭建或选择会员系统时,需要考虑门店的特定需求以及系统的功能、可靠性、易用性和成本等因…

github前端开源json2html

软件介绍 前端低代码工具包,通过 JSON 配置就能生成各种页面。 应用场景 json解析超大数据动态渲染,渲染速度、性能解决问题 包引用列表 vue3 (cdn模式开发)element plusnodehttp-serveraxios 操作步骤 1.环境准备下载node:https://no…

长tree用buffer还是inverter?驱动强度如何选型?

相关文章链接: 静态时序分析: 最小脉冲宽度检查 redhawk:clock buffer cluster 面试中关于CTS buf/inv选型的问题经久不衰,依托经验,不看纸面信息,inverter和buffer各有优劣,同驱动buffer实际推力更强,意味着只用buffer,clock repeater数量更少,inverter必须成对的…

从零开始搭建医药领域知识图谱实现智能问答与分析服务(含码源):含Neo4j基于垂直网站数据的医药知识图谱构建、医药知识图谱的自动问答等

项目设计集合(人工智能方向):助力新人快速实战掌握技能、自主完成项目设计升级,提升自身的硬实力(不仅限NLP、知识图谱、计算机视觉等领域):汇总有意义的项目设计集合,助力新人快速实…

抖音seo矩阵系统源码保姆式开发部署指导

抖音seo霸屏,是一种专为抖音视频创作者和传播者打造的视频批量剪辑,批量分发产品。使用抖音seo霸屏软件,可以帮助用户快速高效的制作出高质量的优质视频。 使用方法:1. 了解用户的行为习惯 2. 充分利用自身资源进行开发 3. 不…

超全整理,selenium自动化测试常见问题解决(汇总)

目录:导读 前言一、Python编程入门到精通二、接口自动化项目实战三、Web自动化项目实战四、App自动化项目实战五、一线大厂简历六、测试开发DevOps体系七、常用自动化测试工具八、JMeter性能测试九、总结(尾部小惊喜) 前言 自动化代码中&…

静态路由小实验

文章目录 一、实验要求及拓扑图二、实验步骤三、思考题 一、实验要求及拓扑图 二、实验步骤 1、创建VLAN,将端口划入vlan 在交换机S3、S4上创建VLAN10、20 Switch(config)#vl 10 Switch(config-vlan)#vl 20 S3(config)#int f0/3 S3(config-if)#switchport access …

SpringBoot中使用测试框架MockMvc来模拟HTTP请求测试Controller接口

场景 Java中进行单元测试junit.Assert断言、Mockito模拟对象、verify验证模拟结果、Java8中lambda的peek方法使用: Java中进行单元测试junit.Assert断言、Mockito模拟对象、verify验证模拟结果、Java8中lambda的peek方法使用_assert java8_霸道流氓气质的博客-CSD…

17的勒索软件攻击泄露关键OT信息

数据泄漏一直是企业关注的问题,敏感信息泄露可能导致声誉受损、法律处罚、知识产权损失、甚至影响员工和客户的隐私。然而很少有关于工业企业面临的威胁行为者披露其OT安全、生产、运营或技术的敏感细节的研究。 2021年,Mandiant威胁情报研究发现&#…

ambari管理配置组实现针对不同节点使用不同配置

实操 一.新建配置组: 二.取名后指定该配置组针对哪些节点生效: 三.添加节点: 保存后有个空的配置组newMR2. 四.接下来在该配置组内自定义一些配置参数,比如单独针对节点hdp01配置fetch最高并发度为20: 五.重…

区块链服务网络的顶层设计与应用实践

日前,2023全球数字经济大会专题论坛:Web3.0发展趋势专题论坛暨2023区块链、元宇宙蓝皮书发布会在北京举行。本次论坛上隆重发布了《中国区块链发展报告(2023)》,对我国区块链行业在2022年的发展状况进行了总结梳理&…

【英飞凌PSoC 6】使用软件和硬件I2C点亮OLED屏,帧率从2FPS提升到51FPS

文章目录 一、准备工作1.1 硬件准备1.2 软件准备1.3 硬件连接 二、原理分析2.1 开发板原理图2.2 芯片数据手册 三、软件I2C驱动OLED3.1 创建RT-Thread项目3.2 添加ssd1306软件包3.3 配置软件I2C和ssd1306软件包3.4 编译和下载程序3.5 运行和测试程序 四、硬件I2C驱动OLED4.1 增…

depcheck检测缺失哪些依赖包

npm install -g depcheck 如果不想全局安装,npm i depcheck后可以在package.json的scripts中输入 "check": "depcheck" 之后使用 npm run check depcheck - npm 超级好用的依赖检查工具depcheck【渡一教育】_哔哩哔哩_bilibili

Cron 选择器

// 定义一个名为 cron 的新组件 Vue.component(cron, {name: cron,props: [data],data() {return {second: {cronEvery: ,incrementStart: 3,incrementIncrement: 5,rangeStart: ,rangeEnd: ,specificSpecific: [],},minute: {cronEvery: ,incrementStart: 3,incrementIncremen…

[Tools: tiny-cuda-nn] Linux安装

official repo: https://github.com/NVlabs/tiny-cuda-nn 该包可以显著提高NeRF训练速度,是Instant-NGP、Threestudio和NeRFstudio等框架中,必须使用的。 1. 命令行安装 最便捷的安装方式,如果安装失败考虑本地编译。 pip install ninja g…

QT第一讲

思维导图 手动实现登录框 要求&#xff1a; 1、登录窗口更改标题、图标 2、设置固定尺寸、并给定一定的透明度 widget.h #ifndef WIDGET_H #define WIDGET_H#include <QWidget>#include<QWidget> #include<QDebug> //信息调试类&#xff0c;用于打印输出…