使用 Elasticsearch 和 LlamaIndex 保护 RAG 中的敏感信息和 PII 信息

作者：来自 Elastic Srikanth Manvi

在这篇文章中，我们将研究在 RAG（检索增强生成）流程中使用公共 LLMs 时保护个人身份信息 (personal identifiable information - PII) 和敏感数据的方法。我们将探索使用开源库和正则表达式屏蔽 PII 和敏感数据，以及在调用公共 LLM 之前使用本地 LLMs 屏蔽数据。

在开始之前，让我们回顾一下我们在这篇文章中使用的一些术语。

术语

LlamaIndex 是用于构建 LLM（Large Language Model - 大型语言模型）应用程序的领先数据框架。LlamaIndex 为构建 RAG（检索增强生成）应用程序的各个阶段提供抽象。LlamaIndex 和 LangChain 等框架提供抽象，以便应用程序不会与任何特定 LLM 的 API 紧密耦合。

Elasticsearch 由 Elastic 提供。Elastic 是 Elasticsearch 背后的行业领导者，Elasticsearch 是一种搜索和分析引擎，支持精确的全文搜索、语义理解的向量搜索和兼具两全其美的混合搜索。Elasticsearch 是一种分布式 RESTful 搜索和分析引擎、可扩展的数据存储和向量数据库。我们在本博客中使用的 Elasticsearch 功能可在 Elasticsearch 的免费和开放版本中使用。

检索增强生成 (Retrieval-Augmented Generation - RAG) 是一种 AI 技术/模式，其中 LLMs 提供外部知识来生成对用户查询的响应。这使得 LLM 响应可以根据特定上下文进行定制，而不那么通用。

嵌入（embeddings）是文本/媒体含义的数值表示。它们是高维信息的低维表示。

RAG 和数据保护

一般来说，大型语言模型 (LLMs) 擅长根据模型中可用的信息生成响应，这些信息可能在互联网数据上进行训练。但是，对于那些模型中没有信息的查询，LLMs 需要提供外部知识或模型中不包含的特定细节。这些信息可能在你的数据库或内部知识系统中。检索增强生成 (RAG) 是一种技术，对于给定的用户查询，你首先从外部（对 LLMs）系统（例如你的数据库）检索相关上下文/信息，然后将该上下文与用户查询一起发送到 LLM 以生成更具体和相关的响应。

这使得 RAG 技术在问答、内容创建以及任何需要深入了解上下文和细节的地方都非常有效。

因此，在 RAG 管道中，你可能会将内部信息（如 PII（personal identifiable information - 个人身份信息））和敏感信息（例如姓名、出生日期、帐号等）暴露给公共 LLM。

虽然使用 Elasticsearch 等向量数据库时数据是安全的（通过基于角色的访问控制、文档级别安全性等各种手段），但将数据发送到公共 LLMs 外部时必须小心谨慎。

使用大型语言模型 (LLM) 时，出于以下几个原因，保护个人身份信息 (PII) 和敏感数据至关重要：

隐私合规性：许多地区都有严格的法规，例如欧洲的《General Data Protection Regulation - 通用数据保护条例》(GDPR) 或美国的《California Consumer Privacy Act - 加州消费者隐私法案》(CCPA)，这些法规要求保护个人数据。遵守这些法律是避免法律后果和罚款的必要条件。
用户信任：确保敏感信息的机密性和完整性可以建立用户信任。用户更有可能使用和与他们认为可以保护其隐私的系统进行交互。
数据安全：防止数据泄露至关重要。如果没有充分的保护措施，暴露给 LLM 的敏感数据很容易被盗窃或滥用，从而导致身份盗窃或金融欺诈等潜在危害。
道德考量：从道德角度而言，尊重用户的隐私并负责任地处理其数据非常重要。不当处理 PII 可能导致歧视、污名化或其他负面社会影响。
商业声誉：未能保护敏感数据的公司可能会遭受声誉损害，这可能会对其业务产生长期负面影响，包括客户流失和收入损失。
降低滥用风险：安全处理敏感数据有助于防止恶意使用数据或模型，例如使用有偏见的数据训练模型或使用数据操纵或伤害个人。

总体而言，对 PII 和敏感数据进行强有力的保护对于确保法律合规、维护用户信任、确保数据安全、维护道德标准、保护商业声誉和降低滥用风险是必不可少的。

快速回顾

在上一篇文章中，我们讨论了如何使用 RAG 技术实现问答体验，使用 Elasticsearch 作为向量数据库，同时使用 LlamaIndex 和本地运行的 Mistral LLM。这里我们在此基础上进行构建。

阅读上一篇文章是可选的，因为我们现在将快速讨论/回顾我们在上一篇文章中所做的工作。

我们有一个虚构家庭保险公司的客服人员和客户之间的呼叫中心对话样本数据集。我们构建了一个简单的 RAG 应用程序，它可以回答诸如 “What kind of water related issues are customers filing claims for - 客户提出索赔的是什么类型的水相关问题？” 之类的问题。

从高层次来看，流程是这样的。

在索引阶段，我们使用 LlamaIndex 管道加载和索引文档。文档被分块并与其嵌入一起存储在 Elasticsearch 向量数据库中。

在查询阶段，当用户提出问题时，LlamaIndex 检索与查询相关的前 K 个相似文档。这些前 K 个相关文档连同查询一起被发送到本地运行的 Mistral LLM，然后生成要发送回用户的响应。请随意阅读上一篇文章或探索代码。

在上一篇文章中，我们在本地运行了 LLM。但是，在生产中，你可能希望使用由 OpenAI、Mistral、Anthropic 等各种公司提供的外部 LLM。这可能是因为你的用例需要更大的基础模型，或者由于企业生产需求（如可扩展性、可用性、性能等）而无法在本地运行。

在你的 RAG 管道中引入外部 LLM 会使你面临无意中将敏感信息和 PII 泄露给 LLM 的风险。在这篇文章中，我们将探讨在将文档发送到外部 LLM 之前如何将 PII 信息作为 RAG 管道的一部分进行屏蔽的选项。

具有公共 LLM 的 RAG

在讨论如何在 RAG 管道中保护你的 PII 和敏感信息之前，我们将首先使用 LlamaIndex、Elasticsearch Vector 数据库和 OpenAI LLM 构建一个简单的 RAG 应用程序。

先决条件

我们需要以下内容。

Elasticsearch 作为向量数据库启动并运行以存储嵌入。按照关于安装 Elasticsearch 的文章中的说明进行操作。
OpenAI API 密钥。

简单的 RAG 应用程序

作为参考，完整代码可在此 Github 存储库（分支：protecting-pii）中找到。克隆存储库是可选的，因为我们将介绍下面的代码。

在你最喜欢的 IDE 中，使用以下 3 个文件创建一个新的 Python 应用程序。

index.py，用于存放与索引数据相关的代码。
query.py，用于存放与查询和 LLM 交互相关的代码。
.env，用于存放 API 密钥等配置属性。

我们需要安装一些软件包。我们首先在应用程序的根文件夹中创建一个新的 Python 虚拟环境。

python3 -m venv .venv

激活虚拟环境并安装以下所需的包。

source .venv/bin/activate
pip install llama-index 
pip install llama-index-embeddings-openai
pip install llama-index-vector-stores-elasticsearch
pip install sentence-transformers
pip install python-dotenv
pip install openai

在 .env 文件中配置 OpenAI 和 Elasticsearch 连接属性。

OPENAI_API_KEY="REPLACEME"
ELASTIC_CLOUD_ID="REPLACEME"
ELASTIC_API_KEY="REPLACEME"

索引数据

下载 conversations.json 文件，其中包含我们虚构的家庭保险公司的客户和呼叫中心代理之间的对话。将该文件放在应用程序的根目录中，与之前创建的 2 个 python 文件和 .env 文件放在一起。以下是该文件内容的示例。

{
"conversation_id": 103,
"customer_name": "Sophia Jones",
"agent_name": "Emily Wilson",
"policy_number": "JKL0123",
"conversation": "Customer: Hi, I'm Sophia Jones. My Date of Birth is November 15th, 1985, Address is 303 Cedar St, Miami, FL 33101, and my Policy Number is JKL0123.\nAgent: Hello, Sophia. How may I assist you today?\nCustomer: Hello, Emily. I have a question about my policy.\nCustomer: There's been a break-in at my home, and some valuable items are missing. Are they covered?\nAgent: Let me check your policy for coverage related to theft.\nAgent: Yes, theft of personal belongings is covered under your policy.\nCustomer: That's a relief. I'll need to file a claim for the stolen items.\nAgent: We'll assist you with the claim process, Sophia. Is there anything else I can help you with?\nCustomer: No, that's all for now. Thank you for your assistance, Emily.\nAgent: You're welcome, Sophia. Please feel free to reach out if you have any further questions or concerns.\nCustomer: I will. Have a great day!\nAgent: You too, Sophia. Take care.",
"summary": "A customer inquires about coverage for stolen items after a break-in at home, and the agent confirms that theft of personal belongings is covered under the policy. The agent offers assistance with the claim process, resulting in the customer expressing relief and gratitude."
}

在 index.py 中粘贴以下代码，用于索引数据。

index.py

# index.py
# pip install sentence-transformers
# pip install llama-index-embeddings-openai
# pip install llama-index-embeddings-huggingface

import json
import os
from dotenv import load_dotenv
from llama_index.core import Document
from llama_index.core import Settings
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.elasticsearch import ElasticsearchStore


def get_documents_from_file(file):
   """Reads a json file and returns list of Documents"""

   with open(file=file, mode='rt') as f:
       conversations_dict = json.loads(f.read())

   # Build Document objects using fields of interest.
   documents = [Document(text=item['conversation'],
                         metadata={"conversation_id": item['conversation_id']})
                for
                item in conversations_dict]
   return documents

# Load .env file contents into env
load_dotenv('.env')
Settings.embed_model = HuggingFaceEmbedding(
   model_name="BAAI/bge-small-en-v1.5"
)

def main():
   # ElasticsearchStore is a VectorStore that
   # takes care of Elasticsearch Index and Data management.
   es_vector_store = ElasticsearchStore(index_name="convo_index",
                                        vector_field='conversation_vector',
                                        text_field='conversation',
                                        es_cloud_id=os.getenv("ELASTIC_CLOUD_ID"),
                                        es_api_key=os.getenv("ELASTIC_API_KEY"))

   # LlamaIndex Pipeline configured to take care of chunking, embedding
   # and storing the embeddings in the vector store.
   llamaindex_pipeline = IngestionPipeline(
       transformations=[
           SentenceSplitter(chunk_size=350, chunk_overlap=50),
           Settings.embed_model
       ],
       vector_store=es_vector_store
   )

   # Load data from a json file into a list of LlamaIndex Documents
   documents = get_documents_from_file(file="conversations.json")
   llamaindex_pipeline.run(documents=documents)
   print(".....Indexing Data Completed.....\n")

if __name__ == "__main__":
   main()

运行上述代码，查看在 Elasticsearch 中创建索引，并将嵌入存储在名为 convo_index 的 Elasticsearch 索引中。

如果你需要有关 LlamaIndex IngestionPipeline 的说明，请在上一篇文章中参阅 “创建 IngestionPipeline” 部分。

查询

在上一篇文章中，我们使用了本地 LLM 进行查询。

在这篇文章中，我们使用公共 LLM，OpenAI，如下所示。

query.py

# query.py
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.llms.openai import OpenAI
from index import es_vector_store

# Public LLM where we send user query and Related Documents
llm = OpenAI()

index = VectorStoreIndex.from_vector_store(es_vector_store)

# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents are sent as-is. So any PII/Sensitive data is sent to the LLM.
query_engine = index.as_query_engine(llm, similarity_top_k=10)

query="Give me summary of water related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
result = query_engine.query(bundle)
print(result)

上述代码打印了 OpenAI 的响应，如下所示。

客户提出了各种与水有关的索赔，包括地下室水损、管道爆裂、屋顶冰雹损坏等问题，以及由于未及时通知、维护问题、逐渐磨损和预先存在的损坏等原因而拒绝索赔。在每种情况下，客户都对索赔被拒绝表示沮丧，并寻求对其索赔进行公平的评估和决定。

在 RAG 中屏蔽 PII

到目前为止，我们介绍的内容涉及将文档与用户查询一起按原样发送给 OpenAI。

在 RAG 管道中，从 Vector 存储中检索相关上下文后，我们有机会在将查询和上下文发送到 LLM 之前屏蔽 PII 和敏感信息。

在将 PII 信息发送到外部 LLM 之前，有多种方法可以屏蔽 PII 信息，每种方法都有自己的优点。我们来看看下面的一些选项

使用 NLP 库，如 spacy.io 或 Presidio（由 Microsoft 维护的开源库）。
使用 LlamaIndex 开箱即用的 NERPIINodePostprocessor。
通过 PIINodePostprocessor 使用本地 LLMs

使用上述任何一种方式实现屏蔽逻辑后，你可以使用 PostProcessor（你自己的自定义 PostProcessor 或 LlamaIndex 提供的任何开箱即用的 PostProcessor）配置 LlamaIndex IngestionPipeline。

使用 NLP 库

作为 RAG 管道的一部分，我们可以使用 NLP 库屏蔽敏感数据。我们将在此演示中使用 spacy.io 包。

创建一个新文件 query_masking_nlp.py 并添加以下代码。

query_masking_nlp.py

# query_masking_nlp.py

# pip install spacy
# python3 - m spacy download en_core_web_sm
import re
from typing import List, Optional

import spacy
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from index import es_vector_store

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Compile regex patterns for performance
phone_pattern = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
date_pattern = re.compile(r'\b(\d{1,2}[-/]\d{1,2}[-/]\d{2,4}|\d{2,4}[-/]\d{1,2}[-/]\d{1,2})\b')
dob_pattern = re.compile(
r"(January|February|March|April|May|June|July|August|September|October|November|December)\s(\d{1,2})(st|nd|rd|th),\s(\d{4})")
address_pattern = re.compile(r'\d+\s+[\w\s]+\,\s+[A-Za-z]+\,\s+[A-Z]{2}\s+\d{5}(-\d{4})?')
zip_code_pattern =  re.compile(r'\b\d{5}(?:-\d{4})?\b')
policy_number_pattern = re.compile(r"[A-Z]{3}\d{4}\.$")  # 3 characters followed by 4 digits, in our case e.g XYZ9876

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# match = re.match(policy_number_pattern, "XYZ9876")
# print(match)


def mask_pii(text):
   """
   Masks Personally Identifiable Information (PII) in the given
   text using pre-defined regex patterns and spaCy's named entity recognition.
   Args:
       text (str): The input text containing potential PII.
   Returns:
       str: The text with PII masked.
   """

   # Process the text with spaCy for NER
   doc = nlp(text)

   # Mask entities identified by spaCy NER (e.g First/Last Names etc)
   for ent in doc.ents:
       if ent.label_ in ["PERSON", "ORG", "GPE"]:
           text = text.replace(ent.text, '[MASKED]')

   # Apply regex patterns after NER to avoid overlapping issues
   text = phone_pattern.sub('[PHONE MASKED]', text)
   text = email_pattern.sub('[EMAIL MASKED]', text)
   text = date_pattern.sub('[DATE MASKED]', text)
   text = address_pattern.sub('[ADDRESS MASKED]', text)
   text = dob_pattern.sub('[DOB MASKED]', text)
   text = zip_code_pattern.sub('[ZIP MASKED]', text)
   text = policy_number_pattern.sub('[POLICY MASKED]', text)

   return text


class CustomPostProcessor(BaseNodePostprocessor):
   """
   Custom Postprocessor which masks Personally Identifiable Information (PII).
   PostProcessor is called on the Documents before they are sent to the LLM.
   """
   def _postprocess_nodes(
           self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
   ) -> List[NodeWithScore]:
       # Masks PII
       for n in nodes:
          n.node.set_content(mask_pii(n.text))
       return nodes

   
# Use Public LLM to send user query and Related Documents
llm = OpenAI()
index = VectorStoreIndex.from_vector_store(es_vector_store)

# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents are masked based on custom logic defined in CustomPostProcessor._postprocess_nodes.
query_engine = index.as_query_engine(llm, similarity_top_k=10, node_postprocessors=[CustomPostProcessor()])



query = "Give me summary of water related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
response = query_engine.query(bundle)
print(response)

LLM 的回应如下所示。

Customers have raised various water-related claims, including issues such as water damage in basements, burst pipes, hail damage to roofs, and flooding during heavy rainfall. These claims have led to frustrations due to claim denials based on reasons such as lack of timely notification, maintenance issues, gradual wear and tear, and pre-existing damage. Customers have expressed disappointment, stress, and financial burden as a result of these claim denials, seeking fair evaluations and thorough reviews of their claims. Some customers have also faced delays in claim processing, causing further dissatisfaction with the service provided by the insurance company.

在上面的代码中，当创建 Llama Index QueryEngine 时，我们提供了一个 CustomPostProcessor。

QueryEngine 调用的逻辑在 CustomPostProcessor 的 _postprocess_nodes 方法中定义。我们使用 SpaCy.io 库来检测我们的命名实体，然后在将文档发送到 LLM 之前使用一些正则表达式来替换这些名称以及敏感信息。

以下是原始对话的部分内容和 CustomPostProcessor 创建的 Masked 对话。

原文：

Customer: Hi, I'm Matthew Lopez, DOB is October 12th, 1984, and I live at 456 Cedar St, Smalltown, NY 34567. My Policy Number is TUV8901. Agent: Good afternoon, Matthew. How can I assist you today? Customer: Hello, I'm extremely disappointed with your company's decision to deny my claim.

由 CustomPostProcessor 屏蔽的文本。

Customer: Hi, I'm [MASKED], [MASKED] is [DOB MASKED], and I live at 456 Cedar St, [MASKED], [MASKED] 34567. My Policy Number is [MASKED]. Agent: Good afternoon, [MASKED]. How can I assist you today? Customer: Hello, I'm extremely disappointed with your company's decision to deny my claim.

注意：

识别和屏蔽 PII 和敏感信息并非易事。涵盖敏感信息的各种格式和语义需要对你的领域和数据有很好的了解。虽然上面提供的代码可能适用于某些用例，但你可能需要根据你的需求和测试进行修改。

使用 LlamaIndex 开箱即用的 NERPIINodePostprocessor

LlamaIndex 通过引入 NERPIINodePostprocessor，使得保护 RAG 管道中的 PII 信息变得更加容易。

from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from index import es_vector_store

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Use Public LLM to send user query and Related Documents
llm = OpenAI()

ner_processor = NERPIINodePostprocessor()
index = VectorStoreIndex.from_vector_store(es_vector_store)

# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the LLM.
# Note that documents masked using the NERPIINodePostprocessor so that PII/Sensitive data is not sent to the LLM.
query_engine = index.as_query_engine(llm, similarity_top_k=10, node_postprocessors=[ner_processor])

query = "Give me summary of fire related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
response = query_engine.query(bundle)
print(response)

响应如下所示：

Customers have raised fire-related claims regarding damage to their properties. In one case, a claim for fire damage to a garage was denied due to arson being excluded from coverage. Another customer filed a claim for fire damage to their home, which was covered under their policy. Additionally, a customer reported a kitchen fire and was assured that fire damage was covered.

通过 PIINodePostprocessor 使用本地 LLM

我们还可以利用本地或私有网络中运行的 LLM 来完成屏蔽工作，然后再将数据发送到公共 LLM。

我们将使用本地机器上运行 Ollama 的 Mistral 进行屏蔽。

在本地运行 Mistral

下载并安装 Ollama。安装 Ollama 后，运行此命令下载并运行 mistral

ollama run mistral

首次在本地下载并运行模型可能需要几分钟时间。通过询问类似下面的问题 “Write a poem about clouds” 来验证 mistral 是否正在运行，并验证这首诗是否符合你的喜好。保持 ollama 运行，因为我们稍后需要通过代码与 mistral 模型进行交互。

创建一个名为 query_masking_local_LLM.py 的新文件并添加以下代码。

query_masking_local_LLM.py

# pip install llama-index-llms-ollama
from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.core.postprocessor import PIINodePostprocessor
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.llms.openai import OpenAI
from index import es_vector_store

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Use Public LLM to send user query and Related Documents and Local LLM to mask
public_llm = OpenAI()
local_llm = Ollama(model="mistral")

pii_processor = PIINodePostprocessor(llm=local_llm)
index = VectorStoreIndex.from_vector_store(es_vector_store)

# This query_engine, for a given user query retrieves top 10 similar documents from
# Elasticsearch vector database and sends the documents along with the user query to the public LLM.
# Note that documents are masked using the local llm via PIINodePostprocessor
# so that PII/Sensitive data is not sent to the public LLM.
query_engine = index.as_query_engine(public_llm, similarity_top_k=10, node_postprocessors=[pii_processor])


query = "Give me summary of fire related claims that customers raised."
bundle = QueryBundle(query, embedding=Settings.embed_model.get_query_embedding(query))
result = query_engine.query(bundle)
print(result)

响应如下所示：

Customers have raised fire-related claims regarding damage to their properties. In one case, a claim for fire damage to a garage was denied due to arson being excluded from coverage. Another customer filed a claim for fire damage to their home, which was covered under their policy. Additionally, a customer reported a kitchen fire and was assured that fire damage was covered.