Search for documents with similar texts

news2025/7/12 21:55:20

题意：搜索具有相似文本的文档

问题背景：

I have a document with three attributes: tags, location, and text.

我有一份文档，包含三个属性：标签、位置和文本。

Currently, I am indexing all of them using LangChain/pgvector/embeddings.

目前，我正在使用 LangChain/pgvector/embeddings 对所有这些进行索引。

I have satisfactory results, but I want to know if there is a better way since I want to find one or more documents with a specific tag and location, but the text can vary drastically while still meaning the same thing. I thought about using embeddings/vector databases for this reason.

我目前的结果令人满意，但我想知道是否有更好的方法，因为我想找到具有特定标签和位置的一个或多个文档，但文本可能变化很大而意思仍然相同。出于这个原因，我考虑过使用嵌入/向量数据库。

Would it also be a case of using RAG (Retrieval-Augmented Generation) to "teach" the LLM about some common abbreviations that it doesn't know?

是否也可以利用 RAG（检索增强生成）来“教授”大型语言模型（LLM）一些它不知道的常见缩写呢？

import pandas as pd

from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_openai.embeddings import OpenAIEmbeddings

connection = "postgresql+psycopg://langchain:langchain@localhost:5432/langchain"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
collection_name = "notas_v0"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)


# START INDEX

# df = pd.read_csv("notes.csv")
# df = df.dropna()  # .head(10000)
# df["tags"] = df["tags"].apply(
#     lambda x: [tag.strip() for tag in x.split(",") if tag.strip()]
# )


# long_texts = df["Texto Longo"].tolist()
# wc = df["Centro Trabalho Responsável"].tolist()
# notes = df["Nota"].tolist()
# tags = df["tags"].tolist()

# documents = list(
#     map(
#         lambda x: Document(
#             page_content=x[0], metadata={"wc": x[1], "note": x[2], "tags": x[3]}
#         ),
#         zip(long_texts, wc, notes, tags),
#     )
# )

# print(
#     [
#         vectorstore.add_documents(documents=documents[i : i + 100])
#         for i in range(0, len(documents), 100)
#     ]
# )
# print("Done.")

### END INDEX

### BEGIN QUERY

result = vectorstore.similarity_search_with_relevance_scores(
    "EVTD202301222707",
    filter={"note": {"$in": ["15310116"]}, "tags": {"$in": ["abcd", "xyz"]}},
    k=10, # Limit of results
)

### END QUERY

问题解决：

There is one primary unknown here, what is the approximate or average number of tokens in the "text" part of your input.

这里有一个主要的未知因素，即你输入中“文本”部分的大致或平均token数是多少。

Scenario 1: You do not have a very long input (say, somewhere around 512 tokens)

场景1：你的输入不是很长（大约512个token左右）

In this case, to get better results, you can train your own "embedding-model", please look at my answer here which has some info around it.

在这种情况下，为了获得更好的结果，你可以训练自己的“嵌入模型”。请参考我之前的回答，其中有一些相关信息。

Once you get right embedding model, you index corresponding text vectors in you RAG pipeline. There are a couple of other steps as well which are applicable to all the scenarios, so, I will add them at the end.

一旦你获得了合适的嵌入模型，你就可以在你的RAG管道中索引相应的文本向量。还有一些其他步骤适用于所有场景，因此，我将在最后添加它们。

Scenario 2: You have a very long input per document, say, every "text" input is huge (say, ~8000 tokens, this number can be anything though). In this case you can leverage symbolic search instead of vector search. Symbolic search because, in any language, to describe something which really means the same or has similar context, there will surely be a lot of words overlap in source and target text. It will be very rare to find 10 pages text on a same topic that does not have a lot of work overlap.

场景2：每个文档的输入都非常长，例如，每个“文本”输入都很大（大约8000个标记，尽管这个数字可以是任何数）。在这种情况下，你可以利用符号搜索而不是向量搜索。之所以选择符号搜索，是因为在任何语言中，为了描述具有相同含义或相似上下文的内容，源文本和目标文本中肯定会有很多词汇重叠。很难找到关于同一主题但文字重叠不多的10页文本。

So, you can leverage symbolic search here, ensemble it with vector based validators and use an LLM service that allows long context prompts. So, you find some good candidates via symbolic searches, then, pass it on the long context LLM to for remaining parts.

因此，你可以在这里利用符号搜索，将其与基于向量的验证器结合使用，并使用允许长上下文提示的大型语言模型（LLM）服务。首先，通过符号搜索找到一些好的候选文档，然后将其传递给长上下文LLM以处理剩余部分。

Steps Applicable to all the scenarios: 适用于所有场景的步骤：

1. You json object should also contain "tag", "location" along with "text" and "vector"

你的JSON对象应该同时包含“tag”（标签）、“location”（位置）、“text”（文本）和“vector”（向量）

{"text":"some text",
"text_embedding":[...], #not applicable in symbolic search

"location":"loc",
"tags":[]
}

2. This way, when you get matches from either vector search or symbolic search; you will further able to filter or sort based on other properties like tags and location

这样，当你从向量搜索或符号搜索中获得匹配项时，你将能够基于其他属性（如标签和位置）进行进一步的过滤或排序。

Please comment if you have more doubts! 如果你还有更多疑问，请随时评论！