文本分隔 可能存在的问题
- 粒度太大可能导致检索不精准
- 粒度太小可能导致信息不全面
- 问题的答案可能跨越两个片段
# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo_text_split", get_embeddings)
# 向向量数据库中添加文档
vector_db.add_documents(paragraphs)
# 创建一个RAG机器人
bot = RAG_Bot(
vector_db,
llm_api=get_completion
)
#%%
# user_query = "llama 2有商用许可协议吗"
user_query="llama 2 chat有多少参数"
search_results = vector_db.search(user_query, 2)
for doc in search_results['documents'][0]:
print(doc+"\n")
print("====回复====")
bot.chat(user_query)
====回复====
llama 2 chat有70B个参数。'
改进: 按一定粒度,部分重叠式的切割文本,使上下文更完整
from nltk.tokenize import sent_tokenize
import json
def split_text(paragraphs, chunk_size=300, overlap_size=100):
'''按指定 chunk_size 和 overlap_size 交叠割文本'''
sentences = [s.strip() for p in paragraphs for s in sent_tokenize(p)]
chunks = []
i = 0
while i < len(sentences):
chunk = sentences[i]
overlap = ''
prev_len = 0
prev = i - 1
# 向前计算重叠部分
while prev >= 0 and len(sentences[prev])+len(overlap) <= overlap_size:
overlap = sentences[prev] + ' ' + overlap
prev -= 1
chunk = overlap+chunk
next = i + 1
# 向后计算当前chunk
while next < len(sentences) and len(sentences[next])+len(chunk) <= chunk_size:
chunk = chunk + ' ' + sentences[next]
next += 1
chunks.append(chunk)
i = next
return chunks
此处 sent_tokenize 为针对英文的实现,针对中文的实现参考 chinese_utils.py
chunks = split_text(paragraphs, 300, 100)
#%%
# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo_text_split", get_embeddings)
# 向向量数据库中添加文档
vector_db.add_documents(chunks)
# 创建一个RAG机器人
bot = RAG_Bot(
vector_db,
llm_api=get_completion
)
#%%
# user_query = "llama 2有商用许可协议吗"
user_query="llama 2 chat有多少参数"
search_results = vector_db.search(user_query, 2)
for doc in search_results['documents'][0]:
print(doc+"\n")
response = bot.chat(user_query)
print("====回复====")
print(response)
====回复====
llama 2 chat有7B, 13B和70B参数。