大模型RAG实战｜文本转换：文本分割器、中文标题增强与高级提取管道

在这里插入图片描述

ThinkRAG大模型RAG实战系列文章，带你深入探索使用LlamaIndex框架，构建本地大模型知识库问答系统。本系列涵盖知识库管理、检索优化、模型本地部署等主题，通过代码与实例，讲解如何打造生产级系统，实现本地知识库的快速检索与智能问答。

现在各种大模型RAG系统越来越多，而大模型的能力也越来越趋同，因此，要做到真正能落地应用，关键点在于对原始数据的处理。

LlamaIndex作为专门针对RAG的数据框架，提供了一系列优秀的组件，用于加载文档、分割文本、向量嵌入等操作。不过，一个常见的问题是，LlamaIndex并未针对中文环境进行优化。

比如，在上一篇文章中，我提到BM25检索器，其中默认的tokenizer是为英文设计的。所以我们必须做些改动，使用中文分词组件jieba作为tokenizer。

因此，在实践中，我们需要在LlamaIndex提供的功能组件的基础上，进行定制与优化，以达到最好的数据处理效果。

文本分割器

先从文本分割器（Text Splitter）开始。

文本分割，指的是根据文本中自然分割的标志，比如标题、段落、标点符号等，将文本分割成适当大小的文本块（Chunk）。

LlamaIndex提供了一些文本分割器，我在开发环境使用默认的Sentence Splitter；而在生产环境，则推荐使用处理中文更优的Spacy Text Splitter。

代码如下：

from config import DEV_MODE``from llama_index.core import Settings``   ``def create_text_splitter(chunk_size=2048, chunk_overlap=512):`    `if DEV_MODE:`        `# Development environment`        `from llama_index.core.node_parser import SentenceSplitter``   `        `sentence_splitter = SentenceSplitter(`            `chunk_size=chunk_size,`            `chunk_overlap=chunk_overlap,`        `)``   `        `return sentence_splitter`    `    else:`        `# Production environment`        `from langchain.text_splitter import SpacyTextSplitter`        `from llama_index.core.node_parser import LangchainNodeParser``   `        `spacy_text_splitter = LangchainNodeParser(SpacyTextSplitter(`            `pipeline="zh_core_web_sm",``            chunk_size=chunk_size,`            `chunk_overlap=chunk_overlap,`        `))``   `        `return spacy_text_splitter`    `Settings.text_splitter = create_text_splitter()

在通过Langchain Node Parser构造Spacy Text Splitter之前，我们需要首先通过pip安装spacy组件，然后下载和指定zh_core_web_sm模型，它对中文有更好的支持。

pip install spacy``spacy download zh_core_web_sm

值得注意的是，**文本块大小（Chunk Size）**非常关键，将直接影响后续检索和回答的效果。

如果在实际使用场景中，用户问题是总结性的居多，比如：“流程有哪三个特征”，那么文本块的大小不宜过小，不然容易给出不完整的回答。

我建议的文本块大小为1024或2048。

而且，现在大模型的上下文窗口也都比较大，比如月之暗面（Moonshot）提供了三种API，上下文窗口分别是8K，32K，128K，足以容纳召回的多个文本块。

比如，如果召回前3个文本块（TopK = 3），我们可以将这3个文本块合并后，一次性给到大模型生成回答。这样，我们既可以减少大模型的调用次数，也能更好地利用大模型的长上下文理解和生成能力。

中文标题增强

我们构建RAG系统，有两个优秀的开源项目值得借鉴：Langchain-chatchat和QAnything。这两个项目都用到了中文标题增强，在代码中都包含了zh_title_enhance模块。

中文标题增强，指的是识别和标注分割后的文本块，是否为文章的标题或小标题。标题实际是后续文本内容的总结，也是对文本进行分割的好标志。

该模块的代码如下：

from llama_index.core.schema import BaseNode # modified based on Document in Langchain``from typing import List``import re``   ``   ``def under_non_alpha_ratio(text: str, threshold: float = 0.5):`    `"""Checks if the proportion of non-alpha characters in the text snippet exceeds a given`    `threshold. This helps prevent text like "-----------BREAK---------" from being tagged`    `as a title or narrative text. The ratio does not count spaces.``   `    `Parameters`    `----------`    `text`        `The input string to test`    `threshold`        `If the proportion of non-alpha characters exceeds this threshold, the function`        `returns False`    `"""`    `if len(text) == 0:`        `return False``   `    `alpha_count = len([char for char in text if char.strip() and char.isalpha()])`    `total_count = len([char for char in text if char.strip()])`    `try:`        `ratio = alpha_count / total_count`        `return ratio < threshold`    `except:`        `return False``   ``   ``def is_possible_title(`        `text: str,`        `title_max_word_length: int = 20,`        `non_alpha_threshold: float = 0.5,``) -> bool:`    `"""Checks to see if the text passes all of the checks for a valid title.``   `    `Parameters`    `----------`    `text`        `The input text to check`    `title_max_word_length`        `The maximum number of words a title can contain`    `non_alpha_threshold`        `The minimum number of alpha characters the text needs to be considered a title`    `"""``   `    `# If the text length is zero, it is not a title`    `if len(text) == 0:`        `print("Not a title. Text is empty.")`        `return False``   `    `# If the text has punctuation, it is not a title`    `ENDS_IN_PUNCT_PATTERN = r"[^\w\s]\Z"`    `ENDS_IN_PUNCT_RE = re.compile(ENDS_IN_PUNCT_PATTERN)`    `if ENDS_IN_PUNCT_RE.search(text) is not None:`        `return False``   `    `# The text length must not exceed the set value, which is set to be 20 by default.`    `# NOTE(robinson) - splitting on spaces here instead of word tokenizing because it`    `# is less expensive and actual tokenization doesn't add much value for the length check`    `if len(text) > title_max_word_length:`        `return False``   `    `# The ratio of numbers in the text should not be too high, otherwise it is not a title.`    `if under_non_alpha_ratio(text, threshold=non_alpha_threshold):`        `return False``   `    `# NOTE(robinson) - Prevent flagging salutations like "To My Dearest Friends," as titles`    `if text.endswith((",", ".", "，", "。")):`        `return False``   `    `if text.isnumeric():`        `print(f"Not a title. Text is all numeric:\n\n{text}")  # type: ignore`        `return False``   `    `# "The initial characters should contain numbers, typically within the first 5 characters by default."`    `if len(text) < 5:`        `text_5 = text`    `else:`        `text_5 = text[:5]`    `alpha_in_text_5 = sum(list(map(lambda x: x.isnumeric(), list(text_5))))`    `if not alpha_in_text_5:`        `return False``   `    `return True``   ``   ``def zh_title_enhance(docs: List[BaseNode]) -> List[BaseNode]: # modified based on Document in Langchain`    `title = None`    `if len(docs) > 0:`        `for doc in docs:`            `if is_possible_title(doc.text): # modified based on doc.page_content in Langchain`                `doc.metadata['category'] = 'cn_Title'`                `title = doc.text`            `elif title:`                `doc.text = f"下文与({title})有关。{doc.text}"`        `return docs`    `else:`        `print("文件不存在")

我们通过阅读代码可以发现，实现中文标题增强的方式很简单，就是通过各种规则判定一段文本是否为标题，比如：文本中是否有标点符号，文本是否超过20字符的长度，文本中数字的占比等等。

如果判定为标题，则在这段文本的元数据（Metadata）中标记category为cn_title，未来可通过元数据过滤来筛选。
如果判定不是标题，那么这段文本，就被标记为与上一个标题有关。未来在检索时，就能够比较完整地检索出相应的内容。

由于Langchain-chatchat和QAnything使用Langchain框架，而我们使用LlamaIndex框架，因此需要相应地修改一些代码。

进一步，我把它封装成一个TransformComponent对象，命名为ChineseTitleExtractor，这样便于在LlamaIndex的提取管道（Ingestion Pipeline）中调用。

import re``from llama_index.core.schema import TransformComponent``   ``class ChineseTitleExtractor(TransformComponent):`    `def __call__(self, nodes, **kwargs):`        `nodes = zh_title_enhance(nodes)`        `return nodes

高级提取管道

LlamaIndex的文本提取管道（Ingestion Pipeline）。

它用于对各种方式加载的文档（如PDF，网页等），通过统一的方式处理成文本块，并调用嵌入模型生成向量，存储到向量数据库中。同步，文本块也会存储在文档存储中。

所以，在提取管道中，我们需要配置以下内容：

嵌入模型，比如BAAI/bge-small-zh-v1.5
文本分割器，比如Sentence Splitter或Spacy Text Splitter
文档存储数据库，比如MongoDB 或 Redis
向量数据库，比如Chroma或LanceDB

代码如下：

from llama_index.core import Settings``from llama_index.core.ingestion import IngestionPipeline, DocstoreStrategy``from server.splitters import ChineseTitleExtractor``from server.stores.strage_context import STORAGE_CONTEXT``from server.stores.ingestion_cache import INGESTION_CACHE``   ``class AdvancedIngestionPipeline(IngestionPipeline):`    `def __init__(`        `self,``    ):`        `# Initialize the embedding model, text splitter`        `embed_model = Settings.embed_model`        `text_splitter = Settings.text_splitter``   `        `# Call the super class's __init__ method with the necessary arguments`        `super().__init__(`            `transformations=[`                `text_splitter,`                `embed_model,`                `ChineseTitleExtractor(), # modified Chinese title enhance: zh_title_enhance`            `],`            `docstore=STORAGE_CONTEXT.docstore,`            `vector_store=STORAGE_CONTEXT.vector_store,`            `cache=INGESTION_CACHE,`            `docstore_strategy=DocstoreStrategy.UPSERTS,  # UPSERTS: Update or insert`        `)``   `    `# If you need to override the run method or add new methods, you can do so here`    `def run(self, documents):`        `nodes = super().run(documents=documents)`        `print(f"Ingested {len(nodes)} Nodes")`        `print(f"Load {len(self.docstore.docs)} documents into docstore")`        `return nodes

从代码可以看出，我们基于LlamaIndex的IngestionPipeline构建了一个新的类AdvancedIngestionPipeline。

在对文本的转换中（transformations），我们除了配置了嵌入模型和文本分割器，也增加了前文介绍的中文标题增强ChineseTitleExtractor。

这样，我们通过选择与构建文本分割器、中文标题增强和高级提取管道，形成了一套完整的适合中文环境的文本处理机制，为大模型RAG系统构建了良好的基础。

全部开源：ThinkRAG

以上代码，可以在ThinkRAG这一开源项目中找到：

https://github.com/wzdavid/ThinkRAG

ThinkRAG是我基于LlamaIndex框架，前端使用Streamlit开发的大模型知识库RAG系统，可本地化部署和离线运行。

该项目已在Github上开源发布，采用MIT许可证，并将持续完善代码与文档。

在《大模型RAG实战》系列文章中，我将详细介绍ThinkRAG各个功能的实现方法和相关代码。

欢迎关注与交流！

参考资料：

https://zhuanlan.zhihu.com/p/638827267

https://docs.llamaindex.ai/en/stable/examples/ingestion/advanced_ingestion_pipeline/

https://github.com/chatchat-space/Langchain-Chatchat

https://github.com/netease-youdao/QAnything

https://github.com/wzdavid/ThinkRAG

如何系统的去学习大模型LLM ？

作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。

但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的 AI大模型资料 包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

所有资料 ⚡️ ，朋友们如果有需要全套《LLM大模型入门+进阶学习资源包》，扫码获取~

👉[CSDN大礼包🎁：全网最全《LLM大模型入门+进阶学习资源包》免费分享]👈

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
  - L1.4.1 知识大模型
  - L1.4.2 生产大模型
  - L1.4.3 模型工程方法论
  - L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
  - L2.1.1 OpenAI API接口
  - L2.1.2 Python接口接入
  - L2.1.3 BOT工具类框架
  - L2.1.4 代码示例
- L2.2 Prompt框架
  - L2.2.1 什么是Prompt
  - L2.2.2 Prompt框架应用现状
  - L2.2.3 基于GPTAS的Prompt框架
  - L2.2.4 Prompt框架与Thought
  - L2.2.5 Prompt框架与提示词
- L2.3 流水线工程
  - L2.3.1 流水线工程的概念
  - L2.3.2 流水线工程的优点
  - L2.3.3 流水线工程的应用
- L2.4 总结与展望

阶段3：AI大模型应用架构实践

目标：深入理解AI大模型的应用架构，并能够进行私有化部署。
内容：
- L3.1 Agent模型框架
  - L3.1.1 Agent模型框架的设计理念
  - L3.1.2 Agent模型框架的核心组件
  - L3.1.3 Agent模型框架的实现细节
- L3.2 MetaGPT
  - L3.2.1 MetaGPT的基本概念
  - L3.2.2 MetaGPT的工作原理
  - L3.2.3 MetaGPT的应用场景
- L3.3 ChatGLM
  - L3.3.1 ChatGLM的特点
  - L3.3.2 ChatGLM的开发环境
  - L3.3.3 ChatGLM的使用示例
- L3.4 LLAMA
  - L3.4.1 LLAMA的特点
  - L3.4.2 LLAMA的开发环境
  - L3.4.3 LLAMA的使用示例
- L3.5 其他大模型介绍