《Advanced RAG》-03-使用 RAGAs + LlamaIndex 进行 RAG 评估

摘要

文章首先介绍了 RAG 评估的三个主要部分：输入查询、检索上下文和 LLM 生成的响应。

提到了 RAGAs 提出的 RAG 评估指标，包括 Faithfulness、Answer Relevance 和 Context Relevance，以及 RAGAs 网站提供的两个额外指标：Context Precision 和 Context Recall。详细解释了每个指标的计算方法，并提供了一些示例。

最后，它介绍了使用 RAGAs 和 LlamaIndex 进行 RAG 评估的主要过程。

文章观点

RAG 评估是一个复杂的过程，需要多种指标来评估 RAG 的有效性。RAGAs 和 LlamaIndex 提供了一种有效的方法来进行 RAG 评估，可以帮助开发人员改进 RAG 应用程序的性能。

如果为实际业务系统开发了一个检索增强生成（RAG）应用程序，有效性会很重要。换句话说，需要评估 RAG 的性能如何。

如果发现现有的 RAG 不够有效，可能需要验证 RAG 改进方法的有效性。同时也需要进行评估，看看这些改进方法是否有效。

在本文中，首先介绍论文 RAGAs（Retrieval Augmented Generation Assessment）提出的 RAG 评估指标，这是一个用于评估 RAG 管道的框架。然后，将继续解释如何使用 RAGAs + LlamaIndex 实现整个评估流程。

RAG 评估指标

简单地说，RAG 过程包括三个主要部分：输入查询、检索上下文和 LLM 生成的内容。这三个要素构成了 RAG 过程中最重要的三要素，并且相互依存。

因此，如图 1 所示，可以通过衡量这些三元组之间的相关性来评估 RAG 的有效性。

图 1：可以通过衡量这些三要素之间的相关性来评估 RAG 的有效性。

论文 RAGAs（Retrieval Augmented Generation Assessment）总共提到了 3 个指标：这些指标无需访问人工标注的数据集或参考答案。

此外，RAGAs 网站还介绍了另外两个指标：上下文精确度和上下文召回率。

忠实性|稳定性（Faithfulness/Groundedness）

忠实性：是指确保答案以给定的上下文为基础。

这对于避免错觉和确保检索到的上下文可用作生成答案的理由非常重要。

如果得分较低，则表明LLM的回答与检索到的知识不符，提供幻觉答案的可能性就会增加。例如
在这里插入图片描述
为了估计忠实性，我们首先使用 LLM 提取一组语句 S(a(q))。具体方法如下：

Given a question and answer, create one or more statements from each sentence in the given answer.
question: [question]
answer: [answer]

生成 S(a(q))后，LLM 会判断每条语句 si是否都能从 c(q)中推断出来。这一验证步骤通过以下提示进行：

Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.

statement: [statement 1]
...
statement: [statement n]

最终忠实性得分 F 的计算公式为 F = |V| / |S|，其中 |V| 代表根据 LLM 得到支持的语句数，|S| 代表语句总数。

答案相关性（Answer Relevance）

该指标衡量生成的答案与查询之间的相关性。分数越高，相关性越好。例如

在这里插入图片描述

为了估计一个答案的相关性，我们促使 LLM 根据给定的答案 a(q)，生成 n 个潜在问题 qi，如下所示：

Generate a question for the given answer.
answer: [answer]

然后，我们利用文本嵌入模型获得所有问题的嵌入。

对于每个 qi，计算与原始问题 q 的相似度 sim(q,qi)，这相当于嵌入之间的余弦相似度。问题 q 的答案相关性得分 AR 计算如下：

在这里插入图片描述

上下文相关性（Context Relevance）

这是一个衡量检索质量的指标，主要评估检索到的上下文对查询的支持程度。得分低表示检索到大量无关内容，这可能会影响 LLM 生成的最终答案。例如

在这里插入图片描述

为了估计上下文的相关性，使用 LLM 从上下文（c(q)）中提取了一组关键句子（Sext）。这些句子对于回答问题至关重要。提示如下

Please extract relevant sentences from the provided context that can potentially help answer the following question. 
If no relevant sentences are found, or if you believe the question cannot be answered from the given context, 
return the phrase "Insufficient Information". 
While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.

然后，在 RAGAs 中，相关性是通过以下公式在句子层面计算的：

在这里插入图片描述

上下文召回率（Context Recall）

该指标衡量的是检索上下文与标记的答案之间的一致性水平。

它使用基本事实和检索到的上下文进行计算，数值越高，表示性能越好。例如

在这里插入图片描述

在实施时，需要提供基础实况数据。计算公式如下

在这里插入图片描述

上下文准确度（Context Precision）

这一指标相对复杂，用于衡量检索到的包含真实情况的所有相关上下文是否都排在前列。得分越高，表示精确度越高。

该指标的计算公式如下：

在这里插入图片描述

上下文精度的优点是能够感知排名效果。但它的缺点是，如果相关的召回次数很少，但排名都很靠前，得分也会很高。因此，有必要结合其他几个指标来考虑整体效果。

使用 RAGAs + LlamaIndex 进行 RAG 评估

主要流程如图 6 所示：

在这里插入图片描述

环境配置

安装 ragas： pip install ragas。然后，检查当前版本。

(py) Florian:~ Florian$ pip list | grep ragas
ragas                        0.0.22

值得一提的是，如果使用 pip install git https://github.com/explodinggradients/ragas.git 安装最新版本（v0.1.0rc1），则不支持 LlamaIndex。

然后，导入相关库，设置环境变量和全局变量

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision
)

from ragas.llama_index import evaluate

目录中只有一个 PDF 文件，即 “TinyLlama: An Open Source Small Language Model”。

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/
tinyllama.pdf

使用 LlamaIndex 构建简单的 RAG 查询引擎

documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

默认情况下，LlamaIndex 使用 OpenAI 模型，但可以使用 ServiceContext 轻松配置 LLM 和嵌入模型。

构建评估数据集

由于有些指标需要人工标注数据集，我自己编写了一些问题及其相应的答案。

eval_questions = [
    "Can you provide a concise description of the TinyLlama model?",
    "I would like to know the speed optimizations that TinyLlama has made.",
    "Why TinyLlama uses Grouped-query Attention?",
    "Is the TinyLlama model open source?",
    "Tell me about starcoderdata dataset",
]
eval_answers = [
    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",  
    "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
    "Yes, TinyLlama is open-source",
    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]

指标选择和 RAGAs 评估

metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_precision,
    context_recall,
]

result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，在 RAGAs 中，默认情况下使用 OpenAI 模型。

在 RAGAs 中，如果您想使用其他 LLM（如 Gemini）与 LlamaIndex 一起进行评估，即使在调试了 RAGAs 的源代码之后，我也没有在 RAGAs 0.0.22 版中找到任何有用的方法。

最终代码

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReader

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision
)

from ragas.llama_index import evaluate

documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

eval_questions = [
    "Can you provide a concise description of the TinyLlama model?",
    "I would like to know the speed optimizations that TinyLlama has made.",
    "Why TinyLlama uses Grouped-query Attention?",
    "Is the TinyLlama model open source?",
    "Tell me about starcoderdata dataset",
]
eval_answers = [
    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",  
    "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
    "Yes, TinyLlama is open-source",
    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]

metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_precision,
    context_recall,
]

result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')