跳到主要内容

使用LlamaIndex评估RAG

nbviewer

在这个笔记本中,我们将探讨如何构建一个RAG流水线,并使用LlamaIndex进行评估。它包含以下3个部分。

  1. 理解检索增强生成(RAG)。
  2. 使用LlamaIndex构建RAG。
  3. 使用LlamaIndex评估RAG。

检索增强生成(RAG)

大型语言模型(LLMs)是在庞大的数据集上训练的,但这些数据集不会包含您的特定数据。检索增强生成(RAG)通过在生成过程中动态地整合您的数据来解决这个问题。这并不是通过改变LLMs的训练数据来实现的,而是通过允许模型实时访问和利用您的数据,以提供更加个性化和具有上下文相关性的响应。

在RAG中,您的数据被加载并准备用于查询或“索引”。用户查询作用于索引,将您的数据筛选到最相关的上下文。这个上下文和您的查询然后与提示一起传递给LLM,LLM提供一个响应。

即使您正在构建的是一个聊天机器人或代理程序,您也会想要了解RAG技术,以将数据引入您的应用程序中。

RAG概览

RAG中的阶段

RAG中有五个关键阶段,这些阶段将成为您构建的任何更大应用程序的一部分。这些阶段包括:

加载: 这指的是从数据源(无论是文本文件、PDF、另一个网站、数据库还是API)将数据加载到您的流水线中。LlamaHub提供了数百个可供选择的连接器。

索引: 这意味着创建一个数据结构,允许查询数据。对于LLMs,这几乎总是意味着创建向量嵌入,即数据含义的数值表示,以及许多其他元数据策略,使得准确找到上下文相关数据变得容易。

存储: 一旦您的数据被索引,您将希望存储您的索引,以及任何其他元数据,以避免需要重新对其进行索引。

查询: 对于任何给定的索引策略,您可以利用LLMs和LlamaIndex数据结构进行查询的许多方式,包括子查询、多步查询和混合策略。

评估: 在任何流水线中的一个关键步骤是检查其相对于其他策略的有效性,或者在您进行更改时。评估提供了关于您对查询的响应有多准确、忠实和快速的客观度量。

构建RAG系统。

现在我们已经了解了RAG系统的重要性,让我们构建一个简单的RAG流水线。

!pip install llama-index

# nest_asyncio模块允许在已经运行的异步循环中嵌套执行异步函数。
# 这是必要的,因为Jupyter笔记本本质上是在一个异步循环中运行的。
# 通过应用nest_asyncio,我们可以在现有循环中运行额外的异步函数,而不会产生冲突。
import nest_asyncio

nest_asyncio.apply()

from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI

import os
import pandas as pd

设置您的OpenAI API密钥

在使用OpenAI API之前,您需要设置您的API密钥。您可以在OpenAI的官方网站上获得API密钥。将您的API密钥复制粘贴到下面的代码单元格中,并运行以设置API密钥。

os.environ['OPENAI_API_KEY'] = 'YOUR OPENAI API KEY'

让我们使用Paul Graham的文章文本来构建RAG管道。

下载数据

!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload Upload Total Spent Left Speed
100 75042 100 75042 0 0 190k 0 --:--:-- --:--:-- --:--:-- 190k--:-- 0:00:03 24586

加载数据并构建索引。

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# 定义一个大型语言模型(LLM)
llm = OpenAI(model="gpt-4")

# 以512的块大小构建索引
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

构建一个查询引擎并开始查询。

query_engine = vector_index.as_query_engine()

response_vector = query_engine.query("What did the author do growing up?")

检查响应。

response_vector.response

'The author wrote short stories and worked on programming, specifically on an IBM 1401 computer using an early version of Fortran.'

默认情况下,它检索两个相似的节点/块。您可以在vector_index.as_query_engine(similarity_top_k=k)中修改这一点。

让我们检查每个检索到的节点中的文本。

# 首个检索节点
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.\n\nI was puzzled by the 1401. I couldn\'t figure out what to do with it. And in retrospect there\'s not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn\'t have any data stored on punched cards. The only other option was to do things that didn\'t rely on any input, like calculate approximations of pi, but I didn\'t know enough math to do anything interesting of that type. So I\'m not surprised I can\'t remember any programs I wrote, because they can\'t have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn\'t. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager\'s expression made clear.\n\nWith microcomputers, everything changed.'
# 第二个检索节点
response_vector.source_nodes[1].get_text()

"It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\nNotes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n[2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione). It's the everyday words that differ. So if you string together a lot of abstract concepts with a few simple verbs, you can make a little Italian go a long way.\n\n[3] I lived at Piazza San Felice 4, so my walk to the Accademia went straight down the spine of old Florence: past the Pitti, across the bridge, past Orsanmichele, between the Duomo and the Baptistery, and then up Via Ricasoli to Piazza San Marco."

我们已经构建了一个RAG流水线,现在需要评估其性能。我们可以使用LlamaIndex的核心评估模块来评估我们的RAG系统/查询引擎。让我们看看如何利用这些工具来量化我们的检索增强生成系统的质量。

评估

评估应该作为评估您的RAG应用程序的主要指标。它确定管道是否会根据数据源和一系列查询产生准确的响应。

虽然在开始阶段检查单个查询和响应是有益的,但随着边缘情况和失败案例的增加,这种方法可能变得不切实际。相反,建立一套摘要指标或自动化评估可能更有效。这些工具可以提供关于整体系统性能的见解,并指示可能需要更仔细审查的特定领域。

在RAG系统中,评估侧重于两个关键方面:

  • 检索评估: 这评估系统检索到的信息的准确性和相关性。
  • 响应评估: 这衡量系统根据检索到的信息生成的响应的质量和适当性。

问题-上下文对生成:

为了评估RAG系统,必须有能够获取正确上下文并随后生成适当响应的查询。LlamaIndex提供了一个专门用于制作问题和上下文对的generate_question_context_pairs模块,可用于评估RAG系统的检索和响应评估。有关问题生成的更多详细信息,请参考文档

qa_dataset = generate_question_context_pairs(
nodes,
llm=llm,
num_questions_per_chunk=2
)

100%|██████████| 58/58 [06:26<00:00,  6.67s/it]

检索评估:

我们现在准备进行检索评估。我们将使用我们生成的评估数据集来执行我们的RetrieverEvaluator

我们首先创建Retriever,然后定义两个函数:get_eval_results,它在数据集上操作我们的检索器,以及display_results,它展示评估的结果。

让我们创建检索器。

retriever = vector_index.as_retriever(similarity_top_k=2)

定义RetrieverEvaluator。我们使用**命中率(Hit Rate)平均倒数排名(MRR)**指标来评估我们的检索器。

命中率(Hit Rate):

命中率计算了在前k个检索到的文档中找到正确答案的查询的比例。简单来说,它表示我们的系统在前几次猜测中多久能够答对。

平均倒数排名(MRR):

对于每个查询,MRR通过查看最高排名的相关文档的排名来评估系统的准确性。具体来说,它是所有查询中这些排名的倒数的平均值。因此,如果第一个相关文档是顶部结果,则倒数排名为1;如果是第二个,则倒数排名为1/2,依此类推。

让我们检查这些指标来评估我们的检索器的性能。

retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)

# 评估
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

让我们定义一个函数,以表格形式显示检索评估结果。

def display_results(name, eval_results):
"""显示评估结果。"""

metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)

full_df = pd.DataFrame(metric_dicts)

hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()

metric_df = pd.DataFrame(
{"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
)

return metric_df

display_results("OpenAI Embedding Retriever", eval_results)

Retriever Name Hit Rate MRR
0 OpenAI Embedding Retriever 0.758621 0.62069

观察:

使用OpenAI Embedding的Retriever表现出命中率为0.7586,而MRR为0.6206,表明有改进的空间,以确保最相关的结果出现在前面。MRR小于命中率的观察表明,排名靠前的结果并不总是最相关的。增强MRR可能涉及使用rerankers,这些rerankers可以优化检索文档的顺序。要深入了解rerankers如何优化检索指标,请参考我们的博客文章中的详细讨论。

响应评估:

  1. FaithfulnessEvaluator:衡量查询引擎的响应是否与任何源节点匹配,用于衡量响应是否是虚构的。
  2. 相关性评估器:衡量响应 + 源节点是否与查询匹配。
# 从上述创建的数据集中获取查询列表

queries = list(qa_dataset.queries.values())

信念度评估器

让我们从FaithfulnessEvaluator开始。

我们将使用gpt-3.5-turbo来为给定的查询生成响应,并使用gpt-4进行评估。

让我们分别为gpt-3.5-turbogpt-4创建service_context

# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# GPT-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

创建一个QueryEngine,使用gpt-3.5-turboservice_context来为查询生成响应。

vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

创建一个FaithfulnessEvaluator。

from llama_index.evaluation import FaithfulnessEvaluator
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

让我们来评估一个问题。

eval_query = queries[10]

eval_query

"Based on the author's experience and observations, why did he consider the AI practices during his first year of grad school as a hoax? Provide specific examples from the text to support your answer."

首先生成回应,然后使用忠实的评估器。

response_vector = query_engine.query(eval_query)

# 计算忠实度评估

eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)

# 你可以在 `eval_result` 中检查传递的参数,如果它通过了评估。
eval_result.passing

True

相关性评估器

RelevancyEvaluator用于衡量响应和源节点(检索到的上下文)是否与查询匹配。有助于查看响应是否实际回答了查询。

实例化RelevancyEvaluator以使用gpt-4进行相关性评估。

from llama_index.evaluation import RelevancyEvaluator

relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

让我们对其中一个查询进行相关性评估。

# 选择一个查询
query = queries[10]

query

"Based on the author's experience and observations, why did he consider the AI practices during his first year of grad school as a hoax? Provide specific examples from the text to support your answer."
# 生成回复。
# response_vector 包含了响应节点和源节点(即检索到的上下文)。
response_vector = query_engine.query(query)

# 相关性评估
eval_result = relevancy_gpt4.evaluate_response(
query=query, response=response_vector
)

# 你可以在 `eval_result` 中检查传递的参数,如果它通过了评估。
eval_result.passing

True
# 你可以获得评估的反馈。
eval_result.feedback

'YES'

批量评估器:

现在我们已经分别进行了忠实度和相关性评估。LlamaIndex有BatchEvalRunner来以批处理的方式计算多个评估。

from llama_index.evaluation import BatchEvalRunner

# 我们来挑选前10个查询进行评估。
batch_eval_queries = queries[:10]

# 启动BatchEvalRunner,以进行忠实度和相关性评估计算。
runner = BatchEvalRunner(
{"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
workers=8,
)

# 计算评估
eval_results = await runner.aevaluate_queries(
query_engine, queries=batch_eval_queries
)

# Let's get faithfulness score

faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])

faithfulness_score

1.0
# Let's get relevancy score

relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

relevancy_score

1.0

观察:

信实度得分为1.0表示生成的答案没有出现幻觉,完全基于检索到的上下文。

相关性得分为1.0表明生成的答案与检索到的上下文和查询之间保持一致。

结论

在这个笔记本中,我们探讨了如何使用LlamaIndex构建和评估一个RAG管道,重点关注评估检索系统和管道内生成的响应。

LlamaIndex还提供了许多其他评估模块,您可以在这里进一步探索:链接