从LlamaHub下载LlamaDataset¶
您可以通过llamahub.ai浏览我们提供的基准数据集。本笔记本指南描述了如何下载数据集及其源文本文档。特别是,download_llama_dataset
将下载评估数据集(即LabelledRagDataset
)以及用于构建评估数据集的源文本文件的Document
。
最后,在本笔记本中,我们还演示了下载评估数据集、使用您自己的RAG管道(查询引擎)对其进行预测,然后评估这些预测的端到端工作流程。
%pip install llama-index-llms-openai
from llama_index.core.llama_dataset import download_llama_dataset# 下载并安装依赖项rag_dataset, documents = download_llama_dataset( "PaulGrahamEssayDataset", "./paul_graham")
github url: https://raw.githubusercontent.com/nerdai/llama-hub/datasets/llama_hub/llama_datasets/library.json github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/rag_dataset.json github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/source.txt
rag_dataset.to_pandas()[:5]
query | reference_contexts | reference_answer | reference_answer_by | query_by | |
---|---|---|---|---|---|
0 | In the essay, the author mentions his early ex... | [What I Worked On\n\nFebruary 2021\n\nBefore c... | The first computer the author used for program... | ai (gpt-4) | ai (gpt-4) |
1 | The author switched his major from philosophy ... | [What I Worked On\n\nFebruary 2021\n\nBefore c... | The two specific influences that led the autho... | ai (gpt-4) | ai (gpt-4) |
2 | In the essay, the author discusses his initial... | [I couldn't have put this into words when I wa... | The two main influences that initially drew th... | ai (gpt-4) | ai (gpt-4) |
3 | The author mentions his shift of interest towa... | [I couldn't have put this into words when I wa... | The author shifted his interest towards Lisp a... | ai (gpt-4) | ai (gpt-4) |
4 | In the essay, the author mentions his interest... | [So I looked around to see what I could salvag... | The author in the essay is Paul Graham, who wa... | ai (gpt-4) | ai (gpt-4) |
使用documents
,您可以构建自己的RAG管道,然后进行预测和评估,以与与数据集相关的DatasetCard
中列出的基准进行比较。 llamahub.ai。
预测¶
注意:笔记本的其余部分演示了如何手动进行预测和随后的评估,以进行示范。或者,您可以使用RagEvaluatorPack
来处理使用您提供的RAG系统进行预测和评估。
from llama_index.core import VectorStoreIndex# 一个基本的RAG流水线,使用默认设置index = VectorStoreIndex.from_documents(documents=documents)query_engine = index.as_query_engine()
您现在可以手动创建预测并进行评估,或者下载PredictAndEvaluatePack
来用一行代码完成这些工作。
import nest_asyncio
nest_asyncio.apply()
# 手动prediction_dataset = await rag_dataset.amake_predictions_with( query_engine=query_engine, show_progress=True)
100%|███████████████████████████████████████████████████████| 44/44 [00:08<00:00, 4.90it/s]
prediction_dataset.to_pandas()[:5]
response | contexts | |
---|---|---|
0 | The author mentions that the first computer he... | [What I Worked On\n\nFebruary 2021\n\nBefore c... |
1 | The author switched his major from philosophy ... | [I couldn't have put this into words when I wa... |
2 | The author mentions two main influences that i... | [I couldn't have put this into words when I wa... |
3 | The author mentions that he shifted his intere... | [So I looked around to see what I could salvag... |
4 | The author mentions his interest in both compu... | [What I Worked On\n\nFebruary 2021\n\nBefore c... |
评估¶
现在我们有了预测结果,我们可以在两个维度上进行评估:
- 生成的回复:预测的回复与参考答案匹配程度如何。
- 检索到的上下文:预测的检索到的上下文与参考上下文匹配程度如何。
注意:对于检索到的上下文,我们无法使用标准的检索指标,比如命中率
和平均倒数排名
,因为这样做需要我们拥有生成地面真实数据时使用的相同索引。但是,创建LabelledRagDataset
并不一定需要使用索引。因此,我们将使用预测上下文与参考上下文之间的语义相似度
作为衡量好坏的指标。
import tqdm
为了评估响应,我们将使用LLM-As-A-Judge模式。具体来说,我们将使用CorrectnessEvaluator
、FaithfulnessEvaluator
和RelevancyEvaluator
。
为了评估检索到的上下文的好坏,我们将使用SemanticSimilarityEvaluator
。
# 实例化gpt-4评估器from llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import ( CorrectnessEvaluator, # 正确性评估器 FaithfulnessEvaluator, # 忠实度评估器 RelevancyEvaluator, # 相关性评估器 SemanticSimilarityEvaluator, # 语义相似性评估器)judges = {}judges["correctness"] = CorrectnessEvaluator( llm=OpenAI(temperature=0, model="gpt-4"),)judges["relevancy"] = RelevancyEvaluator( llm=OpenAI(temperature=0, model="gpt-4"),)judges["faithfulness"] = FaithfulnessEvaluator( llm=OpenAI(temperature=0, model="gpt-4"),)judges["semantic_similarity"] = SemanticSimilarityEvaluator()
循环遍历(labelled_example
, prediction
)对,并对每个对进行单独的评估。
evals = {
"correctness": [],
"relevancy": [],
"faithfulness": [],
"context_similarity": [],
}
for example, prediction in tqdm.tqdm(
zip(rag_dataset.examples, prediction_dataset.predictions)
):
correctness_result = judges["correctness"].evaluate(
query=example.query,
response=prediction.response,
reference=example.reference_answer,
)
relevancy_result = judges["relevancy"].evaluate(
query=example.query,
response=prediction.response,
contexts=prediction.contexts,
)
faithfulness_result = judges["faithfulness"].evaluate(
query=example.query,
response=prediction.response,
contexts=prediction.contexts,
)
semantic_similarity_result = judges["semantic_similarity"].evaluate(
query=example.query,
response="\n".join(prediction.contexts),
reference="\n".join(example.reference_contexts),
)
evals["correctness"].append(correctness_result)
evals["relevancy"].append(relevancy_result)
evals["faithfulness"].append(faithfulness_result)
evals["context_similarity"].append(semantic_similarity_result)
44it [07:15, 9.90s/it]
import json# 保存评估结果evaluations_objects = { "context_similarity": [e.dict() for e in evals["context_similarity"]], "correctness": [e.dict() for e in evals["correctness"]], "faithfulness": [e.dict() for e in evals["faithfulness"]], "relevancy": [e.dict() for e in evals["relevancy"]],}with open("evaluations.json", "w") as json_file: json.dump(evaluations_objects, json_file)
现在,我们可以使用我们的笔记本实用函数来查看这些评估结果。
import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df
deep_eval_df, mean_correctness_df = get_eval_results_df(
["base_rag"] * len(evals["correctness"]),
evals["correctness"],
metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
["base_rag"] * len(evals["relevancy"]),
evals["relevancy"],
metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
["base_rag"] * len(evals["faithfulness"]),
evals["faithfulness"],
metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
["base_rag"] * len(evals["context_similarity"]),
evals["context_similarity"],
metric="context_similarity",
)
mean_scores_df = pd.concat(
[
mean_correctness_df.reset_index(),
mean_relevancy_df.reset_index(),
mean_faithfulness_df.reset_index(),
mean_context_similarity_df.reset_index(),
],
axis=0,
ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df
rag | base_rag |
---|---|
metrics | |
mean_correctness_score | 4.238636 |
mean_relevancy_score | 0.977273 |
mean_faithfulness_score | 0.977273 |
mean_context_similarity_score | 0.933568 |
在这个示例中,我们可以看到基本的RAG pipeline在评估基准(rag_dataset
)上表现相当不错!为了完整起见,可以使用下面提供的代码来使用RagEvaluatorPack
执行上述步骤:
from llama_index.core.llama_pack import download_llama_packRagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")rag_evaluator = RagEvaluatorPack( query_engine=query_engine, rag_dataset=rag_dataset, show_progress=True)############################################################################# 注意:如果您有OpenAI API的低级订阅,比如使用第一层订阅 ## 那么您需要使用不同的batch_size和sleep_time_in_seconds。 ## 对于第一层订阅,似乎效果很好的设置是batch_size=5, ## 和sleep_time_in_seconds=15(截至2023年12月。) #############################################################################benchmark_df = await rag_evaluator_pack.arun( batch_size=20, # 将要进行的openai api调用的批次 sleep_time_in_seconds=1, # 在进行api调用之前睡眠的秒数)