从LlamaHub下载LlamaDataset¶

您可以通过llamahub.ai浏览我们可用的基准数据集。本笔记本指南展示了如何下载数据集及其源文本文档。具体来说，download_llama_dataset将下载评估数据集（即LabelledRagDataset）以及最初用于构建评估数据集的源文本文件的Document。

最后，在这个笔记本中，我们还演示了端到端的工作流程：下载评估数据集，使用您自己的RAG管道（查询引擎）对其进行预测，然后评估这些预测结果。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-llms-openai

In [ ]:

Copied!





from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)
from llama_index.core.llama_dataset import download_llama_dataset

# 下载并安装依赖项
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./paul_graham"
)

github url: https://raw.githubusercontent.com/nerdai/llama-hub/datasets/llama_hub/llama_datasets/library.json
github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/rag_dataset.json
github url: https://media.githubusercontent.com/media/run-llama/llama_datasets/main/llama_datasets/paul_graham_essay/source.txt

In [ ]:

Copied!

rag_dataset.to_pandas()[:5]
rag_dataset.to_pandas()[:5]

输出[ ]:

	查询	参考上下文	参考答案	参考答案提供者	查询提供者
0	在文章中，作者提到了他早期...	[我所从事的工作\n\n2021年2月\n\n在...	作者用于编程的第一台计算机...	ai (gpt-4)	ai (gpt-4)
1	作者从哲学专业转到了...	[我所从事的工作\n\n2021年2月\n\n在c之前...	导致作者做出这一转变的两个具体影响因素是...	ai (gpt-4)	ai (gpt-4)
2	在这篇文章中，作者讨论了他最初...	[当我年轻时，我无法用语言表达这一点...	最初吸引作者的两个主要影响是...	ai (gpt-4)	ai (gpt-4)
3	作者提到他的兴趣转向了...	[当我年轻时我无法用语言表达这一点...	作者将兴趣转向了Lisp并...	ai (gpt-4)	ai (gpt-4)
4	在文章中，作者提到了他对...的兴趣	[所以我四处寻找可以挽救的东西...	文章的作者是保罗·格雷厄姆，他曾...	ai (gpt-4)	ai (gpt-4)

通过documents，您可以构建自己的RAG流程，然后进行预测并执行评估，与llamahub.ai数据集中关联的DatasetCard所列基准进行比较。

预测¶

注意: 本笔记本的剩余部分将演示如何手动执行预测和后续评估，仅用于说明目的。或者，您可以使用RagEvaluatorPack，该工具将使用您提供的RAG系统自动处理预测和评估工作。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex

# a basic RAG pipeline, uses defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
from llama_index.core import VectorStoreIndex

# 一个基础的RAG流程，使用默认设置
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

您现在可以手动创建预测并执行评估，或者下载PredictAndEvaluatePack通过一行代码自动完成这些操作。

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!





# manually
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)
# 手动方式
prediction_dataset = await rag_dataset.amake_predictions_with(
    query_engine=query_engine, show_progress=True
)

100%|███████████████████████████████████████████████████████| 44/44 [00:08<00:00,  4.90it/s]

In [ ]:

Copied!

prediction_dataset.to_pandas()[:5]
prediction_dataset.to_pandas()[:5]

输出[ ]:

	响应	上下文
0	作者提到他接触的第一台计算机...	[我的工作经历\n\n2021年2月\n\n在接触计算机之前...
1	作者将专业从哲学转向...	[当我年轻时还无法用语言表达这一点...
2	作者提到了两个主要影响...	[当我年轻时我无法用语言表达这一点...
3	作者提到他将兴趣转向了...	[于是我四处寻找可以挽救的东西...
4	作者提到他对计算机...	[我所从事的工作\n\n2021年2月\n\n在开始...

评估¶

现在我们有了预测结果，可以从两个维度进行评估：

生成的响应：预测响应与参考答案的匹配程度。
检索到的上下文：预测时检索到的上下文与参考上下文的匹配程度。

注意：对于检索到的上下文，我们无法使用标准的检索指标，如hit rate和mean reciprocal rank，因为这样做需要使用与生成基准数据相同的索引。但是，LabelledRagDataset甚至不需要由索引创建。因此，我们将使用预测上下文与参考上下文之间的semantic similarity作为衡量标准。

In [ ]:

Copied!

import tqdm
导入 tqdm

为了评估响应质量，我们将采用LLM作为评判者的模式。具体来说，我们会使用CorrectnessEvaluator、FaithfulnessEvaluator和RelevancyEvaluator这三个评估器。

为了评估检索上下文的优劣，我们将使用SemanticSimilarityEvaluator。

In [ ]:

Copied!





# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = RelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["faithfulness"] = FaithfulnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator()
# 实例化gpt-4评估器
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = RelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["faithfulness"] = FaithfulnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["semantic_similarity"] = SemanticSimilarityEvaluator()

遍历(labelled_example, prediction)对，并对每个单独执行评估。

In [ ]:

Copied!





evals = {
    "correctness": [],
    "relevancy": [],
    "faithfulness": [],
    "context_similarity": [],
}

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)
评估指标 = {
    "正确性": [],
    "相关性": [],
    "忠实度": [],
    "上下文相似度": [],
}

例如，在 tqdm.tqdm 进度条中遍历预测结果：
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

    relevancy_result = judges["relevancy"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    faithfulness_result = judges["faithfulness"].evaluate(
        query=example.query,
        response=prediction.response,
        contexts=prediction.contexts,
    )

    semantic_similarity_result = judges["semantic_similarity"].evaluate(
        query=example.query,
        response="\n".join(prediction.contexts),
        reference="\n".join(example.reference_contexts),
    )

    evals["correctness"].append(correctness_result)
    evals["relevancy"].append(relevancy_result)
    evals["faithfulness"].append(faithfulness_result)
    evals["context_similarity"].append(semantic_similarity_result)

44it [07:15,  9.90s/it]

In [ ]:

Copied!





import json

# saving evaluations
evaluations_objects = {
    "context_similarity": [e.dict() for e in evals["context_similarity"]],
    "correctness": [e.dict() for e in evals["correctness"]],
    "faithfulness": [e.dict() for e in evals["faithfulness"]],
    "relevancy": [e.dict() for e in evals["relevancy"]],
}

with open("evaluations.json", "w") as json_file:
    json.dump(evaluations_objects, json_file)
import json

# 保存评估结果
evaluations_objects = {
    "context_similarity": [e.dict() for e in evals["context_similarity"]],
    "correctness": [e.dict() for e in evals["correctness"]],
    "faithfulness": [e.dict() for e in evals["faithfulness"]],
    "relevancy": [e.dict() for e in evals["relevancy"]],
}

with open("evaluations.json", "w") as json_file:
    json.dump(evaluations_objects, json_file)

现在，我们可以使用笔记本实用函数来查看这些评估结果。

In [ ]:

Copied!





import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
import pandas as pd
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    ["base_rag"] * len(evals["correctness"]),
    evals["correctness"],
    metric="correctness",
)
deep_eval_df, mean_relevancy_df = get_eval_results_df(
    ["base_rag"] * len(evals["relevancy"]),
    evals["relevancy"],
    metric="relevancy",
)
_, mean_faithfulness_df = get_eval_results_df(
    ["base_rag"] * len(evals["faithfulness"]),
    evals["faithfulness"],
    metric="faithfulness",
)
_, mean_context_similarity_df = get_eval_results_df(
    ["base_rag"] * len(evals["context_similarity"]),
    evals["context_similarity"],
    metric="context_similarity",
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
        mean_context_similarity_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [ ]:

Copied!

mean_scores_df
mean_scores_df

输出[ ]:

检索增强生成	基础检索增强生成
指标
平均正确性得分	4.238636
平均相关性得分	0.977273
平均忠实度得分	0.977273
平均上下文相似度得分	0.933568

在这个简单的示例中，我们看到基础的RAG流程在评估基准(rag_dataset)上表现相当出色！为了完整性，若要通过使用RagEvaluatorPack来执行上述步骤，请使用下面提供的代码：

In [ ]:

Copied!





from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, rag_dataset=rag_dataset, show_progress=True
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds.  #
# For Usage Tier 1, settings that seemed to work well were batch_size=5,   #
# and sleep_time_in_seconds=15 (as of December 2023.)                      #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine, rag_dataset=rag_dataset, show_progress=True
)

############################################################################
# 注意：如果使用的是OpenAI API的低级订阅如使用层级1 #
# 则需要设置不同的batch_size和sleep_time_in_seconds参数。 #
# 对于使用层级1，截至2023年12月，batch_size=5和sleep_time_in_seconds=15 #
# 的设置效果较好。 #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
    batch_size=20,  # 设置OpenAI API调用的批量大小
    sleep_time_in_seconds=1,  # 在发起API调用前的休眠秒数
)