检索评估¶
这个笔记本使用我们的RetrieverEvaluator
来评估LlamaIndex中定义的任何Retriever模块的质量。
我们指定了一组不同的评估指标:包括命中率和MRR。对于任何给定的问题,这些指标将比较从真实上下文中检索到的结果的质量。
为了减轻首次创建评估数据集的负担,我们可以依赖于合成数据生成。
设置¶
在这里,我们加载数据(PG文章),将其解析为节点。然后我们使用简单的向量索引对这些数据进行索引,并获得一个检索器。
%pip install llama-index-llms-openai
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
下载数据
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# 默认情况下,节点ID被设置为随机的UUID。为了确保每次运行相同的ID,我们手动设置它们。
for idx, node in enumerate(nodes):
node.id_ = f"node_{idx}"
llm = OpenAI(model="gpt-4")
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)
尝试检索¶
我们将在一个简单的数据集上尝试检索。
retrieved_nodes = retriever.retrieve("What did the author do growing up?")
from llama_index.core.response.notebook_utils import display_source_node
for node in retrieved_nodes:
display_source_node(node, source_length=1000)
Node ID: node_0
Similarity: 0.8181379514114543
Text: What I Worked On
February 2021
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...
Node ID: node_52
Similarity: 0.8143530600618721
Text: It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.
In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.
In the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.
Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that ques...
构建一个(查询,上下文)对的评估数据集¶
在这里,我们将在现有文本语料库上构建一个简单的评估数据集。
我们使用我们的 generate_question_context_pairs
来在给定的非结构化文本语料库上生成一组(问题,上下文)对。这使用LLM来自动从每个上下文块生成问题。
我们得到一个 EmbeddingQAFinetuneDataset
对象。在高层次上,这包含一组映射到查询和相关文档块的id,以及语料库本身。
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
queries = qa_dataset.queries.values()
print(list(queries)[2])
"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. What were the key differences in terms of user interaction and programming capabilities?"
# [可选] 保存
qa_dataset.save_json("pg_eval_dataset.json")
# [可选] 加载
qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")
使用RetrieverEvaluator
进行检索评估¶
现在我们准备运行我们的检索评估。我们将在我们生成的评估数据集上运行我们的RetrieverEvaluator
。
我们定义了两个函数:get_eval_results
和 display_results
,它们可以在数据集上运行我们的检索器。
include_cohere_rerank = True
if include_cohere_rerank:
!pip install cohere -q
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["mrr", "hit_rate"]
if include_cohere_rerank:
metrics.append(
"cohere_rerank_relevancy" # requires COHERE_API_KEY environment variable to be set
)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
# 在一个样本查询上尝试
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)
Query: In the context provided, the author describes his early experiences with programming on an IBM 1401. Based on his description, what were some of the limitations and challenges he faced while trying to write programs on this machine? Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'cohere_rerank_relevancy': 0.99620515}
# 尝试在整个数据集上进行评估
eval_results = await retriever_evaluator.evaluate_dataset(qa_dataset)
import pandas as pd
def display_results(name, eval_results):
"""显示evaluate函数的结果。"""
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()
columns = {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}
if include_cohere_rerank:
crr_relevancy = full_df["cohere_rerank_relevancy"].mean()
columns.update({"cohere_rerank_relevancy": [crr_relevancy]})
metric_df = pd.DataFrame(columns)
return metric_df
display_results("top-2 eval", eval_results)
retrievers | hit_rate | mrr | cohere_rerank_relevancy | |
---|---|---|---|---|
0 | top-2 eval | 0.801724 | 0.685345 | 0.946009 |