Tonic 验证评估器¶
这个笔记本包含了一些基本的用法示例,展示了如何使用 Tonic Validate 的 RAGs 指标来使用 LlamaIndex。要使用这些评估器,你需要安装 tonic_validate
,可以通过 pip install tonic-validate
来安装。
%pip install llama-index-evaluation-tonic-validate
import json
import pandas as pd
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.evaluation.tonic_validate import (
AnswerConsistencyEvaluator,
AnswerSimilarityEvaluator,
AugmentationAccuracyEvaluator,
AugmentationPrecisionEvaluator,
RetrievalPrecisionEvaluator,
TonicValidateEvaluator,
)
一个问题实例¶
在这个例子中,我们有一个问题的示例,其中参考正确答案与LLM响应答案不匹配。有两个检索到的上下文片段,其中一个包含了正确答案。
question = "What makes Sam Altman a good founder?"
reference_answer = "He is smart and has a great force of will."
llm_answer = "He is a good founder because he is smart."
retrieved_context_list = [
"Sam Altman is a good founder. He is very smart.",
"What makes Sam Altman such a good founder is his great force of will.",
]
答案相似度分数是一个介于0和5之间的分数,用于评估LLM答案与参考答案的匹配程度。在这种情况下,它们并不完全匹配,因此答案相似度分数不是完美的5分。
answer_similarity_evaluator = AnswerSimilarityEvaluator()
score = await answer_similarity_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=4.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
答案一致性得分介于0.0和1.0之间,用于衡量答案中是否包含在检索到的上下文中未出现的信息。在这种情况下,答案确实出现在检索到的上下文中,因此得分为1。
answer_consistency_evaluator = AnswerConsistencyEvaluator()
score = await answer_consistency_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强准确度衡量了在答案中的检索到的上下文所占的百分比。在这种情况下,一个检索到的上下文在答案中,因此该分数为0.5。
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()
score = await augmentation_accuracy_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
增强精度衡量相关检索到的上下文是否包含在答案中。两个检索到的上下文都是相关的,但只有一个包含在答案中。因此,该分数为0.5。
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()
score = await augmentation_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)
检索精度衡量了检索到的上下文中与回答问题相关的百分比。在这种情况下,检索到的两个上下文都与回答问题相关,因此得分为1.0。
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()
score = await retrieval_precision_evaluator.aevaluate(
question, llm_answer, retrieved_context_list
)
score
EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)
TonicValidateEvaluator
可以一次计算Tonic Validate的所有指标。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate(
question,
llm_answer,
retrieved_context_list,
reference_response=reference_answer,
)
scores.score_dict
{'answer_consistency': 1.0, 'answer_similarity': 4.0, 'augmentation_accuracy': 0.5, 'augmentation_precision': 0.5, 'retrieval_precision': 1.0}
您还可以使用TonicValidateEvaluator
一次评估多个查询和响应,并返回一个可以记录到Tonic Validate UI(validate.tonic.ai)的tonic_validate
Run
对象。
要实现这一点,您需要将问题、LLM答案、检索到的上下文列表和参考答案放入列表中,并调用evaluate_run
函数。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate_run(
[question], [llm_answer], [retrieved_context_list], [reference_answer]
)
scores.run_data[0].scores
{'answer_consistency': 1.0, 'answer_similarity': 3.0, 'augmentation_accuracy': 0.5, 'augmentation_precision': 0.5, 'retrieval_precision': 1.0}
有标签的RAG数据集示例¶
让我们使用数据集EvaluatingLlmSurveyPaperDataset
,并使用Tonic Validate的答案相似度评分来评估默认的LlamaIndex RAG系统。EvaluatingLlmSurveyPaperDataset
是一个LabelledRagDataset
,因此它包含每个问题的参考正确答案。该数据集包含关于论文《Evaluating Large Language Models: A Comprehensive Survey》的276个问题和参考答案。
我们将使用TonicValidateEvaluator
和答案相似度评分指标来评估默认RAG系统在这个数据集上的响应。
!llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import VectorStoreIndex
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data(
num_workers=4
) # 并行加载
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
predictions_dataset = rag_dataset.make_predictions_with(query_engine)
questions, retrieved_context_lists, reference_answers, llm_answers = zip(
*[
(e.query, e.reference_contexts, e.reference_answer, p.response)
for e, p in zip(rag_dataset.examples, predictions_dataset.predictions)
]
)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 2.09it/s] Successfully downloaded EvaluatingLlmSurveyPaperDataset to ./data
from tonic_validate.metrics import AnswerSimilarityMetric
tonic_validate_evaluator = TonicValidateEvaluator(
metrics=[AnswerSimilarityMetric()], model_evaluator="gpt-4-1106-preview"
)
scores = await tonic_validate_evaluator.aevaluate_run(
questions, retrieved_context_lists, reference_answers, llm_answers
)
overall_scores
给出了数据集中276个问题的平均分数。
scores.overall_scores
{'answer_similarity': 2.2644927536231885}
使用 pandas
和 matplotlib
,我们可以绘制相似度分数的直方图。
import matplotlib.pyplot as plt
import pandas as pd
score_list = [x.scores["answer_similarity"] for x in scores.run_data]
value_counts = pd.Series(score_list).value_counts()
fig, ax = plt.subplots()
ax.bar(list(value_counts.index), list(value_counts))
ax.set_title("Answer Similarity Score Value Counts")
plt.show()
由于0是最常见的分数,因此还有很大的改进空间。这是有道理的,因为我们正在使用默认参数。我们可以通过调整许多可能的RAG参数来优化这个分数,从而改善这些结果。