使用 Prometheus 模型进行评估¶
评估是迭代RAG(检索增强生成)管道的关键方面。这个过程严重依赖于GPT-4。然而,一个名为Prometheus的新开源模型最近出现作为评估的替代选择。
在这个笔记本中,我们将演示如何利用Prometheus模型进行评估,并将其与LlamaIndex抽象集成。
如果你对Prometheus模型不熟悉,你可能会发现Andrei准备的论文摘要很有启发性。需要注意的是,该模型要求在提示中包含分数以进行有效评估。要获取更详细的信息,你可以参考笔记本中概述的具体提示。
我们将使用Llama数据集中的两个数据集,使用Prometheus模型来演示正确性评估。如果您还没有探索过Llama数据集,我建议花些时间阅读关于它们的信息这里。
- Paul Graham的文章
- Llama2
注意:我们在这里展示的是原始的Prometheus模型用于分析。您可以使用模型的量化版本重新运行分析。¶
%pip install llama-index-llms-openai
%pip install llama-index-llms-huggingface
# 附加到相同的事件循环import nest_asyncionest_asyncio.apply()
下载数据集¶
from llama_index.core.llama_dataset import download_llama_dataset
paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
"PaulGrahamEssayDataset", "./data/paul_graham"
)
llama2_rag_dataset, llama2_documents = download_llama_dataset(
"Llama2PaperDataset", "./data/llama2"
)
定义托管在HuggingFace上的Prometheus LLM模型。¶
我们使用Nvidia A10G GPU在HF推理端点上托管了该模型。
from llama_index.llms.huggingface import HuggingFaceInferenceAPI
HF_TOKEN = "YOUR HF TOKEN"
HF_ENDPOINT_URL = (
"https://q3yljc2cypyrvw3i.us-east-1.aws.endpoints.huggingface.cloud"
)
prometheus_llm = HuggingFaceInferenceAPI(
model_name=HF_ENDPOINT_URL,
token=HF_TOKEN,
temperature=0.1,
do_sample=True,
top_p=0.95,
top_k=40,
repetition_penalty=1.1,
)
/opt/homebrew/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
提示模板¶
我们将使用相同的提示模板来对比Prometheus模型和GPT-4的性能,以确保一致性。
请按照以下要求评估翻译的准确性:
- 确保翻译的内容准确无误。
- 保持原始文档的格式和结构不变。
prometheus_correctness_eval_prompt_template = """###任务描述:提供了一条指令(可能包括其中的一个输入)、一个查询、一个待评估的响应、一个得分为5的参考答案,以及代表评估标准的得分规则。 1. 撰写一份详细的反馈,严格基于给定的得分规则评估响应的质量,不进行一般性评估。 2. 撰写反馈后,给出一个得分,可以是1或2或3或4或5。您应参考得分规则。 3. 输出格式应如下所示:"反馈:(为标准撰写反馈)[结果](1或2或3或4或5)" 4. 请不要生成任何其他的开头、结尾和解释。 5. 只评估生成的答案与参考答案之间的共同点。不要评估参考答案中存在但生成的答案中不存在的内容。 ###评估指令:您的任务是评估查询的生成答案和参考答案 ###待评估的生成答案:{generated_answer} ###参考答案(得分5):{reference_answer} ###得分规则: 得分1:如果生成的答案与用户查询和参考答案不相关。 得分2:如果生成的答案符合参考答案但与用户查询不相关。 得分3:如果生成的答案与用户查询和参考答案相关,但包含错误。 得分4:如果生成的答案与用户查询相关,并且与参考答案具有完全相同的度量,但不够简洁。 得分5:如果生成的答案与用户查询相关,并且根据参考答案完全正确。 ###反馈:"""
prometheus_correctness_eval_prompt_template = """###任务描述:给定一条指令(可能包括其中的输入),一个查询,一个要评估的响应,一个得分为5的参考答案,以及代表评估标准的得分规则。 1. 撰写一份详细的反馈,严格基于给定的得分规则评估响应的质量,而不是一般性评估。 2. 撰写反馈后,给出一个得分,可以是1或2或3或4或5。您应参考得分规则。 3. 输出格式应如下所示:"反馈:(为标准撰写反馈)[结果](1或2或3或4或5)" 4. 请不要生成任何其他开头、结尾和解释。 5. 只评估生成的答案和参考答案之间的共同之处。不要评估参考答案中存在但生成的答案中不存在的内容。 ###评估指令:您的任务是评估查询的生成答案和参考答案 ###要评估的生成答案:{generated_answer} ###参考答案(得分5):{reference_answer} ###得分规则: 得分1:如果生成的答案与用户查询和参考答案不相关。 得分2:如果生成的答案根据参考答案是正确的,但与用户查询不相关。 得分3:如果生成的答案与用户查询相关,并根据参考答案是正确的,但在事实上有一些错误。 得分4:如果生成的答案与用户查询相关,与参考答案具有完全相同的度量和正确性,但不够简洁。 得分5:如果生成的答案与用户查询相关,并且根据参考答案完全正确。 ###反馈:"""
信实性评估提示¶
prometheus_faithfulness_eval_prompt_template = """###任务描述:提供了一条指示(可能包括其中的一个输入)、一条信息、一个上下文,以及代表评估标准的评分规则。 1. 你将根据信息和上下文提供的评估任务,使用评分规则给出结果。 2. 基于评估任务和给定的评分规则撰写详细反馈,而不是一般性评估。 3. 撰写反馈后,写出一个YES或NO的评分。你应参考评分规则。 4. 输出格式应如下所示:“反馈:(为标准撰写反馈)[结果](YES或NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###评估指示:你的任务是评估所给信息是否得到上下文的支持。 ###信息:{query_str} ###上下文:{context_str} ###评分规则: YES得分:如果所给信息得到上下文的支持。 NO得分:如果所给信息未得到上下文的支持。 ###反馈:"""prometheus_faithfulness_refine_prompt_template = """###任务描述:提供了一条指示(可能包括其中的一个输入)、一条信息、一个上下文信息、一个现有答案,以及代表评估标准的评分规则。 1. 你将根据信息、上下文信息和现有答案提供的评估任务,使用评分规则给出结果。 2. 基于评估任务和给定的评分规则撰写详细反馈,而不是一般性评估。 3. 撰写反馈后,写出一个YES或NO的评分。你应参考评分规则。 4. 输出格式应如下所示:“反馈:(为标准撰写反馈)[结果](YES或NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###评估指示:如果信息在上下文中存在,并且提供了一个现有答案。 ###现有答案:{existing_answer} ###信息:{query_str} ###上下文:{context_msg} ###评分规则: YES得分:如果现有答案已经是YES,或者信息在上下文中存在。 NO得分:如果现有答案是NO,并且信息不在上下文中。 ###反馈:"""
请评估以下内容的相关性:
给定一个包含学生考试成绩的数据集,评估每个学生的数学成绩与语文成绩之间的相关性。您可以使用任何适当的统计方法来评估这种相关性。
prometheus_relevancy_eval_prompt_template = """###任务描述:提供了一条指令(可能包括其中的输入)、一个带有响应的查询、上下文和表示评估标准的分数规则。 1. 通过查询和响应以及上下文提供的评估任务。 2. 基于评估任务和给定的分数规则撰写详细反馈,而不是一般性评估。 3. 撰写反馈后,写出一个YES或NO的分数。您应参考分数规则。 4. 输出格式应如下所示:“反馈:(为标准撰写反馈)[结果](YES或NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###评估指令:您的任务是评估查询的响应是否符合提供的上下文信息。 ###查询和响应:{query_str} ###上下文:{context_str} ###分数规则: 分数YES:如果查询的响应与提供的上下文信息一致。 分数NO:如果查询的响应与提供的上下文信息不一致。 ###反馈:"""prometheus_relevancy_refine_prompt_template = """###任务描述:提供了一条指令(可能包括其中的输入)、一个带有响应的查询、上下文、现有答案和表示评估标准的分数规则。 1. 通过查询和响应以及上下文和现有答案提供的评估任务。 2. 基于评估任务和给定的分数规则撰写详细反馈,而不是一般性评估。 3. 撰写反馈后,写出一个YES或NO的分数。您应参考分数规则。 4. 输出格式应如下所示:“反馈:(为标准撰写反馈)[结果](YES或NO)” 5. 请不要生成任何其他开头、结尾和解释。 ###评估指令:您的任务是评估查询的响应是否符合提供的上下文信息。 ###查询和响应:{query_str} ###上下文:{context_str} ###分数规则: 分数YES:如果现有答案已经是YES或者查询的响应与提供的上下文信息一致。 分数NO:如果现有答案是NO,并且查询的响应与提供的上下文信息一致。 ###反馈:"""
请按照以下步骤设置OpenAI密钥以进行索引:
- 打开OpenAI网站并登录到您的帐户。
- 转到API密钥管理页面。
- 创建一个新的API密钥或使用现有的API密钥。
- 将API密钥复制粘贴到您的Python文件中的相应位置。
- 保存文件并重新运行以确保密钥已设置。
import os
os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY"
from llama_index.llms.openai import OpenAI
gpt4_llm = OpenAI("gpt-4")
定义解析器函数¶
它将用于正确性评估器中。
from typing import Tupleimport redef parser_function(output_str: str) -> Tuple[float, str]: # 用于匹配反馈和响应的模式 # 此模式查找以'[RESULT]'结尾的任何文本,后面跟着一个数字 pattern = r"(.+?) \[RESULT\] (\d)" # 使用正则表达式查找所有匹配项 matches = re.findall(pattern, output_str) # 检查是否找到任何匹配项 if matches: # 假设文本中只有一个匹配项,提取反馈和响应 feedback, score = matches[0] score = float(score.strip()) if score is not None else score return score, feedback.strip() else: return None, None
定义正确性、忠实度、相关性评估器¶
from llama_index.core.evaluation import ( CorrectnessEvaluator, # 正确性评估器 FaithfulnessEvaluator, # 忠实度评估器 RelevancyEvaluator, # 相关性评估器)from llama_index.core.callbacks import CallbackManager, TokenCountingHandlerimport tiktoken# 使用Prometheus模型的CorrectnessEvaluatorprometheus_correctness_evaluator = CorrectnessEvaluator( llm=prometheus_llm, parser_function=parser_function, eval_template=prometheus_correctness_eval_prompt_template,)# 使用Prometheus模型的FaithfulnessEvaluatorprometheus_faithfulness_evaluator = FaithfulnessEvaluator( llm=prometheus_llm, eval_template=prometheus_faithfulness_eval_prompt_template, refine_template=prometheus_faithfulness_refine_prompt_template,)# 使用Prometheus模型的RelevancyEvaluatorprometheus_relevancy_evaluator = RelevancyEvaluator( llm=prometheus_llm, eval_template=prometheus_relevancy_eval_prompt_template, refine_template=prometheus_relevancy_refine_prompt_template,)# 将编码模型设置为`gpt-4`以进行标记计数token_counter = TokenCountingHandler( tokenizer=tiktoken.encoding_for_model("gpt-4").encode)callback_manager = CallbackManager([token_counter])gpt4_llm.callback_manager = callback_manager# 使用GPT-4模型的CorrectnessEvaluatorgpt4_correctness_evaluator = CorrectnessEvaluator( llm=gpt4_llm, # parser_function=parser_function,)# 使用GPT-4模型的FaithfulnessEvaluatorgpt4_faithfulness_evaluator = FaithfulnessEvaluator( llm=gpt4_llm, eval_template=prometheus_faithfulness_eval_prompt_template, refine_template=prometheus_faithfulness_refine_prompt_template,)# 使用GPT-4模型的RelevancyEvaluatorgpt4_relevancy_evaluator = RelevancyEvaluator( llm=gpt4_llm, eval_template=prometheus_relevancy_eval_prompt_template, refine_template=prometheus_relevancy_refine_prompt_template,)# 创建评估器字典prometheus_evaluators = { "correctness": prometheus_correctness_evaluator, "faithfulness": prometheus_faithfulness_evaluator, "relevancy": prometheus_relevancy_evaluator,}gpt4_evaluators = { "correctness": gpt4_correctness_evaluator, "faithfulness": gpt4_faithfulness_evaluator, "relevancy": gpt4_relevancy_evaluator,}
让我们创建一个函数来为不同的数据集创建query_engine
和rag_dataset
。¶
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
def create_query_engine_rag_dataset(dataset_path):
rag_dataset = LabelledRagDataset.from_json(
f"{dataset_path}/rag_dataset.json"
)
documents = SimpleDirectoryReader(
input_dir=f"{dataset_path}/source_files"
).load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
return query_engine, rag_dataset
运行批量评估的函数定义¶
这个函数用于在定义的评估器上运行批量评估。
from llama_index.core.evaluation import BatchEvalRunner
async def batch_eval_runner(
evaluators, query_engine, questions, reference=None, num_workers=8
):
batch_runner = BatchEvalRunner(
evaluators, workers=num_workers, show_progress=True
)
eval_results = await batch_runner.aevaluate_queries(
query_engine, queries=questions, reference=reference
)
return eval_results
检查分数分布的函数¶
from collections import Counterfrom typing import List, Dictdef get_scores_distribution(scores: List[float]) -> Dict[str, float]: # 统计每个分数的出现次数 score_counts = Counter(scores) # 分数的总数 total_scores = len(scores) # 计算百分比分布 percentage_distribution = { score: (count / total_scores) * 100 for score, count in score_counts.items() } return percentage_distribution
用于检查正确性、忠实度和相关性评分的函数¶
def get_eval_results(key, eval_results):
results = eval_results[key]
correct = 0
for result in results:
if result.passing:
correct += 1
score = correct / len(results)
print(f"{key} Score: {round(score, 2)}")
return score
计算汉明距离
的函数。¶
def hamming_distance(list1, list2):
if len(list1) != len(list2):
raise ValueError("Lists must be of the same length")
return sum(el1 != el2 for el1, el2 in zip(list1, list2))
对PaulGraham文章文本的评估¶
query_engine, rag_dataset = create_query_engine_rag_dataset(
"./data/paul_graham"
)
# 获取用于评估的问题questions = [example.query for example in rag_dataset.examples]# 获取用于评估的参考答案reference = [[example.reference_answer] for example in rag_dataset.examples]
计算正确性、忠实度和相关性评估¶
prometheus_eval_results = await batch_eval_runner(
prometheus_evaluators, query_engine, questions, reference
)
100%|██████████| 44/44 [00:30<00:00, 1.43it/s] 100%|██████████| 132/132 [01:56<00:00, 1.13it/s]
gpt4_eval_results = await batch_eval_runner(
gpt4_evaluators, query_engine, questions, reference
)
100%|██████████| 44/44 [00:26<00:00, 1.66it/s] 100%|██████████| 132/132 [02:32<00:00, 1.16s/it]
使用Prometheus评估器进行正确性评估得分分布。¶
prometheus_scores = [
result.score for result in prometheus_eval_results["correctness"]
]
get_scores_distribution(prometheus_scores)
{3.0: 50.0, 1.0: 43.18181818181818, 5.0: 2.272727272727273, 4.0: 4.545454545454546}
使用GPT-4评估器进行正确性评估得分分布。¶
gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]]
get_scores_distribution(gpt4_scores)
{4.5: 50.0, 5.0: 34.090909090909086, 2.5: 9.090909090909092, 4.0: 2.272727272727273, 3.5: 4.545454545454546}
Prometheus和GPT-4的反馈比较¶
在这里,我们将比较Prometheus和GPT-4两个系统的反馈机制。
Prometheus反馈¶
- Prometheus是一个开源的监控系统,它提供了丰富的数据模型和查询语言,可以用于实时监控和警报。
- Prometheus的反馈主要集中在系统的性能指标和运行状况,例如CPU利用率、内存使用情况、请求延迟等。
- Prometheus通过收集和分析这些指标数据,可以帮助用户了解系统的运行情况,并及时发现和解决问题。
GPT-4反馈¶
- GPT-4是由OpenAI开发的自然语言处理模型,具有强大的文本生成能力,可以用于生成文章、对话等。
- GPT-4的反馈主要体现在生成的文本质量和逻辑连贯性上,用户可以根据生成的内容来评估模型的表现。
- GPT-4通过不断的训练和优化,可以提供更准确、更自然的文本生成,从而改善用户的体验。
总的来说,Prometheus和GPT-4的反馈机制针对的是不同的应用场景,分别关注系统性能和文本生成质量,用户可以根据自己的需求选择合适的系统来满足特定的需求。
查询 = prometheus_eval_results["correctness"][0].查询响应 = prometheus_eval_results["correctness"][0].响应参考答案 = 参考[0][0]# prometheus反馈和得分prometheus反馈 = prometheus_eval_results["correctness"][0].反馈prometheus得分 = prometheus_eval_results["correctness"][0].得分# GPT4反馈和得分gpt4反馈 = gpt4_eval_results["correctness"][0].反馈gpt4得分 = gpt4_eval_results["correctness"][0].得分
print(f"Query: {query} \n\n")
print(f"Generated Answer: {response} \n\n")
print(f"Reference Answer: {reference_answer} \n\n")
print(
f"Prometheus Feedback: {prometheus_feedback} \n\n {prometheus_score} \n\n"
)
print(f"GPT-4 Feedback: {gpt4_feedback} \n\n {gpt4_score}")
Query: In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced. Generated Answer: The author mentions that the first computer he used for programming was the IBM 1401, which was located in the basement of his junior high school. He used an early version of Fortran as the programming language. The author faced challenges in figuring out what to do with the computer, as the only form of input was data stored on punched cards, and he didn't have any. Additionally, he didn't know enough math to do anything interesting with the computer. Reference Answer: The first computer the author used for programming was the IBM 1401, which was used by his school district for data processing. He started using it in 9th grade, around the age of 13 or 14. The programming language he used was an early version of Fortran. The author faced several challenges while using this computer. The only form of input to programs was data stored on punched cards, and he didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but he didn't know enough math to do anything interesting of that type. Therefore, he couldn't figure out what to do with it and in retrospect, he believes there's not much he could have done with it. Prometheus Feedback: The generated response is relevant to the user query and correctly describes the first computer the author used for programming, the programming language he used, and the challenges he faced. However, it has some inaccuracies in the details. The author did not use the IBM 1401 in the basement of his junior high school, but rather in 9th grade, around the age of 13 or 14. The author did not have any data stored on punched cards, but the only form of input was data stored on punched cards. The author did not know enough math to do anything interesting with the computer, but he didn't know enough math to do anything interesting of that type. So the overall score is 3. 3.0 GPT-4 Feedback: The generated answer is highly relevant and almost completely accurate. It correctly identifies the first computer the author used (IBM 1401), the programming language (Fortran), and the challenges he faced (lack of input data and insufficient math knowledge). However, it omits the detail about the author's age and grade level when he started programming, which was included in the reference answer. 4.5
观察:¶
普罗米修斯的反馈更加详细,指出生成的回复中省略了某些具体细节,导致得分为3.0
。相比之下,GPT-4的反馈更加宽泛,不太具体,尽管缺少一些细节,但仍给出了5.0
的评分。
Prometheus忠实度和相关性评分。¶
_ = get_eval_results("faithfulness", prometheus_eval_results)
_ = get_eval_results("relevancy", prometheus_eval_results)
faithfulness Score: 0.75 relevancy Score: 0.86
GPT-4忠实度和相关性评分。¶
_ = get_eval_results("faithfulness", gpt4_eval_results)
_ = get_eval_results("relevancy", gpt4_eval_results)
faithfulness Score: 0.98 relevancy Score: 0.95
普罗米修斯和GPT-4之间的汉明距离比较¶
(数值越低越好)
prometheus_faithfulness_scores = [
result.score for result in prometheus_eval_results["faithfulness"]
]
prometheus_relevancy_scores = [
result.score for result in prometheus_eval_results["relevancy"]
]
gpt4_faithfulness_scores = [
result.score for result in gpt4_eval_results["faithfulness"]
]
gpt4_relevancy_scores = [
result.score for result in gpt4_eval_results["relevancy"]
]
faithfulness_hamming_distance = hamming_distance(
prometheus_faithfulness_scores, gpt4_faithfulness_scores
)
relevancy_hamming_distance = hamming_distance(
prometheus_relevancy_scores, gpt4_relevancy_scores
)
print(f"Faithfulness Hamming Distance: {faithfulness_hamming_distance}")
print(f"Relevancy Hamming Distance: {relevancy_hamming_distance}")
Faithfulness Hamming Distance: 10 Relevancy Hamming Distance: 8
观察:¶
比较显示,在信实性和相关性方面,Prometheus和GPT-4评估中约77%
和81%
的分数是相同的。这表明在信实性和相关性评分方面,Prometheus和GPT-4模型之间存在相当的相关性。
GPT-4 成本分析¶
prompt_token_count = token_counter.prompt_llm_token_count
completion_token_count = token_counter.completion_llm_token_count
total_cost_paul_graham_essay = (
prompt_token_count * 0.03 + completion_token_count * 0.06
) / 1000
token_counter.reset_counts()
使用 Llama2 论文进行评估¶
query_engine, rag_dataset = create_query_engine_rag_dataset("./data/llama2")
questions = [example.query for example in rag_dataset.examples]
reference = [[example.reference_answer] for example in rag_dataset.examples]
计算正确性、忠实度和相关性评估¶
prometheus_eval_results = await batch_eval_runner(
prometheus_evaluators, query_engine, questions, reference
)
100%|██████████| 100/100 [01:02<00:00, 1.61it/s] 100%|██████████| 300/300 [04:34<00:00, 1.09it/s]
gpt4_eval_results = await batch_eval_runner(
gpt4_evaluators, query_engine, questions, reference
)
100%|██████████| 100/100 [01:06<00:00, 1.51it/s] 100%|██████████| 300/300 [06:22<00:00, 1.27s/it]
使用Prometheus评估器进行正确性评估得分分布。¶
prometheus_scores = [
result.score for result in prometheus_eval_results["correctness"]
]
get_scores_distribution(prometheus_scores)
{3.0: 56.00000000000001, 1.0: 26.0, 5.0: 9.0, 4.0: 8.0, 2.0: 1.0}
使用GPT-4评估器进行正确性评估得分分布。¶
gpt4_scores = [result.score for result in gpt4_eval_results["correctness"]]
get_scores_distribution(gpt4_scores)
{4.5: 57.99999999999999, 1.0: 6.0, 4.0: 12.0, 5.0: 10.0, 2.0: 5.0, 3.5: 5.0, 2.5: 3.0, 3.0: 1.0}
比较Prometheus和GPT-4在正确性方面的反馈。¶
查询 = prometheus_eval_results["correctness"][0].query响应 = prometheus_eval_results["correctness"][0].response参考答案 = reference[0][0]# Prometheus反馈和得分prometheus反馈 = prometheus_eval_results["correctness"][0].feedbackprometheus得分 = prometheus_eval_results["correctness"][0].score# GPT4反馈和得分gpt4反馈 = gpt4_eval_results["correctness"][0].feedbackgpt4得分 = gpt4_eval_results["correctness"][0].scoreprint(f"查询: {查询} \n\n")print(f"生成的答案: {响应} \n\n")print(f"参考答案: {参考答案} \n\n")print( f"Prometheus反馈: {prometheus反馈} \n\n {prometheus得分} \n\n")print(f"GPT-4反馈: {gpt4反馈} \n\n {gpt4得分}")
Query: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed? Generated Answer: The two primary objectives achieved in this work are the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. The range of parameters for the large language models developed is from 7 billion to 70 billion. Reference Answer: The two primary objectives achieved in the work described in the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models" are: 1. The development and release of a collection of pretrained and fine-tuned large language models (LLMs) specifically optimized for dialogue use cases. 2. The demonstration that these fine-tuned LLMs, referred to as Llama 2-Chat, outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models, particularly in terms of helpfulness and safety based on human evaluations. The range of parameters for the large language models developed in this work is from 7 billion to 70 billion parameters. Prometheus Feedback: The generated response is relevant to the user query and correctly identifies the two primary objectives of the work described in the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models." However, it does not mention the demonstration of the fine-tuned LLMs outperforming open-source chat models on most benchmarks tested, which is a key point in the reference response. The range of parameters for the large language models developed is correctly identified, but the response does not mention the specific models referred to as Llama 2-Chat. So the overall score is 3. 3.0 GPT-4 Feedback: The generated answer is relevant and almost fully correct. It correctly identifies the two primary objectives and the range of parameters for the large language models. However, it misses the detail about Llama 2-Chat outperforming other models on most benchmarks and potentially being a suitable substitute for closed-source models. 4.5
观察:¶
与GPT-4相比,Prometheus的反馈更加精确,它给出了一个3.0
的惩罚分数,而GPT-4给出了一个4.5
的分数。
Prometheus忠实度和相关性评分。¶
_ = get_eval_results("faithfulness", prometheus_eval_results)
_ = get_eval_results("relevancy", prometheus_eval_results)
faithfulness Score: 0.39 relevancy Score: 0.57
GPT-4忠实度和相关性评分。¶
_ = get_eval_results("faithfulness", gpt4_eval_results)
_ = get_eval_results("relevancy", gpt4_eval_results)
faithfulness Score: 0.93 relevancy Score: 0.98
在这个示例中,我们将计算普罗米修斯和 GPT-4 之间的汉明距离。汉明距离是衡量两个等长字符串之间的不同之处的数量的指标。让我们看看普罗米修斯和 GPT-4 之间的汉明距离。
prometheus_faithfulness_scores = [
result.score for result in prometheus_eval_results["faithfulness"]
]
prometheus_relevancy_scores = [
result.score for result in prometheus_eval_results["relevancy"]
]
gpt4_faithfulness_scores = [
result.score for result in gpt4_eval_results["faithfulness"]
]
gpt4_relevancy_scores = [
result.score for result in gpt4_eval_results["relevancy"]
]
faithfulness_hamming_distance = hamming_distance(
prometheus_faithfulness_scores, gpt4_faithfulness_scores
)
relevancy_hamming_distance = hamming_distance(
prometheus_relevancy_scores, gpt4_relevancy_scores
)
print(f"Faithfulness Hamming Distance: {faithfulness_hamming_distance}")
print(f"Relevancy Hamming Distance: {relevancy_hamming_distance}")
Faithfulness Hamming Distance: 58 Relevancy Hamming Distance: 41
观察:¶
比较显示,在“忠实度”方面约有44%
的分数,在“相关性”方面约有63%
的分数在Prometheus和GPT-4评估中是相同的。这表明在忠实度和相关性评分方面,Prometheus和GPT-4模型之间存在相当大的相关性。
Prometheus和GPT-4的忠实度和相关性反馈比较¶
# 获取查询query = questions[0]# 获取查询的响应/生成的答案response = prometheus_eval_results["faithfulness"][0].response# 获取检索到的上下文,因为它们用于忠实度和相关性contexts = prometheus_eval_results["faithfulness"][0].contexts# 从prometheus模型获取忠实度和相关性反馈prometheus_faithfulness_feedback = prometheus_eval_results["faithfulness"][ 0].feedbackprometheus_relevancy_feedback = prometheus_eval_results["relevancy"][ 0].feedback# 从gpt4模型获取忠实度和相关性反馈gpt4_faithfulness_feedback = gpt4_eval_results["faithfulness"][0].feedbackgpt4_relevancy_feedback = gpt4_eval_results["relevancy"][0].feedback# 从prometheus模型获取忠实度和相关性分数prometheus_faithfulness_score = prometheus_eval_results["faithfulness"][ 0].scoreprometheus_relevancy_score = prometheus_eval_results["relevancy"][0].score# 从gpt4模型获取忠实度和相关性分数gpt4_faithfulness_score = gpt4_eval_results["faithfulness"][0].scoregpt4_relevancy_score = gpt4_eval_results["relevancy"][0].score
print(f"Query: {query} \n\n")
print(f"Generated Answer: {response}")
Query: Based on the abstract of "Llama 2: Open Foundation and Fine-Tuned Chat Models," what are the two primary objectives achieved in this work, and what is the range of parameters for the large language models developed? Generated Answer: The two primary objectives achieved in this work are the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. The range of parameters for the large language models developed is from 7 billion to 70 billion.
print(f"Context-1: {contexts[0]}")
Context-1: Llama 2 : Open Foundation and Fine-Tuned Chat Models Hugo Touvron∗Louis Martin†Kevin Stone† Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov Thomas Scialom∗ GenAI, Meta Abstract In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed- source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. ∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com †Second author Contributions for all the authors can be found in Section A.1.arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
print(f"Context-2: {contexts[1]}")
Context-2: (2021)alsoilluminatesthedifficultiestiedtochatbot-oriented LLMs, with concerns ranging from privacy to misleading expertise claims. Deng et al. (2023) proposes a taxonomic framework to tackle these issues, and Bergman et al. (2022) delves into the balance between potential positive and negative impacts from releasing dialogue models. InvestigationsintoredteamingrevealspecificchallengesintunedLLMs,withstudiesbyGangulietal.(2022) and Zhuoet al. (2023) showcasing a variety ofsuccessful attack typesand their effects onthe generation of harmful content. National security agencies and various researchers, such as (Mialon et al., 2023), have also raisedredflagsaroundadvancedemergentmodelbehaviors,cyberthreats,andpotentialmisuseinareaslike biological warfare. Lastly, broader societal issues like job displacement due to accelerated AI research and an over-reliance on LLMs leading to training data degradation are also pertinent considerations (Acemoglu andRestrepo,2018;AutorandSalomons,2018;Webb,2019;Shumailovetal.,2023). Wearecommittedto continuing our work engaging with the broader policy, academic, and industry community on these issues. 7 Conclusion Inthisstudy,wehaveintroduced Llama 2,anewfamilyofpretrainedandfine-tunedmodelswithscales of7billionto70billionparameters. Thesemodelshavedemonstratedtheircompetitivenesswithexisting open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation setsweexamined,althoughtheystilllagbehindothermodelslikeGPT-4. Wemeticulouslyelaboratedonthe methodsandtechniquesappliedinachievingourmodels,withaheavyemphasisontheiralignmentwiththe principlesofhelpfulnessandsafety. Tocontributemoresignificantlytosocietyandfosterthepaceofresearch, wehaveresponsiblyopenedaccessto Llama 2 andLlama 2-Chat . Aspartofourongoingcommitmentto transparency and safety, we plan to make further improvements to Llama 2-Chat in future work. 36
print(
f"Prometheus Faithfulness Feedback: {prometheus_faithfulness_feedback}\n\n"
)
print(f"Prometheus Faithfulness Score: {prometheus_faithfulness_score}\n\n")
print(f"Prometheus Relevancy Feedback: {prometheus_relevancy_feedback}\n\n")
print(f"Prometheus Relevancy Score: {prometheus_relevancy_score}")
Prometheus Faithfulness Feedback: The information provided in the context is not supported by the given information. The context is about the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. However, the information provided in the context does not align with the given information. The context does not mention the range of parameters for the large language models developed, which is the primary objective mentioned in the information. The context only talks about the development and release of Llama 2 and its optimization for dialogue use cases, but it does not provide any information about the range of parameters for the large language models developed. So the overall score is NO. [RESULT] NO Prometheus Faithfulness Score: 0.0 Prometheus Relevancy Feedback: The response is not in line with the context information provided. The query asked for the two primary objectives achieved in the work and the range of parameters for the large language models developed. However, the response provided the abstract of the paper and mentioned the authors, which is not relevant to the query. The response also did not mention the two primary objectives achieved in the work or the range of parameters for the large language models developed. So the overall score is NO. [RESULT] NO Prometheus Relevancy Score: 0.0
如果你比较反馈和上下文,你会发现上下文和回复中提到了一系列参数,但是反馈表示模型找不到这样的信息。
print(f"GPT-4 Faithfulness Feedback: {gpt4_faithfulness_feedback}\n\n")
print(f"GPT-4 Faithfulness Score: {gpt4_faithfulness_score}\n\n")
print(f"GPT-4 Relevancy Feedback: {gpt4_relevancy_feedback}\n\n")
print(f"GPT-4 Relevancy Score: {gpt4_relevancy_score}")
GPT-4 Faithfulness Feedback: The given piece of information is well supported by the context. The context clearly states that Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), was developed and released. It also mentions that these models range in scale from 7 billion to 70 billion parameters. Furthermore, the context confirms that these models are optimized for dialogue use cases. Therefore, the information provided is accurate and is corroborated by the context. [RESULT] YES GPT-4 Faithfulness Score: 1.0 GPT-4 Relevancy Feedback: The response accurately reflects the context provided. The response correctly identifies the two primary objectives of the work as the development and release of Llama 2, a collection of pretrained and fine-tuned large language models (LLMs), and the optimization of these models for dialogue use cases. This is in line with the information provided in the abstract of the context. The response also correctly states the range of parameters for the large language models developed as being from 7 billion to 70 billion, which is also confirmed in the context. Therefore, the response is in line with the context information provided. [RESULT] YES GPT-4 Relevancy Score: 1.0
GPT-4能够正确评估,而Prometheus模型则不能。¶
GPT-4 成本分析¶
prompt_token_count = token_counter.prompt_llm_token_count
completion_token_count = token_counter.completion_llm_token_count
total_cost_llama2 = (
prompt_token_count * 0.03 + completion_token_count * 0.06
) / 1000
总成本分析¶
Prometheus模型 - $2.167
用于 144
个查询(44
用于Paul Graham的文章,100
用于Llama2论文),每个查询约为 $0.015
。¶
GPT4 模型 - $22
(total_cost_paul_graham_essay + total_cost_llama2)- 每个查询为 $0.15
。¶
观察:¶
- 评估成本(约):Prometheus模型为
$2.167
,GPT4为$22
。 - 尽管Prometheus模型提供比GPT-4更详细的反馈,但偶尔会提供不正确的反馈,因此需要谨慎应用。
- 如果生成的答案缺少参考答案中的某些事实,Prometheus模型会对分数施加比GPT-4更严格的惩罚。
- 与GPT-4相比,Prometheus的忠实度和相关性反馈在反馈中显示出更多的幻觉/错误解释。
- Prometheus和GPT-4的忠实度和相关性分数的共同性在两个数据集中不同,因此在生产中应谨慎使用。
注意:HF上的端点在AWS Nvidia A100G上提供,配备1个GPU和80GB内存,成本为每小时$6.5。我们在这里使用了Prometheus模型进行分析。我们还使用了Prometheus模型的GPTQ量化版本进行了类似的分析,并观察到与原始未量化模型相比,在反馈中出现了更多的幻觉。感谢论文的作者和Tom Jobbins提供了模型的量化版本。