使用 mlflow.evaluate()
评估 Hugging Face LLM
本指南将展示如何加载一个预训练的 Hugging Face 管道,将其记录到 MLflow 中,并使用 mlflow.evaluate()
来评估内置指标以及模型的自定义 LLM 评判指标。
如需详细信息,请阅读 使用 MLflow 评估 的文档。
Download this Notebook启动 MLflow 服务器
你可以选择:
通过在您的笔记本所在的同一目录中运行
mlflow ui
来启动本地跟踪服务器。使用跟踪服务器,如 此概述 中所述。
安装必要的依赖项
[ ]:
%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat
[2]:
# Necessary imports
import warnings
import pandas as pd
from datasets import load_dataset
from transformers import pipeline
import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric
[3]:
# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)
加载预训练的 Hugging Face 管道
这里我们加载了一个文本生成管道,但你也可以使用文本摘要或问答管道。
[4]:
mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")
使用 MLflow 记录模型
我们将我们的管道记录为一个 MLflow 模型,它遵循一个标准格式,允许你以不同的“风格”保存模型,这些风格可以被不同的下游工具理解。在这种情况下,模型是 transformers “风格”的。
[5]:
mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")
# Define the signature
signature = mlflow.models.infer_signature(
model_input="What are the three primary colors?",
model_output="The three primary colors are red, yellow, and blue.",
)
# Log the model using mlflow
with mlflow.start_run():
model_info = mlflow.transformers.log_model(
transformers_model=mpt_pipeline,
artifact_path="mpt-7b",
signature=signature,
registered_model_name="mpt-7b-chat",
)
Successfully registered model 'mpt-7b-chat'.
Created version '1' of model 'mpt-7b-chat'.
加载评估数据
从 Hugging Face Hub 加载一个数据集用于评估。
以下数据集中的数据字段表示:
指令:描述模型应执行的任务。数据集中的每一行都是一个要执行的独特指令(任务)。
输入: 与``instruction``字段中定义的任务相关的可选上下文信息。例如,对于指令“找出不同的一项”,``input``上下文指导给出了从中选择异常项的项目列表,“Twitter, Instagram, Telegram”。
输出:指令的答案(根据提供的可选``输入``上下文),由原始评估模型(OpenAI的``text-davinci-003``)生成
文本: 作为应用
指令
、输入
和输出
到所使用的提示模板的结果,最终的总文本,该文本被发送到模型以进行微调。
[7]:
dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)
[7]:
instruction | input | output | text | |
---|---|---|---|---|
0 | Give three tips for staying healthy. | 1.Eat a balanced diet and make sure to include... | Below is an instruction that describes a task.... | |
1 | What are the three primary colors? | The three primary colors are red, blue, and ye... | Below is an instruction that describes a task.... | |
2 | Describe the structure of an atom. | An atom is made up of a nucleus, which contain... | Below is an instruction that describes a task.... | |
3 | How can we reduce air pollution? | There are a number of ways to reduce air pollu... | Below is an instruction that describes a task.... | |
4 | Describe a time when you had to make a difficu... | I had to make a difficult decision when I was ... | Below is an instruction that describes a task.... | |
5 | Identify the odd one out. | Twitter, Instagram, Telegram | Telegram | Below is an instruction that describes a task,... |
6 | Explain why the following fraction is equivale... | 4/16 | The fraction 4/16 is equivalent to 1/4 because... | Below is an instruction that describes a task,... |
7 | Write a short story in third person narration ... | John was at a crossroads in his life. He had j... | Below is an instruction that describes a task.... | |
8 | Render a 3D model of a house | <nooutput> This type of instruction cannot be ... | Below is an instruction that describes a task.... | |
9 | Evaluate this sentence for spelling and gramma... | He finnished his meal and left the resturant | He finished his meal and left the restaurant. | Below is an instruction that describes a task,... |
定义指标
由于我们正在评估我们的模型如何很好地提供对给定指令的回答,我们可能希望选择一些指标来帮助衡量这一点,除了 mlflow.evaluate()
提供的任何内置指标。
让我们用以下两个指标来衡量我们的模型表现如何:
答案正确吗? 让我们在这里使用预定义的指标
answer_correctness
。答案是否流畅、清晰且简洁? 我们将定义一个自定义指标
answer_quality
来衡量这一点。
我们需要将这两个指标传递给 mlflow.evaluate()
的 extra_metrics
参数,以便评估我们模型的质量。
什么是评估指标?
评估指标封装了您希望为模型计算的任何定量或定性测量。对于每种模型类型,mlflow.evaluate()
将自动计算一组内置指标。请参阅 这里 以了解将为每种模型类型计算哪些内置指标。您还可以传入任何其他您希望计算的额外指标。MLflow 提供了一组预定义的指标,您可以在 这里 找到,或者您可以定义自己的自定义指标。在这个例子中,我们将使用预定义指标 mlflow.metrics.genai.answer_correctness
和用于质量评估的自定义指标的组合。
让我们加载我们预定义的指标——在这种情况下,我们使用的是带有 GPT-4 的 answer_correctness
。
[9]:
answer_correctness_metric = answer_correctness(model="openai:/gpt-4")
现在我们想要使用 make_genai_metric()
创建一个名为 answer_quality
的自定义 LLM 评判指标。我们需要定义一个指标定义和评分标准,以及一些供 LLM 评判使用的示例。
[8]:
# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
- Fluency measures how naturally and smooth the output reads.
- Clarity measures how understandable the output is.
- Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""
# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
- Score 1: The output is entirely incomprehensible and cannot be read.
- Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
- Score 3: The output is understandable but still needs improvement.
- Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
- Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""
# We provide an example of a "bad" output
example1 = EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform. For managing machine learning workflows, it "
"including experiment tracking model packaging versioning and deployment as well as a platform "
"simplifying for on the ML lifecycle.",
score=2,
justification="The output is difficult to understand and demonstrates extremely low clarity. "
"However, it still conveys some meaning so this output deserves a score of 2.",
)
# We also provide an example of a "good" output
example2 = EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform for managing machine learning workflows, including "
"experiment tracking, model packaging, versioning, and deployment.",
score=5,
justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)
answer_quality_metric = make_genai_metric(
name="answer_quality",
definition=answer_quality_definition,
grading_prompt=answer_quality_grading_prompt,
version="v1",
examples=[example1, example2],
model="openai:/gpt-4",
greater_is_better=True,
)
评估
我们需要设置我们的 OpenAI API 密钥,因为我们使用 GPT-4 作为我们的 LLM 评判指标。
为了安全地设置您的私钥,请确保通过当前实例的命令行终端导出您的密钥,或者,为了永久添加到所有基于用户的会话中,配置您喜欢的环境管理配置文件(例如,.bashrc, .zshrc)以包含以下条目:
OPENAI_API_KEY=<你的 openai API 密钥>
现在,我们可以调用 mlflow.evaluate()
。为了测试一下,我们使用数据的前10行。使用 "text"
模型类型,内置的毒性和可读性指标会被计算。我们还将在上面定义的两个指标传递到 extra_metrics
参数中进行评估。
[14]:
with mlflow.start_run():
results = mlflow.evaluate(
model_info.model_uri,
eval_df.head(10),
evaluators="default",
model_type="text",
targets="output",
extra_metrics=[answer_correctness_metric, answer_quality_metric],
evaluator_config={"col_mapping": {"inputs": "instruction"}},
)
2023/12/28 11:57:30 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
2023/12/28 12:00:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/12/28 12:00:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/12/28 12:02:23 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint
2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness
2023/12/28 12:02:53 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_quality
查看结果
results.metrics
是一个包含所有计算指标的汇总值的字典。有关每种模型类型的内置指标的详细信息,请参阅 这里。
[15]:
results.metrics
[15]:
{'toxicity/v1/mean': 0.00809656630299287,
'toxicity/v1/variance': 0.0004603014839856817,
'toxicity/v1/p90': 0.010559113975614286,
'toxicity/v1/ratio': 0.0,
'flesch_kincaid_grade_level/v1/mean': 4.9,
'flesch_kincaid_grade_level/v1/variance': 6.3500000000000005,
'flesch_kincaid_grade_level/v1/p90': 6.829999999999998,
'ari_grade_level/v1/mean': 4.1899999999999995,
'ari_grade_level/v1/variance': 16.6329,
'ari_grade_level/v1/p90': 7.949999999999998,
'answer_correctness/v1/mean': 1.5,
'answer_correctness/v1/variance': 1.45,
'answer_correctness/v1/p90': 2.299999999999999,
'answer_quality/v1/mean': 2.4,
'answer_quality/v1/variance': 1.44,
'answer_quality/v1/p90': 4.1}
我们还可以查看 eval_results_table
,它向我们展示了每行数据的指标。
[16]:
results.tables["eval_results_table"]
[16]:
instruction | input | text | output | outputs | token_count | toxicity/v1/score | flesch_kincaid_grade_level/v1/score | ari_grade_level/v1/score | answer_correctness/v1/score | answer_correctness/v1/justification | answer_quality/v1/score | answer_quality/v1/justification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Give three tips for staying healthy. | Below is an instruction that describes a task.... | 1.Eat a balanced diet and make sure to include... | Give three tips for staying healthy.\n1. Eat a... | 19 | 0.000446 | 4.1 | 4.0 | 2 | The output provided by the model only includes... | 3 | The output is understandable and fluent but it... | |
1 | What are the three primary colors? | Below is an instruction that describes a task.... | The three primary colors are red, blue, and ye... | What are the three primary colors?\nThe three ... | 19 | 0.000217 | 5.0 | 4.9 | 5 | The output provided by the model is completely... | 5 | The model's output is fluent, clear, and conci... | |
2 | Describe the structure of an atom. | Below is an instruction that describes a task.... | An atom is made up of a nucleus, which contain... | Describe the structure of an atom.\nAn atom is... | 18 | 0.000139 | 3.1 | 2.2 | 1 | The output provided by the model is incomplete... | 2 | The output is incomplete and lacks clarity, ma... | |
3 | How can we reduce air pollution? | Below is an instruction that describes a task.... | There are a number of ways to reduce air pollu... | How can we reduce air pollution?\nThere are ma... | 18 | 0.000140 | 5.0 | 5.5 | 1 | The output provided by the model is completely... | 1 | The output is entirely incomprehensible and ca... | |
4 | Describe a time when you had to make a difficu... | Below is an instruction that describes a task.... | I had to make a difficult decision when I was ... | Describe a time when you had to make a difficu... | 18 | 0.000159 | 5.2 | 2.9 | 1 | The output provided by the model is completely... | 2 | The output is incomplete and lacks clarity, ma... | |
5 | Identify the odd one out. | Twitter, Instagram, Telegram | Below is an instruction that describes a task,... | Telegram | Identify the odd one out.\n\n1. A car\n2. A tr... | 18 | 0.072345 | 0.1 | -5.4 | 1 | The output provided by the model is completely... | 2 | The output is not clear and lacks fluency. The... |
6 | Explain why the following fraction is equivale... | 4/16 | Below is an instruction that describes a task,... | The fraction 4/16 is equivalent to 1/4 because... | Explain why the following fraction is equivale... | 23 | 0.000320 | 6.4 | 7.6 | 1 | The output provided by the model is completely... | 2 | The output is not clear and does not answer th... |
7 | Write a short story in third person narration ... | Below is an instruction that describes a task.... | John was at a crossroads in his life. He had j... | Write a short story in third person narration ... | 20 | 0.000247 | 10.7 | 11.1 | 1 | The output provided by the model is completely... | 1 | The output is exactly the same as the input, a... | |
8 | Render a 3D model of a house | Below is an instruction that describes a task.... | <nooutput> This type of instruction cannot be ... | Render a 3D model of a house in Blender - Blen... | 19 | 0.003694 | 5.2 | 2.7 | 1 | The output provided by the model is completely... | 2 | The output is partially understandable but lac... | |
9 | Evaluate this sentence for spelling and gramma... | He finnished his meal and left the resturant | Below is an instruction that describes a task,... | He finished his meal and left the restaurant. | Evaluate this sentence for spelling and gramma... | 18 | 0.003260 | 4.2 | 6.4 | 1 | The output provided by the model is completely... | 4 | The output is fluent and clear, but it is not ... |
在UI中查看结果
最后,我们可以在 MLflow UI 中查看我们的评估结果。我们可以在左侧边栏中选择我们的实验,这将带我们进入以下页面。我们可以看到,一次运行记录了我们的模型“mpt-7b-chat”,而另一次运行则记录了我们评估的数据集。
我们点击评估标签,并隐藏任何不相关的运行。
我们现在可以选择要按哪些列进行分组,以及要比较哪一列。在下面的示例中,我们正在查看每个输入-输出对的答案正确性的分数,但我们可以选择任何其他指标进行比较。
最后,我们得到以下视图,在这里我们可以看到每行答案正确性的理由和分数。