使用 `mlflow.evaluate()` 评估 Hugging Face LLM

本指南将展示如何加载一个预训练的 Hugging Face 管道，将其记录到 MLflow 中，并使用 mlflow.evaluate() 来评估内置指标以及模型的自定义 LLM 评判指标。

如需详细信息，请阅读使用 MLflow 评估的文档。

启动 MLflow 服务器

你可以选择：

通过在您的笔记本所在的同一目录中运行 mlflow ui 来启动本地跟踪服务器。
使用跟踪服务器，如此概述中所述。

安装必要的依赖项

[ ]:

%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat

[2]:

# Necessary imports
import warnings

import pandas as pd
from datasets import load_dataset
from transformers import pipeline

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric

[3]:

# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

加载预训练的 Hugging Face 管道

这里我们加载了一个文本生成管道，但你也可以使用文本摘要或问答管道。

[4]:

mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")

使用 MLflow 记录模型

我们将我们的管道记录为一个 MLflow 模型，它遵循一个标准格式，允许你以不同的“风格”保存模型，这些风格可以被不同的下游工具理解。在这种情况下，模型是 transformers “风格”的。

[5]:

mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")

# Define the signature
signature = mlflow.models.infer_signature(
    model_input="What are the three primary colors?",
    model_output="The three primary colors are red, yellow, and blue.",
)

# Log the model using mlflow
with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=mpt_pipeline,
        artifact_path="mpt-7b",
        signature=signature,
        registered_model_name="mpt-7b-chat",
    )

Successfully registered model 'mpt-7b-chat'.
Created version '1' of model 'mpt-7b-chat'.

加载评估数据

从 Hugging Face Hub 加载一个数据集用于评估。

以下数据集中的数据字段表示：

指令：描述模型应执行的任务。数据集中的每一行都是一个要执行的独特指令（任务）。
输入: 与``instruction``字段中定义的任务相关的可选上下文信息。例如，对于指令“找出不同的一项”，``input``上下文指导给出了从中选择异常项的项目列表，“Twitter, Instagram, Telegram”。
输出：指令的答案（根据提供的可选``输入``上下文），由原始评估模型（OpenAI的``text-davinci-003``）生成
文本: 作为应用 指令、输入 和 输出 到所使用的提示模板的结果，最终的总文本，该文本被发送到模型以进行微调。

[7]:

dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)

[7]:

	instruction	input	output	text
0	Give three tips for staying healthy.		1.Eat a balanced diet and make sure to include...	Below is an instruction that describes a task....
1	What are the three primary colors?		The three primary colors are red, blue, and ye...	Below is an instruction that describes a task....
2	Describe the structure of an atom.		An atom is made up of a nucleus, which contain...	Below is an instruction that describes a task....
3	How can we reduce air pollution?		There are a number of ways to reduce air pollu...	Below is an instruction that describes a task....
4	Describe a time when you had to make a difficu...		I had to make a difficult decision when I was ...	Below is an instruction that describes a task....
5	Identify the odd one out.	Twitter, Instagram, Telegram	Telegram	Below is an instruction that describes a task,...
6	Explain why the following fraction is equivale...	4/16	The fraction 4/16 is equivalent to 1/4 because...	Below is an instruction that describes a task,...
7	Write a short story in third person narration ...		John was at a crossroads in his life. He had j...	Below is an instruction that describes a task....
8	Render a 3D model of a house		<nooutput> This type of instruction cannot be ...	Below is an instruction that describes a task....
9	Evaluate this sentence for spelling and gramma...	He finnished his meal and left the resturant	He finished his meal and left the restaurant.	Below is an instruction that describes a task,...

定义指标

由于我们正在评估我们的模型如何很好地提供对给定指令的回答，我们可能希望选择一些指标来帮助衡量这一点，除了 mlflow.evaluate() 提供的任何内置指标。

让我们用以下两个指标来衡量我们的模型表现如何：

答案正确吗？ 让我们在这里使用预定义的指标 answer_correctness。
答案是否流畅、清晰且简洁？ 我们将定义一个自定义指标 answer_quality 来衡量这一点。

我们需要将这两个指标传递给 mlflow.evaluate() 的 extra_metrics 参数，以便评估我们模型的质量。

什么是评估指标？

评估指标封装了您希望为模型计算的任何定量或定性测量。对于每种模型类型，mlflow.evaluate() 将自动计算一组内置指标。请参阅这里以了解将为每种模型类型计算哪些内置指标。您还可以传入任何其他您希望计算的额外指标。MLflow 提供了一组预定义的指标，您可以在这里找到，或者您可以定义自己的自定义指标。在这个例子中，我们将使用预定义指标 mlflow.metrics.genai.answer_correctness 和用于质量评估的自定义指标的组合。

让我们加载我们预定义的指标——在这种情况下，我们使用的是带有 GPT-4 的 answer_correctness。

[9]:

answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

现在我们想要使用 make_genai_metric() 创建一个名为 answer_quality 的自定义 LLM 评判指标。我们需要定义一个指标定义和评分标准，以及一些供 LLM 评判使用的示例。

[8]:

# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
  - Fluency measures how naturally and smooth the output reads.
  - Clarity measures how understandable the output is.
  - Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
  - Score 1: The output is entirely incomprehensible and cannot be read.
  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
  - Score 3: The output is understandable but still needs improvement.
  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""

# We provide an example of a "bad" output
example1 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform. For managing machine learning workflows, it "
    "including experiment tracking model packaging versioning and deployment as well as a platform "
    "simplifying for on the ML lifecycle.",
    score=2,
    justification="The output is difficult to understand and demonstrates extremely low clarity. "
    "However, it still conveys some meaning so this output deserves a score of 2.",
)

# We also provide an example of a "good" output
example2 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine learning workflows, including "
    "experiment tracking, model packaging, versioning, and deployment.",
    score=5,
    justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
    name="answer_quality",
    definition=answer_quality_definition,
    grading_prompt=answer_quality_grading_prompt,
    version="v1",
    examples=[example1, example2],
    model="openai:/gpt-4",
    greater_is_better=True,
)

评估

我们需要设置我们的 OpenAI API 密钥，因为我们使用 GPT-4 作为我们的 LLM 评判指标。

为了安全地设置您的私钥，请确保通过当前实例的命令行终端导出您的密钥，或者，为了永久添加到所有基于用户的会话中，配置您喜欢的环境管理配置文件（例如，.bashrc, .zshrc）以包含以下条目：

OPENAI_API_KEY=<你的 openai API 密钥>

现在，我们可以调用 mlflow.evaluate()。为了测试一下，我们使用数据的前10行。使用 "text" 模型类型，内置的毒性和可读性指标会被计算。我们还将在上面定义的两个指标传递到 extra_metrics 参数中进行评估。

[14]:

with mlflow.start_run():
    results = mlflow.evaluate(
        model_info.model_uri,
        eval_df.head(10),
        evaluators="default",
        model_type="text",
        targets="output",
        extra_metrics=[answer_correctness_metric, answer_quality_metric],
        evaluator_config={"col_mapping": {"inputs": "instruction"}},
    )

2023/12/28 11:57:30 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

2023/12/28 12:00:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/12/28 12:00:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/12/28 12:02:23 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint

2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness

2023/12/28 12:02:53 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_quality

查看结果

results.metrics 是一个包含所有计算指标的汇总值的字典。有关每种模型类型的内置指标的详细信息，请参阅这里。

[15]:

results.metrics

[15]:

{'toxicity/v1/mean': 0.00809656630299287,
 'toxicity/v1/variance': 0.0004603014839856817,
 'toxicity/v1/p90': 0.010559113975614286,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 4.9,
 'flesch_kincaid_grade_level/v1/variance': 6.3500000000000005,
 'flesch_kincaid_grade_level/v1/p90': 6.829999999999998,
 'ari_grade_level/v1/mean': 4.1899999999999995,
 'ari_grade_level/v1/variance': 16.6329,
 'ari_grade_level/v1/p90': 7.949999999999998,
 'answer_correctness/v1/mean': 1.5,
 'answer_correctness/v1/variance': 1.45,
 'answer_correctness/v1/p90': 2.299999999999999,
 'answer_quality/v1/mean': 2.4,
 'answer_quality/v1/variance': 1.44,
 'answer_quality/v1/p90': 4.1}

我们还可以查看 eval_results_table，它向我们展示了每行数据的指标。

[16]:

results.tables["eval_results_table"]

[16]:

	instruction	input	text	output	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	answer_correctness/v1/score	answer_correctness/v1/justification	answer_quality/v1/score	answer_quality/v1/justification
0	Give three tips for staying healthy.		Below is an instruction that describes a task....	1.Eat a balanced diet and make sure to include...	Give three tips for staying healthy.\n1. Eat a...	19	0.000446	4.1	4.0	2	The output provided by the model only includes...	3	The output is understandable and fluent but it...
1	What are the three primary colors?		Below is an instruction that describes a task....	The three primary colors are red, blue, and ye...	What are the three primary colors?\nThe three ...	19	0.000217	5.0	4.9	5	The output provided by the model is completely...	5	The model's output is fluent, clear, and conci...
2	Describe the structure of an atom.		Below is an instruction that describes a task....	An atom is made up of a nucleus, which contain...	Describe the structure of an atom.\nAn atom is...	18	0.000139	3.1	2.2	1	The output provided by the model is incomplete...	2	The output is incomplete and lacks clarity, ma...
3	How can we reduce air pollution?		Below is an instruction that describes a task....	There are a number of ways to reduce air pollu...	How can we reduce air pollution?\nThere are ma...	18	0.000140	5.0	5.5	1	The output provided by the model is completely...	1	The output is entirely incomprehensible and ca...
4	Describe a time when you had to make a difficu...		Below is an instruction that describes a task....	I had to make a difficult decision when I was ...	Describe a time when you had to make a difficu...	18	0.000159	5.2	2.9	1	The output provided by the model is completely...	2	The output is incomplete and lacks clarity, ma...
5	Identify the odd one out.	Twitter, Instagram, Telegram	Below is an instruction that describes a task,...	Telegram	Identify the odd one out.\n\n1. A car\n2. A tr...	18	0.072345	0.1	-5.4	1	The output provided by the model is completely...	2	The output is not clear and lacks fluency. The...
6	Explain why the following fraction is equivale...	4/16	Below is an instruction that describes a task,...	The fraction 4/16 is equivalent to 1/4 because...	Explain why the following fraction is equivale...	23	0.000320	6.4	7.6	1	The output provided by the model is completely...	2	The output is not clear and does not answer th...
7	Write a short story in third person narration ...		Below is an instruction that describes a task....	John was at a crossroads in his life. He had j...	Write a short story in third person narration ...	20	0.000247	10.7	11.1	1	The output provided by the model is completely...	1	The output is exactly the same as the input, a...
8	Render a 3D model of a house		Below is an instruction that describes a task....	<nooutput> This type of instruction cannot be ...	Render a 3D model of a house in Blender - Blen...	19	0.003694	5.2	2.7	1	The output provided by the model is completely...	2	The output is partially understandable but lac...
9	Evaluate this sentence for spelling and gramma...	He finnished his meal and left the resturant	Below is an instruction that describes a task,...	He finished his meal and left the restaurant.	Evaluate this sentence for spelling and gramma...	18	0.003260	4.2	6.4	1	The output provided by the model is completely...	4	The output is fluent and clear, but it is not ...