MLflow LLM 评估

随着ChatGPT的出现，大型语言模型（LLMs）在问答、翻译和文本摘要等各种领域展示了其文本生成的强大能力。评估LLMs的性能与传统的机器学习模型略有不同，因为通常没有单一的基准事实进行比较。MLflow提供了一个API mlflow.evaluate() 来帮助评估您的LLMs。

MLflow 的 LLM 评估功能由 3 个主要组件组成：

一个评估模型：它可以是一个 MLflow pyfunc 模型，一个指向已注册 MLflow 模型的 URI，或者任何代表你的模型的 Python 可调用对象，例如，一个 HuggingFace 文本摘要管道。
指标: 要计算的指标，LLM 评估将使用 LLM 指标。
评估数据：模型评估所用的数据，可以是 pandas Dataframe、python 列表、numpy 数组或 mlflow.data.dataset.Dataset() 实例。

完整的笔记本指南和示例

如果您对展示MLflow评估功能在LLMs中的简单性和强大性的面向用例的详细指南感兴趣，请访问下面的笔记本集合：

View the Notebook Guides

快速入门

下面是一个简单的示例，快速概述了MLflow LLM评估的工作原理。该示例通过使用自定义提示包装“openai/gpt-4”来构建一个简单的问答模型。您可以将其粘贴到您的IPython或本地编辑器中并执行它，并根据提示安装缺失的依赖项。运行代码需要OpenAI API密钥，如果您没有OpenAI密钥，可以按照 OpenAI指南进行设置。

export OPENAI_API_KEY='your-api-key-here'

import mlflow
import openai
import os
import pandas as pd
from getpass import getpass

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

LLM 评估指标

MLflow 中有两种类型的 LLM 评估指标：

依赖于SaaS模型（例如，OpenAI）进行评分的指标，例如 mlflow.metrics.genai.answer_relevance()。这些指标通过 mlflow.metrics.genai.make_genai_metric() 方法创建。对于每个数据记录，这些指标在底层向SaaS模型发送一个包含以下信息的提示，并从模型响应中提取分数：
- 指标定义。
- 指标评分标准。
- 参考示例。
- 输入数据/上下文。
- 模型输出。
- [可选] 地面实况。
这些字段如何设置的更多细节可以在“创建您的自定义LLM评估指标”部分找到。
基于函数的逐行指标。这些指标根据某些函数（如 Rouge (mlflow.metrics.rougeL()) 或 Flesch Kincaid (mlflow.metrics.flesch_kincaid_grade_level())）为每个数据记录（在 Pandas/Spark 数据框中称为行）计算得分。这些指标类似于传统指标。

选择要评估的指标

有两种方法可以选择指标来评估你的模型：

使用预定义模型类型的**默认**指标。
使用一个 自定义 的指标列表。

使用预定义模型类型的默认指标

MLflow LLM 评估包括预选任务的默认指标集合，例如，“问答”。根据您正在评估的 LLM 用例，这些预定义的集合可以大大简化运行评估的过程。要使用预选任务的默认指标，请在 mlflow.evaluate() 中指定 model_type 参数，如下例所示：

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

支持的LLM模型类型及其相关指标如下：

问答: model_type="question-answering":
- exact-match
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
文本摘要：model_type="text-summarization"：
- ROUGE ³
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
文本模型：model_type="text"：
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
检索器：model_type="retriever":
- precision_at_k ⁴
- recall_at_k ⁴
- ndcg_at_k ⁴

¹ 需要以下包：evaluate、torch 和 transformers。

² 需要包 textstat

³ 需要 evaluate, nltk, 和 rouge-score 包

⁴ 所有检索器指标都有一个默认的 retriever_k 值 3，可以通过在 evaluator_config 参数中指定 retriever_k 来覆盖。

使用自定义指标列表

使用与给定模型类型相关联的预定义指标并不是在 MLflow 中为 LLM 评估生成评分指标的唯一方法。您可以在 mlflow.evaluate 中的 extra_metrics 参数中指定自定义的指标列表。

要向预定义模型类型的默认指标列表添加额外指标，请保留 model_type 并将您的指标添加到 extra_metrics 中：
```
results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[mlflow.metrics.latency()],
)
```
上述代码将使用所有指标评估您的“问答”模型，并加上 mlflow.metrics.latency()。

要禁用默认的指标计算并仅计算您选择的指标，请移除 model_type 参数并定义所需的指标。

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    extra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],
)

支持的评估指标的完整参考可以在这里找到。

以LLM为评判标准的指标

MLflow 提供了一些使用 LLM 作为评判的预设指标。尽管底层有所不同，使用方法是一样的——将这些指标放入 mlflow.evaluate() 的 extra_metrics 参数中。以下是预设指标的列表：

mlflow.metrics.genai.answer_similarity(): 当你想要评估模型生成的输出与ground_truth中的信息相似度时，使用此指标。高分意味着你的模型输出包含与ground_truth相似的信息，而低分则意味着输出可能与ground_truth不一致。
mlflow.metrics.genai.answer_correctness(): 当你想要评估模型生成的输出在多大程度上基于ground_truth的信息是事实正确时，使用此指标。高分意味着你的模型输出包含与ground_truth相似的信息，并且这些信息是正确的，而低分意味着输出可能与ground_truth不一致，或者输出中的信息是不正确的。请注意，这是建立在answer_similarity之上的。
mlflow.metrics.genai.answer_relevance(): 当你想要评估模型生成的输出与输入的相关性时使用此指标（上下文被忽略）。高分意味着你的模型输出与输入主题相同，而低分意味着输出可能与主题无关。
mlflow.metrics.genai.relevance(): 当你想要评估模型生成的输出相对于输入和上下文的关联性时，使用此指标。高分意味着模型理解了上下文并从中正确提取了相关信息，而低分意味着输出完全忽略了问题和上下文，可能是幻觉。
mlflow.metrics.genai.faithfulness(): 当你想要评估模型生成的输出在多大程度上忠实于提供的上下文时，使用此指标。高分意味着输出包含与上下文一致的信息，而低分意味着输出可能与上下文不一致（输入被忽略）。

选择 LLM-as-judge 模型

默认情况下，llm-as-judge 指标使用 openai:/gpt-4 作为评判模型。你可以通过在指标定义中传递 model 参数的覆盖来更改默认的评判模型，如下所示。除了 OpenAI 模型外，你还可以通过 MLflow Deployments 使用任何端点。使用 mlflow.deployments.set_deployments_target() 来设置目标部署客户端。

要使用由本地 MLflow AI 网关托管的端点，可以使用以下代码。

from mlflow.deployments import set_deployments_target

set_deployments_target("http://localhost:5000")
my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/my-endpoint"
)

要使用托管在 Databricks 上的端点，可以使用以下代码。

from mlflow.deployments import set_deployments_target

set_deployments_target("databricks")
llama2_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/databricks-llama-2-70b-chat"
)

有关各种模型作为评判的表现的更多信息，请参阅这篇博客。

创建自定义 LLM 评估指标

创建LLM作为评判的评估指标（类别1）

你也可以使用 MLflow API mlflow.metrics.genai.make_genai_metric() 创建自己的 Saas LLM 评估指标，这需要以下信息：

name: 你的自定义指标的名称。
definition: 描述指标的作用。
grading_prompt: 描述评分标准。
examples: 一些带有分数的输入/输出示例，它们被用作LLM评判的参考。
model: LLM 评判的标识符，格式为 “openai:/gpt-4” 或 “endpoints:/databricks-llama-2-70b-chat”。
parameters: 发送给 LLM 判断的额外参数，例如，temperature 用于 "openai:/gpt-4o-mini"。
aggregations: 使用 numpy 函数聚合每行分数的选项列表。
greater_is_better: 指示分数越高是否意味着模型越好。

在底层，definition、grading_prompt、examples 连同评估数据和模型输出将被组合成一个长提示并发送给 LLM。如果你熟悉提示工程的概念，SaaS LLM 评估指标基本上是在尝试组合一个包含指令、数据和模型输出的“正确”提示，以便 LLM，例如 GPT4，能够输出我们想要的信息。

现在让我们创建一个名为“专业性”的自定义GenAI指标，该指标用于衡量我们模型输出的专业程度。

首先，让我们创建一些带有分数的示例，这些将是LLM判断器使用的参考样本。为了创建这些示例，我们将使用 mlflow.metrics.genai.EvaluationExample() 类，该类有4个字段：

输入：输入文本。
输出: 输出文本。
分数：在输入的上下文中输出的分数。
合理性：我们为什么要为数据给出分数。

professionalism_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps "
        "you track experiments, package your code and models, and collaborate with your team, making the whole ML "
        "workflow smoother. It's like your Swiss Army knife for machine learning!"
    ),
    score=2,
    justification=(
        "The response is written in a casual tone. It uses contractions, filler words such as 'like', and "
        "exclamation points, which make it sound less professional. "
    ),
)
professionalism_example_score_4 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was "
        "developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning engineers face when "
        "developing, training, and deploying machine learning models.",
    ),
    score=4,
    justification=("The response is written in a formal language and a neutral tone. "),
)

现在让我们定义 professionalism 指标，你将看到每个字段是如何设置的。

professionalism = mlflow.metrics.genai.make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
        "tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
        "colloquialisms, and instead using clear, concise, and respectful language."
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
        "- Score 0: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
        "professional contexts."
        "- Score 1: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
        "some informal professional settings."
        "- Score 2: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
        "business or academic settings. "
    ),
    examples=[professionalism_example_score_2, professionalism_example_score_4],
    model="openai:/gpt-4o-mini",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

创建基于启发式的LLM评估指标（类别2）

这与创建自定义传统指标非常相似，除了返回一个 mlflow.metrics.MetricValue() 实例。基本上你需要：

实现一个 eval_fn 来定义你的评分逻辑，它必须接受两个参数 predictions 和 targets。eval_fn 必须返回一个 mlflow.metrics.MetricValue() 实例。
将 eval_fn 和其他参数传递给 mlflow.metrics.make_metric API 以创建指标。

以下代码创建了一个名为 "over_10_chars" 的每行虚拟指标：如果模型输出大于10，则得分是 “yes”，否则是 “no”。

def eval_fn(predictions, targets):
    scores = []
    for i in range(len(predictions)):
        if len(predictions[i]) > 10:
            scores.append("yes")
        else:
            scores.append("no")
    return MetricValue(
        scores=scores,
        aggregate_results=standard_aggregations(scores),
    )


# Create an EvaluationMetric object.
passing_code_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="over_10_chars"
)

要创建一个依赖于其他指标的自定义指标，请在 predictions 和 targets 之后将这些其他指标的名称作为参数包含在内。这可以是内置指标的名称或其他自定义指标的名称。确保您的指标中没有意外的循环依赖，否则评估将失败。

以下代码创建了一个名为 "toxic_or_over_10_chars" 的每行虚拟指标：如果模型输出大于10或毒性分数大于0.5，则分数为“是”，否则为“否”。

def eval_fn(predictions, targets, toxicity, over_10_chars):
    scores = []
    for i in range(len(predictions)):
        if toxicity.scores[i] > 0.5 or over_10_chars.scores[i]:
            scores.append("yes")
        else:
            scores.append("no")
    return MetricValue(scores=scores)


# Create an EvaluationMetric object.
toxic_and_over_10_chars_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="toxic_or_over_10_chars"
)

准备您的LLM进行评估

为了使用 mlflow.evaluate() 评估你的 LLM，你的 LLM 必须是以下类型之一：

一个 mlflow.pyfunc.PyFuncModel() 实例或指向已记录的 mlflow.pyfunc.PyFuncModel 模型的URI。通常我们称之为MLflow模型。
一个接受字符串输入并输出单个字符串的Python函数。你的可调用对象必须匹配 mlflow.pyfunc.PyFuncModel.predict() 的签名（不包括 params 参数），简而言之，它应该：
- 只有一个 data 参数，可以是 pandas.Dataframe、numpy.ndarray、Python 列表、字典或 scipy 矩阵。
- 返回 pandas.DataFrame、pandas.Series、numpy.ndarray 或 list 中的一种。
一个指向本地 MLflow AI Gateway、Databricks Foundation Models API 和 Databricks Model Serving 中的外部模型的 MLflow Deployments 端点 URI。
设置 model=None，并将模型输出放入 data 中。仅在数据为 Pandas 数据框时适用。

使用 MLflow 模型进行评估

有关如何将您的模型转换为 mlflow.pyfunc.PyFuncModel 实例的详细说明，请阅读此文档。但简而言之，要将您的模型评估为 MLflow 模型，我们建议按照以下步骤操作：

将您的 LLM 打包为 MLflow 模型并通过 log_model 记录到 MLflow 服务器。每种风格（opeanai、pytorch 等）都有自己的 log_model API，例如 mlflow.openai.log_model()：

with mlflow.start_run():
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4o-mini" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

使用已记录模型的URI作为 mlflow.evaluate() 中的模型实例：

results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

使用自定义函数进行评估

自 MLflow 2.8.0 起，mlflow.evaluate() 支持在不要求将模型记录到 MLflow 的情况下评估一个 Python 函数。当你不想记录模型而只想评估它时，这非常有用。以下示例使用 mlflow.evaluate() 来评估一个函数。你还需要设置 OpenAI 认证以运行下面的代码。

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)


def openai_qa(inputs):
    answers = []
    system_prompt = "Please answer the following question in formal language."
    for index, row in inputs.iterrows():
        completion = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": "{row}"},
            ],
        )
        answers.append(completion.choices[0].message.content)

    return answers


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        openai_qa,
        eval_data,
        model_type="question-answering",
    )

使用 MLflow 部署端点进行评估

对于 MLflow >= 2.11.0，mlflow.evaluate() 支持通过直接将 MLflow Deployments 端点 URI 传递给 model 参数来评估模型端点。这在您想要评估由本地 MLflow AI Gateway、Databricks Foundation Models API 和 Databricks Model Serving 中的外部模型托管的已部署模型时特别有用，而无需实现自定义预测逻辑将其包装为 MLflow 模型或 Python 函数。

在调用带有端点URI的 mlflow.evaluate() 之前，请不要忘记通过使用 mlflow.deployments.set_deployments_target() 来设置目标部署客户端，如以下示例所示。否则，您将看到类似 MlflowException: No deployments target has been set... 的错误消息。

提示

当你想要使用一个非 MLflow AI Gateway 或 Databricks 托管的端点时，你可以按照使用自定义函数进行评估指南创建一个自定义 Python 函数，并将其用作 model 参数。

支持的输入数据格式

在使用 MLflow 部署端点的 URI 作为模型时，输入数据可以是以下格式之一：

数据格式	示例	附加说明
一个带有字符串列的 pandas DataFrame。	pd.DataFrame( { "inputs": [ "What is MLflow?", "What is Spark?", ] } )	对于这种输入格式，MLflow 将构建适当的请求负载到模型端点类型。例如，如果你的模型是一个聊天端点（`llm/v1/chat`），MLflow 会将你的输入字符串包装在聊天消息格式中，如 `{"messages": [{"role": "user", "content": "什么是MLflow？"}]}`。如果你想自定义请求负载，例如包括系统提示，请使用下一个格式。
一个包含字典列的 pandas DataFrame。	pd.DataFrame( { "inputs": [ { "messages": [ {"role": "system", "content": "Please answer."}, {"role": "user", "content": "What is MLflow?"}, ], "max_tokens": 100, }, # ... more dictionary records ] } )	在这种格式中，字典应具有适用于您的模型端点的正确请求格式。有关不同模型端点类型的请求格式的更多信息，请参阅 MLflow 部署文档。
输入字符串列表。	[ "What is MLflow?", "What is Spark?", ]	The `mlflow.evaluate()` also accepts a list input.
请求负载列表（字典）。	[ { "messages": [ {"role": "system", "content": "Please answer."}, {"role": "user", "content": "What is MLflow?"}, ], "max_tokens": 100, }, # ... more dictionary records ]	与 Pandas DataFrame 输入类似，字典应具有适用于您的模型端点的正确请求格式。

传递推理参数

你可以通过在 mlflow.evaluate() 中设置 inference_params 参数，向模型端点传递额外的推理参数，如 max_tokens、temperature、n 等。inference_params 参数是一个包含要传递给模型端点的参数的字典。指定的参数将用于评估数据集中的所有输入记录。

备注

当你的输入是一个表示请求负载的字典格式时，它也可以包括像 max_tokens 这样的参数。如果在 inference_params 和输入数据中存在重叠的参数，inference_params 中的值将优先。

示例

由本地托管的 MLflow AI Gateway 聊天端点

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to the local MLflow AI Gateway
set_deployments_target("http://localhost:5000")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        # Additional ground truth data for evaluating the answer
        "ground_truth": [
            "MLflow is an open-source platform ....",
            "Apache Spark is an open-source, ...",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/my-chat-endpoint",
        data=eval_data,
        targets="ground_truth",
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="question-answering",
    )

完成端点托管在 Databricks 基础模型 API 上

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to Databricks Foundation Models API
set_deployments_target("databricks")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "Write 3 reasons why you should use MLflow?",
            "Can you explain the difference between classification and regression?",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-mpt-7b-instruct",
        data=eval_data,
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="text",
    )

评估 Databricks Model Serving 中的外部模型可以以相同的方式进行，您只需指定指向服务端点的不同URI，例如 "endpoints:/your-chat-endpoint"。

使用静态数据集进行评估

对于 MLflow >= 2.8.0，mlflow.evaluate() 支持在不指定模型的情况下评估静态数据集。当你将模型输出保存到 Pandas DataFrame 或 MLflow PandasDataset 的列中，并且希望在不重新运行模型的情况下评估静态数据集时，这非常有用。

如果你使用的是 Pandas DataFrame，你必须使用 mlflow.evaluate() 中的顶级 predictions 参数来指定包含模型输出的列名：

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. "
            "It was developed by Databricks, a company that specializes in big data and machine learning solutions. "
            "MLflow is designed to address the challenges that data scientists and machine learning engineers "
            "face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and "
            "analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, "
            "offering improvements in speed and ease of use. Spark provides libraries for various tasks such as "
            "data ingestion, processing, and analysis through its components like Spark SQL for structured data, "
            "Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
        "predictions": [
            "MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow "
            "lifecycle in a simple way",
            "Spark is a popular open-source distributed computing system designed for big data processing and analytics.",
        ],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

查看评估结果

通过代码查看评估结果

mlflow.evaluate() 返回评估结果作为 mlflow.models.EvaluationResult() 实例。要查看所选指标的分数，您可以检查：

metrics: 存储聚合结果，如评估数据集中的平均值/方差。让我们对上面的代码示例再进行一次检查，并专注于打印出聚合结果。

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

tables["eval_results_table"]: 存储每行的评估结果。

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(
        f"See per-data evaluation results below: \n{results.tables['eval_results_table']}"
    )

通过 MLflow UI 查看评估结果

您的评估结果会自动记录到 MLflow 服务器中，因此您可以直接从 MLflow UI 查看您的评估结果。要查看 MLflow UI 上的评估结果，请按照以下步骤操作：

转到您的 MLflow 实验的实验视图。
选择“评估”选项卡。
选择您想要检查评估结果的运行。
从右侧的下拉菜单中选择指标。

请参见下面的截图以获得清晰的理解：