mlflow.metrics

mlflow.metrics 模块帮助你定量和定性地测量你的模型。

class mlflow.metrics.EvaluationMetric(eval_fn, name, greater_is_better, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[源代码]

一个评估指标。

参数:

eval_fn – 一个计算指标的函数，具有以下签名：
name – 指标的名称。
greater_is_better – 指标的值越高是否越好。
long_name – （可选）指标的全名。例如，"均方根误差" 对应 "mse"。
version – (可选) 指标版本。例如 v1。
metric_details – (可选) 该指标的描述及其计算方法。
metric_metadata – (可选) 包含指标元数据的字典。
genai_metric_args – （可选）一个包含用户在调用 make_genai_metric 或 make_genai_metric_from_prompt 时指定的参数的字典。这些参数会被保留，以便我们稍后可以反序列化相同的指标对象。

这些 评估指标 由 mlflow.evaluate() API 使用，根据 model_type 自动计算或通过 extra_metrics 参数指定。

以下代码展示了如何使用 mlflow.evaluate() 与 评估指标。

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_similarity

eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
        ],
    }
)

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
answer_similarity_metric = answer_similarity(examples=[example])
results = mlflow.evaluate(
    logged_model.model_uri,
    eval_df,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[answer_similarity_metric],
)

关于如何计算 评估指标 的信息，例如使用的评分提示，可以通过 metric_details 属性获得。

import mlflow
from mlflow.metrics.genai import relevance

my_relevance_metric = relevance()
print(my_relevance_metric.metric_details)

评估结果以 MetricValue 的形式存储。聚合结果作为指标记录到 MLflow 运行中，而每个示例的结果则以评估表的形式作为工件记录到 MLflow 运行中。

class mlflow.metrics.MetricValue(scores=None, justifications=None, aggregate_results=None)[源代码]

备注

实验性：此类在未来的版本中可能会在没有警告的情况下更改或被移除。

指标的值。

参数:

scores – 每行的指标值
justifications – 各自分数的理由（如果适用）
aggregate_results – 一个将聚合名称映射到其值的字典

我们提供了以下内置工厂函数来创建 评估指标 以评估模型。这些指标是根据 model_type 自动计算的。有关 model_type 参数的更多信息，请参阅 mlflow.evaluate() API。

回归器指标

mlflow.metrics.mae() → EvaluationMetric[源代码]

此函数将创建一个用于评估 mae 的指标。

此指标计算回归的平均绝对误差的聚合分数。

mlflow.metrics.mape() → EvaluationMetric[源代码]

此函数将为评估 mape 创建一个指标。

此指标计算回归的平均绝对百分比误差的聚合分数。

mlflow.metrics.max_error() → EvaluationMetric[源代码]

此函数将创建一个用于评估 max_error 的指标。

此指标计算回归的最大残差误差的聚合分数。

mlflow.metrics.mse() → EvaluationMetric[源代码]

此函数将为评估 mse 创建一个指标。

此指标计算回归均方误差的聚合分数。

mlflow.metrics.rmse() → EvaluationMetric[源代码]

此函数将创建一个用于评估 mse 平方根的指标。

此指标计算回归的平均绝对误差的聚合分数。

mlflow.metrics.r2_score() → EvaluationMetric[源代码]

此函数将创建一个用于评估 r2_score 的指标。

此指标计算决定系数的综合得分。R2 的范围从负无穷到 1，并衡量回归中预测变量解释的方差百分比。

分类器指标

mlflow.metrics.precision_score() → EvaluationMetric[源代码]

此函数将为分类创建一个用于评估 precision 的指标。

此指标计算分类任务精确度的综合得分，范围在0到1之间。

mlflow.metrics.recall_score() → EvaluationMetric[源代码]

此函数将为分类评估 recall 创建一个指标。

此指标计算分类任务召回率的介于0和1之间的综合评分。

mlflow.metrics.f1_score() → EvaluationMetric[源代码]

此函数将为二分类创建一个用于评估 f1_score 的指标。

此指标计算分类任务的 F1 分数（F 值）在 0 到 1 之间的综合得分。F1 分数定义为 2 * (精确率 * 召回率) / (精确率 + 召回率)。

文本度量

mlflow.metrics.ari_grade_level() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于计算自动可读性指数的指标，使用 textstat。

此指标输出一个数字，该数字近似于理解文本所需的年级水平，可能的范围大约在0到15之间（尽管不限于此范围）。

为此指标计算的聚合：

平均

mlflow.metrics.flesch_kincaid_grade_level() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于计算 flesch kincaid 年级水平的指标，使用 textstat。

此指标输出一个数字，该数字近似于理解文本所需的年级水平，可能的范围大约在0到15之间（尽管不限于此范围）。

为此指标计算的聚合：

平均

问答指标

包括上述所有 文本指标 以及以下内容：

mlflow.metrics.exact_match() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将使用 sklearn 创建一个用于计算准确率的指标。

此指标仅计算一个从0到1的聚合分数。

mlflow.metrics.rouge1() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估 rouge1 的指标。

分数范围从0到1，分数越高表示相似度越高。rouge1 使用基于一元组的评分来计算相似度。

为此指标计算的聚合：

平均

mlflow.metrics.rouge2() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将为评估 rouge2 创建一个指标。

分数范围从0到1，分数越高表示相似度越高。rouge2 使用基于二元组的评分来计算相似度。

为此指标计算的聚合：

平均

mlflow.metrics.rougeL() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估 rougeL 的指标。

分数范围从0到1，分数越高表示相似度越高。rougeL 使用基于单字的分值计算相似度。

为此指标计算的聚合：

平均

mlflow.metrics.rougeLsum() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估 rougeLsum 的指标。

分数范围从0到1，分数越高表示相似度越高。rougeLsum 使用基于最长公共子序列的评分方法来计算相似度。

为此指标计算的聚合：

平均

mlflow.metrics.toxicity() → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将使用模型 roberta-hate-speech-dynabench-r4 创建一个用于评估 toxicity 的指标，该模型将仇恨定义为“针对特定群体特征（如种族、宗教、性别或性取向）的辱骂性言论。”

分数范围从0到1，其中接近1的分数表示更毒性。默认情况下，文本被认为是“有毒”的阈值为0.5。

为此指标计算的聚合：

毒性输入文本的比例

mlflow.metrics.token_count() → EvaluationMetric[源代码]: 备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于计算 token_count 的指标。token_count 是通过使用 cl100k_base 分词器的 tiktoken 计算的。

mlflow.metrics.latency() → EvaluationMetric[源代码]: 备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于计算延迟的指标。延迟是通过为给定输入生成预测所需的时间来确定的。请注意，计算延迟需要逐行预测，这可能会减慢评估过程。

检索器指标

以下指标是 'retriever' 模型类型的内置指标，这意味着它们将使用默认的 retriever_k 值 3 自动计算。

要评估文档检索模型，建议使用包含以下列的数据集：

输入查询
检索到相关文档ID
地面实况文档ID

或者，您也可以通过 model 参数提供一个函数来表示您的检索模型。该函数应接受包含输入查询和真实相关文档ID的Pandas DataFrame，并返回一个包含检索到的相关文档ID列的DataFrame。

“doc ID” 是一个字符串或整数，用于唯一标识一个文档。检索到的和真实文档 ID 列的每一行应由一个 doc ID 列表或 numpy 数组组成。

参数：

targets: 一个字符串，指定地面实况相关文档ID的列名
predictions: 一个字符串，指定在静态数据集或由 model 函数返回的 Dataframe 中检索到的相关文档ID的列名。

retriever_k: 一个正整数，指定每个输入查询要考虑的检索文档ID的数量。retriever_k 默认为 3。你可以通过使用 mlflow.evaluate() API 来更改 retriever_k:

# with a model and using `evaluator_config`
mlflow.evaluate(
    model=retriever_function,
    data=data,
    targets="ground_truth",
    model_type="retriever",
    evaluators="default",
    evaluator_config={"retriever_k": 5}
)

# with a static dataset and using `extra_metrics`
mlflow.evaluate(
    data=data,
    predictions="predictions_param",
    targets="targets_param",
    model_type="retriever",
    extra_metrics = [
        mlflow.metrics.precision_at_k(5),
        mlflow.metrics.precision_at_k(6),
        mlflow.metrics.recall_at_k(5),
        mlflow.metrics.ndcg_at_k(5)
    ]
)

注意：在第二种方法中，建议也省略 model_type ，否则除了 precision@5 、 precision@6 、 recall@5 和 ndcg_at_k@5 之外，还会计算 precision@3 和 recall@3 。

mlflow.metrics.precision_at_k(k) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将为检索模型创建一个用于计算 precision_at_k 的指标。

此指标为每一行计算一个介于0和1之间的分数，表示在给定的 k 值下检索器模型的精确度。如果没有检索到相关文档，分数为0，表示没有检索到相关文档。设 x = min(k, # 检索到的文档ID数量)。然后，在所有其他情况下，k处的精确度计算如下：

precision_at_k = （在前``x``个排序文档中相关检索文档ID的数量）/ x。

mlflow.metrics.recall_at_k(k) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将为检索器模型创建一个用于计算 recall_at_k 的指标。

此指标为每一行计算一个介于0和1之间的分数，表示检索器模型在给定 k 值时的召回能力。如果没有提供真实文档ID且没有检索到文档，则得分为1。然而，如果没有提供真实文档ID但检索到了文档，则得分为0。在所有其他情况下，k处的召回率计算如下：

recall_at_k = (在排名前 k 的文档中唯一相关的检索文档ID的数量) / (基础事实文档ID的数量)

mlflow.metrics.ndcg_at_k(k) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将为检索器模型创建一个用于评估 NDCG@k 的指标。

NDCG 分数能够处理非二元的相关性概念。然而，为了简单起见，我们在这里使用二元相关性。在真实数据中，文档的相关性分数为 1，而未在真实数据中的文档相关性分数为 0。

NDCG 分数使用 sklearn.metrics.ndcg_score 计算，在 sklearn 实现的基础上处理以下边缘情况：

如果没有提供真实文档ID且没有检索到任何文档，则得分为1。
如果没有提供真实文档ID并且检索到文档，则分数为0。
如果提供了真实文档ID且没有检索到任何文档，则得分为0。
如果检索到重复的文档ID，并且这些重复的文档ID在真实数据中，它们将被视为不同的文档。例如，如果真实数据的文档ID是 [1, 2]，而检索到的文档ID是 [1, 1, 1, 3]，那么分数将等同于真实数据的文档ID [10, 11, 12, 2] 和检索到的文档ID [10, 11, 12, 3]。

用户使用 make_metric 工厂函数创建自己的 EvaluationMetric。

mlflow.metrics.make_metric(*, eval_fn, greater_is_better, name=None, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[源代码]

一个用于创建 EvaluationMetric 对象的工厂函数。

参数:

eval_fn –

一个具有以下签名的计算度量的函数：

def eval_fn(
    predictions: pandas.Series,
    targets: pandas.Series,
    metrics: Dict[str, MetricValue],
    **kwargs,
) -> Union[float, MetricValue]:
    """
    Args:
        predictions: A pandas Series containing the predictions made by the model.
        targets: (Optional) A pandas Series containing the corresponding labels
            for the predictions made on that input.
        metrics: (Optional) A dictionary containing the metrics calculated by the
            default evaluator.  The keys are the names of the metrics and the values
            are the metric values.  To access the MetricValue for the metrics
            calculated by the system, make sure to specify the type hint for this
            parameter as Dict[str, MetricValue].  Refer to the DefaultEvaluator
            behavior section for what metrics will be returned based on the type of
            model (i.e. classifier or regressor).  kwargs: Includes a list of args
            that are used to compute the metric. These args could information coming
            from input data, model outputs or parameters specified in the
            `evaluator_config` argument of the `mlflow.evaluate` API.
        kwargs: Includes a list of args that are used to compute the metric. These
            args could be information coming from input data, model outputs,
            other metrics, or parameters specified in the `evaluator_config`
            argument of the `mlflow.evaluate` API.

    Returns: MetricValue with per-row scores, per-row justifications, and aggregate
        results.
    """
    ...

greater_is_better – 指标的值越高是否越好。
name – 指标的名称。如果 eval_fn 是一个 lambda 函数或 eval_fn.__name__ 属性不可用，则必须指定此参数。
long_name – (可选) 指标的长名称。例如，"mean_squared_error" 对应 "mse"。
version – (可选) 指标版本。例如 v1。
metric_details – (可选) 该指标的描述及其计算方法。
metric_metadata – (可选) 包含指标元数据的字典。
genai_metric_args – （可选）一个包含用户在调用 make_genai_metric 或 make_genai_metric_from_prompt 时指定的参数的字典。这些参数会被保留，以便我们稍后可以反序列化相同的指标对象。

参见

生成式AI指标

我们还提供生成式AI（”genai”） 评估指标 用于评估文本模型。这些指标使用LLM来评估模型输出文本的质量。请注意，您使用第三方LLM服务（例如，OpenAI）进行评估可能受限于并受制于该LLM服务的使用条款。以下工厂函数帮助您根据使用案例定制智能指标。

mlflow.metrics.genai.answer_correctness(model: str | None = None, metric_version: str | None = None, examples: List[EvaluationExample] | None = None, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估LLM回答正确性的genai指标，使用提供的模型进行评估。回答的正确性将根据基于``ground_truth``的输出准确性来评估，``ground_truth``应在``targets``列中指定。

targets eval_arg 必须作为输入数据集或输出预测的一部分提供。这可以通过 evaluator_config 参数中的 col_mapping 映射到不同名称的列，或者使用 mlflow.evaluate() 中的 targets 参数。

如果指定的指标版本不存在，则会引发 MlflowException。

参数:

model – openai 或 gateway 评判模型的模型 URI，格式为 “openai:/gpt-4” 或 “gateway:/my-route”。默认为 “openai:/gpt-4”。您使用第三方 LLM 服务（例如 OpenAI）进行评估可能受限于并受制于该 LLM 服务的服务条款。
metric_version – 要使用的答案正确性度量版本。默认为最新版本。
examples – 提供一个示例列表，以帮助评判模型评估答案的正确性。强烈建议添加示例，作为评估新结果的参考。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象

mlflow.metrics.genai.answer_relevance(model: str | None = None, metric_version: str | None = 'v1', examples: List[EvaluationExample] | None = None, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估LLM回答相关性的genai指标，使用提供的模型进行评估。回答的相关性将根据输出相对于输入的适当性和适用性进行评估。

如果指定的指标版本不存在，则会引发 MlflowException。

参数:

model – openai 或 gateway 评判模型的模型 URI，格式为 “openai:/gpt-4” 或 “gateway:/my-route”。默认为 “openai:/gpt-4”。您使用第三方 LLM 服务（例如 OpenAI）进行评估可能受限于并受制于该 LLM 服务的服务条款。
metric_version – 要使用的答案相关性度量版本。默认为最新版本。
examples – 提供一个示例列表，以帮助评判模型评估答案的相关性。强烈建议添加示例，作为评估新结果的参考。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象

mlflow.metrics.genai.answer_similarity(model: str | None = None, metric_version: str | None = None, examples: List[EvaluationExample] | None = None, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估LLM答案相似度的genai指标，使用提供的模型进行评估。答案相似度将通过输出与``ground_truth``的语义相似度来评估，``ground_truth``应在``targets``列中指定。

targets eval_arg 必须作为输入数据集或输出预测的一部分提供。这可以通过 evaluator_config 参数中的 col_mapping 映射到不同名称的列，或者使用 mlflow.evaluate() 中的 targets 参数。

如果指定的指标版本不存在，则会引发 MlflowException。

参数:

model – (可选) 一个 openai 或 gateway 评判模型的 uri，格式为 “openai:/gpt-4” 或 “gateway:/my-route”。默认为 “openai:/gpt-4”。您使用第三方 LLM 服务（例如 OpenAI）进行评估可能受限于并受制于该 LLM 服务的使用条款。
metric_version – (可选) 使用的答案相似度度量版本。默认为最新版本。
examples – （可选）提供一个示例列表，以帮助评判模型评估答案的相似性。强烈建议添加示例，作为评估新结果的参考。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象

mlflow.metrics.genai.faithfulness(model: str | None = None, metric_version: str | None = 'v1', examples: List[EvaluationExample] | None = None, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

此函数将创建一个用于评估LLM忠实度的genai指标，使用提供的模型。忠实度将根据输出与``context``的事实一致性来评估。

context eval_arg 必须作为输入数据集或输出预测的一部分提供。这可以通过 evaluator_config 参数中的 col_mapping 映射到不同名称的列。

如果指定的指标版本不存在，则会引发 MlflowException。

参数:

model – openai 或 gateway 评判模型的模型 URI，格式为 “openai:/gpt-4” 或 “gateway:/my-route”。默认为 “openai:/gpt-4”。您使用第三方 LLM 服务（例如 OpenAI）进行评估可能受限于并受制于该 LLM 服务的服务条款。
metric_version – 要使用的忠实度指标的版本。默认为最新版本。
examples – 提供一个示例列表，以帮助评判模型评估忠实度。强烈建议添加示例，作为评估新结果的参考。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象

mlflow.metrics.genai.make_genai_metric_from_prompt(name: str, judge_prompt: str | None = None, model: str | None = 'openai:/gpt-4', parameters: Dict[str, Any] | None = None, aggregations: List[str] | None = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

创建一个用于评估LLM的genai指标，使用LLM作为MLflow中的评判者。这产生了一个仅使用提供的评判提示的指标，没有任何预先编写的系统提示。这对于任何``EvaluationModel``版本中未涵盖的完整评分提示的使用场景非常有用。

参数:

name – 指标的名称。
judge_prompt – 用于评判模型的整个提示。提示将被最小化地包装在格式化指令中，以确保分数可以解析。提示可能使用 f-string 格式化来包含变量。相应的变量必须作为关键字参数传递到结果指标的 eval 函数中。
model – (可选) 一个 openai、gateway 或 deployments 判断模型的 uri，格式为 “openai:/gpt-4”、”gateway:/my-route”、”endpoints:/databricks-llama-2-70b-chat”。默认为 “openai:/gpt-4”。如果使用 Azure OpenAI，OPENAI_DEPLOYMENT_NAME 环境变量将优先。您使用第三方 LLM 服务（例如，OpenAI）进行评估可能受限于并受制于该 LLM 服务的使用条款。
parameters – （可选）用于计算指标的LLM的参数。默认情况下，我们将温度设置为0.0，max_tokens设置为200，top_p设置为1.0。我们建议将用作评判的LLM的温度设置为0.0，以确保结果的一致性。
aggregations – （可选）用于汇总分数的选项列表。目前支持的选项有：最小值、最大值、平均值、中位数、方差、p90。
greater_is_better – （可选）指标是否在更大时更好。
max_workers – (可选) 用于评分评判的最大工作线程数。默认设置为10个工作线程。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象。

创建 genai 指标的示例

from mlflow.metrics.genai import make_genai_metric_from_prompt

metric = make_genai_metric_from_prompt(
    name="ease_of_understanding",
    judge_prompt=(
        "You must evaluate the output of a bot based on how easy it is to "
        "understand its outputs."
        "Evaluate the bot's output from the perspective of a layperson."
        "The bot was provided with this input: {input} and this output: {output}."
    ),
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

mlflow.metrics.genai.relevance(model: str | None = None, metric_version: str | None = None, examples: List[EvaluationExample] | None = None, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

此函数将创建一个用于评估所提供模型下LLM相关性的genai指标。相关性将通过输出相对于输入和``context``的适当性、重要性和适用性来评估。

context eval_arg 必须作为输入数据集或输出预测的一部分提供。这可以通过 evaluator_config 参数中的 col_mapping 映射到不同名称的列。

如果指定的指标版本不存在，则会引发 MlflowException。

参数:

model – (可选) 一个 openai 或 gateway 评判模型的 uri，格式为 “openai:/gpt-4” 或 “gateway:/my-route”。默认为 “openai:/gpt-4”。您使用第三方 LLM 服务（例如 OpenAI）进行评估可能受限于并受制于该 LLM 服务的使用条款。
metric_version – (可选) 要使用的相关性度量版本。默认为最新版本。
examples – （可选）提供一个示例列表，以帮助评判模型评估相关性。强烈建议添加示例，作为评估新结果的参考。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象

mlflow.metrics.genai.retrieve_custom_metrics(run_id: str, name: str | None = None, version: str | None = None) → List[EvaluationMetric][源代码]

通过 make_genai_metric() 或 make_genai_metric_from_prompt() 检索用户创建的自定义指标，这些指标与特定的评估运行相关联。

参数:

run_id – 运行的唯一标识符。
name – (可选) 要检索的自定义指标的名称。如果为 None，则检索所有指标。
version – (可选) 要检索的自定义指标的版本。如果为 None，则检索所有指标。

返回:

符合检索条件的 EvaluationMetric 对象列表。

获取自定义 genai 指标的示例

import pandas as pd

import mlflow
from mlflow.metrics.genai.genai_metric import (
    make_genai_metric_from_prompt,
    retrieve_custom_metrics,
)

eval_df = pd.DataFrame(
    {
        "inputs": ["foo"],
        "ground_truth": ["bar"],
    }
)
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task="chat.completions",
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    custom_metric = make_genai_metric_from_prompt(
        name="custom llm judge",
        judge_prompt="This is a custom judge prompt.",
        greater_is_better=False,
        parameters={"temperature": 0.0},
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[custom_metric],
    )
metrics = retrieve_custom_metrics(
    run_id=run.info.run_id,
    name="custom llm judge",
)

你也可以使用 make_genai_metric 工厂函数创建自己的生成式 AI EvaluationMetric。

mlflow.metrics.genai.make_genai_metric(name: str, definition: str, grading_prompt: str, examples: List[EvaluationExample] | None = None, version: str | None = 'v1', model: str | None = 'openai:/gpt-4', grading_context_columns: List[str] | str | None = None, include_input: bool = True, parameters: Dict[str, Any] | None = None, aggregations: List[str] | None = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: Dict[str, Any] | None = None) → EvaluationMetric[源代码]

备注

实验性功能：此功能可能在未来的版本中无警告地更改或移除。

创建一个用于在 MLflow 中使用 LLM 作为评判来评估 LLM 的 genai 指标。完整的评分提示存储在 EvaluationMetric 对象的 metric_details 字段中。

参数:

name – 指标的名称。
definition – 指标的定义。
grading_prompt – 指标的评分标准。
examples – （可选）指标的示例。
version – （可选）指标的版本。目前支持的版本有：v1。
model – (可选) 一个 openai、gateway 或 deployments 判断模型的 uri，格式为 “openai:/gpt-4”、”gateway:/my-route”、”endpoints:/databricks-llama-2-70b-chat”。默认为 “openai:/gpt-4”。如果使用 Azure OpenAI，OPENAI_DEPLOYMENT_NAME 环境变量将优先。您使用第三方 LLM 服务（例如，OpenAI）进行评估可能受限于并受制于该 LLM 服务的使用条款。
grading_context_columns – (可选) 评分上下文列的名称，或评分上下文列名称的列表，用于计算指标。grading_context_columns 被LLM用作法官，作为额外的信息来计算指标。这些列是从输入数据集或基于 evaluator_config 中传递给 mlflow.evaluate() 的 col_mapping 的输出预测中提取的。它们也可以是其他评估指标的名称。
include_input – （可选）在计算指标时是否包含输入。
parameters – （可选）用于计算指标的LLM的参数。默认情况下，我们将温度设置为0.0，max_tokens设置为200，top_p设置为1.0。我们建议将用作评判的LLM的温度设置为0.0，以确保结果的一致性。
aggregations – （可选）用于汇总分数的选项列表。目前支持的选项有：最小值、最大值、平均值、中位数、方差、p90。
greater_is_better – （可选）指标是否在更大时更好。
max_workers – (可选) 用于评分评判的最大工作线程数。默认设置为10个工作线程。
metric_metadata – （可选）附加到 EvaluationMetric 对象的元数据字典。对于需要额外信息来确定如何评估此指标的模型评估器非常有用。

返回:

一个度量对象。

创建 genai 指标的示例

from mlflow.metrics.genai import EvaluationExample, make_genai_metric

example = EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing machine "
        "learning workflows, including experiment tracking, model packaging, "
        "versioning, and deployment, simplifying the ML lifecycle."
    ),
    score=4,
    justification=(
        "The definition effectively explains what MLflow is "
        "its purpose, and its developer. It could be more concise for a 5-score.",
    ),
    grading_context={
        "targets": (
            "MLflow is an open-source platform for managing "
            "the end-to-end machine learning (ML) lifecycle. It was developed by "
            "Databricks, a company that specializes in big data and machine learning "
            "solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, "
            "and deploying machine learning models."
        )
    },
)
metric = make_genai_metric(
    name="answer_correctness",
    definition=(
        "Answer correctness is evaluated on the accuracy of the provided output based on "
        "the provided targets, which is the ground truth. Scores can be assigned based on "
        "the degree of semantic similarity and factual correctness of the provided output "
        "to the provided targets, where a higher score indicates higher degree of accuracy."
    ),
    grading_prompt=(
        "Answer correctness: Below are the details for different scores:"
        "- Score 1: The output is completely incorrect. It is completely different from "
        "or contradicts the provided targets."
        "- Score 2: The output demonstrates some degree of semantic similarity and "
        "includes partially correct information. However, the output still has significant "
        "discrepancies with the provided targets or inaccuracies."
        "- Score 3: The output addresses a couple of aspects of the input accurately, "
        "aligning with the provided targets. However, there are still omissions or minor "
        "inaccuracies."
        "- Score 4: The output is mostly correct. It provides mostly accurate information, "
        "but there may be one or more minor omissions or inaccuracies."
        "- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
        "semantic similarity to the targets."
    ),
    examples=[example],
    version="v1",
    model="openai:/gpt-4",
    grading_context_columns=["targets"],
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

在使用生成式AI 评估指标 时，重要的是要传入一个 评估示例

class mlflow.metrics.genai.EvaluationExample(output: str, score: float, justification: str, input: str | None = None, grading_context: Dict[str, str] | str | None = None)[源代码]

备注

实验性：此类在未来的版本中可能会在没有警告的情况下更改或被移除。

在LLM评估期间，存储少样本学习期间的样本示例

参数:

input – 提供给模型的输入
output – 模型生成的输出
score – 评估者给出的分数
justification – 评估者给出的理由
grading_context – 提供给评估者的 grading_context 用于评估。可以是评分上下文列名和评分上下文字符串的字典，或单个评分上下文字符串。

创建 EvaluationExample 的示例

from mlflow.metrics.base import EvaluationExample

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
print(str(example))

输出

Input: What is MLflow?
Provided output: "MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle."
Provided ground_truth: "MLflow is an open-source platform for managing "
    "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
    "a company that specializes in big data and machine learning solutions. MLflow is "
    "designed to address the challenges that data scientists and machine learning "
    "engineers face when developing, training, and deploying machine learning models."
Score: 4
Justification: "The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score."

用户必须为所使用的LLM服务设置适当的环境变量以进行评估。例如，如果使用OpenAI的API，必须设置``OPENAI_API_KEY``环境变量。如果使用Azure OpenAI，还必须设置``OPENAI_API_TYPE``、OPENAI_API_VERSION、``OPENAI_API_BASE``和``OPENAI_DEPLOYMENT_NAME``环境变量。请参阅 Azure OpenAI文档如果用户使用的是网关路由，则不需要设置这些环境变量。