MLflow 数据集跟踪教程

mlflow.data 模块是 MLflow 生态系统的重要组成部分，旨在增强您的机器学习工作流程。该模块使您能够在模型训练和评估期间记录和检索数据集信息，利用 MLflow 的跟踪功能。

关键接口

与 mlflow.data 模块相关联的主要抽象组件有两个，Dataset 和 DatasetSource：

数据集

Dataset 抽象是一个元数据跟踪对象，它保存有关给定记录数据集的信息。

存储在 Dataset 对象中的信息包括特征、目标和预测，以及数据集的名称、摘要（哈希）、模式和配置文件等元数据。你可以使用 mlflow.log_input() API 记录这些元数据。该模块提供了从各种数据类型构建 mlflow.data.dataset.Dataset 对象的函数。

这个抽象类有多个具体的实现，包括：

以下示例展示了如何从 Pandas DataFrame 构建一个 mlflow.data.pandas_dataset.PandasDataset 对象：

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

数据源

DatasetSource 是给定 Dataset 对象的一个组件，提供了与数据原始来源的链接谱系。

Dataset 的 DatasetSource 组件表示数据集的来源，例如 S3 中的目录、Delta 表或 URL。它被引用在 Dataset 中，以理解数据的来源。可以通过访问 Dataset 对象的 source 属性，或通过使用 mlflow.data.get_source() API 来检索已记录数据集的 DatasetSource。

小技巧

在MLflow中支持的许多启用了自动日志记录的类型中，当记录数据集本身时，会自动记录数据集的来源。

备注

下面展示的例子纯粹是为了教学目的，因为在训练运行之外记录数据集并不是一种常见做法。

示例用法

以下示例演示了如何使用 log_inputs API 来记录训练数据集、检索其信息并获取数据源：

import mlflow
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

# Log the Dataset to an MLflow run by using the `log_input` API
with mlflow.start_run() as run:
    mlflow.log_input(dataset, context="training")

# Retrieve the run information
logged_run = mlflow.get_run(run.info.run_id)

# Retrieve the Dataset object
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset

# View some of the recorded Dataset information
print(f"Dataset name: {logged_dataset.name}")
print(f"Dataset digest: {logged_dataset.digest}")
print(f"Dataset profile: {logged_dataset.profile}")
print(f"Dataset schema: {logged_dataset.schema}")

上述代码片段的标准输出结果如下：

Dataset name: wine quality - white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [
    {"type": "double", "name": "fixed acidity"},
    {"type": "double", "name": "volatile acidity"},
    {"type": "double", "name": "citric acid"},
    {"type": "double", "name": "residual sugar"},
    {"type": "double", "name": "chlorides"},
    {"type": "double", "name": "free sulfur dioxide"},
    {"type": "double", "name": "total sulfur dioxide"},
    {"type": "double", "name": "density"},
    {"type": "double", "name": "pH"},
    {"type": "double", "name": "sulphates"},
    {"type": "double", "name": "alcohol"},
    {"type": "long", "name": "quality"}
    ]}

我们可以导航到 MLflow UI 以查看记录的数据集的外观。

当我们想从存储位置加载数据集回来时（调用 load 将本地下载数据），我们通过以下API访问数据集的源：

# Loading the dataset's source
dataset_source = mlflow.data.get_source(logged_dataset)

local_dataset = dataset_source.load()

print(f"The local file where the data has been downloaded to: {local_dataset}")

# Load the data again
loaded_data = pd.read_csv(local_dataset, delimiter=";")

上面的打印语句解析为调用 load 时创建的本地文件。

The local file where the data has been downloaded to:
/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/tmpuxwtrul1/winequality-white.csv

使用数据集与其他 MLflow 功能

mlflow.data 模块在将数据集与 MLflow 运行关联方面起着至关重要的作用。除了为 MLflow 运行记录与训练期间使用的数据集相关联的记录的明显实用性外，MLflow 内部还有一些集成功能，允许直接使用通过 mlflow.log_input() API 记录的数据集。

如何使用数据集与 MLflow 评估

备注

Datasets 与 MLflow evaluate 的集成是在 MLflow 2.8.0 中引入的。之前的版本没有此功能。

要了解这种集成是如何工作的，让我们来看一个相当简单且典型的分类任务。

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import xgboost

import mlflow
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Extract the features and target data separately
y = raw_data["quality"]
X = raw_data.drop("quality", axis=1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=17
)

# Create a label encoder object
le = LabelEncoder()

# Fit and transform the target variable
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

# Fit an XGBoost binary classifier on the training data split
model = xgboost.XGBClassifier().fit(X_train, y_train_encoded)

# Build the Evaluation Dataset from the test set
y_test_pred = model.predict(X=X_test)

eval_data = X_test
eval_data["label"] = y_test

# Assign the decoded predictions to the Evaluation Dataset
eval_data["predictions"] = le.inverse_transform(y_test_pred)

# Create the PandasDataset for use in mlflow evaluate
pd_dataset = mlflow.data.from_pandas(
    eval_data, predictions="predictions", targets="label"
)

mlflow.set_experiment("White Wine Quality")

# Log the Dataset, model, and execute an evaluation run using the configured Dataset
with mlflow.start_run() as run:
    mlflow.log_input(pd_dataset, context="training")

    mlflow.xgboost.log_model(
        artifact_path="white-wine-xgb", xgb_model=model, input_example=X_test
    )

    result = mlflow.evaluate(data=pd_dataset, predictions=None, model_type="classifier")

备注

使用 mlflow.evaluate() API 将自动记录用于评估的数据集到 MLflow 运行中。不需要显式调用来记录输入。

导航到 MLflow UI，我们可以看到数据集、模型、指标以及特定于分类的混淆矩阵是如何被记录到运行中的。