Note

Go to the end to download the full example code. or to run this example in your browser via Binder

使用堆叠方法结合预测器#

堆叠是指一种混合估计器的方法。在这种策略中，一些估计器在一些训练数据上单独拟合，而最终的估计器则使用这些基础估计器的堆叠预测进行训练。

在这个例子中，我们说明了将不同的回归器堆叠在一起，并使用最终的线性惩罚回归器来输出预测的用例。我们比较了每个单独回归器与堆叠策略的性能。堆叠略微提高了整体性能。

# 作者：scikit-learn 开发者
# SPDX 许可证标识符：BSD-3-Clause

下载数据集

我们将使用 `Ames Housing`_ 数据集，该数据集最初由 Dean De Cock 编译，并在 Kaggle 挑战赛中使用后变得更加知名。该数据集包含爱荷华州艾姆斯市的 1460 套住宅，每套住宅由 80 个特征描述。我们将使用它来预测房屋的最终对数价格。在这个例子中，我们将只使用通过 GradientBoostingRegressor() 选择的 20 个最有趣的特征，并限制条目数量（在这里我们不会详细讨论如何选择最有趣的特征）。

Ames房价数据集并未随scikit-learn一起提供，因此我们将从 `OpenML`_ 获取它。

import numpy as np

from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle


def load_ames_housing():
    df = fetch_openml(name="house_prices", as_frame=True)
    X = df.data
    y = df.target

    features = [
        "YrSold",
        "HeatingQC",
        "Street",
        "YearRemodAdd",
        "Heating",
        "MasVnrType",
        "BsmtUnfSF",
        "Foundation",
        "MasVnrArea",
        "MSSubClass",
        "ExterQual",
        "Condition2",
        "GarageCars",
        "GarageType",
        "OverallQual",
        "TotalBsmtSF",
        "BsmtFinSF1",
        "HouseStyle",
        "MiscFeature",
        "MoSold",
    ]

    X = X.loc[:, features]
    X, y = shuffle(X, y, random_state=0)

    X = X.iloc[:600]
    y = y.iloc[:600]
    return X, np.log(y)


X, y = load_ames_housing()

测量并绘制结果#

现在我们可以使用 Ames Housing 数据集进行预测。我们检查每个单独预测器以及回归器堆栈的性能。

import time

import matplotlib.pyplot as plt

from sklearn.metrics import PredictionErrorDisplay
from sklearn.model_selection import cross_val_predict, cross_validate

fig, axs = plt.subplots(2, 2, figsize=(9, 7))
axs = np.ravel(axs)

for ax, (name, est) in zip(
    axs, estimators + [("Stacking Regressor", stacking_regressor)]
):
    scorers = {"R2": "r2", "MAE": "neg_mean_absolute_error"}

    start_time = time.time()
    scores = cross_validate(
        est, X, y, scoring=list(scorers.values()), n_jobs=-1, verbose=0
    )
    elapsed_time = time.time() - start_time

    y_pred = cross_val_predict(est, X, y, n_jobs=-1, verbose=0)
    scores = {
        key: (
            f"{np.abs(np.mean(scores[f'test_{value}'])):.2f} +- "
            f"{np.std(scores[f'test_{value}']):.2f}"
        )
        for key, value in scorers.items()
    }

    display = PredictionErrorDisplay.from_predictions(
        y_true=y,
        y_pred=y_pred,
        kind="actual_vs_predicted",
        ax=ax,
        scatter_kwargs={"alpha": 0.2, "color": "tab:blue"},
        line_kwargs={"color": "tab:red"},
    )
    ax.set_title(f"{name}\nEvaluation in {elapsed_time:.2f} seconds")

    for name, score in scores.items():
        ax.plot([], [], " ", label=f"{name}: {score}")
    ax.legend(loc="upper left")

plt.suptitle("Single predictors versus stacked predictors")
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.show()