make_pipeline#

make_pipeline 是 Pipeline 的简写。虽然设置 Pipeline 时我们需要创建带有步骤名称和转换器或估计器的元组，但使用 make_pipeline 时，我们只需添加一系列转换器和估计器，名称将自动添加。

使用 make_pipeline 设置管道#

在下面的示例中，我们设置了一个 Pipeline，它首先删除缺失的数据，然后用序数替换类别，最后拟合一个 Lasso 回归模型。

import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import make_pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = make_pipeline(
    DropMissingData(),
    OrdinalEncoder(encoding_method="arbitrary"),
    Lasso(random_state=10),
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe

在输出中，我们看到了管道所做的预测：

array([2., 2.])

流水线名称是自动分配的：

print(pipe)

Pipeline(steps=[('dropmissingdata', DropMissingData()),
                ('ordinalencoder', OrdinalEncoder(encoding_method='arbitrary')),
                ('lasso', Lasso(random_state=10))])

由 make_pipeline 返回的管道与 Pipeline 具有完全相同的特性。因此，如需更多指导，请查阅 Pipeline 文档。

预测#

让我们设置另一个管道来进行直接预测：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.multioutput import MultiOutputRegressor

from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)
from feature_engine.pipeline import make_pipeline

我们将使用这里描述的澳大利亚电力需求数据集：

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). 澳大利亚电力需求数据集 (版本 1) [数据集]. Zenodo. https://doi.org/10.5281/zenodo.4659727

url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)

df.drop(columns=["Industrial"], inplace=True)

# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
    lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
    pd.to_timedelta((df["Period"] - 1) * 30, unit="m")

df.dropna(inplace=True)

# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]

df.columns = ["date_time", "demand"]

# Resample to hourly
df = (
    df.set_index("date_time")
    .resample("h")
    .agg({"demand": "sum"})
)

print(df.head())

这里，我们看到了数据的前几行：

                          demand
date_time
2002-01-01 00:00:00  6919.366092
2002-01-01 01:00:00  7165.974188
2002-01-01 02:00:00  6406.542994
2002-01-01 03:00:00  5815.537828
2002-01-01 04:00:00  5497.732922

我们将预测未来3小时的能源需求。我们将使用直接预测法。让我们创建目标变量：

horizon = 3
y = pd.DataFrame(index=df.index)
for h in range(horizon):
    y[f"h_{h}"] = df.shift(periods=-h, freq="h")
y.dropna(inplace=True)
df = df.loc[y.index]
print(y.head())

这是我们的目标变量：

                             h_0          h_1          h_2
date_time
2002-01-01 00:00:00  6919.366092  7165.974188  6406.542994
2002-01-01 01:00:00  7165.974188  6406.542994  5815.537828
2002-01-01 02:00:00  6406.542994  5815.537828  5497.732922
2002-01-01 03:00:00  5815.537828  5497.732922  5385.851060
2002-01-01 04:00:00  5497.732922  5385.851060  5574.731890

接下来，我们将数据分为训练集和测试集：

end_train = '2014-12-31 23:59:59'
X_train = df.loc[:end_train]
y_train = y.loc[:end_train]

begin_test = '2014-12-31 17:59:59'
X_test  = df.loc[begin_test:]
y_test = y.loc[begin_test:]

接下来，我们设置 LagFeatures 和 WindowFeatures 来从滞后和窗口创建特征：

lagf = LagFeatures(
    variables=["demand"],
    periods=[1, 3, 6],
    missing_values="ignore",
    drop_na=True,
)


winf = WindowFeatures(
    variables=["demand"],
    window=["3h"],
    freq="1h",
    functions=["mean"],
    missing_values="ignore",
    drop_original=True,
    drop_na=True,
)

我们将套索回归封装在多输出回归器中以预测多个目标：

lasso = MultiOutputRegressor(Lasso(random_state=0, max_iter=10))

现在，我们组装 Pipeline：

pipe = make_pipeline(lagf, winf, lasso)

print(pipe)

步骤的名称是自动分配的：

Pipeline(steps=[('lagfeatures',
                 LagFeatures(drop_na=True, missing_values='ignore',
                             periods=[1, 3, 6], variables=['demand'])),
                ('windowfeatures',
                 WindowFeatures(drop_na=True, drop_original=True, freq='1h',
                                functions=['mean'], missing_values='ignore',
                                variables=['demand'], window=['3h'])),
                ('multioutputregressor',
                 MultiOutputRegressor(estimator=Lasso(max_iter=10,
                                                      random_state=0)))])

让我们安装流水线：

pipe.fit(X_train, y_train)

现在，我们可以为测试集进行预测：

forecast = pipe.predict(X_test)

forecasts = pd.DataFrame(
    pipe.predict(X_test),
    columns=[f"step_{i+1}" for i in range(3)]

)

print(forecasts.head())

我们看到了每小时的3小时前能源需求预测：

        step_1       step_2       step_3
8031.043352  8262.804811  8484.551733
7017.158081  7160.568853  7496.282999
6587.938171  6806.903940  7212.741943
6503.807479  6789.946587  7195.796841
6646.981390  6970.501840  7308.359237