make_pipeline#
make_pipeline
是 Pipeline
的简写。虽然设置 Pipeline
时我们需要创建带有步骤名称和转换器或估计器的元组,但使用 make_pipeline
时,我们只需添加一系列转换器和估计器,名称将自动添加。
使用 make_pipeline 设置管道#
在下面的示例中,我们设置了一个 Pipeline
,它首先删除缺失的数据,然后用序数替换类别,最后拟合一个 Lasso 回归模型。
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import make_pipeline
from sklearn.linear_model import Lasso
X = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
)
)
y = pd.Series([1, 2, 3, 4, 5])
pipe = make_pipeline(
DropMissingData(),
OrdinalEncoder(encoding_method="arbitrary"),
Lasso(random_state=10),
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe
在输出中,我们看到了管道所做的预测:
array([2., 2.])
流水线名称是自动分配的:
print(pipe)
Pipeline(steps=[('dropmissingdata', DropMissingData()),
('ordinalencoder', OrdinalEncoder(encoding_method='arbitrary')),
('lasso', Lasso(random_state=10))])
由 make_pipeline
返回的管道与 Pipeline
具有完全相同的特性。因此,如需更多指导,请查阅 Pipeline
文档。
预测#
让我们设置另一个管道来进行直接预测:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.multioutput import MultiOutputRegressor
from feature_engine.timeseries.forecasting import (
LagFeatures,
WindowFeatures,
)
from feature_engine.pipeline import make_pipeline
我们将使用这里描述的澳大利亚电力需求数据集:
Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). 澳大利亚电力需求数据集 (版本 1) [数据集]. Zenodo. https://doi.org/10.5281/zenodo.4659727
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)
df.drop(columns=["Industrial"], inplace=True)
# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)
# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
pd.to_timedelta((df["Period"] - 1) * 30, unit="m")
df.dropna(inplace=True)
# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]
df.columns = ["date_time", "demand"]
# Resample to hourly
df = (
df.set_index("date_time")
.resample("h")
.agg({"demand": "sum"})
)
print(df.head())
这里,我们看到了数据的前几行:
demand
date_time
2002-01-01 00:00:00 6919.366092
2002-01-01 01:00:00 7165.974188
2002-01-01 02:00:00 6406.542994
2002-01-01 03:00:00 5815.537828
2002-01-01 04:00:00 5497.732922
我们将预测未来3小时的能源需求。我们将使用直接预测法。让我们创建目标变量:
horizon = 3
y = pd.DataFrame(index=df.index)
for h in range(horizon):
y[f"h_{h}"] = df.shift(periods=-h, freq="h")
y.dropna(inplace=True)
df = df.loc[y.index]
print(y.head())
这是我们的目标变量:
h_0 h_1 h_2
date_time
2002-01-01 00:00:00 6919.366092 7165.974188 6406.542994
2002-01-01 01:00:00 7165.974188 6406.542994 5815.537828
2002-01-01 02:00:00 6406.542994 5815.537828 5497.732922
2002-01-01 03:00:00 5815.537828 5497.732922 5385.851060
2002-01-01 04:00:00 5497.732922 5385.851060 5574.731890
接下来,我们将数据分为训练集和测试集:
end_train = '2014-12-31 23:59:59'
X_train = df.loc[:end_train]
y_train = y.loc[:end_train]
begin_test = '2014-12-31 17:59:59'
X_test = df.loc[begin_test:]
y_test = y.loc[begin_test:]
接下来,我们设置 LagFeatures
和 WindowFeatures
来从滞后和窗口创建特征:
lagf = LagFeatures(
variables=["demand"],
periods=[1, 3, 6],
missing_values="ignore",
drop_na=True,
)
winf = WindowFeatures(
variables=["demand"],
window=["3h"],
freq="1h",
functions=["mean"],
missing_values="ignore",
drop_original=True,
drop_na=True,
)
我们将套索回归封装在多输出回归器中以预测多个目标:
lasso = MultiOutputRegressor(Lasso(random_state=0, max_iter=10))
现在,我们组装 Pipeline
:
pipe = make_pipeline(lagf, winf, lasso)
print(pipe)
步骤的名称是自动分配的:
Pipeline(steps=[('lagfeatures',
LagFeatures(drop_na=True, missing_values='ignore',
periods=[1, 3, 6], variables=['demand'])),
('windowfeatures',
WindowFeatures(drop_na=True, drop_original=True, freq='1h',
functions=['mean'], missing_values='ignore',
variables=['demand'], window=['3h'])),
('multioutputregressor',
MultiOutputRegressor(estimator=Lasso(max_iter=10,
random_state=0)))])
让我们安装流水线:
pipe.fit(X_train, y_train)
现在,我们可以为测试集进行预测:
forecast = pipe.predict(X_test)
forecasts = pd.DataFrame(
pipe.predict(X_test),
columns=[f"step_{i+1}" for i in range(3)]
)
print(forecasts.head())
我们看到了每小时的3小时前能源需求预测:
step_1 step_2 step_3
0 8031.043352 8262.804811 8484.551733
1 7017.158081 7160.568853 7496.282999
2 6587.938171 6806.903940 7212.741943
3 6503.807479 6789.946587 7195.796841
4 6646.981390 6970.501840 7308.359237
要了解更多关于直接预测以及如何创建特征的信息,请查看我们的课程:
我们的书籍和课程都适合初学者和更高级的数据科学家。通过购买它们,您正在支持 Feature-engine 的主要开发者 Sole。