外生特征

%load_ext autoreload
%autoreload 2

使用外生回归变量进行训练和预测

import lightgbm as lgb
import pandas as pd
from mlforecast import MLForecast
from mlforecast.lag_transforms import ExpandingMean, RollingMean
from mlforecast.utils import generate_daily_series, generate_prices_for_series

数据设置

series = generate_daily_series(
    100, equal_ends=True, n_static_features=2
).rename(columns={'static_1': 'product_id'})
series.head()
unique_id ds y static_0 product_id
0 id_00 2000-10-05 39.811983 79 45
1 id_00 2000-10-06 103.274013 79 45
2 id_00 2000-10-07 176.574744 79 45
3 id_00 2000-10-08 258.987900 79 45
4 id_00 2000-10-09 344.940404 79 45

使用现有的外生特征

在mlforecast中,所需的列是序列标识符、时间和目标。您拥有的任何额外列,如这里的static_0product_id,都被视为静态,并在构建下一个时间戳的特征时进行复制。您可以通过将static_features传递给MLForecast.preprocessMLForecast.fit来禁用此功能,这将仅保留您在其中定义的静态列。请记住,输入数据框中的所有特征将用于训练,因此您必须通过X_df参数为MLForecast.predict提供外生特征的未来值。

考虑以下示例。假设我们有一个每个id和日期的价格目录。

prices_catalog = generate_prices_for_series(series)
prices_catalog.head()
ds unique_id price
0 2000-10-05 id_00 0.548814
1 2000-10-06 id_00 0.715189
2 2000-10-07 id_00 0.602763
3 2000-10-08 id_00 0.544883
4 2000-10-09 id_00 0.423655

并且您已经将这些价格合并到您的系列数据框中。

series_with_prices = series.merge(prices_catalog, how='left')
series_with_prices.head()
unique_id ds y static_0 product_id price
0 id_00 2000-10-05 39.811983 79 45 0.548814
1 id_00 2000-10-06 103.274013 79 45 0.715189
2 id_00 2000-10-07 176.574744 79 45 0.602763
3 id_00 2000-10-08 258.987900 79 45 0.544883
4 id_00 2000-10-09 344.940404 79 45 0.423655

这个数据框将被传递给MLForecast.fit(或MLForecast.preprocess)。然而,由于价格是动态的,我们必须告诉该方法只有static_0product_id是静态的。

fcst = MLForecast(
    models=lgb.LGBMRegressor(n_jobs=1, random_state=0, verbosity=-1),
    freq='D',
    lags=[7],
    lag_transforms={
        1: [ExpandingMean()],
        7: [RollingMean(window_size=14)],
    },
    date_features=['dayofweek', 'month'],
    num_threads=2,
)
fcst.fit(series_with_prices, static_features=['static_0', 'product_id'])
MLForecast(models=[LGBMRegressor], freq=D, lag_features=['lag7', 'expanding_mean_lag1', 'rolling_mean_lag7_window_size14'], date_features=['dayofweek', 'month'], num_threads=2)

用于训练的特征存储在 MLForecast.ts.features_order_ 中。正如您所看到的,price 被用于训练。

fcst.ts.features_order_
['static_0',
 'product_id',
 'price',
 'lag7',
 'expanding_mean_lag1',
 'rolling_mean_lag7_window_size14',
 'dayofweek',
 'month']

因此,为了在每个时间步更新价格,我们只需调用 MLForecast.predict,并传入我们的预测区间,另外通过 X_df 传递价格目录。

preds = fcst.predict(h=7, X_df=prices_catalog)
preds.head()
unique_id ds LGBMRegressor
0 id_00 2001-05-15 418.930093
1 id_00 2001-05-16 499.487368
2 id_00 2001-05-17 20.321885
3 id_00 2001-05-18 102.310778
4 id_00 2001-05-19 185.340281

生成外生特征

Nixtla 提供了一些工具来生成训练和预测的外部特征,例如 statsforecast 的 mstl_decompositiontransform_exog 函数。我们还有 utilsforecast 的 fourier 函数,我们将在这里进行演示。

from sklearn.linear_model import LinearRegression
from utilsforecast.feature_engineering import fourier

假设你从一些数据开始,例如上面的数据,其中我们有一些静态特征。

series.head()
unique_id ds y static_0 product_id
0 id_00 2000-10-05 39.811983 79 45
1 id_00 2000-10-06 103.274013 79 45
2 id_00 2000-10-07 176.574744 79 45
3 id_00 2000-10-08 258.987900 79 45
4 id_00 2000-10-09 344.940404 79 45

现在我们想添加一些傅里叶项来建模季节性。我们可以使用以下方法:

transformed_df, future_df = fourier(series, freq='D', season_length=7, k=2, h=7)

这提供了一个扩展的训练数据集。

transformed_df.head()
unique_id ds y static_0 product_id sin1_7 sin2_7 cos1_7 cos2_7
0 id_00 2000-10-05 39.811983 79 45 0.781832 0.974928 0.623490 -0.222521
1 id_00 2000-10-06 103.274013 79 45 0.974928 -0.433884 -0.222521 -0.900969
2 id_00 2000-10-07 176.574744 79 45 0.433884 -0.781831 -0.900969 0.623490
3 id_00 2000-10-08 258.987900 79 45 -0.433884 0.781832 -0.900969 0.623490
4 id_00 2000-10-09 344.940404 79 45 -0.974928 0.433884 -0.222521 -0.900969

未来特征的值。

future_df.head()
unique_id ds sin1_7 sin2_7 cos1_7 cos2_7
0 id_00 2001-05-15 -0.781828 -0.974930 0.623494 -0.222511
1 id_00 2001-05-16 0.000006 0.000011 1.000000 1.000000
2 id_00 2001-05-17 0.781835 0.974925 0.623485 -0.222533
3 id_00 2001-05-18 0.974927 -0.433895 -0.222527 -0.900963
4 id_00 2001-05-19 0.433878 -0.781823 -0.900972 0.623500

我们现在可以仅使用这些特征(以及静态特征)进行训练。

fcst2 = MLForecast(models=LinearRegression(), freq='D')
fcst2.fit(transformed_df, static_features=['static_0', 'product_id'])
MLForecast(models=[LinearRegression], freq=D, lag_features=[], date_features=[], num_threads=1)

并将未来值提供给预测方法。

fcst2.predict(h=7, X_df=future_df).head()
unique_id ds LinearRegression
0 id_00 2001-05-15 275.822342
1 id_00 2001-05-16 262.258117
2 id_00 2001-05-17 238.195850
3 id_00 2001-05-18 240.997814
4 id_00 2001-05-19 262.247123
import numpy as np

from mlforecast.callbacks import SaveFeatures
# 检查价格是否正确传递
first_pred_date = series_with_prices['ds'].max() + pd.offsets.Day()
save_feats = SaveFeatures()
fcst.predict(7, X_df=prices_catalog, before_predict_callback=save_feats)
for h, actual in enumerate(save_feats._inputs):
    expected = prices_catalog.loc[prices_catalog['ds'].eq(first_pred_date + h * pd.offsets.Day())]
    np.testing.assert_allclose(
        actual['price'].values,
        expected['price'].values,
    )
preds2 = fcst.predict(7, X_df=prices_catalog)
preds3 = fcst.predict(7, new_df=series_with_prices, X_df=prices_catalog)

pd.testing.assert_frame_equal(preds, preds2)
pd.testing.assert_frame_equal(preds, preds3)
# 我们可以使用交叉验证进行计算。
# 不添加额外信息的外生变量
fcst.cross_validation(
    series_with_prices,
    h=7,
    n_windows=2,
    static_features=['static_0', 'product_id'],
);

Give us a ⭐ on Github