%load_ext autoreload
%autoreload 2
外生特征
使用外生回归变量进行训练和预测
import lightgbm as lgb
import pandas as pd
from mlforecast import MLForecast
from mlforecast.lag_transforms import ExpandingMean, RollingMean
from mlforecast.utils import generate_daily_series, generate_prices_for_series
数据设置
= generate_daily_series(
series 100, equal_ends=True, n_static_features=2
={'static_1': 'product_id'})
).rename(columns series.head()
unique_id | ds | y | static_0 | product_id | |
---|---|---|---|---|---|
0 | id_00 | 2000-10-05 | 39.811983 | 79 | 45 |
1 | id_00 | 2000-10-06 | 103.274013 | 79 | 45 |
2 | id_00 | 2000-10-07 | 176.574744 | 79 | 45 |
3 | id_00 | 2000-10-08 | 258.987900 | 79 | 45 |
4 | id_00 | 2000-10-09 | 344.940404 | 79 | 45 |
使用现有的外生特征
在mlforecast中,所需的列是序列标识符、时间和目标。您拥有的任何额外列,如这里的static_0
和product_id
,都被视为静态,并在构建下一个时间戳的特征时进行复制。您可以通过将static_features
传递给MLForecast.preprocess
或MLForecast.fit
来禁用此功能,这将仅保留您在其中定义的静态列。请记住,输入数据框中的所有特征将用于训练,因此您必须通过X_df
参数为MLForecast.predict
提供外生特征的未来值。
考虑以下示例。假设我们有一个每个id和日期的价格目录。
= generate_prices_for_series(series)
prices_catalog prices_catalog.head()
ds | unique_id | price | |
---|---|---|---|
0 | 2000-10-05 | id_00 | 0.548814 |
1 | 2000-10-06 | id_00 | 0.715189 |
2 | 2000-10-07 | id_00 | 0.602763 |
3 | 2000-10-08 | id_00 | 0.544883 |
4 | 2000-10-09 | id_00 | 0.423655 |
并且您已经将这些价格合并到您的系列数据框中。
= series.merge(prices_catalog, how='left')
series_with_prices series_with_prices.head()
unique_id | ds | y | static_0 | product_id | price | |
---|---|---|---|---|---|---|
0 | id_00 | 2000-10-05 | 39.811983 | 79 | 45 | 0.548814 |
1 | id_00 | 2000-10-06 | 103.274013 | 79 | 45 | 0.715189 |
2 | id_00 | 2000-10-07 | 176.574744 | 79 | 45 | 0.602763 |
3 | id_00 | 2000-10-08 | 258.987900 | 79 | 45 | 0.544883 |
4 | id_00 | 2000-10-09 | 344.940404 | 79 | 45 | 0.423655 |
这个数据框将被传递给MLForecast.fit
(或MLForecast.preprocess
)。然而,由于价格是动态的,我们必须告诉该方法只有static_0
和product_id
是静态的。
= MLForecast(
fcst =lgb.LGBMRegressor(n_jobs=1, random_state=0, verbosity=-1),
models='D',
freq=[7],
lags={
lag_transforms1: [ExpandingMean()],
7: [RollingMean(window_size=14)],
},=['dayofweek', 'month'],
date_features=2,
num_threads
)=['static_0', 'product_id']) fcst.fit(series_with_prices, static_features
MLForecast(models=[LGBMRegressor], freq=D, lag_features=['lag7', 'expanding_mean_lag1', 'rolling_mean_lag7_window_size14'], date_features=['dayofweek', 'month'], num_threads=2)
用于训练的特征存储在 MLForecast.ts.features_order_
中。正如您所看到的,price
被用于训练。
fcst.ts.features_order_
['static_0',
'product_id',
'price',
'lag7',
'expanding_mean_lag1',
'rolling_mean_lag7_window_size14',
'dayofweek',
'month']
因此,为了在每个时间步更新价格,我们只需调用 MLForecast.predict
,并传入我们的预测区间,另外通过 X_df
传递价格目录。
= fcst.predict(h=7, X_df=prices_catalog)
preds preds.head()
unique_id | ds | LGBMRegressor | |
---|---|---|---|
0 | id_00 | 2001-05-15 | 418.930093 |
1 | id_00 | 2001-05-16 | 499.487368 |
2 | id_00 | 2001-05-17 | 20.321885 |
3 | id_00 | 2001-05-18 | 102.310778 |
4 | id_00 | 2001-05-19 | 185.340281 |
生成外生特征
Nixtla 提供了一些工具来生成训练和预测的外部特征,例如 statsforecast 的 mstl_decomposition 或 transform_exog 函数。我们还有 utilsforecast 的 fourier 函数,我们将在这里进行演示。
from sklearn.linear_model import LinearRegression
from utilsforecast.feature_engineering import fourier
假设你从一些数据开始,例如上面的数据,其中我们有一些静态特征。
series.head()
unique_id | ds | y | static_0 | product_id | |
---|---|---|---|---|---|
0 | id_00 | 2000-10-05 | 39.811983 | 79 | 45 |
1 | id_00 | 2000-10-06 | 103.274013 | 79 | 45 |
2 | id_00 | 2000-10-07 | 176.574744 | 79 | 45 |
3 | id_00 | 2000-10-08 | 258.987900 | 79 | 45 |
4 | id_00 | 2000-10-09 | 344.940404 | 79 | 45 |
现在我们想添加一些傅里叶项来建模季节性。我们可以使用以下方法:
= fourier(series, freq='D', season_length=7, k=2, h=7) transformed_df, future_df
这提供了一个扩展的训练数据集。
transformed_df.head()
unique_id | ds | y | static_0 | product_id | sin1_7 | sin2_7 | cos1_7 | cos2_7 | |
---|---|---|---|---|---|---|---|---|---|
0 | id_00 | 2000-10-05 | 39.811983 | 79 | 45 | 0.781832 | 0.974928 | 0.623490 | -0.222521 |
1 | id_00 | 2000-10-06 | 103.274013 | 79 | 45 | 0.974928 | -0.433884 | -0.222521 | -0.900969 |
2 | id_00 | 2000-10-07 | 176.574744 | 79 | 45 | 0.433884 | -0.781831 | -0.900969 | 0.623490 |
3 | id_00 | 2000-10-08 | 258.987900 | 79 | 45 | -0.433884 | 0.781832 | -0.900969 | 0.623490 |
4 | id_00 | 2000-10-09 | 344.940404 | 79 | 45 | -0.974928 | 0.433884 | -0.222521 | -0.900969 |
未来特征的值。
future_df.head()
unique_id | ds | sin1_7 | sin2_7 | cos1_7 | cos2_7 | |
---|---|---|---|---|---|---|
0 | id_00 | 2001-05-15 | -0.781828 | -0.974930 | 0.623494 | -0.222511 |
1 | id_00 | 2001-05-16 | 0.000006 | 0.000011 | 1.000000 | 1.000000 |
2 | id_00 | 2001-05-17 | 0.781835 | 0.974925 | 0.623485 | -0.222533 |
3 | id_00 | 2001-05-18 | 0.974927 | -0.433895 | -0.222527 | -0.900963 |
4 | id_00 | 2001-05-19 | 0.433878 | -0.781823 | -0.900972 | 0.623500 |
我们现在可以仅使用这些特征(以及静态特征)进行训练。
= MLForecast(models=LinearRegression(), freq='D')
fcst2 =['static_0', 'product_id']) fcst2.fit(transformed_df, static_features
MLForecast(models=[LinearRegression], freq=D, lag_features=[], date_features=[], num_threads=1)
并将未来值提供给预测方法。
=7, X_df=future_df).head() fcst2.predict(h
unique_id | ds | LinearRegression | |
---|---|---|---|
0 | id_00 | 2001-05-15 | 275.822342 |
1 | id_00 | 2001-05-16 | 262.258117 |
2 | id_00 | 2001-05-17 | 238.195850 |
3 | id_00 | 2001-05-18 | 240.997814 |
4 | id_00 | 2001-05-19 | 262.247123 |
import numpy as np
from mlforecast.callbacks import SaveFeatures
# 检查价格是否正确传递
= series_with_prices['ds'].max() + pd.offsets.Day()
first_pred_date = SaveFeatures()
save_feats 7, X_df=prices_catalog, before_predict_callback=save_feats)
fcst.predict(for h, actual in enumerate(save_feats._inputs):
= prices_catalog.loc[prices_catalog['ds'].eq(first_pred_date + h * pd.offsets.Day())]
expected
np.testing.assert_allclose('price'].values,
actual['price'].values,
expected[ )
= fcst.predict(7, X_df=prices_catalog)
preds2 = fcst.predict(7, new_df=series_with_prices, X_df=prices_catalog)
preds3
pd.testing.assert_frame_equal(preds, preds2) pd.testing.assert_frame_equal(preds, preds3)
# 我们可以使用交叉验证进行计算。
# 不添加额外信息的外生变量
fcst.cross_validation(
series_with_prices,=7,
h=2,
n_windows=['static_0', 'product_id'],
static_features; )
Give us a ⭐ on Github