延迟变换

%load_ext autoreload
%autoreload 2

基于滞后计算特征

mlforecast允许您定义对滞后项进行转换以用作特征。这些通过lag_transforms参数提供,该参数是一个字典,其中键是滞后项,值是要应用于该滞后项的转换列表。

数据设置

import numpy as np

from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series
data = generate_daily_series(10)

内置转换

内置的滞后变换位于 mlforecast.lag_transforms 模块中。

from mlforecast.lag_transforms import RollingMean, ExpandingStd
fcst = MLForecast(
    models=[],
    freq='D',
    lag_transforms={
        1: [ExpandingStd()],
        7: [RollingMean(window_size=7, min_samples=1), RollingMean(window_size=14)]
    },
)

一旦定义了你的变换,你可以使用 MLForecast.preprocess 来查看它们的样子。

fcst.preprocess(data).head(2)
unique_id ds y expanding_std_lag1 rolling_mean_lag7_window_size7_min_samples1 rolling_mean_lag7_window_size14
20 id_0 2000-01-21 6.319961 1.956363 3.234486 3.283064
21 id_0 2000-01-22 0.071677 2.028545 3.256055 3.291068

扩展内置转换

您可以使用Combine类组合内置变换,该类接受两个变换和一个操作符。

import operator

from mlforecast.lag_transforms import Combine
fcst = MLForecast(
    models=[],
    freq='D',
    lag_transforms={
        1: [
            RollingMean(window_size=7),
            RollingMean(window_size=14),
            Combine(
                RollingMean(window_size=7),
                RollingMean(window_size=14),
                operator.truediv,
            )
        ],
    },
)
prep = fcst.preprocess(data)
prep.head(2)
unique_id ds y rolling_mean_lag1_window_size7 rolling_mean_lag1_window_size14 rolling_mean_lag1_window_size7_truediv_rolling_mean_lag1_window_size14
14 id_0 2000-01-15 0.435006 3.234486 3.283064 0.985204
15 id_0 2000-01-16 1.489309 3.256055 3.291068 0.989361
np.testing.assert_allclose(
    prep['rolling_mean_lag1_window_size7'] / prep['rolling_mean_lag1_window_size14'],
    prep['rolling_mean_lag1_window_size7_truediv_rolling_mean_lag1_window_size14']
)

如果你希望在Combine中的某个变换应用于不同的滞后,可以使用Offset类,它会先应用偏移,然后再进行变换。

from mlforecast.lag_transforms import Offset
fcst = MLForecast(
    models=[],
    freq='D',
    lag_transforms={
        1: [
            RollingMean(window_size=7),
            Combine(
                RollingMean(window_size=7),
                Offset(RollingMean(window_size=7), n=1),
                operator.truediv,
            )
        ],
        2: [RollingMean(window_size=7)]
    },
)
prep = fcst.preprocess(data)
prep.head(2)
unique_id ds y rolling_mean_lag1_window_size7 rolling_mean_lag1_window_size7_truediv_rolling_mean_lag2_window_size7 rolling_mean_lag2_window_size7
8 id_0 2000-01-09 1.462798 3.326081 0.998331 3.331641
9 id_0 2000-01-10 2.035518 3.360938 1.010480 3.326081
np.testing.assert_allclose(
    prep['rolling_mean_lag1_window_size7'] / prep['rolling_mean_lag2_window_size7'],
    prep['rolling_mean_lag1_window_size7_truediv_rolling_mean_lag2_window_size7']
)
from sklearn.linear_model import LinearRegression
fcst = MLForecast(
    models=[LinearRegression()],
    freq='D',
    lag_transforms={
        1: [
            RollingMean(window_size=7),
            RollingMean(window_size=14),
            Combine(
                RollingMean(window_size=7),
                RollingMean(window_size=14),
                operator.truediv,
            )
        ],
    },
)
fcst.fit(data)
fcst.predict(2);

基于numba的变换

window-ops包 提供了作为 numba JIT 编译 函数定义的转换。我们使用 numba,因为它使得这些转换速度非常快,并且可以绕过 python 的 GIL,这允许我们在多线程环境中并发运行它们。

使用这些转换的主要好处是它们非常易于实现。然而,当我们需要在预测步骤中更新它们的值时,它们可能会非常慢,因为我们必须在完整历史上再次调用函数并仅保留最后一个值。因此,如果性能是一个关注点,您应该尝试使用内置的转换,或在 MLForecast.preprocessMLForecast.fit 中将 keep_last_n 设置为您的转换所需的最小样本数。

from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.shift import shift_array
@njit
def ratio_over_previous(x, offset=1):
    """计算当前值与其`偏移`滞后值之间的比率"""
    return x / shift_array(x, offset=offset)

@njit
def diff_over_previous(x, offset=1):
    """计算当前值与其`offset`滞后值之间的差异"""
    return x - shift_array(x, offset=offset)

如果您的函数接受的参数比输入数组更多,您可以提供一个元组,例如:(func, arg1, arg2, ...)

fcst = MLForecast(
    models=[],
    freq='D',
    lags=[1, 2, 3],
    lag_transforms={
        1: [expanding_mean, ratio_over_previous, (ratio_over_previous, 2)],  # 第二个比率设定偏移量为2
        2: [diff_over_previous],
    },
)
prep = fcst.preprocess(data)
prep.head(2)
unique_id ds y lag1 lag2 lag3 expanding_mean_lag1 ratio_over_previous_lag1 ratio_over_previous_lag1_offset2 diff_over_previous_lag2
3 id_0 2000-01-04 3.481831 2.445887 1.218794 0.322947 1.329209 2.006809 7.573645 0.895847
4 id_0 2000-01-05 4.191721 3.481831 2.445887 1.218794 1.867365 1.423546 2.856785 1.227093

正如您所看到的,函数的名称与转换名称结合使用,并加上 _lag 后缀。如果函数有其他参数并且它们没有设置为默认值,那么这些参数也会被包含在内,就像这里的 offset=2 一样。

np.testing.assert_allclose(prep['lag1'] / prep['lag2'], prep['ratio_over_previous_lag1'])
np.testing.assert_allclose(prep['lag1'] / prep['lag3'], prep['ratio_over_previous_lag1_offset2'])
np.testing.assert_allclose(prep['lag2'] - prep['lag3'], prep['diff_over_previous_lag2'])

Give us a ⭐ on Github