GroupTimeSeriesSplit:一个与scikit-learn兼容的具有分组的时间序列验证版本

一个与scikit-learn兼容的时间序列交叉验证器,支持不重叠的组。

> `from mlxtend.evaluate import GroupTimeSeriesSplit`

概述

时间序列任务在机器学习中需要特殊类型的验证,因为对象的时间顺序对于更公平地评估机器学习模型的质量是重要的。
此外,不同任务中分割数据的时间单位也可能不同 - 小时、天、月等。

在这里,我们使用时间序列验证,并支持能够灵活配置的分组以及其他参数:

这个 GroupTimeSeriesSplit 的实现灵感来源于 scikit-learn 的 TimeSeriesSplit,但它有几个优势:

需要考虑几个特性:

在我们下面的例子中展示GroupTimeSeriesSplit的用法之前,先设置一个DummyClassifier,我们将在接下来的部分中重复使用它。同时,让我们导入我们将在后续示例中使用的库:

import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score

from mlxtend.evaluate.time_series import (
    GroupTimeSeriesSplit,
    plot_splits,
    print_cv_info,
    print_split_info,
)

准备示例数据

对于以下示例,我们正在创建一个包含16个训练数据点及其相应目标的示例数据集。

特征和目标

假设我们有一个数值特征和目标用于二分类任务。

data = [[0], [7], [6], [4], [4], [8], [0], [6], [2], [0], [5], [9], [7], [7], [7], [7]]
target = [1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0]

X = pd.DataFrame(data, columns=["num_feature"])
y = pd.Series(target, name="target")

分组号码

我们创建6个不同的组,使得第一个训练样本属于组0,接下来的4个样本属于组1,依此类推。
这些组不必按升序排列(如在这个数据集中),但必须是连续的。

groups = np.array([0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 5])
groups

array([0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 5])

请注意,以下是一个 正确 的组排序示例(不排序但连续):

np.array([5, 5, 5, 5, 1, 1, 1, 1, 3, 3, 2, 2, 2, 4, 4, 0])

然而,下面的示例显示了一个 不正确 的组排序(不连续),这与 GroupTimeSeriesSplit 不兼容:

np.array([0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 2, 2, 2, 2])

组名(月份)

我们将根据指定的分组添加月份作为索引,以便于更直观的示例。

months_map = {i: f"2021-0{i+1}" for i in range(6)}
months = np.array([months_map[group] for group in groups])
months

array(['2021-01', '2021-02', '2021-02', '2021-02', '2021-02', '2021-03',
       '2021-03', '2021-03', '2021-04', '2021-04', '2021-05', '2021-05',
       '2021-06', '2021-06', '2021-06', '2021-06'], dtype='<U7')
X = X.set_index(months)

示例 1 -- 多个训练组(指定训练大小)

让我们创建一个时间序列划分,其中训练数据集由3个组组成。我们将使用1个组进行测试。在这种情况下,分割的数量将自动计算,因为训练集和测试集的大小都是指定的。

可视化中的前3行描述了每个分割在组之间的分布。
最后一行可视化了这些组,每种颜色代表不同的组。

cv_args = {"test_size": 1, "train_size": 3}

plot_splits(X, y, groups, **cv_args)
print_split_info(X, y, groups, **cv_args)

png

Train indices: [0 1 2 3 4 5 6 7]
Test indices: [8 9]
Train length: 8
Test length: 2
Train groups: [0 1 1 1 1 2 2 2]
Test groups: [3 3]
Train group size: 3
Test group size: 1
Train group months: ['2021-01' '2021-02' '2021-02' '2021-02' '2021-02' '2021-03' '2021-03'
 '2021-03']
Test group months: ['2021-04' '2021-04']

Train indices: [1 2 3 4 5 6 7 8 9]
Test indices: [10 11]
Train length: 9
Test length: 2
Train groups: [1 1 1 1 2 2 2 3 3]
Test groups: [4 4]
Train group size: 3
Test group size: 1
Train group months: ['2021-02' '2021-02' '2021-02' '2021-02' '2021-03' '2021-03' '2021-03'
 '2021-04' '2021-04']
Test group months: ['2021-05' '2021-05']

Train indices: [ 5  6  7  8  9 10 11]
Test indices: [12 13 14 15]
Train length: 7
Test length: 4
Train groups: [2 2 2 3 3 4 4]
Test groups: [5 5 5 5]
Train group size: 3
Test group size: 1
Train group months: ['2021-03' '2021-03' '2021-03' '2021-04' '2021-04' '2021-05' '2021-05']
Test group months: ['2021-06' '2021-06' '2021-06' '2021-06']

请注意,如果我们为训练集和测试集都指定了组的数量,则拆分大小会自动确定,并且拆分的数量会随着组的大小而自然变化。例如,增加训练组的数量将自然导致拆分数量减少,如下所示。

cv_args = {"test_size": 1, "train_size": 4}

plot_splits(X, y, groups, **cv_args)

png

在简历中的使用

下面的示例演示了我们如何使用时间序列切分器与scikit-learn,即使用 cross_val_score

cv = GroupTimeSeriesSplit(**cv_args)
clf = DummyClassifier(strategy="most_frequent")

scores = cross_val_score(clf, X, y, groups=groups, scoring="accuracy", cv=cv)
print_cv_info(cv, X, y, groups, clf, scores)

Split number: 1
Train true target: [1 0 1 0 1 0 0 1]
Train predicted target: [0 0 0 0 0 0 0 0]
Test true target: [1 1]
Test predicted target: [0 0]
Accuracy: 0.0

Split number: 2
Train true target: [0 1 0 1 0 0 1 1 1]
Train predicted target: [1 1 1 1 1 1 1 1 1]
Test true target: [0 1]
Test predicted target: [1 1]
Accuracy: 0.5

Split number: 3
Train true target: [0 0 1 1 1 0 1]
Train predicted target: [1 1 1 1 1 1 1]
Test true target: [1 0 0 0]
Test predicted target: [1 1 1 1]
Accuracy: 0.25

示例 2 -- 多个训练组(指定分割数)

现在让我们看看一个例子,在这个例子中我们不指定训练组的数量。在这里,我们将数据集分割为测试集大小(2组)和指定的分割数量(3组),这对于自动计算训练集大小是足够的。

cv_args = {"test_size": 2, "n_splits": 3}

plot_splits(X, y, groups, **cv_args)
print_split_info(X, y, groups, **cv_args)

png

Train indices: [0 1 2 3 4]
Test indices: [5 6 7 8 9]
Train length: 5
Test length: 5
Train groups: [0 1 1 1 1]
Test groups: [2 2 2 3 3]
Train group size: 2
Test group size: 2
Train group months: ['2021-01' '2021-02' '2021-02' '2021-02' '2021-02']
Test group months: ['2021-03' '2021-03' '2021-03' '2021-04' '2021-04']

Train indices: [1 2 3 4 5 6 7]
Test indices: [ 8  9 10 11]
Train length: 7
Test length: 4
Train groups: [1 1 1 1 2 2 2]
Test groups: [3 3 4 4]
Train group size: 2
Test group size: 2
Train group months: ['2021-02' '2021-02' '2021-02' '2021-02' '2021-03' '2021-03' '2021-03']
Test group months: ['2021-04' '2021-04' '2021-05' '2021-05']

Train indices: [5 6 7 8 9]
Test indices: [10 11 12 13 14 15]
Train length: 5
Test length: 6
Train groups: [2 2 2 3 3]
Test groups: [4 4 5 5 5 5]
Train group size: 2
Test group size: 2
Train group months: ['2021-03' '2021-03' '2021-03' '2021-04' '2021-04']
Test group months: ['2021-05' '2021-05' '2021-06' '2021-06' '2021-06' '2021-06']

在简历中的使用

再来看一下在scikit-learn环境中使用cross_val_score的效果:

cv = GroupTimeSeriesSplit(**cv_args)
clf = DummyClassifier(strategy="most_frequent")

scores = cross_val_score(clf, X, y, groups=groups, scoring="accuracy", cv=cv)
print_cv_info(cv, X, y, groups, clf, scores)

Split number: 1
Train true target: [1 0 1 0 1]
Train predicted target: [1 1 1 1 1]
Test true target: [0 0 1 1 1]
Test predicted target: [1 1 1 1 1]
Accuracy: 0.6

Split number: 2
Train true target: [0 1 0 1 0 0 1]
Train predicted target: [0 0 0 0 0 0 0]
Test true target: [1 1 0 1]
Test predicted target: [0 0 0 0]
Accuracy: 0.25

Split number: 3
Train true target: [0 0 1 1 1]
Train predicted target: [1 1 1 1 1]
Test true target: [0 1 1 0 0 0]
Test predicted target: [1 1 1 1 1 1]
Accuracy: 0.33

示例 3 -- 定义训练和测试数据集之间的间隙大小

GroupTimeSeriesSplit 允许您指定一个大于 1 的间隙大小,以在训练和测试折叠之间跳过指定数量的组(默认间隙大小为 0)。在下面的示例中,我们使用 1 组的间隙来说明这一点。

cv_args = {"test_size": 1, "n_splits": 3, "gap_size": 1}

plot_splits(X, y, groups, **cv_args)
print_split_info(X, y, groups, **cv_args)

png

Train indices: [0 1 2 3 4]
Test indices: [8 9]
Train length: 5
Test length: 2
Train groups: [0 1 1 1 1]
Test groups: [3 3]
Train group size: 2
Test group size: 1
Train group months: ['2021-01' '2021-02' '2021-02' '2021-02' '2021-02']
Test group months: ['2021-04' '2021-04']

Train indices: [1 2 3 4 5 6 7]
Test indices: [10 11]
Train length: 7
Test length: 2
Train groups: [1 1 1 1 2 2 2]
Test groups: [4 4]
Train group size: 2
Test group size: 1
Train group months: ['2021-02' '2021-02' '2021-02' '2021-02' '2021-03' '2021-03' '2021-03']
Test group months: ['2021-05' '2021-05']

Train indices: [5 6 7 8 9]
Test indices: [12 13 14 15]
Train length: 5
Test length: 4
Train groups: [2 2 2 3 3]
Test groups: [5 5 5 5]
Train group size: 2
Test group size: 1
Train group months: ['2021-03' '2021-03' '2021-03' '2021-04' '2021-04']
Test group months: ['2021-06' '2021-06' '2021-06' '2021-06']

在简历中的使用

下面的示例展示了在使用 cross_val_score 的 scikit-learn 环境中,效果是怎样的:

cv = GroupTimeSeriesSplit(**cv_args)
clf = DummyClassifier(strategy="most_frequent")

scores = cross_val_score(clf, X, y, groups=groups, scoring="accuracy", cv=cv)
print_cv_info(cv, X, y, groups, clf, scores)

Split number: 1
Train true target: [1 0 1 0 1]
Train predicted target: [1 1 1 1 1]
Test true target: [1 1]
Test predicted target: [1 1]
Accuracy: 1.0

Split number: 2
Train true target: [0 1 0 1 0 0 1]
Train predicted target: [0 0 0 0 0 0 0]
Test true target: [0 1]
Test predicted target: [0 0]
Accuracy: 0.5

Split number: 3
Train true target: [0 0 1 1 1]
Train predicted target: [1 1 1 1 1]
Test true target: [1 0 0 0]
Test predicted target: [1 1 1 1]
Accuracy: 0.25

API

GroupTimeSeriesSplit(test_size, train_size=None, n_splits=None, gap_size=0, shift_size=1, window_type='rolling')

Group time series cross-validator.

Parameters

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/GroupTimeSeriesSplit/

Methods


get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator.

Parameters

Returns


split(X, y=None, groups=None)

Generate indices to split data into training and test set.

Parameters

Yields