GroupKFold#

class sklearn.model_selection.GroupKFold(n_splits=5)#

K-fold迭代器变体，具有不重叠的组。

每个组在所有折叠的测试集中将恰好出现一次（不同组的数量必须至少等于折叠的数量）。

折叠在某种程度上是平衡的，因为在每个测试折叠中样本的数量大致相同。

更多信息请参阅用户指南。

有关交叉验证行为的可视化以及常见scikit-learn分割方法之间的比较，请参阅在 scikit-learn 中可视化交叉验证行为

Parameters:

n_splitsint, default=5: 折叠的数量。必须至少为2。

Changed in version 0.22: n_splits 默认值从3改为5。

See also

LeaveOneGroupOut: 根据数据集的显式域特定分层进行数据分割。
StratifiedKFold: 考虑类别信息，以避免构建类别比例不平衡的折叠（适用于二分类或多分类任务）。

Notes

组在折叠中以任意顺序出现。

Examples

>>> import numpy as np
>>> from sklearn.model_selection import GroupKFold
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> groups = np.array([0, 0, 2, 2, 3, 3])
>>> group_kfold = GroupKFold(n_splits=2)
>>> group_kfold.get_n_splits(X, y, groups)
2
>>> print(group_kfold)
GroupKFold(n_splits=2)
>>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}, group={groups[train_index]}")
...     print(f"  Test:  index={test_index}, group={groups[test_index]}")
Fold 0:
  Train: index=[2 3], group=[2 2]
  Test:  index=[0 1 4 5], group=[0 0 3 3]
Fold 1:
  Train: index=[0 1 4 5], group=[0 0 3 3]
  Test:  index=[2 3], group=[2 2]

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_n_splits(X=None, y=None, groups=None)#

返回交叉验证器中的分割迭代次数。

Parameters:

Xobject: 总是被忽略，存在是为了兼容性。
yobject: 总是被忽略，存在是为了兼容性。
groupsobject: 总是被忽略，存在是为了兼容性。

Returns:

n_splitsint: 返回交叉验证器中的分割迭代次数。

set_split_request(*, groups: bool | None | str = '$UNCHANGED$') → GroupKFold#

Request metadata passed to the split method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to split .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for groups parameter in split .

Returns:

selfobject: The updated object.

split(X, y=None, groups=None)#

生成索引以将数据拆分为训练集和测试集。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 训练数据，其中 n_samples 是样本数量且 n_features 是特征数量。
y形状为 (n_samples,) 的类数组，默认=None: 监督学习问题的目标变量。
groups形状为 (n_samples,) 的类数组: 在将数据集拆分为训练/测试集时使用的样本组标签。

Yields:

trainndarray: 该拆分的训练集索引。
testndarray: 该拆分的测试集索引。

Gallery examples#

scikit-learn 1.4 版本发布亮点

在 scikit-learn 中可视化交叉验证行为