GroupShuffleSplit#

class sklearn.model_selection.GroupShuffleSplit(n_splits=5, *, test_size=None, train_size=None, random_state=None)#

Shuffle-Group(s)-Out 交叉验证迭代器。

提供随机的训练/测试索引来根据第三方提供的组分割数据。该组信息可以用于以整数编码任意领域特定的样本分层。

例如，组可以是样本收集的年份，从而允许基于时间的交叉验证。

LeavePGroupsOut 和 GroupShuffleSplit 的区别在于，前者使用所有大小为 p 的唯一组子集生成分割，而 GroupShuffleSplit 生成用户确定数量的随机测试分割，每个分割包含用户确定的唯一组的比例。

例如，一个计算量较小的 LeavePGroupsOut(p=10) 替代方案是 GroupShuffleSplit(test_size=10, n_splits=100) 。

注意：参数 test_size 和 train_size 指的是组，而不是样本，如 ShuffleSplit 中那样。

更多信息请参阅用户指南。

有关交叉验证行为的可视化以及常见 scikit-learn 分割方法的比较，请参阅在 scikit-learn 中可视化交叉验证行为

Parameters:

n_splitsint, 默认=5: 重新洗牌和分割的迭代次数。
test_sizefloat, int, 默认=0.2: 如果是浮点数，应在 0.0 和 1.0 之间，并表示测试分割中包含的组的比例（向上取整）。如果是整数，表示测试组的绝对数量。如果为 None，则设置为训练大小的补数。默认值将在 0.21 版本中更改。如果 train_size 未指定，它将保持 0.2，否则将补充指定的 train_size 。
train_sizefloat 或 int, 默认=None: 如果是浮点数，应在 0.0 和 1.0 之间，并表示训练分割中包含的组的比例。如果是整数，表示训练组的绝对数量。如果为 None，则自动设置为测试大小的补数。
random_stateint, RandomState 实例或 None, 默认=None: 控制生成的训练和测试索引的随机性。传递一个 int 以在多次函数调用中获得可重复的输出。参见术语表。

See also

ShuffleSplit: 打乱样本以创建独立的测试/训练集。
LeavePGroupsOut: 训练集留下所有可能的 p 组子集。

Examples

>>> import numpy as np
>>> from sklearn.model_selection import GroupShuffleSplit
>>> X = np.ones(shape=(8, 2))
>>> y = np.ones(shape=(8, 1))
>>> groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
>>> print(groups.shape)
(8,)
>>> gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)
>>> gss.get_n_splits()
2
>>> print(gss)
GroupShuffleSplit(n_splits=2, random_state=42, test_size=None, train_size=0.7)
>>> for i, (train_index, test_index) in enumerate(gss.split(X, y, groups)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}, group={groups[train_index]}")
...     print(f"  Test:  index={test_index}, group={groups[test_index]}")
Fold 0:
  Train: index=[2 3 4 5 6 7], group=[2 2 2 3 3 3]
  Test:  index=[0 1], group=[1 1]
Fold 1:
  Train: index=[0 1 5 6 7], group=[1 1 3 3 3]
  Test:  index=[2 3 4], group=[2 2 2]

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_n_splits(X=None, y=None, groups=None)#

返回交叉验证器中的分割迭代次数。

Parameters:

Xobject: 总是被忽略，存在是为了兼容性。
yobject: 总是被忽略，存在是为了兼容性。
groupsobject: 总是被忽略，存在是为了兼容性。

Returns:

n_splitsint: 返回交叉验证器中的分割迭代次数。

set_split_request(*, groups: bool | None | str = '$UNCHANGED$') → GroupShuffleSplit#

Request metadata passed to the split method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to split .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

groupsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for groups parameter in split .

Returns:

selfobject: The updated object.

split(X, y=None, groups=None)#

生成用于将数据分割为训练集和测试集的索引。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 训练数据，其中 n_samples 是样本数量且 n_features 是特征数量。
y形状为 (n_samples,) 的类数组，默认=None: 监督学习问题的目标变量。
groups形状为 (n_samples,) 的类数组: 在将数据集分割为训练/测试集时使用的样本组标签。

Yields:

trainndarray: 该分割的训练集索引。
testndarray: 该分割的测试集索引。

Notes

随机CV分割器可能在每次调用分割时返回不同的结果。你可以通过将 random_state 设置为整数来使结果相同。

Gallery examples#

在 scikit-learn 中可视化交叉验证行为