MiniBatchKMeans#

class sklearn.cluster.MiniBatchKMeans(n_clusters=8, *, init='k-means++', max_iter=100, batch_size=1024, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init='auto', reassignment_ratio=0.01)#

Mini-Batch K-Means 聚类。

更多信息请参阅用户指南。

Parameters:

n_clustersint, 默认=8

要形成的聚类数量以及要生成的质心数量。

init{‘k-means++’, ‘random’}, callable 或 array-like of shape (n_clusters, n_features), 默认=’k-means++’

初始化方法：

‘k-means++’ : 使用基于经验概率分布的抽样方法选择初始聚类质心，该概率分布基于点对总体惯性的贡献。这种技术加快了收敛速度。实现的算法是“贪婪 k-means++”。它与普通的 k-means++ 不同之处在于，在每次抽样步骤中进行多次试验，并选择其中最好的质心。

‘random’: 从数据中随机选择 n_clusters 个观测值（行）作为初始质心。

如果传递了一个数组，它应该是形状为 (n_clusters, n_features) 并给出初始中心。

如果传递了一个 callable，它应该接受参数 X, n_clusters 和随机状态并返回一个初始化。

max_iterint, 默认=100

在停止之前对完整数据集的最大迭代次数，独立于任何早期停止标准的启发式方法。

batch_sizeint, 默认=1024

小批量的大小。为了加快计算速度，可以将 batch_size 设置为大于 256 * 核心数，以在所有核心上启用并行化。

Changed in version 1.0: batch_size 默认值从 100 改为 1024。

verboseint, 默认=0

详细模式。

compute_labelsbool, 默认=True

在 minibatch 优化收敛后，为完整数据集计算标签分配和惯性。

random_stateint, RandomState 实例或 None, 默认=None

确定质心初始化和随机重新分配的随机数生成。使用 int 使随机性确定。请参阅术语。

tolfloat, 默认=0.0

基于质心变化的相对变化控制早期停止，质心变化通过平滑的、方差归一化的均方位置变化来测量。这种早期停止启发式方法更接近于批量算法的启发式方法，但相对于惯性启发式方法引入了轻微的计算和内存开销。

要禁用基于归一化质心变化的收敛检测，请将 tol 设置为 0.0（默认）。

max_no_improvementint, 默认=10

基于连续的小批量数量控制早期停止，这些小批量没有在平滑惯性上产生改进。

要禁用基于惯性的收敛检测，请将 max_no_improvement 设置为 None。

init_sizeint, 默认=None

为了加速初始化而随机采样的样本数量（有时以准确性为代价）：唯一的算法是通过在数据的随机子集上运行批量 KMeans 来初始化的。这需要大于 n_clusters。

如果 None ，启发式为 init_size = 3 * batch_size 如果 3 * batch_size < n_clusters ，否则 init_size = 3 * n_clusters 。

n_init‘auto’ 或 int, 默认=”auto”

尝试的随机初始化次数。与 KMeans 不同，算法只运行一次，使用 n_init 初始化中惯性最好的那个。对于稀疏的高维问题，建议进行多次运行（参见使用k-means对稀疏数据进行聚类）。

当 n_init='auto' 时，运行的次数取决于 init 的值：如果使用 init='random' 或 init 是 callable，则为 3；如果使用 init='k-means++' 或 init 是 array-like，则为 1。

Added in version 1.2: 添加了 n_init 的 ‘auto’ 选项。

Changed in version 1.4: n_init 的默认值在版本中更改为 'auto' 。

reassignment_ratiofloat, 默认=0.01

控制重新分配的最大计数分数。较高的值意味着低计数中心更容易重新分配，这意味着模型将需要更长时间收敛，但应该会收敛到更好的聚类。然而，过高的值可能会导致收敛问题，特别是在小批量大小的情况下。

Attributes:

cluster_centers_ndarray of shape (n_clusters, n_features): 聚类中心的坐标。
labels_ndarray of shape (n_samples,): 每个点的标签（如果 compute_labels 设置为 True）。
inertia_float: 如果 compute_labels 设置为 True，则与所选分区相关的惯性准则的值。如果 compute_labels 设置为 False，则是基于小批量惯性的指数加权平均值的惯性近似。惯性定义为样本到其聚类中心的平方距离之和，如果提供了样本权重，则按样本权重加权。
n_iter_int: 对完整数据集的迭代次数。
n_steps_int: 处理的小批量数量。

Added in version 1.0.
n_features_in_int: fit 期间看到的特征数量。

Added in version 0.24.
feature_names_in_ndarray of shape ( n_features_in_ ,): fit 期间看到的特征名称。仅当 X 的特征名称均为字符串时定义。

Added in version 1.0.

See also

KMeans: 基于 Lloyd 算法的经典聚类方法实现。它在每次迭代中消耗整个输入数据集。

Notes

参见 https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

当数据集中点太少时，一些中心可能会重复，这意味着在请求的聚类数量和返回的聚类数量之间不会总是匹配。一种解决方案是设置 reassignment_ratio=0 ，这可以防止重新分配太小的聚类。

Examples

>>> from sklearn.cluster import MiniBatchKMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 0], [4, 4],
...               [4, 5], [0, 1], [2, 2],
...               [3, 2], [5, 5], [1, -1]])
>>> # 手动在批次上拟合
>>> kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          n_init="auto")
>>> kmeans = kmeans.partial_fit(X[0:6,:])
>>> kmeans = kmeans.partial_fit(X[6:12,:])
>>> kmeans.cluster_centers_
array([[3.375, 3.  ],
       [0.75 , 0.5 ]])
>>> kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)
>>> # 在整个数据上拟合
>>> kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          max_iter=10,
...                          n_init="auto").fit(X)
>>> kmeans.cluster_centers_
array([[3.55102041, 2.48979592],
       [1.06896552, 1.        ]])
>>> kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)

fit(X, y=None, sample_weight=None)#

计算X上的质心，通过将其分块为小批量。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 要聚类的训练实例。必须注意的是，数据将被转换为C排序，如果给定的数据不是C连续的，这将导致内存复制。如果传递的是稀疏矩阵，如果它不是CSR格式，将进行复制。
y忽略: 未使用，此处存在是为了通过约定保持API一致性。
sample_weightarray-like，形状为 (n_samples,)，默认=None: X中每个观测值的权重。如果为None，所有观测值都被赋予相同的权重。如果 init 是一个可调用对象或用户提供的数组，则在初始化期间不使用 sample_weight 。

Added in version 0.20.

Returns:

selfobject: 拟合的估计器。

fit_predict(X, y=None, sample_weight=None)#

计算聚类中心并为每个样本预测聚类索引。

便捷方法；等效于调用 fit(X) 后再调用 predict(X)。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 要转换的新数据。
y忽略: 未使用，此处仅为了保持 API 一致性而存在。
sample_weight形状为 (n_samples,) 的 array-like，默认=None: X 中每个观测值的权重。如果为 None，则所有观测值被赋予相同的权重。

Returns:

labels形状为 (n_samples,) 的 ndarray: 每个样本所属的聚类索引。

fit_transform(X, y=None, sample_weight=None)#

计算聚类并将X转换为聚类距离空间。

等效于fit(X).transform(X)，但实现更高效。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 要转换的新数据。
y忽略: 未使用，此处仅为了API一致性而存在。
sample_weight形状为 (n_samples,) 的array-like，默认=None: X中每个观测值的权重。如果为None，则所有观测值分配相同的权重。

Returns:

X_new形状为 (n_samples, n_clusters) 的ndarray: 在新空间中转换的X。

get_feature_names_out(input_features=None)#

获取转换后的输出特征名称。

输出特征名称将以小写的类名作为前缀。例如，如果转换器输出3个特征，那么输出特征名称将是： ["class_name0", "class_name1", "class_name2"] 。

Parameters:

input_features类似数组的对象或None，默认为None: 仅用于验证特征名称与 fit 中看到的名称。

Returns:

feature_names_outndarray of str对象: 转换后的特征名称。

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_params(deep=True)#

获取此估计器的参数。

Parameters:

deepbool, 默认=True: 如果为True，将返回此估计器和包含的子对象（也是估计器）的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

partial_fit(X, y=None, sample_weight=None)#

更新单个mini-batch X上的k均值估计。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 要聚类的训练实例。必须注意的是，数据将被转换为C排序，如果给定的数据不是C连续的，这将导致内存复制。如果传递的是稀疏矩阵，如果它不是CSR格式，将进行复制。
y忽略: 未使用，此处存在是为了通过约定保持API一致性。
sample_weightarray-like，形状为 (n_samples,)，默认=None: X中每个观测值的权重。如果为None，则所有观测值被赋予相同的权重。如果 init 是一个可调用对象或用户提供的数组，则在初始化期间不使用 sample_weight 。

Returns:

selfobject: 返回更新的估计器。

predict(X)#

预测X中每个样本所属的最接近的簇。

在向量量化的文献中， cluster_centers_ 被称为代码簿，而 predict 返回的每个值是代码簿中最接近代码的索引。

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): 新数据以进行预测。

Returns:

labelsndarray of shape (n_samples,): 每个样本所属的簇的索引。

score(X, y=None, sample_weight=None)#

X在K-means目标函数上的值的相反数。

X{array-like, sparse matrix} of shape (n_samples, n_features): 新数据。
y忽略: 不使用，出现在这里是为了API一致性。
sample_weightarray-like of shape (n_samples,), default=None: X中每个观测值的权重。如果为None，则所有观测值被赋予相同的权重。
scorefloat: X在K-means目标函数上的值的相反数。

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to fit .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit .

Returns:

selfobject: The updated object.

set_output(*, transform=None)#

设置输出容器。

请参阅介绍 set_output API 以了解如何使用API的示例。

Parameters:

transform{“default”, “pandas”, “polars”}, 默认=None

配置 transform 和 fit_transform 的输出。

"default" : 转换器的默认输出格式
"pandas" : DataFrame 输出
"polars" : Polars 输出
None : 转换配置不变

Added in version 1.4: "polars" 选项已添加。

Returns:

self估计器实例: 估计器实例。

set_params(**params)#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ）。后者具有形式为 <component>__<parameter> 的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans#

Request metadata passed to the partial_fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to partial_fit .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in partial_fit .

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MiniBatchKMeans#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to score .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score .

Returns:

selfobject: The updated object.

transform(X)#

将X转换为聚类距离空间。

在新空间中，每个维度是到聚类中心的距离。请注意，即使X是稀疏的， transform 返回的数组通常也是稠密的。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 要转换的新数据。

Returns:

X_newndarray，形状为 (n_samples, n_clusters): X在新空间中转换后的结果。

Gallery examples#

使用谱协同聚类算法对文档进行双聚类

K-Means 和 MiniBatchKMeans 聚类算法的比较

k-means 初始化影响的经验评估

在玩具数据集上比较不同的聚类算法

在线学习人脸部件的字典

比较 BIRCH 和 MiniBatchKMeans