Kmeans: k-均值聚类
k-means聚类的实现。
> 来自 mlxtend.cluster 的 Kmeans
概述
聚类属于无监督学习的范畴,这是机器学习的一个子领域,在实际应用中我们无法获取真实的标签。在聚类中,我们的目标是根据相似性对样本进行分组(在k均值中:欧几里得距离)。
k均值算法可以总结如下:
- 随机选择k个质心作为初始聚类中心。
- 将每个样本分配给最近的质心 $\mu(j), \; j \in {1,...,k}$。
- 将质心移动到被分配给它的样本的中心。
- 重复步骤2和3,直到聚类分配不再变化,或者达到用户定义的容忍度或最大迭代次数。
参考文献
- MacQueen, J. B. (1967). 用于分类和多变量观察分析的一些方法。第五届伯克利数理统计与概率研讨会会议录。加利福尼亚大学出版社。第281–297页。MR 0214227。Zbl 0214.46201。检索于2009-04-07。
示例 1 - 三个斑点
加载一些示例数据:
import matplotlib.pyplot as plt
from mlxtend.data import three_blobs_data
X, y = three_blobs_data()
plt.scatter(X[:, 0], X[:, 1], c='white')
plt.show()
计算聚类中心:
from mlxtend.cluster import Kmeans
km = Kmeans(k=3,
max_iter=50,
random_seed=1,
print_progress=3)
km.fit(X)
print('Iterations until convergence:', km.iterations_)
print('Final centroids:\n', km.centroids_)
Iteration: 2/50 | Elapsed: 00:00:00 | ETA: 00:00:00
Iterations until convergence: 2
Final centroids:
[[-1.5947298 2.92236966]
[ 2.06521743 0.96137409]
[ 0.9329651 4.35420713]]
可视化聚类成员资格:
y_clust = km.predict(X)
plt.scatter(X[y_clust == 0, 0],
X[y_clust == 0, 1],
s=50,
c='lightgreen',
marker='s',
label='cluster 1')
plt.scatter(X[y_clust == 1,0],
X[y_clust == 1,1],
s=50,
c='orange',
marker='o',
label='cluster 2')
plt.scatter(X[y_clust == 2,0],
X[y_clust == 2,1],
s=50,
c='lightblue',
marker='v',
label='cluster 3')
plt.scatter(km.centroids_[:,0],
km.centroids_[:,1],
s=250,
marker='*',
c='red',
label='centroids')
plt.legend(loc='lower left',
scatterpoints=1)
plt.grid()
plt.show()
API
Kmeans(k, max_iter=10, convergence_tolerance=1e-05, random_seed=None, print_progress=0)
K-means clustering class.
Added in 0.4.1dev
Parameters
-
k
: intNumber of clusters
-
max_iter
: int (default: 10)Number of iterations during cluster assignment. Cluster re-assignment stops automatically when the algorithm converged.
-
convergence_tolerance
: float (default: 1e-05)Compares current centroids with centroids of the previous iteration using the given tolerance (a small positive float)to determine if the algorithm converged early.
-
random_seed
: int (default: None)Set random state for the initial centroid assignment.
-
print_progress
: int (default: 0)Prints progress in fitting to stderr. 0: No output 1: Iterations elapsed 2: 1 plus time elapsed 3: 2 plus estimated time until completion
Attributes
-
centroids_
: 2d-array, shape={k, n_features}Feature values of the k cluster centroids.
-
custers_
: dictionaryThe cluster assignments stored as a Python dictionary; the dictionary keys denote the cluster indeces and the items are Python lists of the sample indices that were assigned to each cluster.
-
iterations_
: intNumber of iterations until convergence.
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/classifier/Kmeans/
Methods
fit(X, init_params=True)
Learn model from training data.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
init_params
: bool (default: True)Re-initializes model parameters prior to fitting. Set False to continue training with weights from a previous model fitting.
Returns
self
: object
predict(X)
Predict targets from X.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns
-
target_values
: array-like, shape = [n_samples]Predicted target values.