.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/cluster/plot_adjusted_for_chance_measures.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_cluster_plot_adjusted_for_chance_measures.py: ========================================================== 聚类性能评估中的机会调整 ========================================================== 本笔记本探讨了均匀分布的随机标签对某些聚类评估指标行为的影响。为此,指标在样本数量固定的情况下计算,并作为估计器分配的聚类数量的函数。该示例分为两个实验: - 第一个实验中,"真实标签"(因此类别数量固定)是固定的,而"预测标签"是随机的; - 第二个实验中,"真实标签"是变化的,"预测标签"是随机的。"预测标签"与"真实标签"具有相同的类别和聚类数量。 .. GENERATED FROM PYTHON SOURCE LINES 10-14 .. code-block:: Python # 作者:scikit-learn 开发者 # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 15-34 定义评估指标列表 ------------------ 聚类算法本质上是无监督学习方法。然而,由于在本示例中我们为合成簇分配了类别标签,因此可以使用利用这种“监督”真实信息的评估指标来量化生成簇的质量。此类指标的示例如下: - V-measure,完整性和同质性的调和平均值; - 兰德指数,衡量数据点对根据聚类算法结果和真实类别分配的一致分组频率; - 调整兰德指数(ARI),一种经过机会调整的兰德指数,使得随机聚类分配的期望ARI为0.0; - 互信息(MI)是一种信息论度量,用于量化两个标记之间的依赖程度。请注意,对于完美标记,MI 的最大值取决于聚类数量和样本数量; - 归一化互信息(NMI),在大量数据点的极限情况下,互信息定义在0(无互信息)到1(完全匹配的标签分配,标签的排列除外)之间。它没有调整机会:当聚类的数据点数量不足时,随机标签的MI或NMI的期望值可能显著非零; - 调整互信息(AMI),一种经过机会调整的互信息。 与ARI类似,随机聚类分配的期望AMI为0.0。 欲了解更多信息,请参见 :ref:`clustering_evaluation` 模块。 .. GENERATED FROM PYTHON SOURCE LINES 34-46 .. code-block:: Python from sklearn import metrics score_funcs = [ ("V-measure", metrics.v_measure_score), ("Rand index", metrics.rand_score), ("ARI", metrics.adjusted_rand_score), ("MI", metrics.mutual_info_score), ("NMI", metrics.normalized_mutual_info_score), ("AMI", metrics.adjusted_mutual_info_score), ] .. GENERATED FROM PYTHON SOURCE LINES 47-51 第一次实验:固定真实标签并增加聚类数量 -------------------------------------------------------------------------- 我们首先定义一个函数来创建均匀分布的随机标签。 .. GENERATED FROM PYTHON SOURCE LINES 51-61 .. code-block:: Python import numpy as np rng = np.random.RandomState(0) def random_labels(n_samples, n_classes): return rng.randint(low=0, high=n_classes, size=n_samples) .. GENERATED FROM PYTHON SOURCE LINES 62-63 另一个函数将使用 `random_labels` 函数创建一个分布在 `n_classes` 中的固定的真实标签集 ( `labels_a` ),然后对多个随机“预测”标签集 ( `labels_b` ) 进行评分,以评估给定 `n_clusters` 下某个度量的变异性。 .. GENERATED FROM PYTHON SOURCE LINES 63-79 .. code-block:: Python def fixed_classes_uniform_labelings_scores( score_func, n_samples, n_clusters_range, n_classes, n_runs=5 ): scores = np.zeros((len(n_clusters_range), n_runs)) labels_a = random_labels(n_samples=n_samples, n_classes=n_classes) for i, n_clusters in enumerate(n_clusters_range): for j in range(n_runs): labels_b = random_labels(n_samples=n_samples, n_classes=n_clusters) scores[i, j] = score_func(labels_a, labels_b) return scores .. GENERATED FROM PYTHON SOURCE LINES 80-81 在第一个示例中,我们将类别数量(真实的聚类数量)设置为 `n_classes=10` 。聚类数量在 `n_clusters_range` 提供的值范围内变化。 .. GENERATED FROM PYTHON SOURCE LINES 81-121 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns n_samples = 1000 n_classes = 10 n_clusters_range = np.linspace(2, 100, 10).astype(int) plots = [] names = [] sns.color_palette("colorblind") plt.figure(1) for marker, (score_name, score_func) in zip("d^vx.,", score_funcs): scores = fixed_classes_uniform_labelings_scores( score_func, n_samples, n_clusters_range, n_classes=n_classes ) plots.append( plt.errorbar( n_clusters_range, scores.mean(axis=1), scores.std(axis=1), alpha=0.8, linewidth=1, marker=marker, )[0] ) names.append(score_name) plt.title( "Clustering measures for random uniform labeling\n" f"against reference assignment with {n_classes} classes" ) plt.xlabel(f"Number of clusters (Number of samples is fixed to {n_samples})") plt.ylabel("Score value") plt.ylim(bottom=-0.05, top=1.05) plt.legend(plots, names, bbox_to_anchor=(0.5, 0.5)) plt.show() .. image-sg:: /auto_examples/cluster/images/sphx_glr_plot_adjusted_for_chance_measures_001.png :alt: Clustering measures for random uniform labeling against reference assignment with 10 classes :srcset: /auto_examples/cluster/images/sphx_glr_plot_adjusted_for_chance_measures_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 122-130 Rand指数在 `n_clusters` > `n_classes` 时饱和。其他未调整的度量(如V-Measure)显示了聚类数量和样本数量之间的线性依赖关系。 调整后的机会测量,例如ARI和AMI,显示出一些围绕平均得分0.0的随机变化,与样本数量和簇的数量无关。 第二个实验:变化类别和簇的数量 --------------------------------------------------------- 在本节中,我们定义了一个类似的函数,该函数使用多种度量标准来对2个均匀分布的随机标签进行评分。在这种情况下,对于 `n_clusters_range` 中的每个可能值,类别数和分配的簇数是匹配的。 .. GENERATED FROM PYTHON SOURCE LINES 130-143 .. code-block:: Python def uniform_labelings_scores(score_func, n_samples, n_clusters_range, n_runs=5): scores = np.zeros((len(n_clusters_range), n_runs)) for i, n_clusters in enumerate(n_clusters_range): for j in range(n_runs): labels_a = random_labels(n_samples=n_samples, n_classes=n_clusters) labels_b = random_labels(n_samples=n_samples, n_classes=n_clusters) scores[i, j] = score_func(labels_a, labels_b) return scores .. GENERATED FROM PYTHON SOURCE LINES 144-145 在这种情况下,我们使用 `n_samples=100` 来展示聚类数量与样本数量相似或相等时的效果。 .. GENERATED FROM PYTHON SOURCE LINES 145-178 .. code-block:: Python n_samples = 100 n_clusters_range = np.linspace(2, n_samples, 10).astype(int) plt.figure(2) plots = [] names = [] for marker, (score_name, score_func) in zip("d^vx.,", score_funcs): scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range) plots.append( plt.errorbar( n_clusters_range, np.median(scores, axis=1), scores.std(axis=1), alpha=0.8, linewidth=2, marker=marker, )[0] ) names.append(score_name) plt.title( "Clustering measures for 2 random uniform labelings\nwith equal number of clusters" ) plt.xlabel(f"Number of clusters (Number of samples is fixed to {n_samples})") plt.ylabel("Score value") plt.legend(plots, names) plt.ylim(bottom=-0.05, top=1.05) plt.show() .. image-sg:: /auto_examples/cluster/images/sphx_glr_plot_adjusted_for_chance_measures_002.png :alt: Clustering measures for 2 random uniform labelings with equal number of clusters :srcset: /auto_examples/cluster/images/sphx_glr_plot_adjusted_for_chance_measures_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 179-184 我们观察到与第一个实验类似的结果:调整后的机会度量始终接近于零,而其他度量在更细粒度的标注下趋于增大。随机标注的平均V-measure显著增加,当聚类数量接近用于计算度量的样本总数时尤为明显。此外,原始互信息没有上界,其尺度取决于聚类问题的维度和真实类别的基数。这就是为什么曲线超出图表范围的原因。 因此,只有调整后的度量才能安全地用作共识指数,以评估聚类算法在数据集的各种重叠子样本上对于给定 k 值的平均稳定性。 非调整的聚类评估指标可能会产生误导,因为它们对细粒度的标注输出较大的值,这可能会让人误以为标注捕捉到了有意义的群体,而实际上它们可能是完全随机的。特别是,不应使用这种非调整的指标来比较输出不同数量聚类的不同聚类算法的结果。 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.423 seconds) .. _sphx_glr_download_auto_examples_cluster_plot_adjusted_for_chance_measures.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/cluster/plot_adjusted_for_chance_measures.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_adjusted_for_chance_measures.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_adjusted_for_chance_measures.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_adjusted_for_chance_measures.zip ` .. include:: plot_adjusted_for_chance_measures.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_