.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/covariance/plot_mahalanobis_distances.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_covariance_plot_mahalanobis_distances.py: ================================================================ 稳健的协方差估计和马氏距离的相关性 ================================================================ 本示例展示了在高斯分布数据上使用马氏距离进行协方差估计。 对于高斯分布数据,可以使用马氏距离计算观测值 :math:`x_i` 到分布模式的距离: .. math:: d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu) 其中 :math:`\mu` 和 :math:`\Sigma` 分别是底层高斯分布的均值和协方差。 在实际操作中,:math:`\mu` 和 :math:`\Sigma` 被一些估计值所替代。标准的协方差最大似然估计(MLE)对数据集中异常值非常敏感,因此,下游的马氏距离也会受到影响。最好使用稳健的协方差估计器,以确保估计对数据集中的“错误”观测值具有抵抗力,并且计算出的马氏距离能够准确反映观测值的真实组织。 最小协方差行列式估计器(MCD)是一种稳健的、高破坏点(即它可以用于估计高度污染数据集的协方差矩阵,最多可容忍 :math:`\frac{n_\text{samples}-n_\text{features}-1}{2}` 个异常值)的协方差估计器。MCD 的思想是找到 :math:`\frac{n_\text{samples}+n_\text{features}+1}{2}` 个观测值,其经验协方差具有最小的行列式,从而产生一个“纯净”的观测子集,从中计算标准的均值和协方差估计。MCD 由 P.J.Rousseuw 在 [1]_ 中引入。 本示例说明了马氏距离如何受到异常数据的影响。当使用基于标准协方差 MLE 的马氏距离时,从污染分布中抽取的观测值与来自真实高斯分布的观测值无法区分。使用基于 MCD 的马氏距离,这两种群体变得可以区分。相关应用包括异常值检测、观测值排序和聚类。 .. note:: 另请参见 :ref:`sphx_glr_auto_examples_covariance_plot_robust_vs_empirical_covariance.py` .. rubric:: 参考文献 .. [1] P. J. Rousseeuw. `最小二乘中位数回归 `_ . J. Am Stat Ass, 79:871, 1984. .. [2] Wilson, E. B., & Hilferty, M. M. (1931). `卡方分布。 `_ 美国国家科学院院刊, 17, 684-688. .. GENERATED FROM PYTHON SOURCE LINES 38-42 生成数据 -------------- 首先,我们生成一个包含125个样本和2个特征的数据集。两个特征均为均值为0的高斯分布,但特征1的标准差为2,特征2的标准差为1。接下来,将25个样本替换为高斯异常值样本,其中特征1的标准差为1,特征2的标准差为7。 .. GENERATED FROM PYTHON SOURCE LINES 42-61 .. code-block:: Python import numpy as np # 为了获得一致的结果 np.random.seed(7) n_samples = 125 n_outliers = 25 n_features = 2 # 生成形状为 (125, 2) 的高斯数据 gen_cov = np.eye(n_features) gen_cov[0, 0] = 2.0 X = np.dot(np.random.randn(n_samples, n_features), gen_cov) # 添加一些离群值 outliers_cov = np.eye(n_features) outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.0 X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov) .. GENERATED FROM PYTHON SOURCE LINES 62-66 结果比较 --------------------- 下面,我们将基于MCD和MLE的协方差估计器拟合到我们的数据,并打印估计的协方差矩阵。请注意,基于MLE的估计器对特征2的估计方差(7.5)比MCD稳健估计器(1.2)高得多。这表明基于MCD的稳健估计器对异常样本更具抵抗力,这些异常样本被设计为在特征2中具有更大的方差。 .. GENERATED FROM PYTHON SOURCE LINES 66-81 .. code-block:: Python import matplotlib.pyplot as plt from sklearn.covariance import EmpiricalCovariance, MinCovDet # 拟合一个MCD稳健估计器到数据 robust_cov = MinCovDet().fit(X) # 拟合一个最大似然估计器到数据 emp_cov = EmpiricalCovariance().fit(X) print( "Estimated covariance matrix:\nMCD (Robust):\n{}\nMLE:\n{}".format( robust_cov.covariance_, emp_cov.covariance_ ) ) .. rst-class:: sphx-glr-script-out .. code-block:: none Estimated covariance matrix: MCD (Robust): [[ 3.26253567e+00 -3.06695631e-03] [-3.06695631e-03 1.22747343e+00]] MLE: [[ 3.23773583 -0.24640578] [-0.24640578 7.51963999]] .. GENERATED FROM PYTHON SOURCE LINES 82-83 为了更好地可视化差异,我们绘制了由两种方法计算的马氏距离的等高线。请注意,基于稳健MCD的马氏距离更好地拟合了内点(黑点),而基于MLE的距离则更容易受到异常值(红点)的影响。 .. GENERATED FROM PYTHON SOURCE LINES 83-129 .. code-block:: Python import matplotlib.lines as mlines fig, ax = plt.subplots(figsize=(10, 5)) # 绘制数据集 inlier_plot = ax.scatter(X[:, 0], X[:, 1], color="black", label="inliers") outlier_plot = ax.scatter( X[:, 0][-n_outliers:], X[:, 1][-n_outliers:], color="red", label="outliers" ) ax.set_xlim(ax.get_xlim()[0], 10.0) ax.set_title("Mahalanobis distances of a contaminated data set") # 创建特征1和特征2值的网格 xx, yy = np.meshgrid( np.linspace(plt.xlim()[0], plt.xlim()[1], 100), np.linspace(plt.ylim()[0], plt.ylim()[1], 100), ) zz = np.c_[xx.ravel(), yy.ravel()] # 计算基于最大似然估计的网格马氏距离 mahal_emp_cov = emp_cov.mahalanobis(zz) mahal_emp_cov = mahal_emp_cov.reshape(xx.shape) emp_cov_contour = plt.contour( xx, yy, np.sqrt(mahal_emp_cov), cmap=plt.cm.PuBu_r, linestyles="dashed" ) # 计算基于MCD的马氏距离 mahal_robust_cov = robust_cov.mahalanobis(zz) mahal_robust_cov = mahal_robust_cov.reshape(xx.shape) robust_contour = ax.contour( xx, yy, np.sqrt(mahal_robust_cov), cmap=plt.cm.YlOrBr_r, linestyles="dotted" ) # Add legend ax.legend( [ mlines.Line2D([], [], color="tab:blue", linestyle="dashed"), mlines.Line2D([], [], color="tab:orange", linestyle="dotted"), inlier_plot, outlier_plot, ], ["MLE dist", "MCD dist", "inliers", "outliers"], loc="upper right", borderaxespad=0, ) plt.show() .. image-sg:: /auto_examples/covariance/images/sphx_glr_plot_mahalanobis_distances_001.png :alt: Mahalanobis distances of a contaminated data set :srcset: /auto_examples/covariance/images/sphx_glr_plot_mahalanobis_distances_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 130-131 最后,我们强调了基于MCD的马氏距离区分异常值的能力。我们取马氏距离的立方根,得到近似正态分布(如Wilson和Hilferty [2]_所建议的),然后用箱线图绘制内点和异常点样本的值。对于基于稳健MCD的马氏距离,异常点样本的分布与内点样本的分布更加分离。 .. GENERATED FROM PYTHON SOURCE LINES 131-169 .. code-block:: Python fig, (ax1, ax2) = plt.subplots(1, 2) plt.subplots_adjust(wspace=0.6) # 计算样本的MLE马氏距离的立方根 emp_mahal = emp_cov.mahalanobis(X - np.mean(X, 0)) ** (0.33) # Plot boxplots ax1.boxplot([emp_mahal[:-n_outliers], emp_mahal[-n_outliers:]], widths=0.25) # Plot individual samples ax1.plot( np.full(n_samples - n_outliers, 1.26), emp_mahal[:-n_outliers], "+k", markeredgewidth=1, ) ax1.plot(np.full(n_outliers, 2.26), emp_mahal[-n_outliers:], "+k", markeredgewidth=1) ax1.axes.set_xticklabels(("inliers", "outliers"), size=15) ax1.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) ax1.set_title("Using non-robust estimates\n(Maximum Likelihood)") # 计算样本的MCD马氏距离的立方根 robust_mahal = robust_cov.mahalanobis(X - robust_cov.location_) ** (0.33) # Plot boxplots ax2.boxplot([robust_mahal[:-n_outliers], robust_mahal[-n_outliers:]], widths=0.25) # Plot individual samples ax2.plot( np.full(n_samples - n_outliers, 1.26), robust_mahal[:-n_outliers], "+k", markeredgewidth=1, ) ax2.plot(np.full(n_outliers, 2.26), robust_mahal[-n_outliers:], "+k", markeredgewidth=1) ax2.axes.set_xticklabels(("inliers", "outliers"), size=15) ax2.set_ylabel(r"$\sqrt[3]{\rm{(Mahal. dist.)}}$", size=16) ax2.set_title("Using robust estimates\n(Minimum Covariance Determinant)") plt.show() .. image-sg:: /auto_examples/covariance/images/sphx_glr_plot_mahalanobis_distances_002.png :alt: Using non-robust estimates (Maximum Likelihood), Using robust estimates (Minimum Covariance Determinant) :srcset: /auto_examples/covariance/images/sphx_glr_plot_mahalanobis_distances_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.126 seconds) .. _sphx_glr_download_auto_examples_covariance_plot_mahalanobis_distances.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/covariance/plot_mahalanobis_distances.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_mahalanobis_distances.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_mahalanobis_distances.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_mahalanobis_distances.zip ` .. include:: plot_mahalanobis_distances.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_