.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/linear_model/plot_ard.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_linear_model_plot_ard.py: ==================================== 比较线性贝叶斯回归器 ==================================== 本示例比较了两种不同的贝叶斯回归器: - :ref:`自动相关性确定` - :ref:`贝叶斯岭回归` 在第一部分中,我们使用 :ref:`普通最小二乘法` (OLS) 模型作为基准,比较模型系数与真实系数。随后,我们展示了通过迭代最大化观测值的边际对数似然来估计这些模型。 在最后一部分中,我们使用多项式特征扩展来拟合 `X` 和 `y` 之间的非线性关系,并绘制了 ARD 和贝叶斯岭回归的预测和不确定性。 .. GENERATED FROM PYTHON SOURCE LINES 16-19 .. code-block:: Python # Author: Arturo Amor .. GENERATED FROM PYTHON SOURCE LINES 20-27 模型的鲁棒性以恢复真实权重 ============================= 生成合成数据集 -------------------------- 我们生成了一个数据集,其中 `X` 和 `y` 具有线性关系: `X` 的 10 个特征将用于生成 `y` 。其他特征对预测 `y` 没有用。此外,我们生成了一个 `n_samples == n_features` 的数据集。对于普通最小二乘法(OLS)模型来说,这样的设置是具有挑战性的,并且可能导致权重任意增大。对权重施加先验和惩罚可以缓解这个问题。最后,添加了高斯噪声。 .. GENERATED FROM PYTHON SOURCE LINES 27-39 .. code-block:: Python from sklearn.datasets import make_regression X, y, true_weights = make_regression( n_samples=100, n_features=100, n_informative=10, noise=8, coef=True, random_state=42, ) .. GENERATED FROM PYTHON SOURCE LINES 40-44 拟合回归模型 ------------------ 我们现在拟合贝叶斯模型和普通最小二乘法(OLS),以便稍后比较模型的系数。 .. GENERATED FROM PYTHON SOURCE LINES 44-61 .. code-block:: Python import pandas as pd from sklearn.linear_model import ARDRegression, BayesianRidge, LinearRegression olr = LinearRegression().fit(X, y) brr = BayesianRidge(compute_score=True, max_iter=30).fit(X, y) ard = ARDRegression(compute_score=True, max_iter=30).fit(X, y) df = pd.DataFrame( { "Weights of true generative process": true_weights, "ARDRegression": ard.coef_, "BayesianRidge": brr.coef_, "LinearRegression": olr.coef_, } ) .. GENERATED FROM PYTHON SOURCE LINES 62-66 绘制真实和估计的系数 ---------------------------------------- 现在我们将每个模型的系数与真实生成模型的权重进行比较。 .. GENERATED FROM PYTHON SOURCE LINES 66-82 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns from matplotlib.colors import SymLogNorm plt.figure(figsize=(10, 6)) ax = sns.heatmap( df.T, norm=SymLogNorm(linthresh=10e-4, vmin=-80, vmax=80), cbar_kws={"label": "coefficients' values"}, cmap="seismic_r", ) plt.ylabel("linear model") plt.xlabel("coefficients") plt.tight_layout(rect=(0, 0, 1, 0.95)) _ = plt.title("Models' coefficients") .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_ard_001.png :alt: Models' coefficients :srcset: /auto_examples/linear_model/images/sphx_glr_plot_ard_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 83-84 由于添加了噪声,所有模型都无法恢复真实权重。实际上,所有模型总是有超过10个非零系数。与OLS估计相比,使用贝叶斯岭回归的系数略微向零偏移,这使它们更加稳定。ARD回归提供了一个更稀疏的解决方案:一些无信息系数被精确地设为零,而其他系数则更接近于零。一些无信息系数仍然存在并保持较大的值。 .. GENERATED FROM PYTHON SOURCE LINES 87-89 绘制边际对数似然函数 -------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 89-101 .. code-block:: Python import numpy as np ard_scores = -np.array(ard.scores_) brr_scores = -np.array(brr.scores_) plt.plot(ard_scores, color="navy", label="ARD") plt.plot(brr_scores, color="red", label="BayesianRidge") plt.ylabel("Log-likelihood") plt.xlabel("Iterations") plt.xlim(1, 30) plt.legend() _ = plt.title("Models log-likelihood") .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_ard_002.png :alt: Models log-likelihood :srcset: /auto_examples/linear_model/images/sphx_glr_plot_ard_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 102-110 确实,这两种模型都将对数似然函数最小化到由 `max_iter` 参数定义的任意截止点。 Bayesian regressions with polynomial feature expansion ====================================================== 生成合成数据集 -------------------------- We create a target that is a non-linear function of the input feature. Noise following a standard uniform distribution is added. .. GENERATED FROM PYTHON SOURCE LINES 110-132 .. code-block:: Python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import PolynomialFeatures, StandardScaler rng = np.random.RandomState(0) n_samples = 110 # 对数据进行排序,以便后续绘图更容易进行 X = np.sort(-10 * rng.rand(n_samples) + 10) noise = rng.normal(0, 1, n_samples) * 1.35 y = np.sqrt(X) * np.sin(X) + noise full_data = pd.DataFrame({"input_feature": X, "target": y}) X = X.reshape((-1, 1)) # 外推 # # X_plot = np.linspace(10, 10.4, 10) y_plot = np.sqrt(X_plot) * np.sin(X_plot) X_plot = np.concatenate((X, X_plot.reshape((-1, 1)))) y_plot = np.concatenate((y - noise, y_plot)) .. GENERATED FROM PYTHON SOURCE LINES 133-137 拟合回归模型 ------------------ 在这里,我们尝试使用10次多项式来潜在地过拟合,尽管贝叶斯线性模型会对多项式系数的大小进行正则化。由于对于:class:`~sklearn.linear_model.ARDRegression` 和:class:`~sklearn.linear_model.BayesianRidge` , `fit_intercept=True` 是默认设置,因此:class:`~sklearn.preprocessing.PolynomialFeatures` 不应引入额外的偏差特征。通过设置 `return_std=True` ,贝叶斯回归器会返回模型参数后验分布的标准差。 .. GENERATED FROM PYTHON SOURCE LINES 137-152 .. code-block:: Python ard_poly = make_pipeline( PolynomialFeatures(degree=10, include_bias=False), StandardScaler(), ARDRegression(), ).fit(X, y) brr_poly = make_pipeline( PolynomialFeatures(degree=10, include_bias=False), StandardScaler(), BayesianRidge(), ).fit(X, y) y_ard, y_ard_std = ard_poly.predict(X_plot, return_std=True) y_brr, y_brr_std = brr_poly.predict(X_plot, return_std=True) .. GENERATED FROM PYTHON SOURCE LINES 153-155 绘制带有分数标准误差的多项式回归图 ------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 155-180 .. code-block:: Python ax = sns.scatterplot( data=full_data, x="input_feature", y="target", color="black", alpha=0.75 ) ax.plot(X_plot, y_plot, color="black", label="Ground Truth") ax.plot(X_plot, y_brr, color="red", label="BayesianRidge with polynomial features") ax.plot(X_plot, y_ard, color="navy", label="ARD with polynomial features") ax.fill_between( X_plot.ravel(), y_ard - y_ard_std, y_ard + y_ard_std, color="navy", alpha=0.3, ) ax.fill_between( X_plot.ravel(), y_brr - y_brr_std, y_brr + y_brr_std, color="red", alpha=0.3, ) ax.legend() _ = ax.set_title("Polynomial fit of a non-linear feature") .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_ard_003.png :alt: Polynomial fit of a non-linear feature :srcset: /auto_examples/linear_model/images/sphx_glr_plot_ard_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 181-184 误差条表示查询点的预测高斯分布的一个标准差。请注意,当在两个模型中使用默认参数时,ARD回归最好地捕捉到了真实情况,但进一步减少贝叶斯岭回归的 `lambda_init` 超参数可以减少其偏差(参见示例 :ref:`sphx_glr_auto_examples_linear_model_plot_bayesian_ridge_curvefit.py` )。 最后,由于多项式回归的内在局限性,两个模型在外推时都失败了。 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.747 seconds) .. _sphx_glr_download_auto_examples_linear_model_plot_ard.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/linear_model/plot_ard.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_ard.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_ard.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_ard.zip ` .. include:: plot_ard.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_