.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/ensemble/plot_forest_importances.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_ensemble_plot_forest_importances.py: ========================================== 使用树的森林评估特征重要性 ========================================== 此示例展示了如何使用树的森林来评估在一个人工分类任务中各特征的重要性。蓝色条表示森林中各特征的重要性，误差条则表示各树之间的重要性差异。如预期所示，图表表明有3个特征是有信息量的，而其余特征则不是。 .. GENERATED FROM PYTHON SOURCE LINES 11-14 .. code-block:: Python import matplotlib.pyplot as plt .. GENERATED FROM PYTHON SOURCE LINES 15-18 数据生成和模型拟合 ------------------- 我们生成一个只有3个信息特征的合成数据集。我们明确不对数据集进行打乱，以确保信息特征对应于X的前三列。此外，我们将把数据集分成训练和测试子集。 .. GENERATED FROM PYTHON SOURCE LINES 18-34 .. code-block:: Python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification( n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False, ) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) .. GENERATED FROM PYTHON SOURCE LINES 35-36 将拟合一个随机森林分类器来计算特征重要性。 .. GENERATED FROM PYTHON SOURCE LINES 36-43 .. code-block:: Python from sklearn.ensemble import RandomForestClassifier feature_names = [f"feature {i}" for i in range(X.shape[1])] forest = RandomForestClassifier(random_state=0) forest.fit(X_train, y_train) .. raw:: html

RandomForestClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 44-50 基于平均杂质减少的特征重要性 --------------------------------- 特征重要性由拟合属性 `feature_importances_` 提供，它们是通过计算每棵树内杂质减少的累积量的均值和标准差得出的。 .. warning:: 基于杂质的特征重要性对于 **高基数** 特征（许多唯一值）可能会产生误导。请参阅下面的 :ref:`permutation_importance` 作为替代方案。 .. GENERATED FROM PYTHON SOURCE LINES 50-61 .. code-block:: Python import time import numpy as np start_time = time.time() importances = forest.feature_importances_ std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0) elapsed_time = time.time() - start_time print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Elapsed time to compute the importances: 0.003 seconds .. GENERATED FROM PYTHON SOURCE LINES 62-63 让我们绘制基于杂质的重要性。 .. GENERATED FROM PYTHON SOURCE LINES 63-74 .. code-block:: Python import pandas as pd forest_importances = pd.Series(importances, index=feature_names) fig, ax = plt.subplots() forest_importances.plot.bar(yerr=std, ax=ax) ax.set_title("Feature importances using MDI") ax.set_ylabel("Mean decrease in impurity") fig.tight_layout() .. image-sg:: /auto_examples/ensemble/images/sphx_glr_plot_forest_importances_001.png :alt: Feature importances using MDI :srcset: /auto_examples/ensemble/images/sphx_glr_plot_forest_importances_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 75-80 我们观察到，正如预期的那样，前三个特征被认为是重要的。基于特征置换的特征重要性 ----------------------------------------------- 置换特征重要性克服了基于不纯度的特征重要性的局限性：它们对高基数特征没有偏向，并且可以在留出的测试集上计算。 .. GENERATED FROM PYTHON SOURCE LINES 80-91 .. code-block:: Python from sklearn.inspection import permutation_importance start_time = time.time() result = permutation_importance( forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2 ) elapsed_time = time.time() - start_time print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds") forest_importances = pd.Series(result.importances_mean, index=feature_names) .. rst-class:: sphx-glr-script-out .. code-block:: none Elapsed time to compute the importances: 1.332 seconds .. GENERATED FROM PYTHON SOURCE LINES 92-94 完全排列重要性的计算成本更高。特征被打乱n次，并重新拟合模型以估计其重要性。有关更多详细信息，请参见 :ref:`permutation_importance` 。我们现在可以绘制重要性排名。 .. GENERATED FROM PYTHON SOURCE LINES 95-104 .. code-block:: Python fig, ax = plt.subplots() forest_importances.plot.bar(yerr=result.importances_std, ax=ax) ax.set_title("Feature importances using permutation on full model") ax.set_ylabel("Mean accuracy decrease") fig.tight_layout() plt.show() .. image-sg:: /auto_examples/ensemble/images/sphx_glr_plot_forest_importances_002.png :alt: Feature importances using permutation on full model :srcset: /auto_examples/ensemble/images/sphx_glr_plot_forest_importances_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 105-106 相同的特征在两种方法中都被检测为最重要的特征，尽管相对重要性有所不同。从图中可以看出，MDI比置换重要性更不可能完全忽略某个特征。 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.785 seconds) .. _sphx_glr_download_auto_examples_ensemble_plot_forest_importances.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/ensemble/plot_forest_importances.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_forest_importances.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_forest_importances.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_forest_importances.zip ` .. include:: plot_forest_importances.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_