permutation_importance#

sklearn.inspection.permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)#

排列重要性用于特征评估 [BRE].

estimator 需要是一个已经 fitted 的估计器。 X 可以是用于训练估计器的数据集或是一个保留集。特征的排列重要性计算如下。首先，在由 X 定义的（可能是不同的）数据集上评估一个基准指标，该指标由 scoring 定义。接下来，从验证集中随机排列一个特征列，并再次评估该指标。排列重要性定义为基准指标与排列特征列后的指标之间的差异。

更多信息请参阅用户指南。

Parameters:

estimatorobject

一个已经 fitted 并且与 scorer 兼容的估计器。

Xndarray 或 DataFrame, shape (n_samples, n_features)

将在其上计算排列重要性的数据。

yarray-like 或 None, shape (n_samples, ) 或 (n_samples, n_classes)

监督学习的目标或无监督学习的 None 。

scoringstr, callable, list, tuple, 或 dict, default=None

使用的评分器。如果 scoring 表示单个分数，可以使用：

单个字符串（见 scoring_parameter ）；
返回单个值的可调用对象（见从指标函数定义您的评分策略）。

如果 scoring 表示多个分数，可以使用：

唯一字符串的列表或元组；
返回字典的可调用对象，其中键是指标名称，值是指标分数；
以指标名称为键、可调用对象为值的字典。

向 scoring 传递多个分数比为每个分数调用 permutation_importance 更高效，因为它重用了预测结果以避免重复计算。

如果为 None，则使用估计器的默认评分器。

n_repeatsint, default=5

每个特征的排列次数。

n_jobsint 或 None, default=None

并行运行的作业数。计算是通过为每个列计算排列分数并在列上并行化完成的。 None 表示 1，除非在 joblib.parallel_backend 上下文中。 -1 表示使用所有处理器。详见 Glossary 。

random_stateint, RandomState 实例, default=None

伪随机数生成器，用于控制每个特征的排列。传递一个 int 以在函数调用之间获得可重复的结果。详见 Glossary 。

sample_weightarray-like of shape (n_samples,), default=None

评分中使用的样本权重。

Added in version 0.24.

max_samplesint 或 float, default=1.0

每次重复中从 X 抽取的样本数（不重复）。

如果是 int，则抽取 max_samples 个样本。
如果是 float，则抽取 max_samples * X.shape[0] 个样本。
如果 max_samples 等于 1.0 或 X.shape[0] ，则使用所有样本。

虽然使用此选项可能提供不太准确的特征重要性估计，但它使该方法在评估大型数据集上的特征重要性时保持可行。与 n_repeats 结合使用，这允许控制该方法的计算速度与统计准确性之间的权衡。

Added in version 1.0.

Returns:

resultBunch 或此类实例的字典

类似字典的对象，具有以下属性。

importances_meanndarray of shape (n_features, ): n_repeats 上特征重要性的均值。
importances_stdndarray of shape (n_features, ): n_repeats 上特征重要性的标准差。
importancesndarray of shape (n_features, n_repeats): 原始排列重要性分数。

如果 scoring 参数中有多个评分指标， result 是一个以评分器名称为键（例如 ‘roc_auc’）、上述 Bunch 对象为值的字典。

References

[BRE]

L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

Examples

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.inspection import permutation_importance
>>> X = [[1, 9, 9],[1, 9, 9],[1, 9, 9],
...      [0, 9, 9],[0, 9, 9],[0, 9, 9]]
>>> y = [1, 1, 1, 0, 0, 0]
>>> clf = LogisticRegression().fit(X, y)
>>> result = permutation_importance(clf, X, y, n_repeats=10,
...                                 random_state=0)
>>> result.importances_mean
array([0.4666..., 0.       , 0.       ])
>>> result.importances_std
array([0.2211..., 0.       , 0.       ])

Gallery examples#

scikit-learn 0.22 版本发布亮点

置换重要性与随机森林特征重要性（MDI）对比